Updated March 29th, 2023
As digitization and process improvement continues, boosted by globalization, remote work, and the reduction of paper-based and manual processes in general, the reliance on document management systems increases in step. As cloud-based applications and storage also increase, there is an opportunity to leverage cloud-native services to create working document management systems (DMS) that can keep pace with the growth of any organization.
Scalability is the ability for a software solution to maintain functionality no matter the size of your data or the amount of processing required. This is not only important for large data sets and complex processes; it's essential that smaller data sets and simpler processes are handled by the solution efficiently. In other words, a small business and a large business should both be able to make effective use of the solution, with no minimum level of size or complexity required for a scalable solution to be economical.
There are two main methods to achieve scalability:
While optimization is an important part of all software development, the most effective way to achieve scalability consistently is by increasing capacity, and that is often achieved by removing any barriers that prevent or restrict that capacity increase. This is commonly done through cloud-based software, avoiding two major logistical barriers, the need for employing a hardware infrastructure team and the need to source compute and networking hardware for capacity increases
In addition, greater efficiency in scaling capacity can be found by utilizing serverless computing in your solution design.
(For more information on scalability itself, see Foundations of Scalable Systems by Ian Gorton)
Serverless computing is a cloud-native architecture that allows developers to build and run applications without having to manage servers.
The term "serverless computing" can be considered imprecise, as the software does still run on servers; the term relates to the fact that the servers are abstracted from the software, and that any configuration and maintenance, and even the number and types of servers, are managed by the cloud provider to meet real-time demand. This abstraction not only simplifies the infrastructure management required for the functioning of the software, but also provides a real-time cost model for computer processing, where the customer is only charged for the amount of time the code is being executed.
In a traditional server-based architecture, the software provider configures and maintains a specific number of always-on servers according to predetermined specifications. If demand increases, either more servers are required to be added to the pool of available servers, or the servers themselves need to be upgraded by the addition of more memory, additional CPUs, an increase of storage space, or a combination of all three. If demand wanes, whether through long-term contraction or just due to an overnight, weekend, or holiday period, the servers would need to be reduced in number or specifically downgraded in order to prevent wasting computing power.
While it's possible to use automation to scale servers, and while cloud providers can allow autoscaling by providing additional servers on demand, there are limitations to the minimum and maximum size of servers, and it's ultimately the customer's responsibility to ensure that they have mechanisms in place to scale up and down as needed.
In addition, traditional server-based architecture often requires routine tasks, such as managing the operating system and file system, keeping up with security patches, and setting up and maintaining logging and monitoring. Cloud providers can abstract some of these tasks, such as Amazon EC2 providing logging through CloudWatch or file handling using Elastic File Storage, but some tasks will always be the responsibility of the customer.
Many software solutions combine server-based and serverless components; AWS customers often make use of S3 to store files, AWS Lambda functions to run tasks, DynamoDB for a NoSQL database, and CloudFront as a content delivery network. All of these components are part of AWS' serverless offerings, where they fulfill workflows without any requirement to maintain and configure servers or to set up scaling automation, i.e., autoscaling.
For document management systems, expanding the scope and responsibilities of serverless components allows for better scalability, by reducing the components that require server configuration and scaling mechanisms. While some functionality may not be possible with serverless components, such as providing full text search using Amazon OpenSearch, using serverless whenever possible allows the restrictions on scaling to be reduced significantly.
Serverless is not free, though in the case of smaller workloads, serverless can often be run almost entirely within the free tier provided by the major cloud providers. For some workflows, on-demand serverless can be more expensive than a stable server-based workflow, particularly when no cost optimizations have been performed. In the case of AWS, there are upfront cost commitments that can be made for processing workflows, such as DynamoDB Provisioned Capacity and Compute Savings Plans. There are also optimizations available for storage, specifically storage tiers (and intelligent tiering, when available) for products such as S3 and DynamoDB.
Where serverless excels is in the dynamic scaling, which is exactly how it achieves its scalability. In the case of storage, this scalability means that your S3 or DynamoDB storage will never run out of space, but at the cost of your cost growing as your storage increases. In the case of compute workflows such as AWS Fargate or AWS Lambda, it's possible to run hundreds or even thousands of tasks concurrently. As Fargate tasks can use 4 virtual CPUs and up to 30GB of memory each, this level of compute concurrency should be able to meet upwards of 99.999% of workloads.
Where server-based excels is when your workloads use the majority of your server-based capacity. As this theoretical study from AWS indicates when comparing EC2 to Fargate for AWS ECS, if your workload is able to remain at near-full utilization of your provisioned EC2 servers, you will see some savings over serverless Fargate; in the case of this paper, they estimated the savings on an ECS cluster of fully-utilized EC2s as 20% over the same cluster using AWS Fargate. But in cases where the ECS cluster has little to no utilization, Fargate can be up to 87% cheaper than EC2.
The key component of systems that would benefit from a serverless model is a requirement for flexibility in usage. Some workloads are consistent enough that it can be more cost-effective to provision server-based infrastructure; for example, if a web application has a consistent level of traffic, with little to no variance, the per-request cost of a server-based architecture will likely be lower than a serverless architecture.
However, even in cases like this, there are some advantages to serverless that may outweigh the moderate increase in cost:
While there are benefits and drawbacks to serverless depending on the system being designed, there are specific tasks within the document management system workflows that can leverage serverless components or managed services for a lower total cost of ownership over a server-based architecture.
As document management systems often include documents of varying confidentiality and differing ownership, it’s essential for a document management system to include authentication and authorization functionality.
In a server-based architecture, authentication often utilizes a database instance as well as the compute instance; in cases of federated logins, the database instance can be replaced by a reliance on an existing identity provider such as Microsoft Active Directory or Google Workspace.
By leveraging a managed authentication service like Amazon Cognito, a document management system can use the built-in authentication or a federated authentication, with no requirement to store user information within a specific database instance.
For authorization, a server-based architecture often implements a module within the application code. This adds some processing overhead to the application and the server instances that host it. A serverless API management service, like Amazon API Gateway, can be used to offload most of the authorization processing, whether through combining that service with a managed authentication service like Cognito, or by relying on internal cloud-based identity and access management such as AWS IAM.
In a server-based architecture, the storage of documents can be handled through the use of a file server, or even by storing on a single application server, but scaling in either of these models can be challenging.
Using a managed object storage service like Amazon S3 removes any scaling challenges, and most include storage tiers for better cost efficiency for long-term infrequent-access storage.
Amazon S3 is known to be more reliable than a local file server, due to its high fault tolerance, reliability, and availability. Because S3 has no minimum storage requirements and offers intelligent tiering, the cost will generally be lower than the overhead of a local file server.
A server-based architecture may include a mail server for receiving documents via email, and may also include an application module to receive documents via an API. Scaling can be an important consideration for both of these methods for importing.
It’s possible to mitigate those scaling concerns through the use of a managed email service like Amazon SES and an API management service like Amazon API Gateway. In addition, the object storage service (e.g., Amazon S3) can also provide functionality to assist in importing objects, such as signed URLs for secure uploads, and a command-line interface (CLI) for uploading objects directly from a workstation or file server.
As the workloads for both a mail server and an API will be variable, the total cost will likely be lower for managed services vs. configuring and running an email and application server for import tasks.
A server-based architecture can include an OCR module, which would need to be designed to work with inconsistent workloads, possibly with queuing functionality.
A managed service like Amazon Textract allows for offloading of OCR processing, with other managed services helping for the queueing and orchestration, such as Amazon SQS. It's also possible to use a well-configured OCR module that runs using serverless compute with an open source OCR library such as Tesseract, which may be possible at a lower cost than by using Amazon Textract, though results and functionality will vary between the two OCR engines. For instance, Tesseract does not process PDF files, while T extract not only works with PDF, but also enables embedding the OCR results as a searchable layer within a modified version of the source PDF.
It may also make sense to use other intelligent processing services such as Amazon Comprehend or Google Cloud's Document AI for specific use cases; by using serverless components such as a queue service or step functions, or even running customized Natural Language Processing and Large Language Models using containers on services like AWS Fargate, it's possible to queue OCR or other intelligent document processing tasks for a better combined result.
Document management systems require the storage of metadata for each document to assist in classification and search. For server-based systems, this would usually include a database instance.
A serverless architecture could involve a managed NoSQL database service like Amazon DynamoDB, which can store metadata in a flexible key-value model, which not only enables easier scaling than a server-based database cluster, but can remove the need for data migration on system updates.
For more robust search, such as Fulltext Search, it’s possible to experiment with a managed serverless relational database, like Amazon Aurora Serverless, or to interact with a server-based Fulltext Search system like Elasticsearch or Amazon OpenSearch. While the fulltext search functionality of Lucene/Elasticsearch/OpenSearch is currently not available as a serverless component, serverless architecture can still leverage these server-based features with an appropriate mechanism for having a server-based component as a dependency.
A server-based system may include a full-stack monolithic application that includes the application controller layer and the presentation of visual information, or it may split these responsibilities between an API/middleware and one or more front-end clients. This could involve one or more application server clusters, and the use of auto-scaling configurations could prevent most cases of failure due to an overload of requests.
While there is no reason why a server-based architecture cannot be used for a document management system, the importance of scalability for a DMS, as well as the specific use-cases of a DMS that are well-suited to managed services and serverless components, makes a serverless architecture a clear competitor, and in the case of a document management system hosted in a cloud provider like AWS, a low-risk and high-value choice.