A FormKiQ Whitepaper: Scalability in Document Management Systems

Comparing Server-Based and Serverless Architectures

As digitization and process improvement continues, boosted by globalization, remote work, and the reduction of paper-based and manual processes in general, the reliance on document management systems increases in step. As cloud-based applications and storage also increase, there is an opportunity to leverage cloud-native services to create working document management systems (DMS) that can keep pace with the growth of any organization.

What is scalability and why is it essential?

Scalability is the ability for a software solution to maintain functionality no matter the size of your data or the amount of processing required. This is not only important for large data sets and complex processes; it's essential that smaller data sets and simpler processes are handled by the solution efficiently. In other words, a small business and a large business should both be able to make effective use of the solution, with no minimum level of size or complexity required for a scalable solution to be economical.

For document management systems, scalability matters

For document management systems, scalability matters for multiple reasons:

  • unlimited capacity is available for document storage
  • system performance, i.e., the organization and retrieval of documents, does not degrade with the growth in the number of documents that is expected over time
  • creating a replica of the live application for development or testing can be done at low cost
  • the system can not only scale up with growth but can also scale down if the organization contracts

How is scalability achieved?

There are two main methods to achieve scalability:

  1. Increase the capacity of the system, often through replicating the system, i.e., adding new database instances and/or processing instances
  2. Optimize the performance of the system, allowing more input, output, and processing within the same system capacity
There are two main methods to achieve scalability

While optimization is an important part of all software development, the most effective way to achieve scalability consistently is by increasing capacity, and that is often achieved by removing any barriers that prevent or restrict that capacity increase. This is commonly done through cloud-based software, avoiding two major logistical barriers, the need for employing a hardware infrastructure team and the need to source compute and networking hardware for capacity increases

In addition, greater efficiency in scaling capacity can be found by utilizing serverless computing in your solution design.

What is serverless computing?

Serverless computing is a cloud-native architecture that allows developers to build and run applications without having to manage servers.

Serverless computing is a cloud-native architecture that allows developers to build and run applications without having to manage servers.

The term "serverless computing" can be considered imprecise, as the software does still run on servers; the term relates to the fact that the servers are abstracted from the software, and that any configuration and maintenance, and even the number and types of servers, are managed by the cloud provider to meet real-time demand. This abstraction not only simplifies the infrastructure management required for the functioning of the software, but also provides a real-time cost model for computer processing, where the customer is only charged for the amount of time the code is being executed.

How does serverless compare to server-based?

In a traditional server-based architecture, the software provider configures and maintains a specific number of always-on servers according to predetermined specifications. If demand increases, either more servers are required to be added to the pool of available servers, or the servers themselves need to be upgraded by the addition of more memory, additional CPUs, an increase of storage space, or a combination of all three. If demand wanes, whether through long-term contraction or just due to an overnight, weekend, or holiday period, the servers would need to be reduced in number or specifically downgraded in order to prevent wasting computing power.

While it's possible to use automation to scale servers, and while cloud providers can allow autoscaling by providing additional servers on demand, there are limitations to the minimum and maximum size of servers, and it's ultimately the customer's responsibility to ensure that they have mechanisms in place to scale up and down as needed.

In addition, traditional server-based architecture often requires routine tasks, such as managing the operating system and file system, keeping up with security patches, and setting up and maintaining logging and monitoring. Cloud providers can abstract some of these tasks, such as Amazon EC2 providing logging through CloudWatch or file handling using Elastic File Storage, but some tasks will always be the responsibility of the customer.

How serverless is being used

Many software solutions combine server-based and serverless components; AWS customers often make use of S3 to store files, AWS Lambda functions to run tasks, DynamoDB for a NoSQL database, and CloudFront as a content delivery network. All of these components are part of AWS' serverless offerings, where they fulfill workflows without any requirement to maintain and configure servers or to set up scaling automation, i.e., autoscaling.

It's common to see static websites or JavaScript-based client applications that are stored in S3 and served by CloudFront; in many cases, these sites or applications may have some limited back-end functionality handled by AWS Lambda and API Gateway, for instance when handling a contact form.

For document management systems, expanding the scope and responsibilities of serverless components allows for better scalability, by reducing the components that require server configuration and scaling mechanisms. While some functionality may not be possible with serverless components, such as providing full text search using Amazon OpenSearch, using serverless whenever possible allows the restrictions on scaling to be reduced significantly.

How does serverless affect cost?

Serverless is not free, though in the case of smaller workloads, serverless can often be run almost entirely within the free tier provided by the major cloud providers. For some workflows, on-demand serverless can be more expensive than a stable server-based workflow, particularly when no cost optimizations have been performed. In the case of AWS, there are upfront cost commitments that can be made for processing workflows, such as DynamoDB Provisioned Capacity and Compute Savings Plans. There are also optimizations available for storage, specifically storage tiers (and intelligent tiering, when available) for products such as S3 and DynamoDB.

Where serverless excels is in the dynamic scaling, which is exactly how it achieves its scalability. In the case of storage, this scalability means that your S3 or DynamoDB storage will never run out of space, but at the cost of your cost growing as your storage increases. In the case of compute workflows such as AWS Fargate or AWS Lambda, it's possible to run hundreds or even thousands of tasks concurrently. As Fargate tasks can use 4 virtual CPUs and up to 30GB of memory each, this level of compute concurrency should be able to meet upwards of 99.999% of workloads.

Where server-based excels is when your workloads use the majority of your server-based capacity. As this theoretical study from AWS indicates when comparing EC2 to Fargate for AWS ECS, if your workload is able to remain at near-full utilization of your provisioned EC2 servers, you will see some savings over serverless Fargate; in the case of this paper, they estimated the savings on an ECS cluster of fully-utilized EC2s as 20% over the same cluster using AWS Fargate. But in cases where the ECS cluster has little to no utilization, Fargate can be up to 87% cheaper than EC2.

Flexibility: considering serverless despite a higher usage cost

The key component of systems that would benefit from a serverless model is a requirement for flexibility in usage. Some workloads are consistent enough that it can be more cost-effective to provision server-based infrastructure; for example, if a web application has a consistent level of traffic, with little to no variance, the per-request cost of a server-based architecture will likely be lower than a serverless architecture.

Advantages to serverless that may outweigh the moderate increase in cost

However, even in cases like this, there are some advantages to serverless that may outweigh the moderate increase in cost:

  • even with relative consistency in workload for a server-based system, any large spike in capacity needs has a higher risk of failure
  • some server-based components still grow in size despite a consistent level of usage, such as log files, databases, or cache stores, and when hard limits are reached, the risk of failure increases
  • it can be cost-prohibitive to replicate server-based systems for non-prod environments, depending on the components required
  • each server-based component requires networking and security configurations that are generally more complex than connecting together various serverless components and managed services

Looking at serverless for specific document management system tasks

While there are benefits and drawbacks to serverless depending on the system being designed, there are specific tasks within the document management system workflows that can leverage serverless components or managed services for a lower total cost of ownership over a server-based architecture.

Looking at serverless for specific document management system task

Authentication and authorization

As document management systems often include documents of varying confidentiality and differing ownership, it’s essential for a document management system to include authentication and authorization functionality.

In a server-based architecture, authentication often utilizes a database instance as well as the compute instance; in cases of federated logins, the database instance can be replaced by a reliance on an existing identity provider such as Microsoft Active Directory or Google Workspace.

By leveraging a managed authentication service like Amazon Cognito, a document management system can use the built-in authentication or a federated authentication, with no requirement to store user information within a specific database instance.

For authorization, a server-based architecture often implements a module within the application code. This adds some processing overhead to the application and the server instances that host it. A serverless API management service, like Amazon API Gateway, can be used to offload most of the authorization processing, whether through combining that service with a managed authentication service like Cognito, or by relying on internal cloud-based identity and access management such as AWS IAM.

Document storage

In a server-based architecture, the storage of documents can be handled through the use of a file server, or even by storing on a single application server, but scaling in either of these models can be challenging.

Using a managed object storage service like Amazon S3 removes any scaling challenges, and most include storage tiers for better cost efficiency for long-term infrequent-access storage.

Amazon S3 is known to be more reliable than a local file server, due to its high fault tolerance, reliability, and availability. Because S3 has no minimum storage requirements and offers intelligent tiering, the cost will generally be lower than the overhead of a local file server.

Document import

A server-based architecture may include a mail server for receiving documents via email, and may also include an application module to receive documents via an API. Scaling can be an important consideration for both of these methods for importing.

It’s possible to mitigate those scaling concerns through the use of a managed email service like Amazon SES and an API management service like Amazon API Gateway. In addition, the object storage service (e.g., Amazon S3) can also provide functionality to assist in importing objects, such as signed URLs for secure uploads, and a command-line interface (CLI) for uploading objects directly from a workstation or file server.

As the workloads for both a mail server and an API will be variable, the total cost will likely be lower for managed services vs. configuring and running an email and application server for import tasks.

Optical character recognition and intelligent document processing

A server-based architecture can include an OCR module, which would need to be designed to work with inconsistent workloads.

A managed service like Amazon Textract allows for offloading of OCR processing. However, with a well-configured OCR module that runs using serverless compute with an OCR library such as Tesseract, it may be possible to run OCR at a lower cost than by using Amazon Textract, but results may vary between the two OCR engines.

It may also make sense to use other intelligent processing services such as Amazon Comprehend or Google Cloud’s Document AI for specific use cases; by using serverless components such as a queue service or step functions, it’s possible to queue OCR or other document processing tasks for a better combined result.

Document search

Document management systems require the storage of metadata for each document to assist in classification and search. For server-based systems, this would usually include a database instance.

A serverless architecture could involve a managed NoSQL database service like Amazon DynamoDB, which can store metadata in a flexible key-value model, which not only enables easier scaling than a server-based database cluster, but can remove the need for data migration on system updates.

For more robust search, such as Fulltext Search, it’s possible to experiment with a managed serverless relational database, like Amazon Aurora Serverless, or to interact with a server-based Fulltext Search system like Elasticsearch or Amazon OpenSearch. While the fulltext search functionality of Lucene/Elasticsearch/OpenSearch is currently not available as a serverless component, serverless architecture can still leverage these server-based features with an appropriate mechanism for having a server-based component as a dependency.

Client interface

A server-based system may include a full-stack monolithic application that includes the application controller layer and the presentation of visual information, or it may split these responsibilities between an API/middleware and one or more front-end clients. This could involve one or more application server clusters, and the use of auto-scaling configurations could prevent most cases of failure due to an overload of requests.

A serverless system would likely include separation between the API and the client, though that is not guaranteed, and could make use of a managed API service like Amazon API Gateway for the API, while using a managed object storage like Amazon S3 and a CDN like Amazon CloudFront to service a static front-end client. This client could use a JavaScript client framework like React or Angular to interact with the API, without requiring an application server instance.

Conclusion

While there is no reason why a server-based architecture cannot be used for a document management system, the importance of scalability for a DMS, as well as the specific use-cases of a DMS that are well-suited to managed services and serverless components, makes a serverless architecture a clear competitor, and in the case of a document management system hosted in a cloud provider like AWS, a low-risk and high-value choice.

Try FormKiQ Core today

Get a feel for FormKiQ through our core offering, which is free forever.

Install Now