What We've Learned So Far with Our Journey into AI for Document Management

Our findings from the post-ChatGPT world

by Regan Wolfrom

What We've Learned So Far with Our Journey into AI for Document Management

We often see rocket ship imagery with startups (and we’ve used it a few times ourselves), where the ideal path is up, up, and away, while a just-as-common trajectory is up-down-EXPLODE.

Maybe this rocket will make it...

I think of what AI’s imagery should be. It’s not a robot holding a laptop, which is a common result on image searches. Based on what we’ve learned so far at FormKiQ, AI has been like riding a high-speed train high up into the Himalayas.

There are some dips, and plenty of dangerous curves, and eventually AI may climb so high that we may not be able to breathe when we reach the end of the journey.

But that’s not the biggest (or only) takeaway we’ve had since November 30th, 2022, when ChatGPT was released. We’ve learned quite a bit, and while some of it has turned our pre-ChatGPT AI strategy on its head, other parts feel like maybe the train track is still heading in the same direction.

All aboard for the AI Express

Ownership, Confidentiality, and the Big Data Cow-Catcher

While people are often more impressed than bothered by ChatGPT, for more visual generative AI such as Midjourney, there are obvious concerns about where the model’s training data came from. When a generated image has a distorted version of the Getty Images watermark, it’s not hard to wonder how much of what you’re using is taken from other creators. Hint: basically all of it.

Since Artificial Intelligence is not Artificial General Intelligence, i.e., not an artificial reflection of the human brain, the generated results from any AI model is just a gumbo (or paella) of the data that was used to train it.

Some tech philosophers might claim that, similar to what Picasso said, that good models copy and great models steal, and that it’s no different than what humans themselves are doing as they trace over their favorite artworks or write stirring passages of Melrose Place fan fiction. That may be something that comes in the future, when artificial minds actually think, but for now that’s not a compelling argument for a reddit thread or a court of law.

But beyond the concepts of plagiarism and fair use and credit to creators, there is the issue of security. While specific AI companies may provide assurances that data submitted to an AI API for the purpose of generating things like summaries, translations, etc., many companies are not only reluctant to share proprietary information via an API, there may be regulatory and compliance risks when it comes to personal information of customers and employees.

In other words: does the AI Cowcatcher Catch it All?

Does the AI Cowcatcher Catch It All?

This has led to a murky understanding of just which AI is safe for use for confidential information, and which AIs are not, and that means that for most organizations, an overabundance of caution is warranted.

That could mean trying to find models that can be run on personal computers or on-prem data centers, or it could just mean ensuring that the models are used within cloud accounts within that organization’s control.

For most, it means that just paying for access and sending confidential data over an API request to OpenAI or other AI APIs is not a top choice. Instead, the big cloud providers are providing a walled garden approach, where the model is brought into the cloud account and data that is included in AI prompts never leaves the yard.

For FormKiQ, that means that we are looking at offering both OpenAI API using a bring-your-own-key model, while also working on using both Amazon Sagemaker and Amazon Comprehend for more guarded data, using models from providers like Cohere.

One possible workflow would be to use Amazon Comprehend to remove Personal Identifiable Information (PII), at which point data that is not considered proprietary IP could be sent to an external AI API.

Another workflow would be to stay completely within the AWS Account, using Amazon Sagemaker with a preloaded AI model.

A third workflow would be to refine models and/or train new models, all within a cloud account under the customer’s control. An example of this, i.e., why you’d bother training your own model, would be to create a customized document classification model based on a set variety of documents that are highly specific to the organization’s workflows, where a more generic document classification model would not return results that are granular enough.

In essence, your actual AI workflow will depend on your specific needs, and it’s important for platforms like FormKiQ (and AWS) to provide enough flexibility to meet those needs.

The Open Source vs. Commercial Debate Rages On

Open Source vs. Commercial AI Models: the Age-Old Public/Private Infrastructure Debate

We recently completed a project with CANARIE, the Canadian Network for the Advancement of Research, Industry, and Education, that looked at document classification built entirely with free and open source components. The end result is a DAIR BoosterPack, a free, curated package of resources on a specific emerging technology, and in our case, it leverages what was the current state of Open Source AI as of Q1 2023.

NOTE: we are presenting a webinar on Wednesday, July 26th, 2023 at 12pm EDT / 9am PDT that will walk through this FormKiQ Automated Document Classification and Discovery BoosterPack. If that sounds interesting, you can register to attend.

What we learned is that at the time, the open source models could not compete with OpenAI. The results were inconsistent in quality, with the same prompt producing wildly different responses of wildly different accuracy.

Here are the models we used:

One interesting situation came from using the Document Image Transformer from Microsoft; it was really good at determining the document type, assuming the document type was within the specific document types that were included in the training (some academic, some business-oriented), but oftentimes documents in other areas, such as legal documents, were classified as some form of scientific literature.

What is convenient about OpenAI’s large language models is that they can perform many different kinds of tasks, with a very robust response to prompts. For instance, FormKiQ’s Document Tagging action can not only ask OpenAI to determine specified tags based on document content, it can also specify the return format and even the specific notation used for key names, e.g., we can ask for keys to be named using camel case (“camelCase”) or we can ask for snake case (“snake_case”).

That was the case in Q1 of 2023. I don’t expect this to be the same finding we’d make if we tried this all over again in Q1 of 2024. As the technology around models and transformers advances and the cost of creating new models declines (which is likely despite the increased demand for GPUs), we do expect that open source models will continually advance to near-parity with commercial models.

I say near-parity, because it’s not yet clear that open source can meet or exceed the models created by OpenAI and Google. It’s definitely possible, but for now, we’re hedging our bets and assuming that both commercial and open source models will be key components for AI strategies for a good while.

Riding together in style

Efficiencies of Scale: Most of these Rail Cars are Headed in the Same Direction

Ultimately, while there will be variations in AI strategy across industries, geographies, and business strategies, the result will be a small amount of variety and flexibility to connect these pieces together.

As platforms like FormKiQ develop new components and the API infrastructure to access those components, I believe the end result will be solutions where organizations will have the ability to choose specific AI items from the buffet table.

It will be the platforms themselves who will be in charge of tracking new developments and providing their best tooling and counsel to their customers, who can then grab a new plate and sample the latest offerings, knowing that their data will be safeguarded as required.

For more information on how you can leverage these new AI components, contact us or schedule a consultation call.