The right way to Consistently Extract Metadata from Complex Documents

amounts of necessary information. Nevertheless, this information is, in lots of cases, hidden deep into the contents of the documents and is thus hard to utilize for downstream tasks. In this text, I’ll discuss methods to consistently extract metadata out of your documents, considering approaches to metadata extraction and challenges you’ll face along the best way.

The article is a higher-level overview of performing metadata extraction on documents, highlighting different considerations you have to make when performing metadata extraction.

This infographic highlights the most important contents of this text. I’ll first discuss why we want to extract document metadata, and the way it’s useful for downstream tasks. Continuing, I’ll discuss approaches to extract metadata, with , OCR + LLM, and vision LLMs. Lastly, I’ll also discuss different challenges when performing metadata extraction, akin to , handwritten text, and coping with long documents. Image by ChatGPT.

Why extract document metadata

First, it’s necessary to make clear why we want to extract metadata from documents. In any case, if the data is present within the documents already, can we not only find the data using RAG or other similar approaches?

In quite a lot of cases, RAG would find a way to seek out specific data points, but pre-extracting metadata simplifies quite a lot of downstream tasks. Using metadata, you’ll be able to, for instance, filter your documents based on data points, akin to:

Document type
Addresses
Dates

Moreover, if you may have a RAG system in place, it is going to, in lots of cases, profit from moreover provided metadata. It is because you present the extra information (the metadata) more clearly to the LLM. For instance, suppose you ask a matter related to dates. In that case, it’s easier to easily provide the pre-extracted document dates to the model, as a substitute of getting the model extract the dates during inference time. This protects on each costs and latency, and is prone to improve the standard of your RAG responses.

The right way to extract metadata

I’m highlighting three most important approaches to extracting metadata, going from simplest to most complex:

OCR + LLM
Vision LLMs

This image highlights the three most important approaches to extracting metadata. The only approach is to make use of , though it doesn’t work in lots of situations. A more powerful approach is OCR + LLM, which works well most often, but misses in situations where you’re depending on visual information. If visual information is significant, you need to use vision LLMs, probably the most powerful approach. Image by ChatGPT.

is the best and most consistent approach to extracting metadata. works well in the event you know the precise format of the information beforehand. For instance, in the event you’re processing lease agreements, and you realize the date is written as dd.mm.yyyy, at all times right after the words “Date: “, then is the technique to go.

Unfortunately, most document processing is more complex than this. You’ll must take care of inconsistent documents, with challenges like:

Dates are written somewhere else within the document
The text is missing some characters due to poor OCR
Dates are written in numerous formats (e.g., mm.dd.yyyy, twenty second of October, December 22, etc.)

For this reason, we often must move on to more complex approaches, like OCR + LLM, which I’ll describe in the following section.

OCR + LLM

A robust approach to extracting metadata is to make use of OCR + LLM. This process starts with applying OCR to a document to extract the text contents. You then take the OCR-ed text and prompt an LLM to extract the date from the document. This often works incredibly well, because LLMs are good at understanding the context (which date is relevant, and which dates are irrelevant), and may understand dates written in all varieties of different formats. LLMs will, in lots of cases, also find a way to know each European (dd.mm.yyyy) and American (mm.dd.yyyy) date standards.

Nevertheless, in some scenarios, the metadata you need to extract requires visual information. In these scenarios, it’s worthwhile to apply probably the most advanced technique: vision LLMs.

Vision LLMs

Using vision LLMs is probably the most complex approach, with each the very best latency and value. In most scenarios, running vision LLMs might be far costlier than running pure text-based LLMs.

When running vision LLMs, you often must ensure images have high resolution, so the vision LLM can read the text of the documents. This then requires quite a lot of visual tokens, which makes the processing expensive. Nevertheless, vision LLMs with high resolution images will often find a way to extract complex information, which OCR + LLM cannot, for instance, the data provided within the image below.

This image highlights a task where it’s worthwhile to use vision LLMs. Should you OCR this image, you’ll find a way to extract the words “Document 1, Document 2, Document 3,” however the OCR will completely miss the filled-in checkbox. It is because OCR is trained to extract characters, and never figures, just like the checkbox with a circle in it. Attempting to make use of OCR + LLM will thus fail on this scenario. Nevertheless, in the event you as a substitute use a vision LLM on this problem, it is going to easily find a way to extract which document is checked off. Image by the writer.

Vision LLMs also work well in scenarios with handwritten text, where OCR might struggle.

Challenges when extracting metadata

As I identified earlier, documents are complex and are available various formats. There are thus quite a lot of challenges you may have to take care of when extracting metadata from documents. I’ll highlight three of the most important challenges:

When to make use of vision vs OCR + LLM
Coping with handwritten text
Coping with long documents

When to make use of vision LLMs vs OCR + LLM

Preferably, we might use vision LLMs for all metadata extraction. Nevertheless, this is often impossible as a result of the price of running vision LLMs. We thus have to make a decision when to make use of vision LLMs vs when to make use of OCR + LLMs.

One thing you’ll be able to do is to make a decision whether the metadata point you need to extract requires visual information or not. If it’s a date, OCR + LLM will work pretty much in just about all scenarios. Nevertheless, in the event you know you’re coping with checkboxes like in the instance task I discussed above, it’s worthwhile to apply vision LLMs.

Coping with handwritten text

One issue with the approach mentioned above is that some documents might contain handwritten text, which traditional OCR is just not particularly good at extracting. In case your OCR is poor, the LLM extracting metadata will even perform poorly. Thus, in the event you know you’re coping with handwritten text, I like to recommend applying vision LLMs, as they’re way higher at coping with handwriting, based by myself experience. It’s necessary to remember that many documents will contain each born-digital text and handwriting.

Coping with long documents

In lots of cases, you’ll also must take care of extremely long documents. If so, you may have to make the consideration of how far into the document a metadata point could be present.

The explanation this can be a consideration is that you need to minimize cost, and if it’s worthwhile to process extremely long documents, it’s worthwhile to have quite a lot of input tokens to your LLMs, which is expensive. Usually, the necessary piece of data (date, for instance) might be present early within the document, during which case you won’t need many input tokens. In other situations, nevertheless, the relevant piece of data could be present on page 94, during which case you would like quite a lot of input tokens.

The difficulty, after all, is that you just don’t know beforehand which page the metadata is present on. Thus, you essentially must make a choice, like only the primary 100 pages of a given document, and assuming the metadata is accessible in the primary 100 pages, for just about all documents. You’ll miss a knowledge point on the rare occasion where the information is on page 101 and onwards, but you’ll save largely on costs.

Conclusion

In this text, I’ve discussed how you’ll be able to consistently extract metadata out of your documents. This metadata is usually critical when performing downstream tasks like filtering your documents based on data points. Moreover, I discussed three most important approaches to metadata extraction with , OCR + LLM, and vision LLMs, and I covered some challenges you’ll face when extracting metadata. I believe metadata extraction stays a task that doesn’t require quite a lot of effort, but that may provide quite a lot of value in downstream tasks. I thus consider metadata extraction will remain necessary in the approaching years, though I consider we’ll see increasingly more metadata extraction move to purely utilizing vision LLMs, as a substitute of OCR + LLM.

👉 Find me on socials:

🧑‍💻 Get in contact

📩 Subscribe to my newsletter

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

It’s also possible to read a few of my other articles:

The right way to Consistently Extract Metadata from Complex Documents

Why extract document metadata