Finetuning olmOCR to be a faithful OCR-Engine

-


Johannes EsslingerTNG's avatar

Innovation Hacker's avatar

At TNG, we created a fine-tune of an Optical Character Recognition model based on olmOCR to assist us automate our internal document processing workflows.

Recently, the Allen Institute for Artificial Intelligence introduced olmOCR, a model that has demonstrated impressive ends in converting PDFs into clean, linearized plain text while preserving structured content. Nonetheless, olmOCR’s primary use case is to supply more training data for Large Language Models and subsequently ignores extraneous information like headers and footers in documents. While this is useful for generating training data, it’s insufficient for real business applications where crucial information often resides within the header and footer a part of the text (consider invoices, for instance). This text describes how we fine-tuned the unique olmOCR-7B-0225-preview to be a faithful OCR engine, enabling it to support a broader range of applications.

In this instance, crucial information within the header and footer marked as red, is ignored by olmOCR.
image/png



Problem of pipeline-based OCR-Engines

Optical Character Recognition has broad applications in various business use cases. For a very long time, the predominant paradigm for AI-based OCR engines have been pipeline-based systems. These are comprised of multiple machine-learning components, e.g., section segmentation, table parsing, character recognition, etc., chained together. Nonetheless, one fundamental flaw of this approach is that the extracted results mostly don’t flatten the context in a way that adheres to logical reading order, also often known as linearization. This is especially difficult for layout-rich documents with floating elements like multi-column documents with floating diagrams, headers, footers, etc. With the recent advent of Vision Language Models, a whole lot of effort has been put into leveraging them as alternative OCR systems to tackle this problem.



Starting Point: olmOCR-7B-0225-preview

During our testing of the olmOCR model published on Hugging Face for business applications like invoice parsing, we observed a consistent omission of essential information within the headers and footers. This was expected, because the dataset used to coach olmOCR, olmOCR-mix-0225, intentionally excludes extraneous information in these areas to keep up a natural reading flow, as such information can’t be meaningfully predicted within the context of next-token prediction for training.

To handle this limitation and enable comprehensive information extraction, we utilized Qwen2.5-VL-72B-Instruct to generate a dataset of 8,000 documents that captures all relevant information, as one would expect from a reliable OCR engine. We based our training setup on the open-sourced olmOCR training pipeline, utilizing 4 gradient accumulation steps on an 8xH100 Nvidia node, for a complete of two.5 epochs. The default hyperparameters worked quite well for us, eliminating the necessity for resource-intensive hyperparameter search. The experiment tracking with MlLflow showed the next results:

image/png

For evaluation, we utilized a customized version of the olmOCR-mix-0225 eval-datasets, which include header and footer information, also acquired with Qwen2.5-VL-72B-Instruct.

Once the training was done, it was time to place our model to the test.



Comparison between original and fine-tuned olmOCR

We did a qualitative assessment of documents where crucial information was missing after parsing its content. Our inference setup is similar to the one by olmOCR, utilizing the special prompting strategy called document anchoring that preserves any born-digital content from each page. This system extracts the raw tex blocks and position information to prompt VLMs alongside the rasterized image.

We offer several examples of the unique response and the response of the fine-tuned model below.
We marked crucial missing information in red.

The unique PDF is an invoice containing essential information at the highest and bottom. Unlike olmOCR-7B-0225-preview, our fine-tuned version extracts all the knowledge.

Again the instance accommodates essential information for downstream task. Our model extracts all the knowledge and remains to be in a position to parse easy tables.

Also complex multi column layout will be parsed with olmOCR. Now also additional information within the header is extracted.

One last example 😉

Overall, we’re blissful with the outcomes, as now all information, including extraneous data, might be extracted and the model remains to be in a position to parse easy tables. Notably, we noticed that for some examples, the standard of the resulting output can change significantly with different temperatures.



Summary

OCR is crucial for extracting structured information from documents. The power of end-to-end-systems like olmOCR to linearize content gives them a robust advantage over traditional systems. With our fine-tuned version, we will now faithfully extract text including header and footer sections from various documents, which is crucial for business use cases resembling invoice parsing. We’re interested by how future models will evolve on this fast-paced domain. Special due to the Allen Institute for AI for open-sourcing their model, dataset, and code.

In the event you’re curious to check out our finetuned olmOCR-model, we’ve got open-sourced it on HuggingFace.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x