
Recently, the Allen Institute for Artificial Intelligence introduced olmOCR, a model that has demonstrated impressive ends in converting PDFs into clean, linearized plain text while preserving structured content. Nonetheless, olmOCR’s primary use case is to supply more training data for Large Language Models and subsequently ignores extraneous information like headers and footers in documents. While this is useful for generating training data, it’s insufficient for real business applications where crucial information often resides within the header and footer a part of the text (consider invoices, for instance). This text describes how we fine-tuned the unique olmOCR-7B-0225-preview to be a faithful OCR engine, enabling it to support a broader range of applications.
In this instance, crucial information within the header and footer marked as red, is ignored by olmOCR.

Problem of pipeline-based OCR-Engines
Optical Character Recognition has broad applications in various business use cases. For a very long time, the predominant paradigm for AI-based OCR engines have been pipeline-based systems. These are comprised of multiple machine-learning components, e.g., section segmentation, table parsing, character recognition, etc., chained together. Nonetheless, one fundamental flaw of this approach is that the extracted results mostly don’t flatten the context in a way that adheres to logical reading order, also often known as linearization. This is especially difficult for layout-rich documents with floating elements like multi-column documents with floating diagrams, headers, footers, etc. With the recent advent of Vision Language Models, a whole lot of effort has been put into leveraging them as alternative OCR systems to tackle this problem.
Starting Point: olmOCR-7B-0225-preview
During our testing of the olmOCR model published on Hugging Face for business applications like invoice parsing, we observed a consistent omission of essential information within the headers and footers. This was expected, because the dataset used to coach olmOCR, olmOCR-mix-0225, intentionally excludes extraneous information in these areas to keep up a natural reading flow, as such information can’t be meaningfully predicted within the context of next-token prediction for training.
To handle this limitation and enable comprehensive information extraction, we utilized Qwen2.5-VL-72B-Instruct to generate a dataset of 8,000 documents that captures all relevant information, as one would expect from a reliable OCR engine. We based our training setup on the open-sourced olmOCR training pipeline, utilizing 4 gradient accumulation steps on an 8xH100 Nvidia node, for a complete of two.5 epochs. The default hyperparameters worked quite well for us, eliminating the necessity for resource-intensive hyperparameter search. The experiment tracking with MlLflow showed the next results:
For evaluation, we utilized a customized version of the olmOCR-mix-0225 eval-datasets, which include header and footer information, also acquired with Qwen2.5-VL-72B-Instruct.
Once the training was done, it was time to place our model to the test.
Comparison between original and fine-tuned olmOCR
We did a qualitative assessment of documents where crucial information was missing after parsing its content. Our inference setup is similar to the one by olmOCR, utilizing the special prompting strategy called document anchoring that preserves any born-digital content from each page. This system extracts the raw tex blocks and position information to prompt VLMs alongside the rasterized image.
We offer several examples of the unique response and the response of the fine-tuned model below.
We marked crucial missing information in red.




Overall, we’re blissful with the outcomes, as now all information, including extraneous data, might be extracted and the model remains to be in a position to parse easy tables. Notably, we noticed that for some examples, the standard of the resulting output can change significantly with different temperatures.
Summary
OCR is crucial for extracting structured information from documents. The power of end-to-end-systems like olmOCR to linearize content gives them a robust advantage over traditional systems. With our fine-tuned version, we will now faithfully extract text including header and footer sections from various documents, which is crucial for business use cases resembling invoice parsing. We’re interested by how future models will evolve on this fast-paced domain. Special due to the Allen Institute for AI for open-sourcing their model, dataset, and code.
In the event you’re curious to check out our finetuned olmOCR-model, we’ve got open-sourced it on HuggingFace.

