We’ve got added Chandra and OlmOCR-2 to this blog, in addition to OlmOCR Scores of the models 🫡
TL;DR: The rise of powerful vision-language models has transformed document AI. Each model comes with unique strengths, making it tricky to decide on the best one. Open-weight models offer higher cost efficiency and privacy. To show you how to start with them, we’ve put together this guide.
On this guide, you’ll learn:
- The landscape of current models and their capabilities
- When to fine-tune models vs. use models out-of-the-box
- Key aspects to contemplate when choosing a model to your use case
- Tips on how to move beyond OCR with multimodal retrieval and document QA
By the top, you’ll know learn how to select the best OCR model, start constructing with it, and gain deeper insights into document AI. Let’s go!
Table-of-Contents
Transient Introduction to Modern OCR
Optical Character Recognition (OCR) is considered one of the earliest and longest running challenges in computer vision. Lots of AI’s first practical applications focused on turning printed text into digital form.
With the surge of vision-language models (VLMs), OCR has advanced significantly. Recently, many OCR models have been developed by fine-tuning existing VLMs. But today’s capabilities extend far beyond OCR: you’ll be able to retrieve documents by query or answer questions on them directly. Because of stronger vision features, these models may also handle low-quality scans, interpret complex elements like tables, charts, and pictures, and fuse text with visuals to reply open-ended questions across documents.
Model Capabilities
Transcription
Recent models transcribe texts right into a machine-readable format.
The input can include:
- Handwritten text
- Various scripts like Latin, Arabic, and Japanese characters
- Mathematical expressions
- Chemical formulas
- Image/Layout/Page number tags
OCR models convert them into machine-readable text that is available in many various formats like HTML, Markdown and more.
Handling complex components in documents
On top of text, some models may also recognize:
Some models know where images are contained in the document, extract their coordinates, and insert them appropriately between texts. Other models generate captions for images and insert them where they seem. This is very useful if you happen to are feeding the machine-readable output into an LLM. Example models are OlmOCR by AllenAI, or PaddleOCR-VL by PaddlePaddle.
Models use different machine-readable output formats, similar to DocTags, HTML or Markdown (explained in the subsequent section Output Formats). The way in which a model handles tables and charts often will depend on the output format they’re using. Some models treat charts like images: they’re kept as is. Other models convert charts into markdown tables or JSON, e.g., a bar chart may be converted as follows.
Similarly for tables, cells are converted right into a machine-readable format while retaining context from headings and columns.
Output formats
Different OCR models have different output formats. Briefly, listed here are the common output formats utilized by modern models.
DocTag: DocTag is an XML-like format for documents that expresses location, text format, component-level information, and more. Below is an illustration of a paper parsed into DocTags. This format is employed by the open Docling models.
- HTML: HTML is one of the popular output formats used for document parsing because it properly encodes structure and hierarchical information.
- Markdown: Markdown is essentially the most human-readable format. It’s simpler than HTML but not as expressive. For instance, it might’t represent split-column tables.
- JSON: JSON is just not a format that models use for the whole output, but it might be used to represent information in tables or charts.
The appropriate model will depend on how you propose to make use of its outputs:
- Digital reconstruction: To reconstruct documents digitally, select a model with a layout-preserving format (e.g., DocTags or HTML).
- LLM input or Q&A: If the use case involves passing outputs to LLM, pick a model that outputs Markdown and image captions, since they’re closer to natural language.
- Programmatic use: If you must pass your outputs to a program (like data evaluation), go for a model that generates structured outputs like JSON.
Locality Awareness
Documents can have complex structures, like multi-column text blocks and floating figures. Older OCR models handled these documents by detecting words after which the layout of pages manually in post-processing to have the text rendered in reading order, which is brittle. Modern OCR models, however, incorporate layout metadata to assist preserve reading order and accuracy. This metadata is named “anchor”, it might are available in bounding boxes. This process can be called as “grounding/anchoring” since it helps with reducing hallucination.
Model Prompting
OCR models can either soak up images and an optional text prompt, this will depend on the model architecture and the pre-training setup.
Some OCR models support prompt-based task switching, e.g. granite-docling can parse a complete page with the prompt “Convert this page to Docling” while it might also take prompts like “Convert this formula to LaTeX” together with a page stuffed with formulas.
Other models, nonetheless, are trained just for parsing entire pages, and so they are conditioned to do that through a system prompt.
As an illustration, OlmOCR by AllenAI takes a protracted conditioning prompt. Like many others, OlmOCR is technically an OCR fine-tuned version of a VLM (Qwen2.5VL on this case), so you’ll be able to prompt for other tasks, but its performance is not going to be on par with the OCR capabilities.
Cutting-edge Open OCR Models
We’ve seen an incredible wave of latest models this past yr. Because a lot work is occurring within the open, these players construct on and profit from one another’s work. An excellent example is AllenAI’s release of OlmOCR, which not only released a model but in addition the dataset used to coach it. With these, others can construct upon them in latest directions. The sector is incredibly lively, but it surely’s not at all times obvious which model to make use of.
Comparing Latest Models
To make things a bit easier, we’re putting together a non-exhaustive comparison of a few of our current favorite models. The entire models below are layout-aware and may parse tables, charts, and math equations. The complete list of languages each model supports are detailed of their model cards, so make certain to envision them if you happen to’re interested. All models below have open-source license aside from Chandra having OpenRAIL license and Nanonets license being unclear. The typical scores are taken from model cards of Chandra, OlmOCR, evaluated on OlmOCR Benchmark, which is English-only.
Most of the models on this collection have been fine-tuned from Qwen2.5-VL or Qwen3-VL, so we also provide Qwen3-VL model below as well.
| Model Name | Output formats | Features | Model Size | Multilingual? | Average Rating on OlmOCR Benchmark |
|---|---|---|---|---|---|
| Nanonets-OCR2-3B | structured Markdown with semantic tagging (plus HTML tables, etc.) | Captions images within the documents Signature & watermark extraction Handles checkboxes, flowcharts, and handwriting |
4B | ✅Supports English, Chinese, French, Arabic and more. | N/A |
| PaddleOCR-VL | Markdown, JSON, HTML tables and charts | Handles handwriting, old documents Allows prompting Converts tables & charts to HTML Extracts and inserts images directly |
0.9B | ✅Supports 109 languages | N/A |
| dots.ocr | Markdown, JSON | Grounding Extracts and inserts images Handles handwriting |
3B | ✅Multilingual with language info not available | 79.1 ± 1.0 |
| OlmOCR-2 | Markdown, HTML, LaTeX | Grounding Optimized for large-scale batch processing |
8B | ❎English-only | 82.3 ± 1.1 |
| Granite-Docling-258M | DocTags | Prompt-based task switching Ability to prompt element locations with location tokens Wealthy output |
258M | ✅Supports English, Japanese, Arabic and Chinese. | N/A |
| DeepSeek-OCR | Markdown, HTML | Supports general visual understanding Can parse and re-render all charts, tables, and more into HTML Handles handwriting Memory-efficient, solves text through image |
3B | ✅Supports nearly 100 languages | 75.4 ± 1.0 |
| Chandra | Markdown, HTML, JSON | Grounding Extracts and inserts images as is |
9B | ✅Supports 40+ languages | 83.1 ± 0.9 |
| Qwen3-VL | Vision Language Model can output in all formats | Can recognize ancient text Handles handwriting Extracts and inserts images as is |
9B | ✅Supports 32 languages | N/A |
While Qwen3-VL itself is a strong and versatile vision-language model post-trained for document understanding and other tasks, it isn’t optimized for a single, universal OCR prompt. In contrast, the opposite models were fine-tuned using one or a number of fixed prompts specifically designed for OCR tasks. So to make use of Qwen3-VL, we recommend experimenting with prompts.
Here’s a small demo so that you can try a number of the latest models and compare their outputs.
Evaluating Models
Benchmarks
There’s no single best model, as every problem has different needs. Should tables be rendered in Markdown or HTML? Which elements should we extract? How should we quantify text accuracy and error rates? 👀
While there are a lot of evaluation datasets and tools, many don’t answer these questions. So we recommend using the next benchmarks:
- OmniDocBenchmark: This widely used benchmark stands out for its diverse document types: books, magazines, and textbooks. Its evaluation criteria are well designed, accepting tables in each HTML and Markdown formats. A novel matching algorithm evaluates the reading order, and formulas are normalized before evaluation. Most metrics depend on edit distance or tree edit distance (tables). Notably, the annotations used for evaluation are usually not solely human-generated but are acquired through SoTA VLMs or conventional OCR methods.
- OlmOCR-Bench: OlmOCR-Bench takes a unique approach: they treat the evaluation as a set of unit tests. For instance, table evaluation is completed by checking the relation between chosen cells of a given table. They use PDFs from public sources, and annotations are done using a wide selection of closed-source VLMs. This benchmark is sort of successful to judge on the English language.
- CC-OCR (Multilingual): In comparison with the previous benchmarks, CC-OCR is less preferred when picking models, attributable to lower document quality and variety. Nonetheless, it’s the one benchmark that incorporates evaluation beyond English and Chinese! While the evaluation is way from perfect (images are photos with few words), it’s still the perfect you’ll be able to do for multilingual evaluation.
When testing different OCR models, we have found that the performance across different document types, languages, etc., varies rather a lot. Your domain might not be well represented in existing benchmarks! To make effective use of this latest generation of VLM-based OCR models we recommend aiming to gather a dataset of representative examples of your task domain and testing a number of different models to check their performance.
Cost-efficiency
Most OCR models are small, having between 3B and 7B parameters; you’ll be able to even find models with fewer than 1B parameters, like PaddleOCR-VL. Nonetheless, the price also will depend on the supply of optimized implementations for specialised inference frameworks. For instance, OlmOCR-2 comes with vLLM and SGLang implementations, and the price per million pages is 178 dollars (assuming on H100 for $2.69/hour). DeepSeek-OCR can process 200k+ pages per day on a single A100 with 40GB VRAM. With napkin math, we see that the price per million pages is roughly much like OlmOCR (even though it will depend on your A100 provider). In case your use case stays unaffected, you too can go for quantized versions of the models. The fee of running open-source models heavily will depend on the hourly cost of the instance and the optimizations the model includes, but it surely’s guaranteed to be cheaper than many closed-source models on the market on a bigger scale.
Open OCR Datasets
While the past yr has seen a surge in open OCR models, this hasn’t been matched by as many open training and evaluation datasets. An exception is AllenAI’s olmOCR-mix-0225, which has been used to coach at the least 72 models on the Hub – likely more, since not all models document their training data.
Sharing more datasets could unlock even greater advances in open OCR models. There are several promising approaches for creating these datasets:
- Synthetic data generation (e.g., isl_synthetic_ocr)
- VLM-generated transcriptions filtered manually or through heuristics
- Using existing OCR models to generate training data for brand new, potentially more efficient models in specific domains
- Leveraging existing corrected datasets just like the Medical History of British India Dataset, which incorporates extensively human-corrected OCR for historic documents
It’s value noting that many such datasets exist but remain unused. Making them more available as ‘training-ready’ datasets carries a substantial potential for the open-source community.
Tools to Run Models
We’ve got received many questions on getting began with OCR models, so listed here are a number of ways you should use local inference tools and host remotely with Hugging Face.
Locally
Most cutting-edge models include vLLM support and transformers implementation. You may get more info about learn how to serve each from the models’ own cards. For convenience, we show learn how to infer locally using vLLM here. The code below can differ from model to model, but for many models it looks like the next.
vllm serve nanonets/Nanonets-OCR2-3B
And then you definately can query as follows using e.g. OpenAI client.
from openai import OpenAI
import base64
client = OpenAI(base_url="http://localhost:8000/v1")
model = "nanonets/Nanonets-OCR2-3B"
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
def infer(img_base64):
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{img_base64}"},
},
{
"type": "text",
"text": "Extract the text from the above document as if you were reading it naturally.",
},
],
}
],
temperature=0.0,
max_tokens=15000
)
return response.decisions[0].message.content
img_base64 = encode_image(your_img_path)
print(infer(img_base64))
Transformers
Transformers provides standard model definitions for straightforward inference and fine-tuning. Models available in transformers include either official transformers implementation (model definitions throughout the library) or “distant code” implementations. Latter is defined by the model owners to enable easy loading of models into transformers interface, so that you don’t must undergo the model implementation. Below is an example loading Nanonets model using transformers implementation.
from transformers import AutoProcessor, AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained(
"nanonets/Nanonets-OCR2-3B",
torch_dtype="auto",
device_map="auto",
attn_implementation="flash_attention_2"
)
model.eval()
processor = AutoProcessor.from_pretrained("nanonets/Nanonets-OCR2-3B")
def infer(image_url, model, processor, max_new_tokens=4096):
prompt = """Extract the text from the above document as if you happen to were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there may be a picture within the document and image caption is just not present, add a small description of the image contained in the
tag; otherwise, add the image caption inside
. Watermarks must be wrapped in brackets. Ex: OFFICIAL COPY . Page numbers must be wrapped in brackets. Ex: 14 or 9/22 . Prefer using ☐ and ☑ for check boxes."""
image = Image.open(image_path)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image", "image": image_url},
{"type": "text", "text": prompt},
]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
return output_text[0]
result = infer(image_path, model, processor, max_new_tokens=15000)
print(result)
MLX
MLX is an open-source machine learning framework for Apple Silicon. MLX-VLM is built on top of MLX to serve vision language models easily. You may explore all of the OCR models available in MLX format here. In addition they are available in quantized versions.
You may install MLX-VLM as follows.
pip install -U mlx-vlm
wget https://huggingface.co/datasets/merve/vlm_test_images/resolve/most important/throughput_smolvlm.png
python -m mlx_vlm.generate --model ibm-granite/granite-docling-258M-mlx --max-tokens 4096 --temperature 0.0 --prompt "Convert this chart to JSON." --image throughput_smolvlm.png
Remotely
Inference Endpoints for Managed Deployment
You may deploy OCR models compatible with vLLM or SGLang on Hugging Face Inference Endpoints, either from a model repository “Deploy” option or directly through Inference Endpoints interface. Inference Endpoints serve the cutting-edge models in a completely managed environment with GPU acceleration, auto-scaling, and monitoring without manually managing the infrastructure.
Here is an easy approach to deploying nanonets using vLLM because the inference engine.
- Navigate to the model repository
nanonets/Nanonets-OCR2-3B - Click on the “Deploy” button and choose the “HF Inference Endpoints”
- Configure the deployment setup inside seconds
- After the endpoint is created, you’ll be able to eat it using the OpenAI client snippet we provided within the previous section.
You may learn more about it here.
Hugging Face Jobs for Batch Inference
For a lot of OCR applications, you must do efficient batch inference, i.e., running a model across 1000’s of images as cheaply and efficiently as possible. A very good approach is to make use of vLLM’s offline inference mode. As discussed above, many recent VLM-based OCR models are supported by vLLM, which efficiently batches images and generates OCR outputs at scale.
To make this even easier, we have created uv-scripts/ocr, a set of ready-to-run OCR scripts that work with Hugging Face Jobs. These scripts allow you to run OCR on any dataset while not having your individual GPU. Simply point the script at your input dataset, and it should:
- Process all images in a dataset column using many various open OCR models
- Add OCR results as a brand new markdown column to the dataset
- Push the updated dataset with OCR results to the Hub
For instance, to run OCR on 100 images:
hf jobs uv run --flavor l4x1
https://huggingface.co/datasets/uv-scripts/ocr/raw/most important/nanonets-ocr.py
your-input-dataset your-output-dataset
--max-samples 100
The scripts handle all of the vLLM configuration and batching mechanically, making batch OCR accessible without infrastructure setup.
Going Beyond OCR
If you happen to are keen on document AI, not only OCR, listed here are a few of our recommendations.
Visual Document Retrievers
Visual document retrieval is to retrieve essentially the most relevant top-k documents when given a text query. If you’ve got previously worked with retriever models, the difference is that you just search directly on a stack of PDFs. Except for using them standalone, you too can construct multimodal RAG pipelines by combining them with a vision language model (find learn how to accomplish that here). You could find all of them on Hugging Face Hub.
There are two forms of visual document retrievers, single-vector and multi-vector models. Single-vector models are more memory efficient and fewer performant; meanwhile, multi-vector models are more memory hungry and more performant. Most of those models often include vLLM and transformers integrations, so you’ll be able to index documents using them after which do a search easily using a vector DB.
Using Vision Language Models for Document Query Answering
If you’ve got a task at hand that only requires answering questions based on documents, you should use a number of the vision language models that had document tasks of their training tasks. We’ve observed users attempting to convert documents into text and passing the output to LLMs, but in case your document has a posh layout, and your converted document outputs charts and so forth in HTML, or images are captioned incorrectly, the LLM will miss out. As an alternative, feed your document and query to considered one of the advanced vision language models like Qwen3-VL to not miss out on any context.
Wrapping up
On this blog post, we wanted to provide you an outline of learn how to pick your OCR model, existing cutting-edge models and capabilities, and the tools to get you began with OCR.
If you must learn more about OCR and vision language models, we encourage you to read the resources below.





