Enterprises are stuffed with documents containing knowledge that may not accessible by digital workflows. These documents can vary from letters, invoices, forms, reports, to receipts. With the improvements in text, vision, and multimodal AI, it’s now possible to unlock that information. This post shows you ways your teams can use open-source models to construct custom solutions without spending a dime!
Document AI includes many data science tasks from image classification, image to text, document query answering, table query answering, and visual query answering. This post starts with a taxonomy of use cases inside Document AI and the perfect open-source models for those use cases. Next, the post focuses on licensing, data preparation, and modeling. Throughout this post, there are links to web demos, documentation, and models.
Use Cases
There are not less than six general use cases for constructing document AI solutions. These use cases differ within the type of document inputs and outputs. A mixture of approaches is commonly crucial when solving enterprise Document AI problems.
Turning typed, handwritten, or printed text into machine-encoded text is often called Optical Character Recognition (OCR). It is a widely studied problem with many well-established open-source and industrial offerings. The figure shows an example of converting handwriting into text.
OCR is a backbone of Document AI use cases because it’s essential to rework the text into something readable by a pc. Some widely available OCR models that operate on the document level are EasyOCR or PaddleOCR. There are also models like TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, which runs on single-text line images. This model works with a text detection model like CRAFT which first identifies the person “pieces” of text in a document in the shape of bounding boxes. The relevant metrics for OCR are Character Error Rate (CER) and word-level precision, recall, and F1. Try this Space to see an indication of CRAFT and TrOCR.
Classifying documents into the suitable category, reminiscent of forms, invoices, or letters, is often called document image classification. Classification may use either one or each of the document’s image and text. The recent addition of multimodal models that use the visual structure and the underlying text has dramatically increased classifier performance.
A basic approach is applying OCR on a document image, after which a BERT-like model is used for classification. Nonetheless, counting on only a BERT model doesn’t take any layout or visual information under consideration. The figure from the RVL-CDIP dataset shows how visual structure differs by different document types.
That is where models like LayoutLM and Donut come into play. By incorporating not only text but additionally visual information, these models can dramatically increase accuracy. For comparison, on RVL-CDIP, a very important benchmark for document image classification, a BERT-base model achieves 89% accuracy by utilizing the text. A DiT (Document Image Transformer) is a pure vision model (i.e., it doesn’t take text as input) and might reach 92% accuracy. But models like LayoutLMv3 and Donut, which use the text and visual information together using a multimodal Transformer, can achieve 95% accuracy! These multimodal models are changing how practitioners solve Document AI use cases.
Document layout evaluation is the duty of determining the physical structure of a document, i.e., identifying the person constructing blocks that make up a document, like text segments, headers, and tables. This task is commonly solved by framing it as a picture segmentation/object detection problem. The model outputs a set of segmentation masks/bounding boxes, together with class names.
Models which can be currently state-of-the-art for document layout evaluation are LayoutLMv3 and DiT (Document Image Transformer). Each models use the classic Mask R-CNN framework for object detection as a backbone. This document layout evaluation Space illustrates how DiT might be used to discover text segments, titles, and tables in documents. An example using DiT detecting different parts of a document is shown here.
Document layout evaluation with DiT.
Document layout evaluation typically uses the mAP (mean average-precision) metric, often used for evaluating object detection models. A very important benchmark for layout evaluation is the PubLayNet dataset. LayoutLMv3, the state-of-the-art on the time of writing, achieves an overall mAP rating of 0.951 (source).
A step beyond layout evaluation is document parsing. Document parsing is identifying and extracting key information (often in the shape of key-value pairs) from a document, reminiscent of names, items, and totals from an invoice form. This LayoutLMv2 Space shows to parse a document to acknowledge questions, answers, and headers.
The primary version of LayoutLM (now often called LayoutLMv1) was released in 2020 and dramatically improved over existing benchmarks, and it’s still one of the crucial popular models on the Hugging Face Hub for Document AI. LayoutLMv2 and LayoutLMv3 incorporate visual features during pre-training, which provides an improvement. The LayoutLM family produced a step change in Document AI performance. For instance, on the FUNSD benchmark dataset, a BERT model has an F1 rating of 60%, but with LayoutLM, it is feasible to get to 90%!
LayoutLMv1 now has many successors, including ERNIE-Layout which shows promising results as shown on this Space. For multilingual use cases, there are multilingual variants of LayoutLM, like LayoutXLM and LiLT. This figure from the LayoutLM paper shows LayoutLM analyzing some different documents.
Many successors of LayoutLM adopt a generative, end-to-end approach. This began with the Donut model, which simply takes a document’s image as input and produces text as an output, not counting on any separate OCR engine.

Donut model consisting of an encoder-decoder Transformer. Taken from the Donut paper.
After Donut, various similar models were released, including Pix2Struct by Google and UDOP by Microsoft. Nowadays, larger vision-language models reminiscent of LLaVa-NeXT and Idefics2 might be fine-tuned to perform document parsing in an end-to-end manner. As a matter of fact, these models might be fine-tuned to perform any document AI task, from document image classification to document parsing, so long as the duty might be defined as an image-text-to-text task. See, as an illustration, the tutorial notebook to fine-tune Google’s PaliGemma (a smaller vision-language model) to return a JSON from receipt images.

Vision-language models reminiscent of PaliGemma might be fine-tuned on any image-text-to-text task. See the tutorial notebook.
Data scientists are finding document layout evaluation and extraction as key use cases for enterprises. The present industrial solutions typically cannot handle the range of most enterprise data, in content and structure. Consequently, data science teams can often surpass industrial tools by fine-tuning their very own models.
Documents often contain tables, and most OCR tools don’t work incredibly well out-of-the-box on tabular data. Table detection is the duty of identifying where tables are positioned, and table extraction creates a structured representation of that information. Table structure recognition is the duty of identifying the person pieces that make up a table, like rows, columns, and cells. Table functional evaluation (FA) is the duty of recognizing the keys and values of the table. The figure from the Table transformer illustrates the difference between the assorted subtasks.
The approach for table detection and structure recognition is comparable to document layout evaluation in using object detection models that output a set of bounding boxes and corresponding classes.
The newest approaches, like Table Transformer, can enable table detection and table structure recognition with the identical model. The Table Transformer is a DETR-like object detection model, trained on PubTables-1M (a dataset comprising a million tables). Evaluation for table detection and structure recognition typically uses the common precision (AP) metric. The Table Transformer performance is reported as having an AP of 0.966 for table detection and an AP of 0.912 for table structure recognition + functional evaluation on PubTables-1M.
Table detection and extraction is an exciting approach, but the outcomes could also be different in your data. In our experience, the standard and formatting of tables vary widely and might affect how well the models perform. Additional fine-tuning on some custom data will greatly improve the performance.
Query answering on documents has dramatically modified how people interact with AI. Recent advancements have made it possible to ask models to reply questions on a picture – that is often called document visual query answering, or DocVQA for brief. After being given an issue, the model analyzes the image and responds with a solution. An example from the DocVQA dataset is shown within the figure below. The user asks, “Mention the ZIP code written?” and the model responds with the reply.
Prior to now, constructing a DocVQA system would often require multiple models working together. There might be separate models for analyzing the document layout, performing OCR, extracting entities, after which answering an issue. The newest DocVQA models enable question-answering in an end-to-end manner, comprising only a single (multimodal) model.
DocVQA is often evaluated using the Average Normalized Levenshtein Similarity (ANLS) metric. For more details regarding this metric, we discuss with this guide. The present state-of-the-art on the DocVQA benchmark that’s open-source is LayoutLMv3, which achieves an ANLS rating of 83.37. Nonetheless, this model consists of a pipeline of OCR + multimodal Transformer.
Newer models reminiscent of Donut, LLaVa-NeXT and Idefics2 solve the duty in an end-to-end manner using a single Transformer-based neural network, not counting on OCR. Impira hosts an exciting Space that illustrates LayoutLM and Donut for DocVQA.
Visual query answering is compelling; nevertheless, there are lots of considerations for successfully using it. Having accurate training data, evaluation metrics, and post-processing is significant. For teams taking up this use case, remember that DocVQA might be difficult to work properly. In some cases, responses might be unpredictable, and the model can “hallucinate” by giving a solution that does not appear inside the document. Visual query answering models can inherit biases in data raising ethical issues. Ensuring proper model setup and post-processing is integral to constructing a successful DocVQA solution.
What are Licensing Issues in Document AI?
Industry and academia make enormous contributions to advancing Document AI. There are a large assortment of models and datasets available for data scientists to make use of. Nonetheless, licensing is usually a non-starter for constructing an enterprise solution. Some well-known models have restrictive licenses that forbid the model from getting used for industrial purposes. Most notably, Microsoft’s LayoutLMv2 and LayoutLMv3 checkpoints can’t be used commercially. Once you start a project, we advise fastidiously evaluating the license of prospective models. Knowing which models you need to use is crucial on the outset, since which will affect data collection and annotation. A table of the favored models with their licensing license information is at the top of this post.
What are Data Prep Issues in Document AI?
Data preparation for Document AI is critical and difficult. It’s crucial to have properly annotated data. Listed here are some lessons we’ve got learned together with the way in which around data preparation.
First, machine learning is dependent upon the dimensions and quality of your data. If the image quality of your documents is poor, you’ll be able to’t expect AI to have the ability to read these documents magically. Similarly, in case your training data is small with many classes, your performance could also be poor. Document AI is like other problems in machine learning where larger data will generally provide greater performance.
Second, be flexible in your approaches. You could must test several different methodologies to search out the perfect solution. A fantastic example is OCR, wherein you should utilize an open-source product like Tesseract, a industrial solution like Cloud Vision API, or the OCR capability inside an open-source multimodal model like Donut.
Third, start small with annotating data and pick your tools correctly. In our experience, you’ll be able to get good results with several hundred documents. So start small and thoroughly evaluate your performance. Once you will have narrowed your overall approach, you’ll be able to begin to scale up the information to maximise your predictive accuracy. When annotating, do not forget that some tasks like layout identification and document extraction require identifying a selected region inside a document. You’ll want to ensure your annotation tool supports bounding boxes.
What are Modeling Issues in Document AI?
The pliability of constructing your models results in many options for data scientists. Our strong suggestion for teams is to begin with the pre-trained open-source models. These models might be fine-tuned to your specific documents, and this is mostly the quickest approach to an excellent model.
For teams considering constructing their very own pre-trained model, remember this may involve hundreds of thousands of documents and might easily take several weeks to coach a model. Constructing a pre-trained model requires significant effort and shouldn’t be really helpful for many data science teams. As a substitute, start with fine-tuning one, but ask yourself these questions first.
Do you would like the model to handle the OCR? For instance, Donut doesn’t require the document to be OCRed and directly works on full-resolution images, so there isn’t a need for OCR before modeling. Nonetheless, depending in your problem setup, it could be simpler to get OCR individually.
Must you use higher-resolution images? When using images with LayoutLMv2, it downscales them to 224 by 224, which destroys the unique aspect ratio of the pictures. Newer models reminiscent of Donut, Pix2Struct and Idefics2 uses the complete high-resolution image, keeping the unique aspect ratio. Research has shown that performance dramatically increases with a better image resolution, because it allows models to “see” quite a bit more. Nonetheless, it also comes at the fee of additional memory required for training and inference.

Effect of image resolution on downstream performance. Taken from the Pix2Struct paper.
How are you evaluating the model? Be careful for misaligned bounding boxes. You need to ensure bounding boxes provided by the OCR engine of your selection align with the model processor. Verifying this may prevent from unexpectedly poor results. Second, let your project requirements guide your evaluation metrics. For instance, in some tasks like token classification or query answering, a 100% match will not be the perfect metric. A metric like partial match could allow for a lot of more potential tokens to be considered, reminiscent of “Acme” and “inside Acme” as a match. Finally, consider ethical issues during your evaluation as these models could also be working with biased data or provide unstable outcomes that would biased against certain groups of individuals.
Next Steps
Are you seeing the chances of Document AI? On daily basis we work with enterprises to unlock precious data using state-of-the-art vision and language models. We included links to numerous demos throughout this post, so use them as a start line. The last section of the post comprises resources for beginning to code up your individual models, reminiscent of visual query answering. Once you’re ready to begin constructing your solutions, the Hugging Face public hub is an important start line. It hosts an enormous array of Document AI models.
If you need to speed up your Document AI efforts, Hugging Face may help. Through our Enterprise Acceleration Program we partner with enterprises to supply guidance on AI use cases. For Document AI, this might involve helping construct a pre-train model, improving accuracy on a fine-tuning task, or providing overall guidance on tackling your first Document AI use case.
We may provide bundles of compute credits to make use of our training (AutoTrain) or inference (Spaces or Inference Endpoints) products at scale.
Resources
Notebooks and tutorials for a lot of Document AI models might be found at:
What are Popular Open-Source Models for Document AI?
A table of the currently available Transformers models achieving state-of-the-art performance on Document AI tasks. A very important trend is that we see increasingly vision-language models that perform document AI tasks in an end-to-end manner, taking the document image(s) as an input and producing text as an output.
This was last updated in June 2024.
What are Metrics and Datasets for Document AI?
A table of the common metrics and datasets for command Document AI tasks. This was last updated in November 2022.
| task | typical metrics | benchmark datasets |
|---|---|---|
| Optical Character Recognition | Character Error Rate (CER) | |
| Document Image Classification | Accuracy, F1 | RVL-CDIP |
| Document layout evaluation | mAP (mean average precision) | PubLayNet, XFUND(Forms) |
| Document parsing | Accuracy, F1 | FUNSD, SROIE, CORD |
| Table Detection and Extraction | mAP (mean average precision) | PubTables-1M |
| Document visual query answering | Average Normalized Levenshtein Similarity (ANLS) | DocVQA |






