With this blog we’re releasing Docmatix – an enormous dataset for Document Visual Query Answering (DocVQA) that’s 100s of times larger than previously available. Ablations using this dataset for fine-tuning Florence-2 show a 20% increase in performance on DocVQA.

An example from the dataset
We first had the thought to create Docmatix after we developed The Cauldron, an intensive collection of fifty datasets for the fine-tuning of Vision-Language Model (VLM), and Idefics2 specifically. Through this process, we identified a big gap in the supply of large-scale Document Visual Query Answering (DocVQA) datasets. The first dataset we relied on for Idefics2 was DocVQA, which accommodates 10,000 images and 39,000 question-answer (Q/A) pairs. Tremendous-tuning on this and other datasets, open-sourced models still maintain a big gap in performance to closed-source ones.
To handle this limitation, we’re excited to introduce Docmatix, a DocVQA dataset featuring 2.4 million images and 9.5 million Q/A pairs derived from 1.3 million PDF documents. A 240X increase in scale in comparison with previous datasets.

Comparing Docmatix to other DocVQA datasets
Here you may explore the dataset yourself and see the kind of documents and question-answer pairs contained in Docterix.
Docmatix is generated from PDFA, an intensive OCR dataset containing 2.1 million PDFs. We took the transcriptions from PDFA and employed a Phi-3-small model to generate Q/A pairs. To make sure the dataset’s quality, we filtered the generations, discarding 15% of the Q/A pairs identified as hallucinations. To accomplish that, we used regular expressions to detect code and removed answers that contained the keyword “unanswerable”.
The dataset accommodates a row for every PDF. We converted the PDFs to photographs at a resolution of 150 dpi, and uploaded the processed images to the Hugging Face Hub for quick access.
All the unique PDFs in Docmatix may be traced back to the unique PDFA dataset, providing transparency and reliability. Still, we uploaded the processed images for convenience because converting many PDFs to photographs may be resource-intensive.

Processing pipeline to generate Docmatix
After processing the primary small batch of the dataset, we performed several ablation studies to optimize the prompts. We aimed to generate around 4 pairs of Q/A per page. Too many pairs indicate a big overlap between them, while too few pairs suggest a scarcity of detail.
Moreover, we aimed for answers to be human-like, avoiding excessively short or long responses. We also prioritized diversity within the questions, ensuring minimal repetition. Interestingly, after we guided the Phi-3 model to ask questions based on the particular information within the document (e.g., “What are the titles of John Doe?”), the questions showed only a few repetitions. The next plot presents some key statistics from our evaluation:

Evaluation of Docmatix per prompt
To judge Docmatix’s performance, we conducted ablation studies using the Florence-2 model. We trained two versions of the model for comparison. The primary version was trained over several epochs on the DocVQA dataset. The second version was trained for one epoch on Docmatix (20% of the pictures and 4% of the Q/A pairs), followed by one epoch on DocVQA to make sure the model produced the proper format for DocVQA evaluation.
The outcomes are significant: training on this small portion of Docmatix yielded a relative improvement of just about 20%. Moreover, the 0.7B Florence-2 model performed only 5% worse than the 8B Idefics2 model trained on a combination of datasets and is significantly larger.
| Dataset | ANSL on DocVQA | model size |
|---|---|---|
| Florence 2 fine-tuned on DocVQA | 60.1 | 700M |
| Florence 2 fine-tuned on Docmatix | 71,4 | 700M |
| Idefics2 | 74,0 | 8B |
Conclusion
On this post, we presented Docmatix, a huge dataset for DocVQA. We showed that using Docmatix we will achieve a 20% increase in DocVQA performance when finetuning Florence-2. This dataset should help bridge the gap between proprietary VLMs and open-sourced VLMs. We encourage the open-source community to leverage Docmatix and train latest amazing DocVQA models! We won’t wait to see your models on the 🤗 Hub!
Useful Resources
We would really like to thank merve and leo for his or her reviews and thumbnails for this blog.
