We’re excited to release Llama Nemotron VLM Dataset V1 with 3 million samples of top quality vision language model (VLM) training data, which give attention to enterprise specific use cases corresponding to optical character recognition (OCR), visual query answering (VQA) and captioning.
This dataset is a compilation of top quality VLM post-training datasets that we used for training our recently released Llama 3.1 Nemotron Nano VL 8B V1. Llama Nemotron Nano VL model is a sophisticated AI model designed specifically for advanced intelligent document processing and understanding. The model topped the OCRBench V2 benchmark, highlighting the standard of this training dataset.
The discharge represents a big move forward in openness and transparency in model development and improvement. By releasing training data, along with the training tools and final model weights, NVIDIA supports continued improvement of open source models.
What’s within the Dataset and How we built it?
At a high level, the Llama Nemotron VLM Dataset V1 consists of 67.0% VQA samples, 28.4% OCR samples, and 4.6% Image Captioning samples. Developers can use the dataset entirely or curate it further using the NVIDIA NeMo Curator to process high-quality datasets for training highly accurate VLM models.
Let’s dive deeper into how we constructed this dataset.
This dataset introduces high-quality annotations that support world class vision-language understanding. An area of focus was to re-annotate popular visual query answering datasets with open-source technologies, such that the info could be permissively used for training. To extract additional wealthy information from input images that shouldn’t be fully present in the present QA annotations, we used commercially permissive models and pipelines to generate fine-grained descriptions of the pictures.
Examples of augmentation include: Adding chain of thought explanations, rule-based QA generation (e.g., with templates), expanding short answers into longer responses, in addition to proper reformatting. More details could be present in Eagle 2: Constructing Post-Training Data Strategies from Scratch for Frontier Vision-Language Models paper.
Optical character recognition (OCR) is an area particularly essential to support document understanding, which involves tables and figures of varied sources and with diverse layouts. OCR can also be essential to be used cases like IT support, customer support. The VLM trained with dataset enables a deeper comprehension of images with text, tabular content, icons, images of documents with structural versatility. Details could also be present in Éclair – Extracting Content and Layout with Integrated Reading Order for Documents. As a part of this dataset release, we share
- Entirely or partially synthetic OCR datasets (annotation + images) for character-, word- and page-level English and Chinese OCR;
- A number of curated annotations for publicly available table and document OCR datasets;
- An internally annotated table OCR dataset.
Llama Nemotron VLM Training Dataset was purpose-built to support enterprise use cases and production-grade AI development. It is prepared for industrial use.
Data Example

Example 1: VQA_4, image: chartqa/train/png/multi_col_948.png. Shared under GPL-3.0 license.
Prompt: "What company ranked second within the microprocessor market in 2020? Please provide an in depth explanation on your answer.”
Response:
Problem Restatement: Determine the corporate that ranked second within the microprocessor market in 2020 based on the provided data.
**Step-by-Step Process:**
Discover the Market Share Data for 2020:
- Intel: 19.5%
- TSMC: 11.2%
- Qualcomm: 10.2%
- SK Hynix: 7.7
- Broadcom: 7
- Samsung: 6%
- Nvidia: 4.6
- Sony: 4.6%
- Micron: 4.4%
2. Rating the Firms Based on Market Share:
- Intel has the very best market share at 19.5%.
- TSMC follows with a market share of 11.2%
3. Conclusion:
- Since Intel has the very best market share, the corporate with the second-highest market share is TSMC.
**Final Answer:** TSMC
Get Began
To summarize, NVIDIA has released a 3-million-sample dataset for OCR, visual query answering, and captioning tasks. Download the dataset from Hugging Face [here] and use it on your projects — we’re excited to see what you construct!

