NVIDIA Releases 8 Million Sample Open Dataset and Tooling for OCR, Image Reasoning, Image and Video QA Tasks

-



Just 2 months ago, the NVIDIA Nemotron team released Llama Nemotron VLM Dataset V1 with 3 million samples of OCR, Visual Query Answering, and Captioning Tasks. Today, we’re releasing the subsequent version – Nemotron VLM Dataset V2 adding 8M samples, i.e. almost 3 times as many high-quality samples on top of the previous release. The mixture of each of those prime quality VLM post-training open datasets features 11M samples and was used for training our recently released Nemotron Nano 2 VL.

This time, our focus was on three most important areas: Adding recent data modalities like video, expanding our chain of thought reasoning data, and providing the community with a toolchain to generate OCR training data.

We found that for further improvements, our models needed to not only learn the correct answer, however the thought process to get there. Adding more targeted chain-of-thought datasets, proved to be the important thing to breaking the plateau for plenty of benchmarks.

We’re excited to share this data with the community and we’re looking forward to seeing what you construct with it.

This release is a component of NVIDIAs commitment to open and transparent model development and improvement. By sharing training data, along with the training tools and final model weights, NVIDIA supports continued improvement of open source models.

NVIDIA is devoted to moral and protected AI, as demonstrated by the great safety and compliance reviews conducted on this dataset. These reviews included verifying permissive data sources, utilizing open-source or proprietary tools and labeling models, and performing CSAM scans for content toxicity.



What’s within the Dataset and How did we construct it?

A pie chart showing the category distribution of nemotron vlm dataset v2

With this release, we’re broadening the dataset scope to permit for training more capable models. We added

Recent Modalities and Domains: Now we have added a considerable amount of latest data covering UI understanding, complex charts, diagrams. For the primary time, we’re also including video understanding tasks.

Deal with Reasoning: Now we have been in a position to break benchmark plateaus by adding more chain of thought SFT data, a few of which we generated by auto labeling pondering traces for existing samples. We found that providing those traces helped especially for samples that the previous model struggled with.

Improved OCR: We further improved on the highly-competitive OCR capabilities of our first VL model by adding a good larger variety of coaching samples including multilingual data for ten languages.

Our OCR training data was generated using a novel LaTeX compilation pipeline that we used to process arXiv papers. Unlike “convert-to-HTML” methods that lose layout and semantics, our pipeline patches the TeX engine to emit precise glyph positions and semantic context, compiling LaTeX directly into richly labeled datasets with PDF page images, 2D bounding boxes, formatted text in Markdown, and semantic classes equivalent to Title, Section-Header, Footnote, and Math. We’re open-sourcing the pipeline so you’ll be able to compile hundreds of thousands of LaTeX documents into training datasets. As well as, you need to use our augmentation pipeline to acquire diverse layouts, fonts, and languages, and translate content for multilingual data generation.

The Nemotron VLM Dataset V2 consists of 55% image QA samples, 25% OCR samples, 10% video QA samples, and 10% image reasoning samples. Developers can use the dataset entirely or curate it further using the NVIDIA NeMo Curator to process high-quality datasets for training highly accurate VLM models.

We re-annotated popular visual query answering datasets with proprietary and open-source technologies, such that the information may be permissively used for training. To extract additional wealthy information from input images that will not be fully present in the present QA annotations, we used commercially permissive models and pipelines to generate fine-grained descriptions of the photographs.

Examples of augmentation include: Adding chain of thought explanations, rule-based QA generation (e.g., with templates), expanding short answers into longer responses, in addition to proper reformatting.

Category No. of Samples in V1 No. of Samples Added in V2
Image QA 1,917,755 4,466,777
OCR 814,381 2,118,483
Captioning 131,718 N/A
Video QA N/A 829,201
Image Reasoning N/A 733,138
Total 2,863,854 8,147,599

Nemotron VLM Training Dataset was purpose-built to support enterprise use cases and production-grade AI development. It is prepared for business use.



Data Examples



Example 1: Video Understanding

example 1 - a single frame from a video in the dataset

Dataset: ego_exo_learn, Single frame from video: beead82c-ac78-11ee-819f-80615f12b59e.mp4. Shared under MIT license

Prompt: From the camera carrier's perspective, narrate the movements and interactions involving their hands.

Response: The left hand injects the answer into the small test tube crammed with white powder.



Example 2: Chart Reasoning with Chain of Thought

example 2 - a chart from this dataset

Dataset: plotqa_cot, 8106.png. Shared under CC-BY-4.0 license

Prompt: What number of legend labels are there ? Just give me the reply, no need for the method.

Response: Got it, let us take a look at the legend on the correct side of the image. The legend has three labels: "Self-employed", "Unpaid family employees", and "Wage employees". So we want to count what number of labels there are. Let's count them: first one is Self-employed, second is Unpaid family employees, third is Wage employees. In order that's three labels.

3

Prompt: What's the label or title of the X-axis ? Directly provide the reply.

Response: Got it, let us take a look at the graph. The X-axis has labels like "2009" and "2012", and the title of the X-axis is "Yr". Wait, let me check. The X-axis label is "Yr" because that is what's written below the axis. So the title of the X-axis is "Yr".

Yr



Easy methods to Get It

👉 Explore the dataset here on Hugging Face to make use of it on your projects!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x