How NVIDIA Builds Open Data for AI

A collaborative approach to scaling trustworthy AI systems and agents.

AI progress is usually framed when it comes to model capability and efficiency. In point of fact, every training pipeline ultimately rests on a knowledge layer that determines how those models behave.

As agentic systems change into more autonomous, the information they’re trained on increasingly determines what they know, how they reason, and what they will safely do. Yet much of today’s training data stays opaque, fragmented, or siloed across teams.

Open data access changes that equation. It gives developers a faster and less expensive path to constructing high-quality models, while making evaluation and improvement easier across the ecosystem. For this reason NVIDIA releases open datasets alongside its open models, tools, and training techniques.

AI-Data Bottlenecks

Constructing high-quality datasets stays one in every of the biggest bottlenecks in AI development. Organizations often spend hundreds of thousands of dollars and months—and even greater than a yr—collecting, annotating, and validating data before a single model training run begins. Even when models are deployed, access to domain expertise and evaluation frameworks stays an evergreen challenge.

NVIDIA goals to scale back this friction by publishing permissively licensed datasets on HuggingFace with training recipes and evaluation frameworks on GitHub that developers can construct on immediately. So far, we’ve shared greater than 2 petabytes of AI-ready training data across greater than 180 datasets and 650+ open models. And we’re just getting began.

Real-World Open Datasets

NVIDIA’s open data releases span multiple domains — from robotics and autonomous systems to sovereign AI, biology, and evaluation benchmarks. Built by teams across NVIDIA, these datasets reveal how shared data can speed up real-world AI development.

Listed here are a couple of examples from across our ecosystem:

Physical AI Collection

Robotics systems require structured, multimodal data. This collection includes 500K+ robotics trajectories, 57M grasps, and 15TB of multimodal data, including assets used to develop the NVIDIA GR00T reasoning vision-language-action model across multiple gripper types and sensor configurations. The dataset has been downloaded greater than 10 million times, including by corporations corresponding to Runway, which developed its recently released GWM-Robotics world model using the open GR00T dataset, and robotics simulation company Lightwheel, which is using the dataset to refine robotics policies.

This collection also includes one of the vital geographically diverse AV datasets available, with greater than 1,700 hours of multi-sensor data that features 7-camera configurations plus LiDAR and radar spanning 25 countries and over 2,500 cities. That breadth supports perception benchmarking across varied driving environments and complements academic datasets with broader business usability.

Nemotron Personas Collection

Nemotron Personas are fully synthetic persona datasets grounded in real-world demographic distributions, producing culturally authentic, diverse individuals across regions and languages at scale.

The gathering supports Sovereign AI development and currently includes population-scale datasets for:

United States – 6M personas
Japan – 6M personas
India – 21M personas
Brazil – 6M personas (developed with WideLabs)
Singapore – 888K personas (developed with AI Singapore)

These datasets are already powering real deployments globally. CrowdStrike used 2M personas to enhance NL→CQL translation accuracy from 50.7% to 90.4%. In Japan, NTT Data and APTO used the datasets to bootstrap domain-specific intelligence with minimal proprietary data, improving legal QA accuracy from 15.3% to 79.3% and reducing attack success rates from 7% to 0%.

The datasets also supported the event of NVIDIA Nemotron-Nano-9B-v2-Japanese, a state-of-the-art sub-10B model that reached the highest of the Nejumi leaderboard.

La Proteina

La Proteina is a totally synthetic, atomistic protein dataset designed for biological modeling and drug discovery workflows. With 455,000 structures and a state-of-the-art 73% structural diversity boost over prior baselines, it provides design-ready molecular representations without PII or licensing constraints. A scientific achievement made possible by an open collaboration with researchers from Oxford, Mila, and CIFAR.

SPEED-Bench

SPEED-Bench is a standardized benchmark for evaluating speculative decoding performance. It features two splits: a Qualitative Split that maximizes semantic diversity across 11 text categories, and a Throughput Split organized into input sequence length buckets (1K–32K) for constructing accurate throughput Pareto curves using real semantic data reasonably than random tokens. Already adopted internally as the first benchmark for Nemotron MTP performance, SPEED-Bench gives teams a consistent methodology for evaluating draft performance across prompt complexities and context lengths.

Retrieval-Synthetic-NVDocs-v1

This synthetic retrieval dataset provides 110,000 triplets of query, passage, and answer generated from 15,000 files of NVIDIA public documentation. Designed to coach and evaluate embedding and RAG systems, the dataset features semantically wealthy QA pairs spanning multiple reasoning types—factual, relational, procedural, inferential, temporal, causal, and visual—alongside diverse query types including structural, multi-hop, and contextual queries. In-domain fine-tuning of embedding models demonstrates substantial gains: fine-tuning nvidia/llama-nemotron-embed-1b-v2 on this dataset yields an 11% increase in NDCG@10. The dataset may be generated in roughly 3–4 days, and fine-tuning takes about two hours on 8×A100 GPUs—enabling rapid iteration from dataset to deployed model.

Nemotron-ClimbMix

ClimbMix is a 400B-token pre-training dataset built using the CLIMB algorithm, which uses embedding-based clustering and iterative refinement to discover higher-quality data mixtures for language model training. The dataset has already gained strong community traction: Andrej Karpathy highlighted Nemotron-ClimbMix as delivering the biggest improvement on the Time-to-GPT-2 leaderboard, resulting in its adoption because the default data recipe in NanoChat Speedrun and reducing H100 compute time by roughly 33% in comparison with the previous FineWeb-Edu setup. ClimbMix is released under the CC-BY-NC-4.0 license.

These releases reflect an ongoing investment within the shared reference layer that AI developers depend upon across modalities and model lifecycle stages.

Nemotron Training Datasets

One major component of NVIDIA’s open data work is the set of datasets used to coach and align the Nemotron model family. Over the past yr these datasets have evolved to raised support reasoning, coding, and multilingual capabilities in frontier language models.

Nemotron Pre-Training Evolution

Nemotron-Pre-Training Evolution Chart

Earlier releases relied heavily on general web corpora, while newer releases emphasize higher-signal domains corresponding to math, code, and STEM knowledge. This increased signal density enables models to learn stronger reasoning and problem-solving capabilities.

The Nemotron pre-training stack includes several curated datasets designed for various capabilities:

Nemotron-CC – globally deduplicated web data rewritten for higher signal density
Nemotron-CC-Math and Nemotron-CC-Code – math and code reasoning preserving LaTeX and code formatting
Nemotron-Pretraining-Code – curated programming datasets from large code repositories
Nemotron-Pretraining-Specialized – synthetic datasets to spice up capabilities in key domains like algorithms, economics, logic, STEM reasoning

Together, these datasets form the muse for general-purpose language models able to reasoning, coding, and multilingual understanding. They power Nemotron in addition to partner frontier models just like the AI security company Trend Micro’s Primus-Labor-70B.

Nemotron-Post-Training Evolution

Nemotron Post-Training Evolution Chart

As models change into more capable, post-training data plays a bigger role in shaping model behavior. Newer releases emphasize multilingual diversity, structured reasoning supervision, and agent-style interaction data.

Key datasets within the Nemotron post-training stack include:

Nemotron-Instruction-Following-Chat – structured conversational supervision
Nemotron-Science – synthetic science reasoning datasets
Nemotron-Math-Proofs – formal mathematical reasoning datasets
Nemotron-Agentic – datasets supporting multi-step planning and gear use
Nemotron-SWE – instruction tuning datasets for software engineering tasks

These datasets provide structured supervision that helps models follow complex instructions, generate reasoning traces, and perform reliably in multi-step tasks. Early iterations were blended with domain data to develop ServiceNow’s Apriel Nemotron 15B / Apriel 1.6 Thinker, which surpassed Gemini 2.5 Flash and Qwen3 on the 15B parameter scale, and Hugging Face’s SmolLM3, a well-liked small language model.

NVIDIA can also be expanding this work with open safety and reinforcement learning datasets, including Nemotron-Agentic-Safety (11K labeled telemetry traces from tool-use workflows) and Nemotron-RL, a 900K-task corpus spanning math, coding, tools, puzzles, and reasoning that provides models a real training “gym.”

Extreme Co-Design

Designing high-quality datasets at this scale is a team sport. It requires close collaboration between data strategists, AI researchers, infrastructure engineers, and policy experts.

At NVIDIA, we approach data the identical way we do any software and hardware engineering problem, through what we call extreme co-design — designing all components together to eliminate bottlenecks at scale.

When possible, we release the datasets in addition to the methods behind them. The open community and our partners then stress-tests them, surfaces edge cases, and extends the datasets into latest domains. Those insights feed directly into the following iteration, improving each our internal systems and the broader AI ecosystem.

CES 2026 Keynote

NVIDIA also collaborates with partners through initiatives just like the ViDoRe and CVDP, two consortia that bring together industry and academic partners to develop open benchmarks and evaluation frameworks for emerging AI systems.

Start Cooking within the Open Kitchen

At NVIDIA, we take into consideration open data very similar to an open kitchen. The ingredients are visible, the recipes are shared, and everybody can learn from how the dish is ready.

We encourage anyone keen about data science and model constructing to explore NVIDIA’s open datasets on Hugging Face, try our tutorials and Nemotron labs, and join the Nemotron community on Discord to collaborate on future datasets.

The subsequent generation of trustworthy AI models and agentic systems will likely be built on shared foundations. Open data is one in every of them.

Source link

How NVIDIA Builds Open Data for AI

AI-Data Bottlenecks