NVIDIA is doubling down on its commitment to open, high-quality AI with the discharge of Nemotron-Pre-Training-Dataset-v1, a pretraining dataset comprising 6.6 trillion tokens of premium math, code, and multilingual Q&An information — built from rigorously curated, high-signal web content and large-scale synthetic data generation.
Released alongside the NVIDIA Nemotron Nano 2 family of huge language models, this dataset isn’t only a research artifact — it’s the very data used to coach these leading open models.
The outcomes speak for themselves:

Figure. Comparison from the tech report of Nemotron Nano V2 and Qwen3-8B by way of accuracy and throughput. NVIDIA-Nemotron-Nano-v2-9B achieves comparable or higher accuracies on complex reasoning benchmarks, while achieving as much as 6.3 times higher throughput for such workloads. We abbreviate input sequence length to ISL and output sequence length to OSL.
Decomposition
The Nemotron-Pre-Training-Dataset-v1 collection is organized into 4 core categories:
- Nemotron-CC-v2: Follow-up to Nemotron-CC (Su et al., 2025) with eight additional Common Crawl snapshots (2024–2025). The info has undergone global deduplication and artificial rephrasing using Qwen3-30B-A3B. It also comprises synthetic diverse QA pairs translated into 15 languages, supporting robust multilingual reasoning and general knowledge pretraining.
- Nemotron-CC-Math-v1: A 133B-token math-focused dataset derived from Common Crawl using NVIDIA’s Lynx + LLM pipeline, which preserves equations and code formatting while standardizing math content to LaTeX. This ensures critical math and code snippets remain intact, leading to prime quality pretraining data that outperforms prior math datasets on benchmark. See more information in our blog.
- Nemotron-Pretraining-Code-v1: A big-scale curated code dataset sourced from GitHub and filtered through multi-stage deduplication, license enforcement, and heuristic quality checks. It also includes LLM-generated code query–answer pairs in 11 programming languages.
- Nemotron-Pretraining-SFT-v1: A synthetically generated dataset covering STEM, academic, reasoning, and multilingual domains. This includes complex multiple-choice and analytical questions derived from high-quality math and science seeds, graduate-level academic texts, and instruction-tuned SFT data spanning math, code, general QA, and reasoning tasks.
- Nemotron-Pretraining-Dataset-sample: A small sampled version of the dataset provides 10 representative subsets, offering insight into high-quality QA data, math-focused extractions, code metadata, and SFT-style instruction data.
Token distribution
| Dataset Category | Tokens Count (B) |
|---|---|
| English Common Crawl | 3360.1 |
| English Synthetic CC | 1257.1 |
| Diverse QA | 692.4 |
| Translated Diverse QA | 558.1 |
| Math | 206.3 |
| Math SFT | 190.4 |
| Synthetic Code | 175.1 |
| MMLU SFT | 81.7 |
| Code SFT | 58.5 |
| General SFT | 5.8 |
| TOTAL | 6585.4 |
Moreover, we release metadata to breed a 747.4B token curated code dataset.
What’s within the Dataset and How Did We Construct It?
Math
In constructing this dataset, we paid special attention to preserving high-value mathematical and code content from Common Crawl — data that is commonly lost or corrupted in typical pretraining pipelines. Our work (see full details in math blogpost introduces a brand new extraction process that:
- Appropriately renders math equations in multiple formats (MathJax, KaTeX, MathML, LaTeX) using a layout-aware text browser,
- Uses a light-weight LLM pass to wash boilerplate, standardize equations to LaTeX, and fix formatting errors,
- Retains code blocks with full syntax and indentation — as an alternative of flattening them into plain text like many previous datasets.
The result’s 133B tokens of math-rich documents in our full corpus, with a 52B-token highest-quality subset. This high-quality set is 5.5× larger than one of the best previous open math dataset (FineMath-4+). We also regenerated the Nemotron-MIND dataset using our highest-quality subset, resulted in a 73B-token synthetic dataset that consistently consistently improves math reasoning and general knowledge (MMLU, MMLU-Pro, MMLU-STEM) and gains +14.4 points on MATH over the prior MIND version.
Because our pipeline preserves structure as an alternative of stripping it away, we also capture a big incidental set of code snippets — over 4.3M code-containing documents — making the info useful for each mathematical reasoning and code generation.
In internal pretraining experiments, models trained with nemotron-CC-math data saw:
- +4.8 to +12.6 points on MATH over strongest baselines,
- +4.6 to +14.3 points on MBPP+ for code generation,
- +2 to +5 points on STEM-heavy general knowledge benchmarks like MMLU-STEM.
This implies the dataset not only boosts mathematical ability, but in addition strengthens logical reasoning, coding skills, and general-domain knowledge.
Code
Our curated code dataset comprises 747.4B tokens ofGitHub-sourced files that underwent multi-stage deduplication, license enforcement, and heuristic filtering. We’re releasing the metadata needed to breed this dataset. As well as, we generate and release large-scale synthetic query–answer pairs across 11 programming languages, totaling 175.1B tokens.
Diverse QA & Multilingual
We generated high-quality multilingual query–answer data from two important sources. First, we translated our English Diverse QA dataset into 15 languages using Qwen3-30B-A3B, ensuring accurate linguistic and contextual alignment. Second, we generated synthetic QA pairs directly in these languages from Wikipedia articles, prompting the model to jot down each questions and answers within the goal language. Moreover, a subset of our GSM8K STEM augmentation data was translated, with each solution post-processed to append a transparent concluding sentence indicating the ultimate answer (e.g., “La respuesta es …” in Spanish, “Die Antwort lautet …” in German).
This multilingual pipeline provides broad linguistic coverage and powerful problem-solving focus. In our ablation studies, including this translated diverse QA data boosted average Global-MMLU accuracy to 47.0, in comparison with 37.0 when using only multilingual Common Crawl data.
SFT Data
We included synthetically generated SFT-style data to strengthen model reasoning, code generation, and instruction-following abilities. This covers:
Code SFT: solving programming problems across multiple languages.
- Math SFT: complex reasoning and problem-solving.
- MMLU-style SFT: diverse query–answer examples across knowledge domains.
- General instruction SFT: broad instruction-following tasks.
The info spans multiple difficulty levels and topics, ensuring comprehensive pretraining that enhances our STEM, academic, and multilingual datasets.
Data Examples
Example: Our pipeline preserves each math and code, unlike prior pretraining datasets that always lose or corrupt math equations.
The best way to Use It
All of the datasets in the gathering may be accessed using 🤗 Datasets library. For eg .,
from datasets import load_dataset
ds = load_dataset("nvidia/Nemotron-CC-Math-v1", "4plus", streaming=True)
👉 Explore the dataset collection here and get in contact to explore enterprise or research use cases.
References
👉 For more details, please see the next paper and technical report.
Contributors
Nemotron-CC-Math-v1. Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye.
Nemotron-CC-v2. Ying Lin, Dan Su, Kezhi Kong, Joseph Jennings, Brandon Norick, Arham Mehta, Ayush Dattagupta, Ranjit Rajan, Sarah Yurick, Vineeth Kalluru, Markus Kliegl.
Nemotron-Pretraining-Code-v1. Brandon Norick, Joseph Jennings, Miguel Martinez, Vitaly Kurin, Rabeeh Karimi Mahabadi.
Nemotron-Pretraining-SFT-v1. Abhinav Khattar, Aleksander Ficek, Brandon Norick, Dan Su, Daria Gitman, Evelina Bakhturina, Igor Gitman, Ivan Moshkov, Jaehun Jung, Jane Polak Scowcroft, Jocelyn Huang, Joseph Jennings, Jupinder Parmar, Markus Kliegl, Matvei Novikov, Mehrzad Samadi, Miguel Martinez, Pavlo Molchanov, Pritam Gundecha, Rabeeh Karimi Mahabadi, Rima Shahbazyan, Sanjeev Satheesh, Sean Narenthiran, Seungju Han, Shizhe Diao, Shrimai Prabhumoye, Shubham Toshniwal, Siddhartha Jain, Somshubra Majumdar, Syeda Nahida Akter, Vahid Noroozi, Vitaly Kurin, Wasi Uddin Ahmad, Wei Du, Ximing Lu, Yejin Choi, Ying Lin.
Legal and Compliance. Barnaby Simkin, Dina Yared, Iain Cunningham, Katherine Cheung, Laya Sleiman, Meredith Price, Michael Boone, Nikki Pope, Saori Kaji.
Project Management. Amy Shen, Ann Guan, Ashton Sharabiani, Krzysztof Pawelec, Negar Habibi, Twinkle Vashishth.
Leadership. Jane Polak Scowcroft, Boris Ginsburg, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro.


