Authors: Dhruv Nathawani, Shuoyang Ding US, Vitaly Lavrukhin US, Jane Polak Scowcroft US, Oleksii Kuchaiev US
NVIDIA continues releasing permissive datasets in support of the open ecosystem with 6 Million Multilingual Reasoning Dataset.
Continuing the success of the recent Nemotron Post-Training Dataset v1 release utilized in Llama Nemotron Super model, and our Llama Nemotron Post-Training Dataset release earlier this yr, we’re excited to release the reasoning dataset translated into five goal languages: French, Spanish, German, Italian, and Japanese.
The newly released NVIDIA Nemotron Nano 2 9B brings these capabilities to the sting with leading accuracy and efficiency with a hybrid Transformer–Mamba architecture and a configurable pondering budget—so you possibly can dial accuracy, throughput, and price to match your real‑world needs.
Model Highlights (TL;DR)
- Model size: 9B parameters
- Architecture: Hybrid Transformer–Mamba (Mamba‑2 + a small variety of attention layers) for higher throughput at similar accuracy to Transformer‑only peers
- Throughput: As much as 6× higher token generation than other leading models in its size class
- Cost: Pondering budget permits you to control what number of “pondering” tokens are used—saving as much as 60% lower reasoning costs
- Goal: Agents for customer support, support chatbots, analytics copilots, and edge/RTX deployments
- Availability: The model weights can be found on Hugging Face, you possibly can try the endpoint on construct.nvidia.com, and the model will probably be available as NVIDIA NIM for prime throughput and low latency
- License: nvidia-open-model-license
The discharge represents a big step forward in our continued commitment to openness and transparency in model development and improvement. By releasing training data, along with the training tools and final model weights, NVIDIA supports continued improvement of open‑weight models.
What’s within the dataset and the way we built it
At a high level, the Nemotron Post-Training Dataset V2 takes our previously released English reasoning data and translates them into five goal languages (French, German, Italian, Japanese, Spanish). To best reap the benefits of English knowledge instilled during pre‑training, we translate the user prompt and model response while preserving the unique English reasoning chain.
Based on results from the WMT 2024 general translation shared task, LLMs are achieving state‑of‑the‑art results for machine translation tasks. Nevertheless, for synthetic generation of post‑training data, our preliminary studies have shown that:
- LLMs are more susceptible to hallucinations when translating SFT datasets in comparison with translating common machine translation test sets (e.g., FLORES).
- The interpretation quality and hallucination rate of open‑source LLMs deteriorate significantly as input length increases.
Hence, we incorporate several mechanisms to take care of high translation quality and simple hallucination detection. To summarize:
- We break down sentences by newline and translate line‑by‑line. If a line is non‑translatable (e.g., only tabs) or is an element of a code block, it won’t be translated.
- We implement a selected format (“Wrap the translated text in brackets 〘〙”) and use this special matching bracket to extract translations. Other examples are discarded (see Table 1).
- We run fastText language ID on the interpretation of prompt inputs to filter out off‑goal data points. We discarded one other 55,567 examples (one other 1.1% of all multilingual examples).
Table 1: Ratio of discarded data (measured by bytes) by enforcing output format
| Language | code | qa | math |
|---|---|---|---|
| de | 2.28% | 1.11% | 2.47% |
| es | 26.14% | 5.15% | 6.38% |
| fr | 11.01% | 1.37% | 1.96% |
| it | 4.94% | 1.36% | 0.75% |
| ja | 7.68% | 2.51% | 3.86% |
After benchmarking, we chosen Qwen2.5-32B-Instruct-AWQ (for German) and Qwen2.5-14B-Instruct (for others) to conduct the interpretation. The considerations for choosing these models include:
- Robust translation quality
- Can fit onto a single A100 GPU for inference
- Wide domain coverage in training data
- Open license (Apache 2.0)
How you can use it
from datasets import load_dataset
ds = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v2")
👉 Explore the dataset here: Hugging Face dataset page

