NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset

Authors: Dhruv Nathawani, Shuoyang Ding US, Vitaly Lavrukhin US, Jane Polak Scowcroft US, Oleksii Kuchaiev US

NVIDIA continues releasing permissive datasets in support of the open ecosystem with 6 Million Multilingual Reasoning Dataset.

Continuing the success of the recent Nemotron Post-Training Dataset v1 release utilized in Llama Nemotron Super model, and our Llama Nemotron Post-Training Dataset release earlier this yr, we’re excited to release the reasoning dataset translated into five goal languages: French, Spanish, German, Italian, and Japanese.

The newly released NVIDIA Nemotron Nano 2 9B brings these capabilities to the sting with leading accuracy and efficiency with a hybrid Transformer–Mamba architecture and a configurable pondering budget—so you possibly can dial accuracy, throughput, and price to match your real‑world needs.

Model Highlights (TL;DR)

Model size: 9B parameters
Architecture: Hybrid Transformer–Mamba (Mamba‑2 + a small variety of attention layers) for higher throughput at similar accuracy to Transformer‑only peers
Throughput: As much as 6× higher token generation than other leading models in its size class
Cost: Pondering budget permits you to control what number of “pondering” tokens are used—saving as much as 60% lower reasoning costs
Goal: Agents for customer support, support chatbots, analytics copilots, and edge/RTX deployments
Availability: The model weights can be found on Hugging Face, you possibly can try the endpoint on construct.nvidia.com, and the model will probably be available as NVIDIA NIM for prime throughput and low latency
License: nvidia-open-model-license

The discharge represents a big step forward in our continued commitment to openness and transparency in model development and improvement. By releasing training data, along with the training tools and final model weights, NVIDIA supports continued improvement of open‑weight models.

What’s within the dataset and the way we built it

At a high level, the Nemotron Post-Training Dataset V2 takes our previously released English reasoning data and translates them into five goal languages (French, German, Italian, Japanese, Spanish). To best reap the benefits of English knowledge instilled during pre‑training, we translate the user prompt and model response while preserving the unique English reasoning chain.

Based on results from the WMT 2024 general translation shared task, LLMs are achieving state‑of‑the‑art results for machine translation tasks. Nevertheless, for synthetic generation of post‑training data, our preliminary studies have shown that:

LLMs are more susceptible to hallucinations when translating SFT datasets in comparison with translating common machine translation test sets (e.g., FLORES).
The interpretation quality and hallucination rate of open‑source LLMs deteriorate significantly as input length increases.

Hence, we incorporate several mechanisms to take care of high translation quality and simple hallucination detection. To summarize:

We break down sentences by newline and translate line‑by‑line. If a line is non‑translatable (e.g., only tabs) or is an element of a code block, it won’t be translated.
We implement a selected format (“Wrap the translated text in brackets 〘〙”) and use this special matching bracket to extract translations. Other examples are discarded (see Table 1).
We run fastText language ID on the interpretation of prompt inputs to filter out off‑goal data points. We discarded one other 55,567 examples (one other 1.1% of all multilingual examples).

Table 1: Ratio of discarded data (measured by bytes) by enforcing output format

Language	code	qa	math
de	2.28%	1.11%	2.47%
es	26.14%	5.15%	6.38%
fr	11.01%	1.37%	1.96%
it	4.94%	1.36%	0.75%
ja	7.68%	2.51%	3.86%

After benchmarking, we chosen Qwen2.5-32B-Instruct-AWQ (for German) and Qwen2.5-14B-Instruct (for others) to conduct the interpretation. The considerations for choosing these models include:

Robust translation quality
Can fit onto a single A100 GPU for inference
Wide domain coverage in training data
Open license (Apache 2.0)

How you can use it

from datasets import load_dataset
ds = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v2")

👉 Explore the dataset here: Hugging Face dataset page

Source link

NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset

Model Highlights (TL;DR)

What’s within the dataset and the way we built it

How you can use it

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

OpenAI steps into Anthropic’s Pentagon void

Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Context Engineering as Your Competitive Edge

Constructing Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

5 Latest Digital Twin Products Developers Can Use to Construct 6G Networks

NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset

Model Highlights (TL;DR)

What’s within the dataset and the way we built it

How you can use it

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.