Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate

-


Agentic AI systems increasingly depend on collections of cooperating agents—retrievers, planners, tool executors, verifiers—working together across large contexts and very long time spans. These systems demand models that deliver fast throughput, strong reasoning accuracy, and protracted coherence over large inputs. In addition they require a level of openness that enables developers to customize, extend, and deploy models wherever they operate.

The NVIDIA Nemotron 3 family of open models (Nano, Super, Ultra), datasets, and techniques were designed to construct specialized agentic AI for this latest era.

It introduces a hybrid Mamba-Transformer mixture-of-experts (MoE) architecture, reinforcement learning (RL) across interactive environments, and a native 1M-token context window that permits high-throughput, long-horizon reasoning for multi-agent applications.

What’s latest in Nemotron 3

Nemotron 3 introduces several innovations that directly address the needs of agentic systems:

  • A hybrid Mamba-Transformer MoE backbone for superior test-time efficiency and long-range reasoning.
  • Multi-environment reinforcement learning designed around real-world agentic tasks.
  • A 1M-token context length supporting deep multi-document reasoning and long-running agent memory.
  • An open, transparent training pipeline, including data, weights, and recipes.
  • Immediate availability of Nemotron 3 Nano with ready-to-use cookbooks. Super and Ultra to follow.

Easy prompt example

Key technologies for Nemotron 3 models

Hybrid Mamba-Transformer MoE

Nemotron 3 integrates three architectures right into a single backbone: 

  • Mamba layers for efficient sequence modeling, 
  • Transformer layers for precision reasoning, and 
  • MoE routing for scalable compute efficiency. 

Mamba excels at tracking long-range dependencies with minimal memory overhead, enabling sustained performance even when processing tons of of 1000’s of tokens. Transformer layers complement this with detailed attention mechanisms that capture structural and logical relationships required for tasks reminiscent of code manipulation, math reasoning, or complex planning.

The MoE component amplifies effective parameter count without incurring the price of dense computation. Only a subset of experts is activated for every token, reducing latency and improving throughput. This architecture is especially well-suited to agent clusters where many lightweight agents must operate concurrently—each generating plans, inspecting context, or executing tool-based workflows.

Layer pattern diagram for Nemotron 3 showing repeating blocks: 5 repetitions of Mamba-2/MoE pairs with one attention layer, followed by 3 Mamba-2/MoE pairs, then 1 block with attention, and finally 4 Mamba-2/MoE pairs ending with a single Mamba-2 layer.Layer pattern diagram for Nemotron 3 showing repeating blocks: 5 repetitions of Mamba-2/MoE pairs with one attention layer, followed by 3 Mamba-2/MoE pairs, then 1 block with attention, and finally 4 Mamba-2/MoE pairs ending with a single Mamba-2 layer.
Figure 1. Nemotron 3 hybrid architecture. The model interleaves Mamba-2 and MoE layers with only a number of self-attention layers, maximizing inference throughput while maintaining state-of-the-art accuracy.

Multi-environment reinforcement learning (RL) training

To align Nemotron 3 with real agentic behavior, the model is post-trained using reinforcement learning across many environments in NeMo Gym, an open-source library for constructing and scaling RL environments. These environments evaluate the model’s ability to perform sequences of actions––going beyond just single-turn responses––reminiscent of generating correct tool calls, writing functional code, or producing multi-part plans that satisfy verifiable criteria.

This trajectory-based reinforcement produces a model that behaves reliably under multi-step workflows, reduces reasoning drift, and handles the sorts of structured operations common in agentic pipelines. Because NeMo Gym is open, developers can reuse, extend, and even create their very own environments when customizing models for domain-specific tasks.

These environments and RL datasets are being made available, alongside NeMo Gym, for those excited about using the environments to coach their very own models.

The graph from Artificial Analysis plots small language reasoning models on intelligence index on the y-axis and output tokens per second on the x-axis. Nemotron 3 Nano delivers the highest throughput efficiency using the hybrid MoE architecture and leading accuracy with advanced Reinforcement Learning using NeMo GymThe graph from Artificial Analysis plots small language reasoning models on intelligence index on the y-axis and output tokens per second on the x-axis. Nemotron 3 Nano delivers the highest throughput efficiency using the hybrid MoE architecture and leading accuracy with advanced Reinforcement Learning using NeMo Gym
Figure 2. Nemotron 3 Nano delivers the very best throughput efficiency using the hybrid MoE architecture and leading accuracy with advanced Reinforcement Learning using NeMo Gym

1M token context length

Nemotron 3’s 1M-token context enables sustained reasoning across large codebases, long documents, prolonged conversations, and aggregated retrieved content. As an alternative of counting on fragmented chunking heuristics, agents can keep entire evidence sets, history buffers, and multi-stage plans in a single context window.

This long context window is enabled by Nemotron 3’s hybrid Mamba-Transformer architecture, which processes extremely large sequences efficiently. MoE routing also keeps the per-token compute lower, making these large sequences practical at time of inference.

For enterprise-scale retrieval-augmented generation, compliance evaluation, multi-hour agent sessions, or monolithic repository understanding, the 1M-token window significantly improves factual grounding and reduces context fragmentation.

Key technologies coming in Nemotron 3 Super and Ultra

Latent MoE

Nemotron 3 Super and Ultra introduce latent MoE, where experts operate on a shared latent representation before outputs are projected back to token space. This approach allows the model to call on 4x more experts at the identical inference cost, enabling higher specialization around subtle semantic structures, domain abstractions, or multi-hop reasoning patterns.

Side-by-side comparison of standard MoE and latent MoE architectures. Standard MoE (left) shows self-attention feeding into a router that dispatches to 4 experts plus a shared expert, then combines outputs. latent MoE (right) adds latent down-projection before routing and up-projection after combining, enabling 8 experts instead of 4 while reducing all-to-all communication overhead.Side-by-side comparison of standard MoE and latent MoE architectures. Standard MoE (left) shows self-attention feeding into a router that dispatches to 4 experts plus a shared expert, then combines outputs. latent MoE (right) adds latent down-projection before routing and up-projection after combining, enabling 8 experts instead of 4 while reducing all-to-all communication overhead.
Figure 3. Standard MoE vs. latent MoE architectures. In latent MoE, tokens are projected right into a smaller latent dimension for expert routing and computation, reducing communication costs while enabling more experts and better accuracy per byte.

Multi-token prediction (MTP)

MTP enables the model to predict several future tokens in a single forward pass, significantly increasing throughput for long reasoning sequences and structured outputs. For planning, trajectory generation, prolonged chain-of-thought, or code generation, MTP reduces latency and improves agent responsiveness.

NVFP4 training

Super and Ultra are pretrained in  NVFP4, NVIDIA’s 4-bit floating-point format that gives best-in-class cost-accuracy for training and inference. An updated NVFP4 recipe was designed for Nemotron 3 to make sure accurate and stable pretraining on our 25T token pretraining dataset. Nearly all of floating-point multiply-accumulate operations during pretraining are in NVFP4. 

Ongoing commitment to open models

Nemotron 3 reinforces NVIDIA’s commitment to transparency and developer empowerment. The model weights are openly released under the NVIDIA Open Model License. NVIDIA’s synthetic pretraining corpus––nearly 10 trillion tokens––may be inspected or repurposed. Developers even have access to detailed training and post-training recipes throughout the Nemotron GitHub repository, enabling complete reproducibility and customization.

Nemotron 3 Nano is out there now, forming the inspiration for high-throughput, long-context agentic systems. Super and Ultra, coming in the primary half of 2026, will extend this foundation with higher reasoning depth and efficiency-minded architectural enhancements.

Nemotron 3 Nano: available now

Available today is our first model within the series: Nemotron 3 Nano. This 30B total 3B energetic parameter model is specifically designed for DGX Spark, H100, and B200 GPUs, allowing you to construct with essentially the most efficient model in our Nemotron 3 family.

If you desire to learn more concerning the technical details of Nemotron 3 Nano, you will discover an in depth Hugging Face blog, or you possibly can read the technical report available here.

This model delivers highest throughput efficiency, achieves leading rating on Artificial Evaluation Intelligence Index, and preserves the Artificial Evaluation Openness Index rating that NVIDIA Nemotron Nano V2 achieved. That showcases its effectiveness for multi-agent tasks while remaining transparent and customizable.

Bar chart ranking 12 models by Intelligence Index score across 10 evaluations. NVIDIA Nemotron 3 Nano scores 52.Bar chart ranking 12 models by Intelligence Index score across 10 evaluations. NVIDIA Nemotron 3 Nano scores 52.
Figure 5. On the Artificial Evaluation Intelligence Index v3.0, Nemotron 3 Nano achieves leading accuracy (52) amongst similarly-sized models.

Developers can start using Nemotron 3 Nano today across multiple deployment and development workflows:

Launch the model with NVIDIA cookbooks

We’re providing ready-to-use cookbooks for several major inference engines:

  • vLLM Cookbook – Deploy Nemotron 3 Nano with high-throughput continuous batching and streaming.
  • SGLang Cookbook– Run fast, lightweight inference optimized for multi-agent tool-calling workloads.
  • TRT-LLM Cookbook– Deploy fully optimized TensorRT-LLM engines for low-latency, production-grade environments.

Each cookbook includes configuration templates, performance suggestions, and reference scripts so you possibly can get Nemotron 3 Nano running inside minutes.

As well as, start today with Nemotron on any NVIDIA GPU – from GeForce RTX desktops and laptops to RTX Pro workstations, to DGX Spark – using top frameworks and tools reminiscent of Llama.cpp, LM Studio and Unsloth.

Construct with Nemotron open training datasets

NVIDIA can also be releasing the open datasets used throughout the model’s development, providing unprecedented transparency into how high-performance, trustworthy models are built.

Recent dataset highlights include:

  • Nemotron-pretraining – A brand new 3-trillion-token dataset with richer coverage of code, math, and reasoning, enhanced through synthetic augmentation and annotation pipelines.
  • Nemotron-post-training 3.0 – An 13-million-sample corpus for supervised fine-tuning and reinforcement learning that powers Nemotron 3 Nano’s alignment and reasoning.
  • Nemotron-RL datasets – A curated collection of RL datasets and environments for tool-use, planning, and multi-step reasoning.
  • Nemotron agentic safety dataset – A set of nearly 11-thousand AI agent workflow traces designed to assist researchers evaluate and mitigate emerging safety and security risks in agentic systems.

Paired with NVIDIA NeMo Gym, RL, Data Designer, and Evaluator open libraries, these open datasets enable developers to coach, enhance, and evaluate their very own Nemotron models.

Explore the Nemotron GitHub: pre-training & RL recipes

NVIDIA maintains an open Nemotron GitHub repository that features:

  • Pre-training recipes (already available) showing how Nemotron 3 Nano was trained
  • RL alignment recipes for multi-environment optimization
  • Data-processing pipelines, tokenizer configuration, and long-context setup
  • Future updates will include additional post-training and fine-tuning recipes

If you desire to train your personal Nemotron, extend Nano, or produce a domain-specialized variant, the GitHub repository provides the documentation, configurations, and tooling to breed key steps end-to-end.

This openness completes the story: You may run the model, deploy the model, inspect how the model was built, and even train your personal—all using NVIDIA open resources.

Nemotron 3 Nano is out there now. Start constructing long-context, high-throughput agentic systems today using NVIDIA open models, open tools, open data, and open training infrastructure.

Join the Nemotron Model Reasoning Challenge

Accelerating open research is a core priority for the Nemotron team. With that in mind, we’re excited to announce a brand new community competition focused on improving Nemotron’s reasoning performance using Nemotron’s open models and datasets.

Register here to be the primary to know when details are released.

And stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x