If 2025 was the yr of AI agents, then 2026 is gearing as much as be the yr of, well, multi-agents. This leap to the subsequent step requires models that produce plenty of tokens, generated by lightweight accurate models.
Nonetheless, this transition also forces difficult tradeoffs. Smaller models are fast and low cost but often lack the reasoning depth, robustness, and long context capability needed for advanced multi-agents. Larger models deliver strong accuracy, but are too slow and expensive when many agents are running in parallel. As agentic systems grow, inference costs spiral, context windows turn out to be a bottleneck, and reliability starts to degrade, making efficiency of utmost importance.
Striking the appropriate balance is what led NVIDIA to supply the NVIDIA Nemotron 3 Nano 30B A3B, a part of our Nemotron 3 family of models (Nano, Super, and Ultra).
Nano utilizes a hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture with a 1M-token context window. (🔥🔥🔥) enabling developers to construct high-throughput, reliable agents which might be more accurate, more scalable, and capable of specialised sub-tasks in long-running multi-step workflows.
- Hybrid Mamba-Transformer MoE architecture: Mamba‑2 for long-context, low-latency inference combined with transformer attention for high-accuracy, fine-grained reasoning
- 31.6B total parameters, ~3.6B lively per token: Designed for top throughput and low latency
- Exceptional inference efficiency: As much as 4x faster than Nemotron Nano 2 and as much as 3.3x faster than leading models in its size category
- Best-in-class reasoning accuracy: Across reasoning, coding, tools, and multi-step agentic tasks
- Reasoning controls: Reasoning ON/OFF modes plus a configurable considering budget to cap “considering” tokens and keep inference cost predictable
- 1M-token context window: Ideal for long-horizon workflows, retrieval-augmented tasks, and chronic memory
- Fully open: Open Weights, datasets, training recipes, and framework
- A full open data stack: 3T latest high-quality pre-training tokens, 13M cross-disciplinary post-training samples, 10+ RL environments with datasets covering greater than 900k tasks in math, coding, reasoning, and tool-use, and ~11k agent-safety traces
- Easy deployment: Seamless serving with vLLM and SGLang, and integration via OpenRouter, popular inference service providers, and construct.nvidia.com endpoints
- License: Released under the nvidia-open-model-license
Figure 1: Nemotron 3 Nano matches or exceeds the accuracy of Qwen3-30B and GPT-OSS-20B while delivering dramatically higher throughput. In an 8K input / 16K output configuration on a single H200 GPU, Nano achieves 3.3x higher throughput than Qwen3-30B and a couple of.2x higher than GPT-OSS-20B.
Nemotron 3 Nano (30B/A3B) is our latest small-but-powerful reasoning model, constructing on the success of Nemotron Nano 2’s hybrid Mamba-2 + Transformer architecture, reasoning ON/OFF modes, and explicit considering budgets—while introducing a serious architectural upgrade: a sparse Mixture-of-Experts (MoE) design.
At high level:
- 31.6B total parameters
- ~3.6B lively parameters per token, due to the MoE routing
- Hybrid layer stack with interleaved Mamba‑2 layers and grouped-query attention (GQA) Transformer layers
- A learned multi-layer perceptron (MLP) router that prompts 6 of 128 experts on each forward pass, delivering each efficiency and reasoning accuracy
This mixture enables Nemotron 3 Nano to behave like a much larger model by way of reasoning quality—while maintaining the speed and price profile expected of a light-weight architecture
Figure 2: Nemotron 3 Nano architecture. It uses a hybrid Mamba-Transformer backbone, much like Nemotron Nano v2, but replaces standard feed-forward network (FFN) layers with sparse MoE layers to significantly boost efficiency and scalability.
Nemotron 3 Nano is built for agentic, reasoning, tool-use, and chat tasks and supports a
context length as much as 1M tokens.
It extends the Nemotron model family we released earlier within the yr, continuing the progression toward increasingly accurate and efficient open models for reasoning and agent development.
Figure 3: NVIDIA Nemotron family of open models is engineered for advanced reasoning and agentic tasks, delivering leading accuracies, and best-in-class efficiencies
We employed a multi-stage pipeline combining massive-scale pre-training, specialized supervised fine-tuning (SFT), and advanced reinforcement learning techniques to refine the reasoning abilities and agentic behavior.
Pre-Training
Nemotron 3 Nano was trained on a 25-trillion-token corpus (including 2.5T of recent Common Crawl tokens), spanning web crawls, code and math, Wikipedia and academic text, multilingual content (15 Pre-training followed a two-phase strategy:
- Phase 1: Diversity (first 94%)
Broad , diverse mixture to maximise coverage and generalization. - Phase 2: Quality (final 6%)
High-quality sources resembling Wikipedia to refine accuracy and consistency.
Long-Context Extension
Context length for Nemotron 3 Nano was prolonged by adding a continued pre-training (CPT) stage at 512k sequence length. A combination of 512k and 4k sequence length training preserved short benchmark scores while extending the context length. We included synthetic data designed to support long-range retrieval, multi-hop reasoning, multi-document information aggregation, and related capabilities across different stages of coaching.
We’re releasing a big portion of those pretraining datasets openly on Hugging Face. These additions contribute 3 trillion latest tokens to the Nemotron-Pretraining series, with higher-fidelity coverage of code, math, and reasoning. Enhanced synthetic augmentation and annotation pipelines increase data density and structure, improving training efficiency and directly contributing to Nemotron-3 Nano’s strong quality profile.
With Nemotron 3, we’ve learned that quantity without quality isn’t useful. Our pre-training data continues to shift toward efficient data: smarter filtration, rewritten and improved samples, and nearly half a trillion tokens of rescued math and code that previous pipelines would have discarded. This deal with signal over noise directly enables smarter, smaller models which might be cheaper to coach and run, without sacrificing accuracy.
Post-Training
This included Supervised fine-tuning (SFT) and two distinct stages of reinforcement learning, RLVR and RLHF. These stages specialize the model for agentic workflows, tool use, high-quality reasoning, and chat tasks.
Supervised Finetuning
Our SFT recipe was improved from Nano v2 to higher support complex agentic behaviors. Improvements included greater dataset diversity, higher data quality, and explicit training for multi-step and multi-turn reasoning.
The model learns each reasoning ON/OFF modes directly from the chat template:
- Reasoning ON: multi-step mode, where the model preserves and builds upon its prior chain-of-thought inside a task.
- Reasoning OFF: multi-turn mode, where reasoning content will not be carried over across turns, ensuring concise responses.
Figure 4. Nemotron 3 Nano delivers the very best throughput efficiency using the hybrid MoE architecture and leading accuracy with advanced Reinforcement Learning using NeMo Gym
We’re releasing the vast majority of SFT datasets and codebase openly.
Our latest post-training data release also expands the intelligence of the model by design. We added 13 million latest post-training samples—nearly tripling our previous release and making this the most important openly available post-training corpus by 2.5×. To succeed in higher reasoning accuracy, we blended cross-disciplinary domains including code, math, physics, and chemistry to create novel, multi-step problems that don’t exist in scraped web data. This helps the model reason about questions that fall between fields, where real scientific and technical progress often happens.
Multi environment Reinforcement Learning from Verifiable Rewards (RLVR)
Nemotron 3 Nano was trained concurrently across many distinct environments – spanning math, code, query answering, instruction following, multi-step tool use, multi-turn conversations, and structured output amongst others, using synchronous GRPO (Group Relative Policy Optimization). This multi-environment RLVR stage ensures uniform improvement across domains, reduced overfitting to any single benchmark, and more reliable agentic behavior in real-world workflows.
Figure 5: Uniform improvements on account of training concurrently on multiple RL environments.
Models need greater than textbooks to coach — they need a gym. NVIDIA is one in all the one open model providers releasing each reinforcement learning datasets and the environments used to coach them. This permits developers to check agents, capture critical edge cases, and forestall model drift over time. On this release, we’re adding 10+ latest RL environments covering competitive coding, advanced math, and even real-world calendar scheduling.
We’re also open-sourcing all of the essential RLVR infrastructure—the environments including their datasets and code used to construct and scale them. These components form the inspiration of the brand new NVIDIA NeMo Gym library, which enables scalable RL environment construction.
Training at scale is executed using NVIDIA NeMo RL, our high-performance RL library.
Reinforce Learning Using Human Feedback (RLHF)
To further refine the model’s conversational quality, we trained a generative reward model (GenRM) using GRPO on Qwen3-235B-A22B.
Given a conversation history, a brand new user query, and two candidate assistant responses, the GenRM explicitly reasons in regards to the strengths and weaknesses of every response, produces individual helpfulness scores and generates a relative rating between the candidates. These reward signals are then utilized in an RLHF stage to enhance helpfulness, coherence, correctness, and overall chat experience in Nemotron 3 Nano.
The combined post-training pipeline—SFT+RLVR+RLHF—produces the ultimate Nemotron 3 Nano 30B-A3B model.
As models evolve into multi-step agents that use tools, they face entirely latest safety and security challenges. To support responsible deployment, we’re releasing an agentic safety dataset featuring nearly 11,000 labeled traces from realistic, tool-using workflows. This provides developers the information they need to guage, diagnose, and mitigate safety risks before agentic systems reach production.
Why We Needed Higher RL Infrastructure
During development, the restrictions of existing RL tooling became clear. Training large reasoning models with RL is difficult because:
- Multi-step rollouts are complex to orchestrate
- Tool integrations are sometimes brittle
- Orchestration logic can conflict with training loop design
- Collecting rollout data at scale is slow and difficult
- Most high-quality RL environments are closed and proprietary
In consequence, meaningful RL training has historically been accessible only to major AI labs.
NeMo Gym: Opening RL to Everyone
To beat these challenges NVIDIA built NeMo Gym, an open-source standardized library for constructing and scaling RL environments.
NeMo Gym powers reinforcement learning pipelines utilized in Nemotron 3 Nano, and now gives developers:
- Ready-to-use RL environments across math, code, tool use, multi-turn reasoning, and agentic workflows
- The flexibility to construct custom RL environments with verifiable reward logic
- Ecosystem interoperability with NeMo RL and other training frameworks (TRL, Unsloth, VeRL underway)
- High-throughput rollout orchestration, enabling large-scale RL training
- A practical pathway to perform RL on their very own models
NeMo Gym is a versatile open source library for constructing and running RL training environments. It is a component of the broader NVIDIA NeMo software suite for end-to-end model training and provides infrastructure for designing, running, and scaling complex RL environments.
Battle tested through the event of your complete Nemotron 3 model family, NeMo Gym includes the core environment development infrastructure, a growing collection of ready-to-use training environments alongside the datasets utilized in RLVR, and integration with NeMo RL, the high-performance and efficient RL training engine with support for advanced RL training algorithms, end-to-end FP8 training and async RL.
With NeMo Gym, teams can quickly assemble environments using modular server components and templates, integrate external tools, systems, or databases, and orchestrate long-context, multi-step, multi-turn rollouts. This permits training environments to be iterated on and shared independent of the training loop.
Figure 6: How NeMo Gym matches into the RL training loop: The RL training framework (e.g., NeMo RL) sends task prompts to NeMo Gym, which operates as a set of independent HTTP services. Inside NeMo Gym, the agent server orchestrates rollouts by coordinating the policy model server (generation) and external resources server (tools and rewards). NeMo Gym returns model trajectories and rewards to the training framework, which then updates and refits the policy model.
By decoupling RL environments from RL training frameworks, NeMo Gym works seamlessly with many popular training frameworks (resembling NeMo RL), supports high-throughput, concurrent rollout collection, and enables large-scale distributed RL training. This separation of concerns makes it easy to scale RL workflows and adapt environments as training objectives evolve.
To speed up experimentation, NeMo Gym ships with an expanding RL Hub—a catalog of ready-to-use domain-specific environments that developers can use immediately or extend. Current domains include math, coding, instruction following, multi-step tool use, multi-turn structured conversations. Practitioners can fine-tune models on these environments out of the box, reuse community contributions, or publish their very own.
Nemotron 3 Nano (30B A3B) delivers state-of-the-art accuracy in an exceptionally cost-efficient package. It offers as much as 3.3x higher throughput than leading open-source models of comparable size (see Figure 1), while supporting a 1M-token context window —performing well on long-context reasoning benchmarks.
Built for high-volume, real-time execution, Nemotron 3 Nano excels in math and coding, multi-step tool calling, and multi-turn agentic workflows. It also retains the classic Nemotron Pondering ON/OFF modes and Pondering Budget controls, giving developers the flexibility to tune exactly how much the model thinks for every task.
With this release, we’re also introducing NeMo Gym, containing ready-to-use training environments we developed throughout the course of Nemotron 3 training, and the infrastructure to construct your personal training environments and scale rollout collection.
We’re releasing:
- Full model weights
- The entire training recipe, including SFT, RLVR, and RLHF
- A lot of the datasets (pre-training, post-training) used throughout the training pipeline
- Training frameworks that power Nemotron 3
Every thing you might want to study, reproduce, or extend the model is out there openly.
Start with Nemotron 3 Nano:
- Download the model: Now available on Hugging Face.
- Try hosted endpoints: Run queries immediately on OpenRouter or construct.nvidia.com.
- Deploy at scale: Use our cookbooks for vLLM, TRT-LLM, and SGLang
- Experiment, develop and run at the sting: Available on edge devices resembling NVIDIA RTX AI PCs and Workstations and DGX Spark via Llama.cpp, LM Studio and Unsloth
For a deep dive into the architecture, datasets, and benchmarks, read the complete Nemotron 3 Nano Technical Report.






