Agentic AI systems need models with the specialized depth to unravel dense technical problems autonomously. They have to excel at reasoning, coding, and long-context evaluation, while remaining efficient enough to run repeatedly at scale.
Multi-agent systems generate as much as 15x the tokens of normal chats, re-sending history, tool outputs, and reasoning steps at every turn. Over long tasks, this “context explosion” causes goal drift, where agents step by step lose alignment with the unique objective. And using massive reasoning models for each sub-task—the “pondering tax”—makes multi-agent applications too expensive and sluggish for practical use.
Today, we’re releasing Nemotron 3 Super to handle these limitations. The brand new Super model is a 120B total, 12B active-parameter model that delivers maximum compute efficiency and accuracy for complex multi-agent applications reminiscent of software development and cybersecurity triaging. This model follows the introduction of Nemotron 3 Nano in December.
Super addresses the “pondering tax” with its hybrid mixture-of-experts (MoE) architecture. It delivers over 5x throughput than the previous Nemotron Super. This model tackles the “context explosion” with a native 1M-token context window that offers agents long-term memory for aligned, high-accuracy reasoning. The model is fully open with open weights, datasets, and recipes so developers can easily customize, optimize, and deploy it on their very own infrastructure.
What makes Nemotron 3 Super different
Nemotron 3 Super isn’t just a much bigger Nano. It introduces architectural innovations that allow the model to mitigate among the typical efficiency-accuracy tradeoffs for high-capacity reasoning models:
- Latent MoE that calls 4x as many expert specialists for a similar inference cost, by compressing tokens before they reach the experts.
- Multi-token prediction (MTP) that predicts multiple future tokens in a single forward pass, dramatically reducing generation time for long sequences and enabling built-in speculative decoding.
- Hybrid Mamba-Transformer backbone integrating Mamba layers for sequence efficiency with Transformer layers for precision reasoning, delivering higher throughput with 4x improved memory and compute efficiency.
- Native NVFP4 pretraining optimized for NVIDIA Blackwell, significantly cutting memory requirements and speeding up inference by 4x on NVIDIA B200 in comparison with FP8 on NVIDIA H100, while maintaining accuracy.
- Multi-environment reinforcement-learning (RL) post-trained with RL across 21 environment configurations using NVIDIA NeMo Gym and NVIDIA NeMo RL, trained with greater than 1.2 million environment rollouts.
These benefits come together to create a model that’s well fitted to long-running autonomous agents. On PinchBench—a brand new benchmark for determining how well LLM models perform because the brain of an OpenClaw agent—Nemotron 3 Super scores 85.6% across the complete test suite, making it the perfect open model in its class.
See it in motion
If you need to go hands on with Nemotron 3 Super, follow the tutorial video below. This may walk you thru use the model from construct.nvidia.com to OpenCode.
Diving deep into the architecture
Hybrid Mamba-Transformer MoE backbone
Super builds on the identical hybrid philosophy as Nano but at a fundamentally different scale. The backbone interleaves three layer types:
Mamba-2 layers handle the vast majority of sequence processing. State space models (SSMs) provide linear-time complexity with respect to sequence length, which is what makes the 1M-token context window practical somewhat than theoretical. When an agent must reason over a whole codebase, an extended conversation history, or a stack of retrieved documents, Mamba layers keep the memory footprint manageable.
Transformer attention layers are interleaved at key depths. Pure SSMs can struggle with precise associative recall—the sort of task where you’ll want to find one specific fact buried in an extended context. The eye layers preserve this capability, ensuring that Super maintains high-fidelity retrieval even when the “needle” sits in the midst of a haystack of conflicting information.
MoE layers scale effective parameter count without the associated fee of dense computation. Only a subset of experts prompts per token, keeping latency low and throughput high—critical when many agents are running concurrently in a shared deployment.


Latent MoE
Standard MoE architectures route tokens directly from the model’s full hidden dimension to the experts. As models grow, this routing layer becomes a bottleneck—it increases compute costs and limits what number of experts you possibly can practically deploy.
Super introduces latent MoE: Before routing decisions are made, token embeddings are projected right into a compressed, low-rank latent space. Expert computation happens on this smaller dimension, and results are projected back to the complete model dimension afterward.
Why this matters in practice:
More experts, same cost. By compressing tokens before they reach the experts, latent MoE enables the model to seek the advice of 4x as many experts for the very same computational cost as running one.
Finer-grained specialization. With more experts available, the model can afford highly specialized routing—for instance, activating distinct experts for Python syntax versus SQL logic—which can be only activated when strictly vital. This granularity is particularly worthwhile in agentic settings where a single conversation may span tool calls, code generation, data evaluation, and conversational reasoning inside a number of turns.


Multi-token prediction (MTP)
Standard language models are trained to predict one token at a time—a fundamentally myopic objective. Super is trained with MTP, where specialized prediction heads forecast several future tokens concurrently from each position.
This has two concrete advantages:
Stronger reasoning during training. Predicting multiple future tokens forces the model to internalize longer-range structure and logical dependencies. Slightly than learning to guess plausible next words, the model must learn to anticipate coherent sequences. This produces measurable gains on chain-of-thought tasks where each step must follow logically from the last.
Built-in speculative decoding at inference. By predicting multiple future tokens concurrently in a single forward pass, MTP dramatically reduces the time required to generate long sequences. The MTP heads provide draft predictions that might be verified in parallel, enabling as much as 3x wall-clock speedups for structured generation tasks like code and gear calls—without requiring a separate draft model.
Each advantages stem from the identical design decision. Unlike architectures that train independent prediction heads per offset, Super uses a shared-weight design across all MTP heads. This keeps the parameter overhead minimal while improving training stability—the heads learn to agree on coherent continuations somewhat than diverging into offset-specific shortcuts. The identical weight sharing also makes the speculative drafts more consistent at longer draft lengths, which is where independently trained heads typically degrade.
Native NVFP4 pretraining
Most quantized models start as full-precision and get compressed after training, which inevitably introduces accuracy loss. Super takes a unique approach: The vast majority of floating-point multiply-accumulate operations during pretraining run in NVFP4, the NVIDIA 4-bit floating-point format. Optimized for Blackwell, this significantly cuts memory requirements and hurries up inference in comparison with FP8, while maintaining accuracy.
Training natively in reduced precision means the model learns to be accurate inside the constraints of 4-bit arithmetic from the very first gradient update. The result’s a model that’s mathematically stable and accurate despite running on a significantly reduced memory footprint.
How we trained Nemotron 3 Super
Nemotron 3 Super is trained in three sequential phases, each constructing on the last. Pretraining establishes broad world knowledge and language understanding at scale. Supervised fine-tuning shapes the model’s behavior across the duty types it’s going to encounter in deployment. Reinforcement learning then refines that behavior against verifiable outcomes across diverse agentic environments.
Pretraining
Super is pretrained on 25 trillion tokens using NVFP4, the NVIDIA 4-bit floating-point format optimized for NVIDIA Blackwell. Slightly than quantizing a full-precision model after the very fact, Super trains natively in reduced precision from the primary gradient update—meaning the model learns to be accurate throughout the constraints of 4-bit arithmetic throughout pretraining, not only at inference. The pretraining corpus spans 10 trillion unique curated tokens, with the model seeing 25 trillion total tokens across the run, including additional compute focused on reasoning and coding.
Supervised fine-tuning
Before reinforcement learning, Super undergoes supervised fine-tuning on about 7 million SFT samples. They’re drawn from a broader post-training corpus of 40 million samples, which cover reasoning, instruction following, coding, safety, and multi-step agent tasks. This stage establishes the behavioral foundation that RL then refines. The model learns the format and structure of correct responses across task types, giving the next RL phase a stable start line somewhat than optimizing from a raw pretrained checkpoint.
Multi-environment reinforcement learning
To align Super with real agentic behavior, the model is post-trained using reinforcement learning across diverse environments in NeMo Gym, the NVIDIA open source library for constructing and scaling RL training environments. These environments evaluate the model’s ability to perform sequences of actions—generating correct tool calls, writing functional code, producing multi-part plans that satisfy verifiable criteria—not only providing satisfying single-turn responses. These trajectories form the core training data to run reinforcement learning at scale with the NeMo RL open library.
This trajectory-based reinforcement produces a model that behaves reliably under multi-step workflows, reduces reasoning drift, and handles the sorts of structured operations common in agentic pipelines.
Benchmarking Nemotron 3 Super
Nemotron 3 Super achieves leading accuracy across quite a few necessary agentic benchmarks while maintaining incredible throughput.


The “Super + Nano” deployment pattern
Nemotron 3 Nano is a wonderful alternative for achieving high accuracy in executing targeted, individual steps inside an agentic workflow. Nevertheless, when multi-agent applications escalate to complex, multi-step activities, they require a high-capacity model for superior planning and reasoning. Consider a pc use agent that should make decisions between different modalities of tools with the intention to, say, create a presentation with 10 high-quality slides.
Nemotron 3 Super is right on this use. For example, in software development, easy merge requests might be addressed by Nemotron 3 Nano while complex coding tasks that require deeper understanding of the code base might be handled by Nemotron 3 Super. And expert-level coding tasks might be addressed by proprietary models.
Constructing with Super’s open resources
Nemotron 3 Super is fully open—weights, datasets, and recipes—so developers can easily customize, optimize, and deploy the model on their very own infrastructure for optimum privacy and security.
Model weights
Full parameter checkpoints for Nemotron 3 Super can be found on Hugging Face and thru NVIDIA NIM. The NVIDIA Nemotron Open Model License gives enterprises the flexibleness to keep up data control and deploy anywhere.
End-to-end training and evaluation recipes
We’re releasing the entire training and evaluation recipe for Nemotron 3 Super, covering the complete pipeline from pretraining through alignment. This permits developers to breed Super’s training, adapt the recipe for domain-specific variants, or use it as a start line for their very own hybrid architecture research.
Deployment cookbooks
We’ve built ready-to-use cookbooks for major inference engines, each with configuration templates, performance tuning guidance, and reference scripts:
- vLLM Cookbook: High-throughput continuous batching and streaming for Super.
- SGLang Cookbook: Fast, lightweight inference optimized for multi-agent tool-calling workloads.
- NVIDIA TensorRT LLM Cookbook: Fully optimized TensorRT LLM engines with latent MoE kernels for production-grade, low-latency deployment.
Advantageous-tuning cookbooks
Explore our Nemotron 3 Super customization cookbooks to efficiently fine-tune to your domain (LoRA/SFT) or advance its agentic reasoning capabilities (GRPO/DAPO):
Open datasets
Nemotron 3 Super is built on a totally open, end-to-end data pipeline that spans pretraining, post-training, and interactive reinforcement learning—giving developers reproducible constructing blocks for agentic AI.
- Pretraining corpora: 10 trillion curated tokens, trained over 25 trillion total seen tokens, plus a further 10 billion tokens focused on reasoning and 15 million coding problems. All aggressively deduplicated and quality-filtered to maximise signal-to-noise.
- Post-training datasets: 40 million latest supervised and alignment samples, covering reasoning, instruction following, coding, safety, and multi-step agent tasks across supervised fine-tuning, preference data, and RL trajectories (about 7 million used directly for SFT)
- RL tasks and environments: Interactive RL across 21 environment configurations and 37 datasets (~10 of that are being released) including software engineer-style agent training and tool-augmented search/planning tasks—moving beyond static text into dynamic, verifiable execution workflows and generating ~1.2 million environment rollouts during training.
Open training and evaluation infrastructure
NVIDIA publishes development techniques and tools, giving researchers and enterprises the flexibleness to customize Nemotron 3 Super or construct their very own reasoning models. All recipes integrate with the Nemotron GitHub repository, NeMo Gym, NeMo RL, NVIDIA NeMo Data Designer, NVIDIA NeMo Curator, and NVIDIA NeMo Evaluator—providing a whole, reproducible pipeline from data to deployment.
All Nemotron models are released with an open evaluation approach, including a printed evaluation recipe that allows anyone to rerun and inspect the complete evaluation pipeline from Nemotron 3 Super.
Start
Nemotron 3 Super is live now. Available across leading inference platforms and packaged as NVIDIA NIM, Super can run anywhere from the workstation to the cloud. Try it on Perplexity with a Pro subscription or through API, OpenRouter, or construct.nvidia.com.
Download the weights from Hugging Face, launch an optimized instance through NVIDIA NIM, fine-tune with Unsloth, or start with the cookbooks to get running in minutes.
Super can also be available through Baseten, Cloudflare, DeepInfra, Fireworks AI, FriendliAI, Inference.net, Lightning AI, Modal, Nebius, and Together AI.
Try our GitHub repository which has getting began instructions for platforms like OpenCode, OpenHands, and OpenClaw.
For the complete technical details, read the Nemotron 3 Super technical report.
Stay awake-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube. Visit the Nemotron developer page for resources to start. Explore open Nemotron models and datasets on Hugging Face and Blueprints on construct.nvidia.com. And engage with Nemotron livestreams, tutorials, and the developer community on the NVIDIA forum and Discord.
