Agentic AI is an ecosystem where specialized models work together to handle planning, reasoning, retrieval, and safety guardrailing. As these systems scale, developers need models that may understand real-world multimodal data, converse naturally with users globally, and operate safely across languages and modalities.
At GTC 2026, NVIDIA introduced a brand new generation of NVIDIA Nemotron models designed to work together as a unified agentic stack:
Along with open data, training recipes, and NVIDIA NeMo tools, the Nemotron family of models provides an end-to-end toolkit to construct, evaluate, and optimize production-grade agentic AI systems.
This blog explores the newest Nemotron 3 models, their performance, and the way developers can use them to construct scalable, multimodal, and real-time AI agents.
Power multi-agent systems with NVIDIA Nemotron 3 Super
Multi-agent systems suffer from “context explosion” with massive token histories 15x that of normal chat and a “pondering tax” with chain-of-thought reasoning for each decision. NVIDIA Nemotron 3 Super is an open hybrid mixture-of-experts (MoE) model that prompts just 12B parameters per pass, delivering high accuracy and efficiency for a fraction of the compute.
A hybrid architecture with Mamba and Transformer layers, multi‑token prediction, and NVFP4 precision on NVIDIA Blackwell GPUs delivers as much as 5x higher throughput than the previous generation while reducing memory footprint and value. A configurable “pondering budget” lets developers certain chain‑of‑thought to maintain latency and spend predictable, even for continuous agent workloads.
With a 1M-token context window and reinforcement learning across 10+ environments, Nemotron 3 Super excels at coding, math, instruction following, and function-calling, making it ideal for multi-agent applications—with significantly higher throughput on Blackwell when running in NVFP4.


Nemotron 3 Super uses latent MoE to call 4 expert specialists for the inference cost of just one, compressing tokens before they reach the experts.
External evaluations back this up. On the Artificial Evaluation Intelligence Index for open‑weight models under 250B parameters, Nemotron 3 Super NVFP4 ranks among the many top models, matching the best intelligence scores from leading alternatives.


Within the intelligence‑versus‑efficiency plot, Nemotron 3 Super lands in essentially the most attractive upper‑right quadrant—combining strong task performance with high output throughput per GPU—making it a compelling alternative for cost‑sensitive production agents.
Nemotron 3 Super—with open weights, open training data, and open development recipes—is right for software development, deep research, cybersecurity, and the financial services industry.
Keep agents secure with Nemotron 3 Content Safety
As agents expand from text‑only to multimodal workflows, safety guardrails must evolve across inputs, retrieval, and outputs. They have to even be applicable in use cases like enterprise copilots and user-generated content (think dating apps or social media), and detect prompt injection in agentic systems similar to healthcare, where self-harm is a priority.
Nemotron 3 Content Safety is a compact 4B‑parameter multimodal safety model that detects unsafe or sensitive content across text and pictures. Built on the Gemma‑3‑4B backbone with an adapter‑based classification head, it delivers high‑accuracy safety classification at low latency that’s ideal for production agentic pipelines. It fuses visual and language features to supply an easy secure/unsafe decision, with optional granular category labels. A fast keyword toggle lets developers choose from fast binary classification and full taxonomy reporting, supporting each low‑latency paths and deeper inspection.
On a collection of multimodal, multilingual safety benchmarks, Nemotron 3 Content Safety reaches roughly 84% accuracy, outperforming alternative safety models across the identical tasks while keeping latency low enough for in‑line moderation in production pipelines.


The model uses the identical 23‑category taxonomy as Aegis 1–3, covering classes similar to hate, harassment, violence, sexual content, plagiarism, and unauthorized advice. Trained on high‑quality Aegis datasets and human‑annotated real‑world images—relatively than primarily synthetic data—the model performs strongly across multimodal benchmarks in its 12 supported languages, with solid zero‑shot generalization beyond them.
Natural conversations with Nemotron 3 VoiceChat
Traditional voice AI relies on cascaded pipelines, automatic speech recognition (ASR), a big language model (LLM), and text-to-speech (TTS)—all of which introduce latency, complexity, and multiple points of failure.
Nemotron 3 VoiceChat is a 12B-parameter end-to-end speech model for full-duplex, real-time conversational AI, currently in early access. Unlike cascaded stacks, VoiceChat directly analyzes audio input and generates audio output in a unified and streaming LLM architecture. Using this single model eliminates multi-model orchestration. Built on the Nemotron Nano v2 LLM backbone with Nemotron speech (Parakeet encoder) and TTS decoder, VoiceChat delivers natural, interruptible conversations with low latency.
This model, in its early-access stage, has landed in essentially the most attractive upper right quadrant of the Artificial Evaluation Speech to Speech leaderboard. The graphic below plots conversational dynamics against speech reasoning performance, where Nemotron 3 VoiceChat lands within the highlighted upper‑right quadrant, alongside NVIDIA PersonaPlex, a full duplex, 7B-parameter research model. This implies developers get each responsive turn‑taking behavior and robust reasoning over audio; each are critical for assistants that must sound natural and stay on task.


With a streamlined end-to-end pipeline, VoiceChat targets sub-300ms end-to-end latency, processing 80ms audio chunks faster than real-time. A single model means fewer points of failure, reduced technical debt, and easier deployment for conversational agents in healthcare, financial services, telecommunications, gaming, and more.
Understand the world with NVIDIA Nemotron 3 Omni
Agentic systems increasingly need to grasp real-world data in several formats: video, audio, documents, UI screens, and reason across modalities. Existing solutions are either closed source or face compliance challenges for global enterprise deployment.
NVIDIA Nemotron 3 Nano Omni is the primary open, production-ready native omni-understanding foundation model delivering high-context video reasoning enhanced through audio transcription. Nano Omni is powered by NVIDIA Nemotron speech (Parakeet encoder), state-of-the-art optical character recognition (OCR) reasoning with a Nemotron 3 Nano language backbone, and NVIDIA’s first GUI-trained system for real agentic applications.
The architecture uses 3D convolution layers (Conv3D) for efficient handling of temporal-spatial data in video, and efficient video sampling (EVS) enables processing of longer videos at the identical computational cost by identifying and pruning temporally static patches. Stay tuned for release updates about this model.
Improve multimodal search with Llama Nemotron Embed VL and Rerank VL
Agentic RAG pipelines depend on retrieval to ground generation on evidence, not only prompts. But enterprise data lives in PDFs with charts, scanned contracts, tables, and slide decks—formats that text-only retrieval misses entirely.
Llama Nemotron Embed VL and Llama Nemotron Rerank VL are compact multimodal models that enable accurate visual document retrieval while remaining compatible with standard vector databases. On the ViDoRe V3/MTEB Pareto curve, which plots retrieval accuracy versus tokens processed per second on a single NVIDIA H100 GPU, Llama Nemotron Embed VL occupies the Pareto frontier. It delivers competitive or higher accuracy at high throughput relative to each open and industrial alternatives.


Llama Nemotron Embed VL is a 1.7B-parameter dense embedding model that encodes page images and text right into a single-dimensional vector, with support for Matryoshka embeddings. Built on NVIDIA Eagle—a frontier vision-language model with a Llama 3.2 1B backbone and SigLip2 400M vision encoder—it uses contrastive learning for query-document similarity and enables millisecond-latency search with standard vector databases.
Llama Nemotron Rerank VL is a 1.7B-parameter cross-encoder reranker that scores query-page relevance. When paired with the Llama Nemotron Embed VL model, it further increases accuracy by reranking retrieved text chunks and pictures.
Evaluate and optimize with NVIDIA NeMo
Constructing production agents requires not only strong models but additionally robust tools for evaluation and optimization. NVIDIA NeMo provides tools to guage, compare, and tune agentic systems:
- NVIDIA NeMo Evaluator, enables robust, reproducible benchmarking with support for agentic evaluation. By providing standardized evaluation setups, developers can benchmark performance, validate outputs, and compare models under consistent conditions.
- NVIDIA NeMo Agent Toolkit is an open source framework for profiling and optimizing agentic systems end-to-end. Bring agents from LangChain, AutoGen, AWS Strands, or other frameworks—without code changes—and get visibility into latency bottlenecks, token costs, and orchestration overhead to ship performant agents at scale.
Start constructing with Nemotron
Agentic AI is a shift from systems that reply to systems that act. It’s a coordinated stack of models, tools, memory, and guardrails that may plan, execute, critique, and adapt. If it’s just a much bigger model in the identical chat window, it’s not agentic.
The Nemotron family of models, released under the NVIDIA permissive open model licenses, is built for this multi‑model reality. Nemotron 3 Super anchors long‑context reasoning and planning. Nemotron 3 Content Safety watches every step, moderating multimodal inputs, retrieved content, and outputs. Nemotron 3 VoiceChat turns that intelligence into full‑duplex, real‑time conversations. Nemotron 3 Nano Omni (coming soon) gives agents eyes and ears across video, audio, documents, charts, and GUIs. Around them, NeMo tools add retrieval, tool‑calling, evaluation, and judge models so agents can rating their very own work and improve.
Efficiency is the hidden requirement that makes production viable. Real agents make dozens or tons of of model calls per task, so Nemotron models are right‑sized and optimized for throughput, latency, and value. And since they’re open and customizable, teams can tune behaviors, align to their very own data, and deploy where their security and compliance teams need them.
With Nemotron and NVIDIA NeMo, you’re getting the constructing blocks for trustworthy, repeatable, and scalable digital assistants on your production agentic systems.
Start today:
Not sleep-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.
Visit the Nemotron developer page for resources to start. Explore open Nemotron models and datasets on Hugging Face and Blueprints on construct.nvidia.com .
Engage with Nemotron livestreams, tutorials, and the developer community on the NVIDIA forum and Discord.
