Constructing an agent is greater than just “call an API”—it requires stitching together retrieval, speech, safety, and reasoning components in order that they behave like one cohesive system. Each layer has its own interface, latency constraints, and integration challenges, and also you begin to feel them as soon as you progress beyond a straightforward prototype.
On this tutorial, you’ll learn the right way to construct a voice-powered RAG agent with guardrails using the newest NVIDIA Nemotron models released at CES 2026 for speech, RAG, safety, and reasoning. By the tip, you’ll have an agent that:
- Listens to spoken input
- Uses multimodal RAG to ground itself in your data
- Reasons over long context
- Applies guardrails before responding
- Returns a protected answer as audio
You’ll be able to start in your local GPU for development, then deploy the identical code on a scalable NVIDIA environment—whether that’s a managed GPU service, an on‑demand cloud workspace, or a production‑ready API runtime—without changing your workflow.
Prerequisites
Before you start this tutorial, you’ll need:
- NVIDIA API Key for cloud-hosted reasoning models (get one free)
- Local deployment requires:
- ~20GB of disk space
- NVIDIA GPU with a minimum of 24GB of VRAM
- Operating system with Bash (Ubuntu, macOS, or Windows Subsystem for Linux)
- Python 3.10+ environment
- One hour of free time
What you’ll construct


Step 1: Arrange the environment
To construct a voice agent, you’ll run several NVIDIA Nemotron models together (shown above). The speech, embedding, reranking, and safety models run locally via Transformers and NVIDIA NeMo, while the reasoning models use the NVIDIA API.
The companion notebook handles all environment configuration. Set your NVIDIA API key for the cloud-hosted reasoning models, and also you’re able to go.
Step 2: Ground the agent with multimodal RAG
Retrieval is the backbone of a reliable agent. With the brand new Llama Nemotron multimodal embedding and reranking models, you possibly can embed text, images (including scanned documents), and store them directly in a vector index without extra preprocessing. This retrieves the grounded context that the reasoning model will depend on, ensuring the agent references real enterprise data somewhat than hallucinating.


The llama-nemotron-embed-vl-1b-v2 model supports three input modes—text-only, image-only, and combined image and text—allowing you to index every little thing from plain documents to slip decks and technical diagrams. On this tutorial, we embed an example that mixes each image and text. The embedding model loads via Transformers with flash attention enabled:
from transformers import AutoModel
model = AutoModel.from_pretrained(
"nvidia/llama-nemotron-embed-vl-1b-v2",
trust_remote_code=True,
device_map="auto"
).eval()
# Embed queries and documents
query_embedding = model.encode_queries(["How does AI improve robotics?"])
doc_embeddings = model.encode_documents(texts=documents)
After initial retrieval, the llama-nemotron-rerank-vl-1b-v2 model reranks the outcomes using each text and pictures to make sure higher accuracy post-retrieval. In benchmarks, adding reranking improves accuracy by roughly 6 to 7%, a meaningful gain when precision matters.
Step 3: Add real‑time speech with Nemotron Speech ASR
With grounding in place, the subsequent step is enabling natural interaction through speech.


The Nemotron Speech ASR model is a streaming model, trained on tens of hundreds of hours of English audio from the Granary dataset and a big selection of public speech corpora, optimized for ultra-low latency, real-time decoding. Developers stream audio to the ASR service, receive text results as they arrive, and feed that output directly into the RAG pipeline.
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained(
"nvidia/nemotron-speech-streaming-en-0.6b"
)
transcription = model.transcribe(["audio.wav"])[0]
The model has configurable latency settings, achieving 8.53% average WER at its lowest latency setting of 80ms, improving to 7.16% WER at a 1.1s latency, well below the one-second threshold critical for voice assistants, field tools, and hands-free workflows.
Step 4: Implement safety with Nemotron Content Safety and PII Models
AI agents operating across regions and languages must understand not only harmful content, but additionally cultural nuance and context-dependent meaning.


The llama-3.1-nemotron-safety-guard-8b-v3 model provides multilingual content safety across 20+ languages and real-time PII detection across 23 safety categories.
Available via the NVIDIA API, the model makes it straightforward so as to add input and output filtering without hosting additional infrastructure. It distinguishes between similar phrases that carry different meanings depending on language, dialect, and cultural context—especially vital when processing real-time ASR output which may be noisy or informal.
from langchain_nvidia_ai_endpoints import ChatNVIDIA
safety_guard = ChatNVIDIA(model="nvidia/llama-3.1-nemotron-safety-guard-8b-v3")
result = safety_guard.invoke([
{"role": "user", "content": query},
{"role": "assistant", "content": response}
])
Step 5: Add long‑context reasoning with Nemotron 3 Nano
NVIDIA Nemotron 3 Nano provides the reasoning capability for the agent, combining efficient mixture-of-experts (MoE) and hybrid Mamba-Transformer architecture with a 1M-token context window. This enables the model to include retrieved documents, user history, and intermediate steps in a single inference request.


When retrieved documents contain images, the agent first uses Nemotron Nano VL to explain them in context, then passes all information to Nemotron 3 Nano for the ultimate response. The model supports an optional pondering mode for more complex reasoning tasks:
completion = client.chat.completions.create(
model="nvidia/nemotron-3-nano-30b-a3b",
messages=[{"role": "user", "content": prompt}],
extra_body={"chat_template_kwargs": {"enable_thinking": True}}
)
The output routes through the protection filter before being returned, transforming your retrieval-augmented lookup right into a full reasoning-capable agent.
Step 6: Wire all of it along with LangGraph
LangGraph orchestrates the whole workflow as a directed graph. Each node handles one stage—transcription, retrieval, image description, generation, and safety checking—with clean handoffs between components:
Voice Input → ASR → Retrieve → Rerank → Describe Images → Reason → Safety → Response
The agent state flows through each node, accumulating context because it progresses. This structure makes it straightforward so as to add conditional logic, retry failed steps, or branch based on content type. The entire implementation within the companion notebook shows the right way to define each node and wire them right into a production-ready pipeline.
Step 7: Deploy the agent
You’ll be able to deploy your agent anywhere once it runs cleanly in your machine. Use NVIDIA DGX Spark once you need distributed ingestion, embedding generation, or large‑scale batch vector indexing. Nemotron models will be optimized, packaged, and run as NVIDIA NIM–a set of prebuilt, GPU‑accelerated inference microservices for deploying AI models on NVIDIA infrastructure– and will be called directly from Spark for scalable processing. Use NVIDIA Brev once you want an on‑demand GPU workspace where your notebook runs as-is, with no system setup, plus distant access to your Spark cluster that you may easily share along with your team.
If you should see the identical deployment patterns applied to a physical robot assistant, take a look at the Reachy Mini personal assistant tutorial built with Nemotron and DGX Spark.
Each environments use the identical code path, so you possibly can move easily from experimentation to production with minimal changes.
What you’ve built
You now have the core structure of a Nemotron-powered agent with 4 core components: speech ASR for voice interaction, multimodal RAG for grounding, multilingual content‑safety filtering that accounts for cultural nuance, and Nemotron 3 Nano for long-context reasoning. The identical code runs from local development to production GPU clusters without changes.
| Component | Purpose |
|---|---|
| Multimodal RAG | Ground responses in real enterprise data |
| Speech ASR | Enable natural voice interaction |
| Safety | Discover unsafe content across languages and cultural contexts |
| Long-Context LLM | Generate accurate responses with reasoning |
Each section on this tutorial aligns directly with a piece within the notebook, so you possibly can implement and test the pipeline incrementally. Once it really works end-to-end, the identical code scales to production deployment.
Able to construct? Open the companion notebook and follow along step-by-step:
Should you’d wish to explore the underlying components, each of the next is a group of Nemotron models available on Hugging Face, together with the tools used to orchestrate the agent:
Not sleep to this point on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.
Browse video tutorials and livestreams to get essentially the most out of NVIDIA Nemotron.
