Develop Specialized AI Agents with Recent NVIDIA Nemotron Vision, RAG, and Guardrail Models

-



Agentic AI
is an ecosystem where specialized language and vision models work together. They handle planning, reasoning, retrieval, and safety guardrailing.

Developers need specialized AI agents for domain-specific workflows, real-world deployment, and compliance. Constructing specialized AI requires 4 critical ingredients: open models that may be fine-tuned, robust datasets, recipes for optimum model accuracy and compute, and efficient inference for deploying them at scale.

At NVIDIA GTC DC, we’re unveiling reasoning, vision-language, retrieval-augmented generation (RAG), and safety models with open data and recipes that deliver accuracy, compute efficiency, and openness.

This blog covers the features, performance, and tutorials on using the brand new Nemotron models for constructing multimodal agents, RAG pipelines, and AI with content safety. 

The image shows the new NVIDIA Nemotron models launched at GTC DC. This includes models for document intelligence, video understanding, multilingual content safety, and information retrieval.The image shows the new NVIDIA Nemotron models launched at GTC DC. This includes models for document intelligence, video understanding, multilingual content safety, and information retrieval.
Figure 1. Recent Nemotron models for document intelligence, video understanding, multilingual content safety, and data retrieval

Enable agents to think efficiently with NVIDIA Nemotron Nano 3

NVIDIA Nemotron Nano 3 is an efficient and accurate 32B parameter MoE with 3.6B lively parameters designed for developers to construct specialized agentic AI systems. Available soon, this model delivers higher throughput in comparison with similarly-sized dense models, enabling it to explore a bigger search space, do higher self-reflection, and supply higher accuracy across scientific reasoning, coding, math, and tool-calling benchmarks. Moreover, the MoE architecture reduces compute costs and latency.

Add multimodal understanding and reasoning with NVIDIA Nemotron Nano 2 VL

NVIDIA Nemotron Nano 2 VL, a number one model on OCRBenchV2, is an open 12B multimodal reasoning model for document intelligence and video understanding. It enables AI assistants to extract, interpret, and act on information across text, images, tables, and videos. This makes the model helpful for agents focused on data evaluation, document processing, and visual understanding in applications like generating reports, curating videos, and dense captioning for media asset management and retrieval-augmented search. 

Video 1. Constructing multimodal AI agents for document and video intelligence using NVIDIA Nemotron VLMs

At its core, this vision-language model (VLM) incorporates a hybrid Mamba-Transfomer architecture delivering on-par accuracy, high token throughput, and low latency for efficient large-scale reasoning for visual and text tasks. This model is trained on the Nemotron VLM Dataset V2 with over 11M high-quality samples covering several tasks corresponding to image Q&A, OCR, dense captioning, video Q&A, and multi-image reasoning. Read more concerning the dataset. We used FP8 for faster speed and context parallelism to administer longer inputs, resulting in greater efficiency and accuracy for video and long-document tasks.

The bar chart shows accuracy of Nemotron Nano VL and Nemotron Nano 2 VL models across visual benchmarks for multi-image understanding, document intelligence, and video captioning.The bar chart shows accuracy of Nemotron Nano VL and Nemotron Nano 2 VL models across visual benchmarks for multi-image understanding, document intelligence, and video captioning.
Figure 2. Nemotron Nano 2 VL delivers improved accuracy across visual benchmarks for multi-image understanding, document intelligence, and video captioning

This model introduces the Efficient Video Sampling (EVS) method that identifies and prunes temporally static patches in video sequences. EVS reduces token redundancy, preserving essential semantics, for the model to process longer clips and deliver results more swiftly.

The line graph shows accuracy of two video benchmarks across various levels of tokens dropped with EVS. The graphs stay largely flat in terms of accuracy and slope down slightly after 50% token drops.The line graph shows accuracy of two video benchmarks across various levels of tokens dropped with EVS. The graphs stay largely flat in terms of accuracy and slope down slightly after 50% token drops.
Figure 3. EVS enables Nemotron Nano 2 VL to attain as much as 2.5x higher throughput without sacrificing accuracy

Quantized for FP4, FP8, and BF16, this model is supported by vLLM and TRT-LLM inference engines and is obtainable as an NVIDIA NIM. Developers can use the NVIDIA AI Blueprint for video search and summarization (VSS) to investigate long videos and NVIDIA NeMo to curate multimodal datasets and customize or construct their very own models. The technical report also guides developers on the models for constructing custom, optimized models with Nemotron techniques.

Improve document intelligence with NVIDIA Nemotron Parse 1.1

We’re also releasing NVIDIA Nemotron Parse 1.1, a compact 1B parameter VLM-based document parser for enhanced document intelligence. Given a picture, this model extracts structured text and tables with bounding boxes and semantic classes, enabling downstream applications corresponding to improved retriever accuracy, richer large language model (LLM) training data, and improved document processing pipelines.

he bar chart shows accuracy comparison of Nemotron Parse 1.1 with a leading open popular model. The Nemotron model delivers significant accuracy improvements on PubTabNet benchmark, designed to evaluate image-based table recognition.he bar chart shows accuracy comparison of Nemotron Parse 1.1 with a leading open popular model. The Nemotron model delivers significant accuracy improvements on PubTabNet benchmark, designed to evaluate image-based table recognition.
Figure 4. Nemotron Parse 1.1 delivers leading accuracy on the PubTabNet benchmark for image-based table recognition

Nemotron Parse delivers comprehensive text, tables, and layout understanding to be used in retriever and curator workflows. Its extraction datasets and structured outputs support each LLM and VLM training, and boost inference accuracy for VLMs at runtime.

Ground agents with open RAG models

NVIDIA Nemotron RAG is a collection of models for constructing RAG pipelines and real-time business insights. It ensures data privacy and connects securely to proprietary data across environments, supporting enterprise-grade retrieval. As a core component of NVIDIA AI-Q and the NVIDIA RAG Blueprint, Nemotron RAG provides a scalable and production-ready foundation for intelligent, retrieval-based AI applications.

It enables the event of a wide selection of applications—from multi-agent systems where AI agents perceive, plan, and act to attain complex goals, to generative co-pilots powered by specialized large language models that assist with IT support, HR operations, and customer support. It also supports AI assistants that interact naturally with developers using company data and summarization tools that create written reports or visual media highlights.

The embedding models have consistently led on industry leaderboards like ViDoRe and MTEB for visual and multimodal retrieval, MMTEB for multilingual text retrieval, making them well-suited for constructing best-in-class RAG pipelines. The brand new models are actually available on Hugging Face.

Video 2. Developing custom AI agents powered with information retrieval using NVIDIA Nemotron RAG

Make AI safer with the Llama 3.1 Nemotron Safety Guard

As developers construct agentic AI systems that may reason, retrieve, and act autonomously, safety becomes essential to forestall harmful or unintended behavior. LLMs may be misused, prompted into unsafe outputs, or miss cultural nuance—especially in non-English contexts—making reliable moderation models critical to responsible development.

The brand new Llama 3.1 Nemotron Safety Guard 8B V3 is a multilingual content safety model. It’s fine-tuned on the Nemotron Safety Guard dataset, a culturally diverse dataset with greater than 386K samples covering 23 regionally adapted safety categories, including examples of adversarial and jailbreak prompts inside each category.

The model detects unsafe or policy-violating content in each prompts and responses across 23 safety categories and nine languages, corresponding to Arabic, Hindi, and Japanese. Figure 4 illustrates our model’s performance comparison on a per-language basis. 

Bar chart comparing Llama 3.1 Nemotron Safety Guard’s performance across multiple languages.Bar chart comparing Llama 3.1 Nemotron Safety Guard’s performance across multiple languages.
Figure 5. A comparison of the Llama 3.1 Nemotron Safety Guard model performance across languages

The model achieves 84.2% harmful content classification accuracy with minimal latency, as seen in Figure 5. Two novel techniques power its performance: 1) LLM-driven cultural adaptation aligns prompts and responses with local idioms and sensitivities, and a couple of) consistency filtering removes noisy or misaligned samples for high-quality fine-tuning.

Bar chart showing average scores of 4 safety models being tested across 8 datasets, 23 safety categories, and 8 languages and their average harmful content classification accuracy.Bar chart showing average scores of 4 safety models being tested across 8 datasets, 23 safety categories, and 8 languages and their average harmful content classification accuracy.
Figure 6. In benchmark testing across eight datasets, the Llama 3.1 Nemotron Safety Guard model delivers best-in-class performance across 23 safety categories

Lightweight and deployable on a single GPU or as an NVIDIA NIM, it integrates with NeMo Guardrails for real-time, multilingual content safety in agentic AI pipelines. Explore the model and dataset on HuggingFace or construct.nvidia.com to start out constructing safer, globally aligned AI systems.

Video 3. Power AI with culturally-aware LLM guardrails using Nemotron Safety Guard

Evaluate your models and optimize AI agents with NVIDIA NeMo

To make sure LLM capabilities are measured reliably, the NVIDIA NeMo Evaluator SDK was recently open sourced. This SDK enables reproducible benchmarking, giving developers confidence in real-world performance beyond reported scores. 

NeMo Evaluator can now also assess models on dynamic, interactive workflows with support for ProfBench, a benchmark suite designed to guage agentic AI behaviors, including multi-step reasoning and power usage. 

By open-sourcing standardized evaluation setups, developers can benchmark performance, validate outputs, and compare models under consistent conditions. 

NeMo Agent Toolkit is an open-source framework integrated with industry standards like MCP and compatible with other frameworks, including Semantic Kernel, Google ADK, LangChain, and CrewAI. The toolkit’s latest Agent Optimizer feature robotically tunes key hyperparameters—LLM type, temperature, max tokens—and optimizes for accuracy, groundedness, latency, token usage, and custom metrics. This reduces trial-and-error and accelerates agent, tool, and workflow development. 

Try it now with our GitHub notebook.

Start constructing your AI with Nemotron now 

On this blog post, we’ve introduced the most recent members of the Nemotron family and a small sample of what is feasible with them.

To start, download the Nemotron models and datasets from Hugging Face. 

Nemotron Nano 2 VL can also be hosted by inference providers including Baseten, Deep Infra, Fireworks, Hyperbolic, Nebius, and Replicate to supply  an efficient path from development to production for agentic AI.

You may as well evaluate the NVIDIA-hosted API endpoints on construct.nvidia.com and OpenRouter.

Not sleep up to now on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x