Construct a Retrieval-Augmented Generation (RAG) Agent with NVIDIA Nemotron

-


Unlike traditional LLM-based systems which might be limited by their training data, retrieval-augmented generation (RAG) improves text generation by incorporating relevant external information. Agentic RAG goes a step further by leveraging autonomous systems integrated with LLMs and retrieval mechanisms. This enables these systems to make decisions, adapt to changing requirements, and perform complex reasoning tasks dynamically.

On this guide to the self-paced workshop for constructing a RAG agent, you’ll gain: 

  • Understanding of the core principles of agentic RAG, including NVIDIA Nemotron, an open model family with open data and weights.
  • Knowledge of the way to construct agentic RAG systems using LangGraph.
  • A turnkey, portable development environment.
  • Your personal customized agentic RAG system, able to share as an NVIDIA Launchable.

Video walkthrough

Video 1. Construct a RAG Agent with NVIDIA Nemotron

Opening the workshop

Launch the workshop as an NVIDIA Launchable:

Button of the “Deploy Now” button for NVIDIA DevX  WorkshopButton of the “Deploy Now” button for NVIDIA DevX  Workshop
Figure 1. Click on the ‘Deploy Now’ button to deploy the NVIDIA DevX Workshop within the cloud

Together with your Jupyter Lab environment running, locate the NVIDIA DevX Learning Path section of the Jupyterlab Launcher. Select the Agentic RAG tile to open up the lab instructions and start.

A screenshot of the 2. Agentic RAG tileA screenshot of the 2. Agentic RAG tile
Figure 2. Click on the “Agentic RAG” tile in NVIDIA DevX Learning Path to open lab instructions.

Organising secrets

With a view to follow together with this workshop, you’ll need to collect and configure just a few project secrets.

  • NGC API Key: This permits access to NVIDIA software, models, containers, and more
  • (optional) LangSmith API Key: This connects the workshop to LangChain’s platform for tracing and debugging your AI Agent

You’ll be able to utilize the Secrets Manager tile under NVIDIA DevX Learning Path of the Jupyterlab Launcher to configure these secrets on your workshop development environment. Confirm within the logs tab that the secrets have been added successfully.

A screenshot of the Secrets Manager tile under NVIDIA DevX Learning Path.A screenshot of the Secrets Manager tile under NVIDIA DevX Learning Path.
Figure 3. Use the “Secrets Manager” tile under the NVIDIA DevX Learning Path section to configure project secrets (API keys).

Introduction to RAG architecture

Once your workshop environment has been arrange, the following step is knowing the architecture of the agentic RAG system you’ll construct.

RAG enhances the capabilities of LLMs by incorporating relevant external information during output text generation. Traditional language models generate responses based solely on the knowledge captured of their training data, which could be a limiting factor, especially when coping with rapidly changing information, highly specialized knowledge domains, or enterprise confidential data. RAG, alternatively, is a strong tool for generating responses based on relevant unstructured data retrieved from an external knowledge base.

A flow chart showing the path of a user prompt takes from the retrieval chain, to the LLM, to the final generated response.A flow chart showing the path of a user prompt takes from the retrieval chain, to the LLM, to the final generated response.
Figure 4. Traditionally, RAG utilizes a user prompt to retrieve contextually-relevant documents, providing them as context to the LLM for a more informed response.

The standard flow for a RAG system is:

  1. Prompt: A user generates a natural language query.
  2. Embedding Model: The prompt is converted into vectors
  3. Vector Database Search: After a user’s prompt is embedded right into a vector, the system searches a vector database stuffed with semantically indexed document chunks, enabling fast retrieval of contextually relevant data chunks.”
  4. Reranking Model: The retrieved data chunks are reranked to prioritize probably the most relevant data.
  5. LLM: The LLM generates responses informed by the retrieved data.

This approach ensures that the language model can access up-to-date and specific information beyond its training data, making it more versatile and effective.

Understanding ReAct agent architecture

Unlike traditional LLM-based applications, agents can dynamically select tools, incorporate complex reasoning, and adapt their evaluation approach based on the situation at hand.

A flow chart showing the path a user prompt takes inside of a ReAct agent to iteratively utilize tool calling.A flow chart showing the path a user prompt takes inside of a ReAct agent to iteratively utilize tool calling.
Figure 5. A ReAct agent can iteratively reason and call out to user-defined tools to generate the next quality RAG-based response.

ReAct Agents are an easy agentic architecture that use “reasoning and acting” via tool calling supported LLMs. If the LLM requests any tool calls after taking within the prompt, those tools will likely be run, added to the chat history, and sent back to the model to be invoked again.

RAG works well, however it’s limited since the LLM can’t determine how data is retrieved, control for data quality, or choose from data sources. Agentic RAG takes the concept of RAG a step further by combining the strengths of LLMs resembling language comprehension, contextual reasoning, and versatile generation, with dynamic tool usage, and advanced retrieval mechanisms resembling semantic search, hybrid retrieval, reranking, and data source selection. Making a ReAct Agent for RAG just requires giving it the Retrieval Chain as a tool so the agent can resolve when and the way to seek for information.

A flow chart showing the path a user prompt takes between the ReAct agent and the Retrieval Chain.A flow chart showing the path a user prompt takes between the ReAct agent and the Retrieval Chain.
Figure 6. The complete agentic RAG pipeline will involve adding the ReAct agent to the Retrieval Chain where the contextual documents are stored.

Agentic RAG employs a ReAct agent architecture wherein the reasoning LLM systematically decides whether to retrieve information via tool calling or respond directly, activating the retrieval pipeline only when additional context is required to higher address the user’s request.

Learn and implement the code

Now that we understand the concepts, let’s dive into the technical implementation. We’ll start with the foundational components before increase to the whole agentic RAG system:

  1. Models
  2. Tools
  3. Data Ingestion
  4. Text Splitting
  5. Vector Database Ingestion
  6. Document Retriever and Reranker
  7. Retriever Tool Creation
  8. Agent Configuration

Foundations: the models

The workshop relies on NVIDIA NIM endpoints for the core model powering the agent. NVIDIA NIM provides high-performance inference capabilities, including:

  • Tool binding: Native support for function calling.
  • Structured output: Built-in support for Pydantic models.
  • Async operations: Full async/await support for concurrent processing.
  • Enterprise reliability: Production-grade inference infrastructure.

This instance shows the ChatNVIDIA LangChain connector using NVIDIA NIM.

from langchain_nvidia_ai_endpoints import ChatNVIDIA
LLM_MODEL = "nvidia/nvidia-nemotron-nano-9b-v2"
llm = ChatNVIDIA(model=LLM_MODEL, temperature=0.6, top_p=0.95, max_tokens=8192)

To make sure the standard of the LLM-based application, it’s crucial that the agent receives clear instructions to make clear decision-making, remove ambiguity, and make clear the way it should treat retrieved documents. One such example from code/rag_agent.py is provided as follows:

SYSTEM_PROMPT = (
    "You're an IT help desk support agent.n"
    "- Use the 'company_llc_it_knowledge_base' tool for questions likely covered by the interior IT knowledge base.n"
    "- All the time write grounded answers. If unsure, say you do not know.n"
    "- Cite sources inline using [KB] for knowledge base snippets.n"
    "- If the knowledge base doesn't contain sufficient information, clearly state what information is missing.n"
    "- Keep answers transient, to the purpose, and conversational."
)

This prompt shows just a few key principles of reliable LLM prompting for RAG-based applications:

  • Role specification: Clear definition of the agent’s expertise and responsibilities.
  • Tool Utilization: Instruct the agent on which tools to make use of for specific tasks.
  • Grounding: Emphasize the importance of providing answers based on reliable sources and the importance of admitting to uncertainty.
  • Source Citation: Provide guidelines for citing sources to make sure transparency.
  • Communication Style: Specify the specified communication style.

In code/rag_agent.py we define the models needed for the IT Help Desk agent to reply user queries by utilizing the Knowledge Base. 

  • The LLM Model, Nemotron Nano 9b V2, is the first reasoning model used for generating responses. 
  • The NVIDIA NeMo Retriever Embedding Model, Llama 3.2 EmbedQA 1b V2, is used for converting documents into vector embedding representations for storage and retrieval. 
  • The NeMo Retriever Reranking Model, Llama 3.2 RerankQA 1b V2, is used for reranking for probably the most relevant retrieved documents and data.

These models collectively enable the IT Help Desk agent to reply user queries accurately by leveraging a mix of language generation, document retrieval, and reranking capabilities.

Foundations: the tools

Our RAG agent can have access to the knowledge base provided at ./data/it-knowledge-base that incorporates markdown files documenting common IT-related procedures. The retriever tool enables the agent to look the interior IT knowledge base for documents relevant to the user’s query.

A vector database stores, indexes, and queries numerical representations of vectorized embeddings, allowing for fast similarity searches of unstructured data like text, images, and audio. For our purposes, we use an in-memory FAISS database, which is efficient for spinning up small databases. When it comes to data ingestion to‌ utilize the information within the knowledge base, we’ll concentrate on text ingestion. Additional features like multimodality must be considered for production use cases.

Foundations: data ingestion

The embedding model utilized is NeMo Retriever llama-3.2-nv-embedqa-1b-v2. This model creates embeddings for documents and queries that assist in efficiently retrieving relevant documents from the knowledge base by comparing the semantic similarity between the query and the documents.

To ingest the documents, we’ll chunk the documents, embed those chunks into vectors, after which insert the vectors into the database. Before doing that, we’d like to load the information from our ./data/it-knowledge-base directory using the LangChain DirectoryLoader. 

from langchain_community.document_loaders import DirectoryLoader, TextLoader
# Read the information
_LOGGER.info(f"Reading knowledge base data from {DATA_DIR}")
data_loader = DirectoryLoader(
    DATA_DIR,
    glob="**/*",
    loader_cls=TextLoader,
    show_progress=True,
)
docs = data_loader.load()

Foundations: text splitting

Document splitting is controlled by two things: chunk size and chunk overlap.

Chunk size defines the utmost length of every text chunk. This ensures that every chunk is of an optimized size for processing by language models and retrieval systems. A bit size that is simply too large may contain information less relevant to specific queries, while one too small may miss necessary context.

Chunk overlap defines the variety of tokens that overlap between consecutive chunks. The goal is to make sure continuity and preserve context across chunks, thereby maintaining coherence within the retrieved information.

To perform text splitting efficiently, we use the RecursiveCharacterTextSplitter. This tool recursively splits documents into smaller chunks based on character length, so each chunk adheres to the defined chunk size and overlap parameters. It’s particularly useful for processing large documents, improving the data retrieval’s overall accuracy.

from langchain.text_splitter import RecursiveCharacterTextSplitter
CHUNK_SIZE = 800
CHUNK_OVERLAP = 120

_LOGGER.info(f"Ingesting {len(docs)} documents into FAISS vector database.")
splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
)
chunks = splitter.split_documents(docs)

Foundations: vector database ingestion

To facilitate efficient retrieval of relevant information, we’d like to ingest our large corpus of documents right into a vector database. Now that we have now broken down our documents into manageable chunks, we utilize the embedding model to generate vector embeddings for every document chunk.

These embeddings are numerical representations of the semantic content of the chunks. High-quality embeddings enable efficient similarity searches, allowing the system to quickly discover and retrieve probably the most relevant chunks in response to a user’s query. 

The subsequent step is to store the generated embeddings in an in-memory FAISS database, which ensures fast indexing and querying capabilities for real-time information retrieval. In this instance, we leverage the indisputable fact that LangChain’s FAISS `from_documents` method conveniently generates the embeddings for the document chunks and in addition stores them within the FAISS vector store in a single function call.

from langchain_community.vectorstores import FAISS
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings,

embeddings = NVIDIAEmbeddings(model=RETRIEVER_EMBEDDING_MODEL, truncate="END")
vectordb = FAISS.from_documents(chunks, embeddings)

By following these steps and making the most of the ability of the embedding model, we be certain that the IT Help Desk agent can efficiently retrieve and process relevant information from the knowledge base.

Foundations: document retriever and reranker

With our vector database populated, we will construct a sequence for content retrieval. This involves making a seamless workflow that features each the embedding step and the lookup step.

A flow chart showing the path ingested document chunks take to get stored in a vector database.A flow chart showing the path ingested document chunks take to get stored in a vector database.
Figure 7. A basic retrieval chain consists of an embedding model and a database to store the converted vector embeddings.

Within the embedding step, user queries are converted into embeddings using the identical model that we previously used for document chunks. This ensures that each the queries and document chunks are represented in the identical semantic space, enabling accurate similarity comparisons.

To initialize the retriever in this instance, we’ll use semantic similarity and seek for the highest six returned results in comparison with our query.

# imports already handled
kb_retriever = vectordb.as_retriever(search_type="similarity", search_kwargs={"k": 6})

The embeddings of the user’s queries are compared against the embeddings stored within the vector database through the lookup step. The system retrieves probably the most similar document chunks, that are then used to generate responses.

A flow chart showing the path ingested document chunks take to get stored in and retrieved from a vector database.A flow chart showing the path ingested document chunks take to get stored in and retrieved from a vector database.
Figure 8. A more complex retrieval chain consists of attaching a Reranking model to reorganize retrieved context to place probably the most relevant chunks first.

For each the embedding and the reranking models, we’ll use NIM microservices from NVIDIA NeMo Retriever. LangChain allows us to simply create a basic retrieval chain from our Vector Database object that has each the embedding step and the lookup step.

For improving the relevance and order of retrieved documents, we will utilize the NVIDIA Rerank class, built on the NVIDIA NeMo Retriever Reranker model. The Reranker model evaluates and ranks the retrieved document chunks based on their relevance to the user’s query in order that probably the most pertinent information is presented to the user first. In this instance, we initialize the Reranker as follows:

from langchain_nvidia_ai_endpoints import NVIDIARerank
reranker = NVIDIARerank(model=RETRIEVER_RERANK_MODEL)

Foundations: Retriever tool creation

Taking the document retriever and the documenter reranker, we will now create the ultimate document retriever as below:

RETRIEVER = ContextualCompressionRetriever(
    base_retriever=kb_retriever,
    base_compressor=reranker,
)

The LangChain ContextualCompressionRetriever makes it easy to mix a retriever with additional processing steps, attaching the retrieval chain to the reranking model. Now we will create the retriever tool that allows our ReAct Agent.

In this instance, we will initialize the retriever tool through the use of the LangChain tools package below, passing in our initialized retriever:

from langchain.tools.retriever import create_retriever_tool
RETRIEVER_TOOL = create_retriever_tool(
    retriever=RETRIEVER,
    name="company_llc_it_knowledge_base",
    description=(
        "Search the interior IT knowledge base for Company LLC IT related questions and policies."
    ),
)

Foundations: agent configuration

With our vector database and retriever chain in place, we’re able to construct the agent graph. This agent graph acts as a type of flowchart, mapping out the possible steps the model can take to perform its task. In traditional, step-by-step LLM applications, these are called “chains.” When the workflow involves more dynamic, non-linear decision-making, we check with them as “graphs.” The agent can select different paths based on the context and requirements of the duty at hand, branching out into different decision nodes.

Given the prevalence of the ReAct agent architecture, LangGraph provides a function that’ll create ReAct Agent Graphs. In this instance, we utilized as below:

from langgraph.prebuilt import create_react_agent
AGENT = create_react_agent(
    model=llm,
    tools=[RETRIEVER_TOOL],
    prompt=SYSTEM_PROMPT,
)

By constructing an agent graph, we create a dynamic and versatile workflow that allows our IT Help Desk agent to handle complex decision-making processes. This approach ensures that the agent can efficiently retrieve and process information, provide accurate responses, and adapt to numerous scenarios.

Running your agent

Congratulations! You could have successfully built your agent! Now, the following step is to try it out.

To start with running your agent out of your terminal, cd into the code directory that has the Python file containing your code for the agent. Once there, start your Agent API with the LangGraph CLI. Your agent will routinely reload as you make changes and save your code.

To speak along with your agent, an easy Streamlit app has been included within the Easy Agents Client. You may also access the Streamlit Client from the Jupyter Launcher page. Within the sidebar, make sure the rag_agent client is chosen and take a look at chatting!

A screenshot of the Simple Agents Client tileA screenshot of the Simple Agents Client tile
Figure 9. Click on the “Easy Agents Client” tile in NVIDIA DevX Learning Path to open the Streamlit chat application.

As your agents grow to be more sophisticated, managing their internal complexity can grow to be difficult. Tracing helps visualize each step your agent takes, which makes it easier to debug and optimize your agent’s behavior. Within the workshop, you may optionally configure the LANGSMITH_API_KEY and think about traces on the LangSmith dashboard

Migrate to local NIM microservices

This workshop utilizes the nvidia-nemotron-nano-9b-v2 LLM from the NVIDIA API Catalog. These APIs are useful for evaluating many models, quick experimentation, and getting began is free. Nevertheless, for the unlimited performance and control needed in production, deploy models locally with NVIDIA NIM microservice containers.

In a typical development workflow, each your agent and NIM containers would run within the background, allowing you to multitask and iterate quickly. For this exercise, we will run the NIM within the foreground to simply monitor its output and ensure proper begin.

First, you must log in to the NGC container registry as follows:

echo $NVIDIA_API_KEY | 
  docker login nvcr.io 
  --username '$oauthtoken' 
  --password-stdin

The subsequent step is to create a location for NIM containers to save lots of their downloaded model files.

docker volume create nim-cache

Now, we’d like to make use of a Docker run command to tug the NIM container image and model data files before hosting the model behind a neighborhood, OpenAI-compliant API.

docker run -it --rm 
    --name nemotron 
    --network workbench 
    --gpus 1 
    --shm-size=16GB 
    -e NGC_API_KEY=$NVIDIA_API_KEY 
    -v nim-cache:/opt/nim/.cache 
    -u $(id -u) 
    -p 8000:8000 
    nvcr.io/nim/nvidia/nvidia-nemotron-nano-9b-v2:latest

After letting it run for just a few minutes, you’ll know the NIM is prepared for inference when it says Application startup complete.

INFO 2025-09-10 16:31:52.7 on.py:48] Waiting for application startup.
INFO 2025-09-10 16:31:52.239 on.py:62] Application startup complete.
INFO 2025-09-10 16:31:52.240 server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
...
INFO 2025-09-10 16:32:05.957 metrics.py:386] Avg prompt throughput: 0.2 tokens/s, Avg generation throughput: 1.1 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Now that your NIM is running locally, we’d like to update the agent you created in rag_agent.py to make use of it. 

llm = ChatNVIDIA(
    base_url="http://nemotron:8000/v1",
    model=LLM_MODEL,
    temperature=0.6,
    top_p=0.95,
    max_tokens=8192
)

Together with your langgraph server still running, return to our Easy Agents Client and take a look at prompting the agent again. If all the things was successful, it is best to notice no change!

Congratulations! You could have now migrated to using Local NIM microservices on your LangGraph Agent! 

Conclusion and next steps

This workshop provides a comprehensive path from basic concepts to stylish agentic systems, emphasizing hands-on learning with production-grade tools and techniques. 

By completing this workshop, developers gain practical experience with:

  • Fundamental concepts: Understanding the difference between standard and agentic RAG.
  • State management: Implementing complex state transitions and persistence.
  • Tool integration: Creating and managing agentic tool-calling capabilities.
  • Modern AI stack: Working with LangGraph, NVIDIA NIM, and associated tooling.

Learn more

For hands-on learning, suggestions, and tricks, watch our Nemotron Labs livestream replay, “Construct a RAG Agent with NVIDIA Nemotron”.

Stay awake up to now on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.

Browse video tutorials and livestreams to get probably the most out of NVIDIA Nemotron



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x