Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline

We’re thrilled to announce that NVIDIA NeMo Retriever team has developed a brand new agentic retrieval pipeline that has officially secured the #1 spot on the ViDoRe v3 pipeline leaderboard. As well as, this very same pipeline architecture achieved the #2 spot on the highly demanding, reasoning-intensive BRIGHT leaderboard.

Within the rapidly evolving landscape of AI retrieval, many solutions are highly specialized, engineered to perform exceptionally well on specific, narrow tasks. Nonetheless, real-world enterprise applications rarely have the luxurious of perfectly curated, single-domain data. They require systems that may seamlessly adapt to a wide selection of challenges—from parsing complex visual layouts to executing deep logical reasoning.

That’s the reason we prioritized generalizability in our design. Somewhat than counting on dataset-specific heuristics, we built an agentic pipeline that dynamically adapts its search and reasoning technique to the info at hand. This permits us to deliver state-of-the-art performance across vastly different benchmarks without requiring any underlying architectural changes.

Here’s a take a look at how we built it.

The Motivation: Why Semantic Similarity Is not Enough

For years, dense retrieval based on semantic similarity has been the usual for locating information. Nonetheless, because the applications of retrieval expand, finding relevant documents goes beyond semantic similarity alone. Complex document search requires reasoning skills, an understanding of real-world systems, and iterative exploration.

There may be a fundamental gap: LLMs are great at pondering and reasoning but cannot process hundreds of thousands of documents directly. Conversely, retrievers can easily sift through hundreds of thousands of documents but possess limited reasoning skills. Agentic retrieval bridges this gap by creating an lively, iterative loop between the LLM and the retriever.

How It Works: The Agentic Loop

Agentic retrieval pipeline overview

Our agentic retrieval pipeline relies on a ReACT architecture. As an alternative of a single “one-and-done” query, the agent iteratively searches, evaluates, and refines its approach.

The agent utilizes built-in tools like think to plan its approach and final_results to output the precise documents needed for a given query, alongside a retrieve (query, top_k) tool to explore the corpus. Through this loop, we observed successful search patterns emerge naturally:

Generating higher queries: The agent dynamically adjusts its search queries based on newly discovered information.
Persistent rephrasing: It continually rephrases queries until useful information is found.
Breaking down complexity: It translates complex, multi-part queries into multiple simpler queries with clear goals.

Finally, to synthesize the iterative discoveries, the agent calls a final_results tool to output probably the most relevant documents, ranked by their relevance to a given query. As a security net—for instance, when the agent hits the utmost variety of steps or the context length limit—the pipeline falls back to Reciprocal Rank Fusion (RRF), which scores documents based on their ranks across all retrieval attempts within the agent trajectory.

Engineering for Speed and Scale

Agentic workflows are notoriously slow and resource-intensive. To make this pipeline viable for leaderboard-scale evaluation, we needed to rethink how the LLM agent and the retriever communicate.

Initially, the retriever was exposed to the agent via a Model Context Protocol (MCP) server—a natural alternative, since MCP is designed precisely for giving LLMs access to external tools. But in practice, this architecture imposed a compounding tax on experiment velocity. Every run required spinning up a separate MCP server, loading the precise dataset corpus into GPU memory, and orchestrating the lifecycle of each client and server. Each of those steps was a possibility for silent misconfiguration or a server freeze under high-volume requests. The network round-trips added latency to each retrieval call, and the general cognitive burden of managing the two-process setup made it significantly harder for other teams to adopt and iterate on the pipeline.

To resolve this, we replaced the MCP server with a thread-safe singleton retriever that lives in-process. The singleton loads the model and corpus embeddings once, protects all access with a reentrant lock, and exposes the identical retrieve() interface to arbitrarily many concurrent agent tasks—achieving the important thing good thing about an MCP server (secure, shared access to a GPU-resident retriever from multiple threads) without incurring network serialization overhead or requiring a separate server process. This single architectural change eliminated a whole class of deployment errors and dramatically improved each GPU utilization and experiment throughput.

Generalization vs. Specialization Across Benchmarks

A typical commentary in modern retrieval evaluation is that solutions highly optimized for one specific kind of task often experience a performance gap when applied to a very different domain.

Pipeline	ViDoRe v3
NeMo Agentic Retrieval (Opus 4.5 + nemotron-colembed-vl-8b-v2)	69.22 (#1)
Dense retrieval (nemotron-colembed-vl-8b-v2)	64.36
INF-X-Retriever (INF-Query-Aligner + nemotron-colembed-vl-8b-v2)	62.31
INF-X-Retriever	51.01

Pipeline	BRIGHT
INF-X-Retriever	63.40 (#1)
NeMo Agentic Retrieval (Opus 4.5 + nemotron-reasoning-3b)	50.90 (#2)

We placed #2 on the reasoning-intensive BRIGHT leaderboard with an NDCG@10 of fifty.90. The #1 solution on that leaderboard, INF-X-Retriever, achieves a formidable 63.40. Nonetheless, to check cross-domain adaptability, we benchmarked the INF-X pipeline (coupled with the identical nemotron-colembed-vl-8b-v2 embedding model utilized in our agentic pipeline) on ViDoRe v3, a dataset specializing in visually wealthy and diverse enterprise documents. On this different task, its performance landed at an NDCG@10 of 62.31, lower than the dense retrieval rating 64.36. In other words, INF-Query-Aligner doesn’t improve over the dense retrieval baseline on ViDoRe v3.

In contrast, our same agentic pipeline (pairing Opus 4.5 with nemotron-colembed-vl-8b-v2) achieved the #1 spot on ViDoRe v3 with a rating of 69.22.

This highlights a core strength of our approach: generalizability. Somewhat than counting on dataset-specific heuristics or a query-rewriter/aligner, our agentic loop naturally adapts its technique to the dataset at hand, whether it requires multi-step logical reasoning or parsing complex visual layouts.

Ablation Studies: Open vs. Closed Models

ViDoRe v3

Agent	Embedding Model	NDCG @10	Average sec/query	Total input tokens (M)	Total output tokens (M)	Average retrieval calls
Opus 4.5	nemotron-colembed-vl-8b-v2	69.22	136.3	1837	15	9.2
gpt-oss-120b	nemotron-colembed-vl-8b-v2	66.38	78.6	1860	13	2.4
gpt-oss-120b	llama-nemotron-embed-vl-1b-v2	62.42	78.1	1459	13	2.5
–	nemotron-colembed-vl-8b-v2	64.36	0.67	–	–	–
–	llama-nemotron-embed-vl-1b-v2	55.83	0.02	–	–	–

BRIGHT

Agent	Embedding Model	NDCG @10	Average sec/query	Total input tokens (M)	Total output tokens (M)	Average retrieval calls
Opus 4.5	llama-embed-nemotron-reasoning-3b	50.79	148.2	1251	11	11.8
gpt-oss-120b	llama-embed-nemotron-reasoning-3b	41.27	92.8	1546	11	4.5
gpt-oss-120b	llama-nemotron-embed-vl-1b-v2	33.85	139.1	1516	12	6.6
–	llama-embed-nemotron-reasoning-3b	38.28	0.11	–	–	–
–	llama-nemotron-embed-vl-1b-v2	19.56	0.08	–	–	–

We conducted extensive ablations to grasp the tradeoff between frontier closed models and open-weights alternatives:

Model Selection: On ViDoRe v3, swapping Opus 4.5 for the open gpt-oss-120b resulted in a small accuracy drop (69.22 to 66.38 NDCG@10) and makes much fewer retrieval calls. On BRIGHT, the gap was wider, indicating that deeper reasoning tasks still profit heavily from frontier models like Opus 4.5.
Embeddings: Pairing the agent with specialized embeddings (nemotron-colembed-vl-8b-v2 for ViDoRe and llama-embed-nemotron-reasoning-3b for BRIGHT) yielded the perfect results, proving that a powerful baseline retriever provides the next ceiling for the agent to achieve.
It is also interesting to notice that the agent can close the gap between stronger and weaker embedding models. For instance, on ViDoRe, the gap between the stronger nemotron-colembed-vl-8b-v2 and the weaker llama-nemotron-embed-vl-1b-v2 is about 8.5 in dense retrieval, but when coupled with gpt-oss-120b agent, the gap shrinks to about 4. Similarly, llama-embed-nemotron-reasoning-3b is about 19 points higher than llama-nemotron-embed-vl-1b-v2 on BRIGHT, however the lead shrinks to about 7.5 when coupled with gpt-oss-120b agent.

The Cost of Autonomy and What’s Next

There isn’t a free lunch. Agentic retrieval is dearer and slower than standard dense retrieval. our ViDoRe v3 results, the agent averages 136 seconds per query and consumes roughly 760k input and 6.3k output tokens per query. (Note: the sequential latency is measured on a single A100 GPU with a single concurrent Claude API call—i.e., not searching multiple queries at the identical time—to reflect true search time in real-world use cases).

Nonetheless, we consider agentic retrieval is a highly viable approach for high-stakes, complex queries. Our immediate next steps give attention to cost reduction: we’re actively researching ways to distill these agentic reasoning patterns into smaller, specialized open-weight agents. By fine-tuning smaller models to orchestrate the think and retrieve loop natively, we aim to deliver Opus-level accuracy at a fraction of the latency and value.

Construct Your Own Agentic Pipeline

While our leaderboard-topping runs explored combos like Claude Opus and gpt-oss alongside our research embedding models, the true strength of this architecture is its modularity. For production-ready deployments, we highly encourage you to try pairing your agent of alternative with our robust business embedding model llama-nemotron-embed-vl-1b-v2. To explore these models, dive into the tools, and begin constructing your personal highly generalizable retrieval workflows, visit the NeMo Retriever library today.

Source link

Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline

The Motivation: Why Semantic Similarity Is not Enough

How It Works: The Agentic Loop

Engineering for Speed and Scale

Generalization vs. Specialization Across Benchmarks

Ablation Studies: Open vs. Closed Models

The Cost of Autonomy and What’s Next

Construct Your Own Agentic Pipeline

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

DenseNet Paper Walkthrough: All Connected

I Replaced Vector DBs with Google’s Memory Agent Pattern for my notes in Obsidian

AI just made the billion-dollar solo founder real

Bringing AI Closer to the Edge and On-Device with Gemma 4

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline

The Motivation: Why Semantic Similarity Is not Enough

How It Works: The Agentic Loop

Engineering for Speed and Scale

Generalization vs. Specialization Across Benchmarks

Ablation Studies: Open vs. Closed Models

The Cost of Autonomy and What’s Next

Construct Your Own Agentic Pipeline

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.