How NVIDIA AI-Q Reached #1 on DeepResearch Bench I and II

Contributors: Raja Biswas, Divyansh Jain, Ivan Sorokin, Alessio Devoto, Chantal D Gama Rose, Ajay Thorve, David Austin, Jean-Francois Puget

NVIDIA AI-Q deep research agent recently achieved first place on each DeepResearch Bench (55.95) and DeepResearch Bench II (54.50), the 2 primary benchmarks for evaluating deep research agents. This marks a meaningful step for open, portable deep research. One configurable stack leading on each shows that developer accessible models and tooling can power state-of-the-art agentic research.

What sets AI-Q apart? AI-Q is an open blueprint for constructing AI agents that reason over enterprise and web data to deliver well-cited responses. AI-Q provides a completely open and modular architecture that enterprises can own, inspect, customize, and configure per use case. The deep researcher is one workflow inside the larger AI-Q blueprint that features intent routing, query clarification, and shallow research. The deep researcher adopts a multi-agent architecture consisting of planner, researcher, and orchestrator built on NVIDIA NeMo Agent Toolkit and fine-tuned NVIDIA Nemotron 3 Super models, with an optional ensemble and report refiner for optimum report quality. One stack – flexible by design, tunable to your needs.

Why Winning Each Benchmarks Matters

DeepResearch Bench I and II evaluate research agents in complementary ways.

DeepResearch Bench scores report quality against a reference report along comprehensiveness, depth of insight, instruction-following, and readability dimensions. Doing well here rewards polished, well-structured narratives and strong synthesis.
DeepResearch Bench II uses 70+ fine-grained, binary rubrics per task to ascertain whether an agent retrieves the suitable information (Information Recall), synthesizes it into higher-level evaluation (Evaluation), and presents findings clearly (Presentation). Doing well here rewards granular factual correctness and analytical rigor.

Leading on each benchmarks means the AI-Q deep researcher produces polished well-cited reports and gets the underlying retrieval and reasoning right.

Architecture at a Glance

The AI-Q deep researcher architecture behind each results centers on three components: an orchestrator that coordinates the research loop, a planner that maps the data landscape and designs an evidence-grounded research plan, and a researcher that dispatches parallel specialists to collect and synthesize evidence across multiple analytical lenses. Each agent could be powered by a distinct LLM. An optional ensemble runs multiple agents in parallel and merges their outputs for optimum report quality and coverage of data. Figure 1 shows the complete architecture.

Figure 1. AI-Q deep researcher: orchestrator, planner, and researcher pipeline (right) with optional ensemble (left).

Core Stack: NVIDIA and Deep Research

The identical underlying stack powers each leaderboard submissions: open, reproducible, and built on:

NVIDIA NeMo Agent Toolkit for workflow wiring, function registration, and evaluation. The NeMo Agent Toolkit open source library provides config-driven composition of LLMs and tools and the power to plug in several agent graphs.
LangChain DeepAgents for the multi-phase planner–researcher–orchestrator flow with subagent middleware where applicable.
NVIDIA Nemotron 3 LLMs powering the agent pipeline. Nemotron models could be fine-tuned to excel at research synthesis and long-horizon tool calling. May be served via NVIDIA Construct or NVIDIA NIM for model inference.

The core is at all times multi-step research (plan → gather → synthesize), web search (Tavily) and academic paper search (Serper), and citation-backed reports. Optionally, an ensemble layer and report refiner could be added on top for optimum report quality.

Key Ingredients in AI-Q

4 ingredients were central to the result:

Multi-agent architecture with evidence-grounded planning and specialist researchers, built on NVIDIA NeMo Agent Toolkit and LangChain DeepAgents.
Effective-tuned NVIDIA Nemotron 3 Super: Roughly 67k SFT trajectories from few seed datasets with research questions, filtered with a principle-based judge. This model powers the researcher and its sub-agents.
Custom middleware for long-horizon reliability. NeMo Agent Toolkit and LangChain middleware are prolonged with components that improve reliability and robustness.
Ensemble researcher and report refiner (optional): parallel pipeline outputs merged by an LLM, with a post-hoc refiner for optimum report quality.

Each is detailed within the sections that follow.

Effective-Tuned NVIDIA Nemotron 3 Super: Data and Training

A significant component in the outcomes is a custom fine-tuned NVIDIA Nemotron-3-Super-120B-A12B model. We selected it for this workflow since it aligns well with multi-step agentic reasoning, tool use, and citation-grounded reporting; fine-tuning on real search-and-synthesis trajectories makes it effective for planner, researcher, and orchestrator roles at scale.

Trajectory generation

We collected research questions from multiple open-sourced datasets: about 17k questions from OpenScholar, 21k from ResearchQA and 2457 questions from Fathom-DeepResearch-SFT.
Then we generated ~80k trajectories for the complete workflow using the open-sourced GPT-OSS-120B model. Each trajectory covers planner, researcher, and orchestrator behavior.
It’s value noting that these trajectories include real web search results from the Tavily and Serper APIs so the model learns to navigate and perform multi-step searches and synthesis on real data.

Principle-based filtering

A lot of the trajectories didn’t complete on time or were stopped because of exceeding the tool call limit, but for people who did produce expected results, we moreover applied filtering using the judge model.
The finished trajectories were scored with the nvidia/Qwen3-Nemotron-32B-GenRM-Principle judge model, which predicts quality along dimensions similar to comprehensiveness, readability, accuracy, and relevance.
After filtering, ~67k trajectories were retained for training.

SFT training

Model: NVIDIA Nemotron-3-Super-120B-A12B
Setup: One epoch, 5,615 steps, roughly 25 hours on 16×8 NVIDIA H100 GPUs.

AI-Q Deep Researcher

AI-Q deep researcher adopts a multi-agent architecture (Orchestrator, Planner, and Researcher) with iterative plan → gather → synthesize loops, citation management, and custom middleware for long-horizon reliability. An optional ensemble and report refiner layer could be enabled for optimum report quality. The multi-agent design also serves as a long-context strategy: each subagent works inside its own context window and returns only its synthesized output, so the orchestrator never sees the raw tool responses. This keeps the orchestrator’s context focused and prevents long, noisy search results from degrading its reasoning.

Orchestrator: Coordinates the complete research loop. Calls the Planner to supply an evidence-grounded research plan, then the Researcher multiple times with focused research tasks derived from that plan. After research completes, the orchestrator reviews the plan’s quality constraints, dispatches targeted gap-filling research, and writes the long-form report. An optional refiner step makes edit to the report leveraging raw researcher briefs in a fresh context window – a second evidence recovery point.

Planner: Runs in two phases. A Scout subagent first maps the data landscape through broad searches. An Architect subagent then designs the research plan including report outline, targeted search queries, and quality constraints, while running its own searches to validate structural decisions.

Evidence-grounded planning is essential to producing reliable, high-quality reports. Our planner knows the data landscape before it commits to a structure. It decides where to go deep and broad based on what it actually found, not assumptions.

Researcher: Dispatches multiple specialist subagents in parallel, each with a definite lens:

Evidence Gatherer: facts, statistics, specific numbers from authoritative sources
Mechanism Explorer: causal explanations, theoretical frameworks
Comparator: benchmarks, head-to-head data, trade-off analyses
Critic: counterarguments, limitations, failure cases
Horizon Scanner: recent developments, emerging trends

They share the identical search tools, but with different analytical framing. Diverse specialists researching the identical topic often surface evidence that a single generalist would miss.

The researcher synthesizes specialist findings right into a unified, cited temporary. An LLM then cross-checks this synthesis against the raw specialist outputs in a fresh context window, recovering any relevant information.

Config-Driven Flexibility
Every component is swappable. LLMs, tools, and agent graphs could be configured through YAML. Planner, researcher, and orchestrator can each be powered by a distinct LLM. For the benchmark submission, a fine-tuned Nemotron 3 drives the researcher, which processes 4x more tokens than the planner and orchestrator combined.

Custom Middleware for Long-Horizon Reliability

Each agent and subagent interleaves LLM and gear calls across many steps (often 32+). At that scale, the system may fail in ways in which short interactions never expose. Our agent harness provides custom middleware to handle and mitigate these:

Tool name sanitization: LLMs may hallucinate tool names mid-run. This middleware applies pattern-based cleansing, alias resolution, and fuzzy matching to get well the intended tool.
Reasoning-aware retry: LLMs with reasoning sometimes produce considering tokens with no tool call or final response, which might silently terminate the agent loop. Middleware detects this, preserves the reasoning in context, and retries.
Budget enforcement: Each agent and subagent has its own tool-call cap. When the limit is reached, middleware nudges the LLM to synthesize first, then removes tools entirely to force a text-only response.
Report validation: Before returning output, middleware checks minimum length and section structure. Incomplete reports get retried with a continuation prompt.

Each middleware addresses failure patterns observed in agent traces. Together they keep long-horizon runs reliable.

Ensemble
When enabled, N independent deep-research pipelines run in parallel. An LLM reads all N outputs, selects one because the structural base, and integrates unique content from the others. The ensemble produces broader evidence coverage than any single pipeline, directly improving comprehensiveness and data recall. A proofread pass removes process artifacts so the output reads as a single-authored work.

Post-hoc Refiner
An optional final report refiner step can run over the report with structured instructions to quantify vague claims, deepen entity coverage, cut scaffolding, ground risks, construct comparison tables, and strengthen causal reasoning. The rewriting prompt is derived via self-supervised meta-learning against reference reports generated from our pipeline with frontier LLMs only.

Takeaways

NVIDIA AI-Q reached first place on each Deep Research Bench and Deep Research Bench II with a single stack: a multi-agent deep researcher built on NVIDIA NeMo Agent Toolkit, fine-tuned NVIDIA Nemotron 3 models, and custom middleware, with an optional ensemble and refiner when maximum report quality is required. The stack is open, reproducible, and configurable to your needs. State-of-the-art results without compromising on transparency or control.

Join us at NVIDIA GTC in San Jose the week of March 16, 2026 to learn more.

S81706 – Evaluation-Driven Development: Best Practices for Constructing Reliable Agents
DLIT81725 – Develop Production Agents with Eval-Driven Design Dhruv Nandakumar US
S81570 – From Data to Decisions: Enabling AI Agents with Business Knowledge
S81569 – Self-Coding Agents: Architectures, Data Flywheels, and Autonomous Code Repair
S81789 – Open Source AI Shaping the Next Era of Intelligent Digital Employees

Source link

How NVIDIA AI-Q Reached #1 on DeepResearch Bench I and II

Why Winning Each Benchmarks Matters

Architecture at a Glance

Core Stack: NVIDIA and Deep Research

Key Ingredients in AI-Q

Effective-Tuned NVIDIA Nemotron 3 Super: Data and Training

AI-Q Deep Researcher

Takeaways

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Bringing AI Closer to the Edge and On-Device with Gemma 4

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

Linear Regression Is Actually a Projection Problem (Part 2: From Projections to Predictions)

Latest Rowhammer attacks give complete control of machines running Nvidia GPUs

Our most capable open models so far

How NVIDIA AI-Q Reached #1 on DeepResearch Bench I and II

Why Winning Each Benchmarks Matters

Architecture at a Glance

Core Stack: NVIDIA and Deep Research

Key Ingredients in AI-Q

Effective-Tuned NVIDIA Nemotron 3 Super: Data and Training

AI-Q Deep Researcher

Takeaways

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.