Measuring Open-Source Llama Nemotron Models on DeepResearch Bench

-


Jay Rodge's avatar


Contributors: David Austin, Raja Biswas, Gilberto Titericz Junior, NVIDIA

NVIDIA’s AI-Q Blueprint—the leading portable, open deep research agent—recently climbed to the highest of the Hugging Face “LLM with Search” leaderboard on DeepResearch Bench. It is a significant step forward for the open-source AI stack, proving that developer-accessible models can power advanced agentic workflows that rival or surpass closed alternatives.

What sets AI-Q apart? It fuses two high-performance open LLMs—Llama 3.3-70B Instruct and Llama-3.3-Nemotron-Super-49B-v1.5—to orchestrate long-context retrieval, agentic reasoning, and robust synthesis.



Core Stack: Model Selections and Technical Innovations

  • Llama 3.3-70B Instruct: The muse for fluent, structured report generation, derived from Meta’s Llama series and open-licensed for unrestricted deployment.
  • Llama-3.3-Nemotron-Super-49B-v1.5: An optimized, reasoning-focused variant. Built via Neural Architecture Search (NAS), knowledge distillation, and successive rounds of supervised and reinforcement learning, it excels at multi-step reasoning, query planning, tool use, and reflection—all with a reduced memory footprint for efficient deployment on standard GPUs.

The AI-Q reference example also includes:

The architecture supports parallel, low-latency search over local and web data, making it ideal to be used cases that demand privacy, compliance, or on-premise deployment for reduced latency.



Deep Reasoning with Llama Nemotron

NVIDIA Llama Nemotron Super isn’t only a fine-tuned instruct model—it’s post-trained for explicit agentic reasoning and supports reasoning ON/OFF toggles via system prompts. You should utilize it in standard chat LLM mode or switch to deep, chain-of-thought reasoning for agent pipelines—enabling dynamic, context-sensitive workflows.

Key highlights:

  • Multi-phase post-training: Combines instruction following, mathematical/programmatic reasoning, and tool-calling skills.
  • Transparent model lineage: Directly traceable from open Meta weights, with additional openness around synthetic data and tuning datasets.
  • Efficiency: 49B parameters with context windows as much as 128K tokens can run on a single H100 GPU or smaller, keeping inference costs predictable and fast.



Evaluation: Transparency and Robustness in Metrics

Certainly one of the core strengths of AI-Q is transparency—not only in outputs, but in reasoning traces and intermediate steps. During development, the NVIDIA team leveraged each standard and recent metrics, comparable to:

  • Hallucination detection: Each factual claim is checked at generation.
  • Multi-source synthesis: Synthesis of recent insights from disparate evidence.
  • Citation trustworthiness: Automated assessment of claim-evidence links.
  • RAGAS metrics: Automated scoring of retrieval-augmented generation accuracy.

The architecture lends itself perfectly to granular, stepwise evaluation and debugging—one among the most important pain points in agentic pipeline development.



Benchmark Results: DeepResearch Bench

DeepResearch Bench evaluates agent stacks using a set of 100+ long-context, real-world research tasks (across science, finance, art, history, software, and more). Unlike traditional QA, tasks require report-length synthesis and complicated multi-hop reasoning:

  • AI-Q achieved an overall rating of 40.52 within the LLM with Search category as of August 2025, currently holding the highest spot for any fully open-licensed stack.
  • Strongest metrics: comprehensiveness (depth of report), insightfulness (quality of research), and citation quality.


For the Hugging Face Developer Community

  • Each Llama-3.3-Nemotron-Super-49B-v1.5 and Llama 3.3-70B Instruct can be found for direct use/download on Hugging Face. Try them in your personal pipelines using a number of lines of Python, or deploy with vLLM for fast inference and tool-calling support (see the model card for code/serving examples).
  • Open post-training data, transparent evaluation methods, and permissive licensing enable experimentation and reproducibility.



Takeaways

The open-source ecosystem is rapidly closing the gap—and, in some areas, leading—on real-world agent tasks that matter. AI-Q, built on Llama Nemotron, demonstrates that you just don’t must compromise on transparency or control to attain state-of-the-art results.

Try the stack or adapt it to your personal research agent projects from Hugging Face or construct.nvidia.com.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x