How Good Are AI Agents at Real Research? Contained in the Deep Research Bench Report

As large language models (LLMs) rapidly evolve, so does their promise as powerful research assistants. Increasingly, they’re not only answering easy factual questions—they’re tackling “deep research” tasks, which involve multi-step reasoning, evaluating conflicting information, sourcing data from across the net, and synthesizing it right into a coherent output.

This emerging capability is now being marketed under different brand names by major labs—OpenAI calls it “Deep Research”, Anthropic refers to it as “Prolonged Considering”, Google’s Gemini offers “Search + Pro” features, and Perplexity labels theirs “Pro Search” or “Deep Research”. But how effective are these offerings in practice? A brand new report by FutureSearch, titled Deep Research Bench (DRB): Evaluating Web Research Agents, offers essentially the most rigorous evaluation so far—and the outcomes reveal each impressive capabilities and significant shortcomings.

What Is Deep Research Bench?

Created by the FutureSearch team, Deep Research Bench is a meticulously constructed benchmark designed to evaluate AI agents’ performance on multi-step, web-based research tasks. These aren’t easy questions with straightforward answers—they reflect the messy, open-ended challenges faced by analysts, policymakers, and researchers in real-world settings.

The benchmark includes 89 distinct tasks across 8 categories corresponding to:

Find Number: e.g. “What number of FDA Class II medical device recalls occurred?”
Validate Claim: e.g. “Is ChatGPT 10x more energy-intensive than Google Search?”
Compile Dataset: e.g. “Job trends for US software developers from 2019–2023”

Each task type is fastidiously structured with human-verified answers and evaluated using a frozen dataset of scraped web pages, often known as RetroSearch. This ensures consistency across model evaluations, avoiding the fluctuating state of the live web.

The Agent Architecture: ReAct and RetroSearch

At the guts of Deep Research Bench lies the ReAct architecture, short for “Reason + Act.” This method mimics how a human researcher might tackle an issue—by considering through the duty, taking an motion like performing an online search, observing the outcomes, after which deciding whether to iterate or conclude.

While earlier models follow this loop explicitly, newer “considering” models often streamline the method, embedding reasoning more fluidly into their actions. To make sure consistency across evaluations, DRB introduces RetroSearch—a custom-built, static version of the net. Slightly than counting on the live web, which consistently changes, agents tap right into a curated archive of web pages scraped using tools like Serper, Playwright, and ScraperAPI. The dimensions is impressive: for high-complexity tasks corresponding to “Gather Evidence,” RetroSearch can provide access to over 189,000 pages, all frozen in time, ensuring a good and replicable testing environment.

Which AI Agents Perform Best?

Amongst all of the contenders, OpenAI’s o3 emerged as the highest performer, scoring 0.51 out of a possible 1.0 on the Deep Research Bench. While that may sound modest, it’s vital to grasp the benchmark’s difficulty: as a result of ambiguity in task definitions and scoring, even a flawless agent would likely top out around 0.8—what researchers call the “noise ceiling.” In other words, even one of the best models today still fall in need of well-informed, methodical human researchers.

Still, the leaderboard offers revealing insights. o3 not only led the pack but did so with speed and consistency, showing strong performance across nearly all task types. Claude 3.7 Sonnet from Anthropic followed closely, demonstrating versatility in each its “considering” and “non-thinking” modes. Gemini 2.5 Pro, Google’s flagship model, stood out for its ability to handle tasks requiring structured planning and step-by-step reasoning. Meanwhile, the open-weight DeepSeek-R1 delivered a nice surprise—keeping pace with GPT-4 Turbo and narrowing the performance gap between open and closed models.

Across the board, a transparent pattern emerged: newer, “thinking-enabled” models consistently outperformed their earlier counterparts, and closed-source models maintained a notable edge over open-weight alternatives.

Where Do Agents Struggle?

Reading through the failure patterns highlighted within the Deep Research Bench report felt surprisingly familiar. One of the crucial frustrating features I’ve personally encountered—especially during long research or content creation sessions—is when an AI agent simply forgets what we were doing. Because the context window stretches, the model often begins to lose the thread: key details fade, goals get muddled, and suddenly, the responses feel disjointed or aimless. Sooner or later, I’ve learned it’s often higher to chop losses and begin from scratch, even when it means throwing away every part that’s been generated thus far.

That sort of forgetfulness isn’t just anecdotal—it’s essentially the most significant predictor of failure within the Deep Research Bench evaluation. However it’s not the one recurring issue. The report also highlights how some models fall into repetitive tool use, running the identical search time and again as if stuck in a loop. Others show poor query crafting, lazily keyword-matching as an alternative of considering critically about search effectively. And much too often, agents fall victim to premature conclusions—delivering a half-formed answer that technically checks the box but falls in need of real insight.

Even among the many top models, the differences are stark. GPT-4 Turbo, for instance, showed a notable tendency to forget prior steps, while DeepSeek-R1 was more prone to hallucinate or invent plausible-sounding—but incorrect—information. Across the board, models incessantly did not cross-check sources or validate findings before finalizing their output. For anyone who’s relied on AI for serious work, these issues will feel all too familiar—they usually underscore how far we still should go in constructing agents that may truly think and research like humans.

What About Memory-Based Performance?

Interestingly, Deep Research Bench also evaluated what it calls “toolless” agents—language models operating with none access to external tools, corresponding to web search or document retrieval. These agents rely entirely on their internal training data and memory, generating answers based solely on what they’ve previously learned during training. In practice, this implies they’ll’t look anything up or confirm information—they’re guessing based on what they “remember.”

Surprisingly, these toolless agents performed almost in addition to full research agents on certain tasks. For instance, on the Validate Claim task—where the goal is to evaluate the plausibility of a press release—they scored 0.61, nearly matching the 0.62 average of tool-enabled agents. This implies that models like o3 and Claude have strong internal priors and may often recognize the truthfulness of common claims without having to look the net.

But on more demanding tasks—like Derive Number, which requires piecing together multiple values from various sources, or Gather Evidence, which will depend on finding and evaluating diverse facts in context—these toolless models completely fell apart. Without fresh information or real-time lookup capabilities, they simply lacked the means to supply accurate or comprehensive answers.

This contrast highlights a vital nuance: while today’s LLMs can simulate “knowing” quite a bit, deep research depends not only on recall, but on reasoning with up-to-date, verifiable information—something only tool-augmented agents can truly deliver.

Final Thoughts

The DRB report makes one thing clear: while today’s best AI agents can outpace average humans on narrowly defined tasks, they still lag behind expert generalist researchers—especially in terms of planning strategically, adapting mid-process, and reasoning with nuance.

This gap becomes especially obvious during long or complex sessions—something I’ve experienced firsthand, where an agent regularly loses track of the duty’s purpose, resulting in a frustrating breakdown in coherence and utility.

What makes Deep Research Bench so precious is that it doesn’t just test surface-level knowledge—it probes the intersection of tool use, memory, reasoning, and adaptation, offering a more in-depth analog to real-world research than benchmarks like MMLU or GSM8k.

As LLMs proceed to integrate into serious knowledge work, FutureSearch tools like DRB will probably be essential for assessing not only what these systems know, but how well they really work.

How Good Are AI Agents at Real Research? Contained in the Deep Research Bench Report

What Is Deep Research Bench?

The Agent Architecture: ReAct and RetroSearch

Which AI Agents Perform Best?

Where Do Agents Struggle?

What About Memory-Based Performance?

Final Thoughts

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Cohere’s Rerank 4 quadruples the context window over 3.5 to chop agent errors and boost enterprise search accuracy

Recent method improves the reliability of statistical estimations

Pushing the Boundaries of Multimodal Understanding

Updating the Frontier Safety Framework

Nous Research just released Nomos 1, an open-source AI that ranks second on the notoriously brutal Putnam math exam

How Good Are AI Agents at Real Research? Contained in the Deep Research Bench Report

What Is Deep Research Bench?

The Agent Architecture: ReAct and RetroSearch

Which AI Agents Perform Best?

Where Do Agents Struggle?

What About Memory-Based Performance?

Final Thoughts

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.