Yesterday, OpenAI released Deep Research, a system that browses the net to summarize content and answer questions based on the summary. The system is impressive and blew our minds after we tried it for the primary time.
Considered one of the most important ends in the blog post is a powerful improvement of performances on the General AI Assistants benchmark (GAIA), a benchmark we’ve been fidgeting with recently as well, where they successfully reached near 67% correct answers on 1-shot on average, and 47.6% on especially difficult “level 3” questions that involve multiple steps of reasoning and gear usage (see below for a presentation of GAIA).
DeepResearch consists of an LLM (which may be chosen from the present list of LLMs provided by OpenAI, 4o, o1, o3, etc) and an internal “agentic framework” which guide the LLM to make use of tools like web search and organize its actions in steps.
While powerful LLMs are actually freely available in open-source (see e.g. the recent DeepSeek R1 model), OpenAI didn’t disclose much in regards to the agentic framework underlying Deep Research…
So we decided to embark on a 24-hour mission to breed their results and open-source the needed framework along the best way!
The clock is ticking, let’s go! ⏱️
Table of Contents
What are Agent frameworks and why they matter?
An Agent framework is a layer on top of an LLM to make said LLM execute actions (like browse the net or read PDF documents), and organize its operations in a series of steps.
For a fast intro to agents, check this great interview by Andrew Ng and our introduction blog post to the smolagents library. For a more detailed dive in agents you may subscribe to our agents course that starts in only a number of days: link here.
Almost everyone has already experienced how powerful LLMs may be just by fidgeting with chatbots.. Nonetheless, what not everyone seems to be aware of yet is that integrating these LLMs into agentic systems can provide them real superpowers!
Here’s a recent example comparing the performance of a number of frontier LLMs with and without an agentic framework (on this case the easy smolagents library) – using an agentic framework bumps performance by as much as 60 points!
In truth, OpenAI also highlighted in its release blogpost how Deep Research performed dramatically higher than standalone LLMs on the knowledge-intensive “Humanity’s Last Exam” benchmark.
So, what happens after we integrate our current top LLM in an agentic framework, to work toward an open-DeepResearch ?
A fast note: We’ll benchmark our results on the identical GAIA challenge but consider that this can be a work in progress. DeepResearch is an enormous achievement and its open reproduction will take time. Specifically, full parity would require improved browser use and interaction like OpenAI Operator is providing, i.e. beyond the present text-only web interaction we explore in this primary step.
Let’s first understand the scope of the challenge: GAIA.
The GAIA benchmark
GAIA is arguably essentially the most comprehensive benchmark for agents. Its questions are very difficult and hit on many challenges of LLM-based systems. Here is an example of a tough query:
Which of the fruits shown within the 2008 painting “Embroidery from Uzbekistan” were served as a part of the October 1949 breakfast menu for the ocean liner that was later used as a floating prop for the film “The Last Voyage”? Give the items as a comma-separated list, ordering them in clockwise order based on their arrangement within the painting ranging from the 12 o’clock position. Use the plural form of every fruit.
You possibly can see this query involves several challenges:
- Answering in a constrained format,
- Using multimodal capabilities (to extract the fruits from the image),
- Gathering several pieces of knowledge, some depending on others:
- Identifying the fruits on the image
- Finding which ocean liner was used as a floating prop for “The Last Voyage”
- Finding the October 1949 breakfast menu for the above ocean liner
- Chaining together a problem-solving trajectory in the right order.
Solving this requires each high-level planning abilities and rigorous execution, that are two areas where LLMs struggle when used alone.
So it’s a wonderful test set for agent systems!
On GAIA’s public leaderboard, GPT-4 doesn’t even reach 7% on the validation set when used with none agentic setup. On the opposite side of the spectrum, with Deep Research, OpenAI reached 67.36% rating on the validation set, so an order of magnitude higher! (Though we don’t know the way they’d actually fare on the private test set.)
Let’s see if we will do higher with open source tools!
Constructing an open Deep Research
Using a CodeAgent
The primary improvement over traditional AI agent systems we’ll tackle is to make use of a so-called “code agent”. As shown by Wang et al. (2024), letting the agent express its actions in code has several benefits, but most notably that code is specifically designed to precise complex sequences of actions.
Consider this instance given by Wang et al.:
This highlights several benefits of using code:
- Code actions are rather more concise than JSON.
- Must run 4 parallel streams of 5 consecutive actions ? In JSON, you would want to generate 20 JSON blobs, each of their separate step; in Code it’s just one step.
- On average, the paper shows that Code actions require 30% fewer steps than JSON, which amounts to an equivalent reduction within the tokens generated. Since LLM calls are sometimes the dimensioning cost of agent systems, it means your agent system runs are ~30% cheaper.
- Code enables to re-use tools from common libraries
- Higher performance in benchmarks, as a result of two reasons:
- More intuitive technique to express actions
- Extensive exposure of LLMs to code in training
The benefits above were confirmed by our experiments on the agent_reasoning_benchmark.
From constructing smolagents we may cite a notable additional advantage, which is a greater handling of state: this could be very useful for multimodal tasks specifically. Must store this image/audio/other for later use? No problem, just assign it as a variable in your state and you may re-use it 4 steps later if needed. In JSON you would need to let the LLM name it in a dictionary key and trust the LLM will later understand that it may still use it.
Making the proper tools 🛠️
Now we want to offer the agent with the proper set of tools.
1. An internet browser. While a totally fledged web browser interaction like Operator will likely be needed to succeed in full performance, we began with a particularly easy text-based web browser for now for our first proof-of-concept. You’ll find the code here
2. A straightforward text inspector, to have the option to read a bunch of text file format, find it here.
These tools were taken from the superb Magentic-One agent by Microsoft Research, kudos to them! We didn’t change them much, as our goal was to get as high a performance as we will with the bottom complexity possible.
Here’s a short roadmap of improvements which we feel would really improve these tools’ performance (be happy to open a PR and contribute!):
- extending the variety of file formats which may be read.
- proposing a more fine-grained handling of files.
- replacing the net browser with a vision-based one, which we’ve began doing here.
Results 🏅
In our 24h+ reproduction sprint, we’ve already seen regular improvements within the performance of our agent on GAIA!
We’ve quickly gone up from the previous SoTA with an open framework, around 46% for Magentic-One, to our current performance of 55.15% on the validation set.
This bump in performance is due mostly to letting our agents write their actions in code! Indeed, when switching to an ordinary agent that writes actions in JSON as an alternative of code, performance of the identical setup is immediately degraded to 33% average on the validation set.
Here is the ultimate agentic system.
We’ve arrange a live demo here so that you can try it out!
Nonetheless, this is simply the start, and there are loads of things to enhance! Our open tools may be made higher, the smolagents framework will also be tuned, and we’d like to explore the performance of higher open models to support the agent.
We welcome the community to come back join us on this endeavour, so we will leverage the facility of open research together to construct an awesome open-source agentic framework! It will allow anyone to run a DeepResearch-like agent at home, with their favorite models, using a totally local and customised approach!
Community Reproductions
While we were working on this and specializing in GAIA, other great open implementations of Deep Research emerged from the community, specifically from
Each of those implementations use different libraries for indexing data, browsing the net and querying LLMs. On this project, we would really like to reproduce the benchmarks presented by OpenAI (pass@1 average rating), benchmark and document our findings with switching to open LLMs (like DeepSeek R1), using vision LMs, benchmark traditional tool calling against code-native agents.
Most significant next steps
OpenAI’s Deep Research might be boosted by the superb web browser that they introduced with Operator.
So we’re tackling that next! In a more general problem: we’re going to construct GUI agents, i.e. “agents that view your screen and may act directly with mouse & keyboard”. When you’re enthusiastic about this project, and wish to assist everyone get access to such cool capabilities through open source, we’d like to get your contribution!
We’re also hiring a full time engineer to assist us work on this and more, apply in the event you’re interested 🙂


