Our Transformers Code Agent beats the GAIA benchmark 🏅

-


Aymeric Roucher's avatar

Sergei Petrov's avatar

After some experiments, we were impressed by the performance of Transformers Agents to construct agentic systems, so we desired to see how good it was! We tested using a Code Agent built with the library on the GAIA benchmark, arguably essentially the most difficult and comprehensive agent benchmark… and ended up on top!

The framework transformers.agents utilized in this blog post has now been upgraded to the stand-alone library smolagents! The 2 libraries have very similar APIs, so switching is simple.
Go checkout the smolagents introduction blog here.



GAIA: a tricky benchmark for Agents

What are agents?

In a single sentence: an agent is any system based on an LLM that may call external tools or not, depending on the necessity for the present use case and iterate on further steps based on the LLM output. Tools can include anything from a Web search API to a Python interpreter.

For a visible analogy: all programs could possibly be described as graphs. Do A, then do B. If/else switches are forks within the graph, but they don’t change its structure. We define agents because the systems where the LLM outputs will change the structure of the graph. An agent decides to call tool A or tool B or nothing, it decides to run yet another step or not: these change the structure of the graph. You might integrate an LLM in a set workflow, as in LLM judge, without it being an agent system, since the LLM output is not going to change the structure of the graph

Here is an illustration for 2 different system that perform Retrieval Augmented Generation: one is the classical, its graph is fixed. But the opposite is agentic, one loop within the graph may be repeated as needed.

Classical vs Agentic RAG

Agent systems give LLMs superpowers. For more detail, read our earlier blog post on the discharge of Transformers Agents 2.0.

GAIA is essentially the most comprehensive benchmark for agents. The questions in GAIA are very difficult and highlight certain difficulties of LLM-based systems.

Here is an example of a tough query:

Which of the fruits shown within the 2008 painting “Embroidery from Uzbekistan” were served as a part of the October 1949 breakfast menu for the ocean liner that was later used as a floating prop for the film “The Last Voyage”? Give the items as a comma-separated list, ordering them in clockwise order based on their arrangement within the painting ranging from the 12 o’clock position. Use the plural form of every fruit.

You’ll be able to see this query involves several difficulties:

  • Answering in a constrained format.
  • Multimodal abilities to read the fruits from the image
  • Several informations to assemble, some depending on the others:
    • The fruits on the image
    • The identity of the ocean liner used as a floating prop for “The Last Voyage”
    • The October 1949 breakfast menu for the above ocean liner
  • The above forces the proper solving trajectory to make use of several chained steps.

Solving this requires each high-level planning abilities and rigorous execution, that are precisely two areas where LLMs struggle.

Subsequently, it’s a superb test set for agent systems!

On GAIA’s public leaderboard, GPT-4-Turbo doesn’t reach 7% on average. The highest submission is (was) an Autogen-based solution with a posh multi-agent system that makes use of OpenAI’s tool calling functions, it reaches 40%.

Let’s take them on. 🥊

Let's fight



Constructing the best tools 🛠️

We used three principal tools to unravel GAIA questions:

a. Web browser

For web browsing, we mostly reused the Markdown web browser from Autogen team’s submission. It comprises a Browser class storing the present browser state, and a number of other tools for web navigation, like visit_page, page_down or find_in_page. This tool returns markdown representations of the present viewport. Using markdown compresses web pages information quite a bit, which may lead to some misses, in comparison with other solutions like taking a screenshot and using a vision model. Nonetheless, we found that the tool was overall performing well without being too complex to make use of or edit.

Note: we expect that a great technique to improve this tool in the long run could be to to load pages using selenium package quite than requests. This is able to allow us to load javascript (many pages cannot load properly without javascript) and accepting cookies to access some pages.

b. File inspector

Many GAIA questions depend on attached files from a wide range of type, corresponding to .xls, .mp3, .pdf, etc. These files have to be properly parsed.. Once more, we use Autogen’s tool since it really works very well.

Many due to the Autogen team for open-sourcing their work. It sped up our development process by weeks to make use of these tools! 🤗

c. Code interpreter

We may have no need for this since our agent naturally generates and executes Python code: see more below.



Code Agent 🧑‍💻



Why a Code Agent?

As shown by Wang et al. (2024), letting the agent express its actions in code has several benefits in comparison with using dictionary-like outputs corresponding to JSON. For us, the principal advantage is that code is a really optimized technique to express complex sequences of actions. Arguably if there had been a greater technique to rigorously express detailed actions than our current programming languages, it might have turn out to be a brand new programming language!

Consider this instance given of their paper:

Code agents are just more intuitive than JSON

It highlights several benefits of using code:

  • Code actions are way more concise than JSON.
    • Have to run 4 parallel streams of 5 consecutive actions ? In JSON, you would want to generate 20 JSON blobs, each of their separate step; in Code it’s only one step.
    • On average, the paper shows that Code actions require 30% fewer steps than JSON, which amounts to an equivalent reduction within the tokens generated. Since LLM calls are sometimes the dimensioning cost of agent systems, it means your agent system runs are ~30% cheaper.
  • Code enables to re-use tools from common libraries
  • Using code gets higher performance in benchmarks, attributable to two reasons:
    • It’s a more intuitive technique to express actions
    • LLMs have a number of code of their training data, which possibly makes them more fluent in code-writing than in JSON writing.

We confirmed these points during our experiments on agent_reasoning_benchmark.

From our latest experiments of constructing transformers agents, we also observed additional benefits:

  • It is far easier to store a component as a named variable in code. For instance, must store this rock image generated by a tool for later use?
    • No problem in code: using “rock_image = image_generation_tool(“An image of a rock”)” will store the variable under the important thing “rock_image” in your dictionary of variables. Later the LLM can just use its value in any code blob by referring to it again as “rock_image”.
    • In JSON you would need to do some complicated gymnastics to create a reputation under which to store this image, in order that the LLM later knows how you can access it again. As an illustration, save any output of the image generation tool under “image_{i}.png”, and trust that the LLM will later understand that image_4.png is the output of the tool call that precedes it in memory? Or let the LLM also output a “output_name” key to decide on under which name to store the variable, thus complicating the structure of your motion JSON?
  • Agent logs are considerably more readable.



Implementation of Transformers Agents’ CodeAgent

The thing with LLM generated code is that it may possibly be really unsafe to execute as is. Should you let an LLM write and execute code without guardrails, it could hallucinate anything: as an illustration that each one your personal files have to be erased by copies of the Dune lore, or that this audio of you singing the Frozen theme must be shared in your blog!

So for our agents, we needed to make code execution secure. The standard approach is top-down: “use a completely functional python interpreter, but forbid certain actions”.

To be more secure, we preferred to go the other way, and construct a LLM-safe Python interpreter from the ground-up. Given a Python code blob provided by the LLM, our interpreter starts from the Abstract Syntax Tree representation of the code given by the ast python module. It executes the tree nodes one after the other, following the tree structure, and stops at any operation that was not explicitly authorised

For instance, an import statement will first check if the import is explicitly mentioned within the user-defined list of authorized_imports: if not, it doesn’t execute. We include a default list of built-in standard Python functions, comprising as an illustration print and range. Anything outside of it should not be executed except explicitly authorized by the user. As an illustration, open (as in with open("path.txt", "w") as file:) isn’t authorized.

When encountering a function call (ast.Call), if the function name is certainly one of the user-defined tools, the tool known as with the arguments to the decision. If it’s one other function defined and allowed earlier, it gets run normally.

We also do several tweaks to assist with LLM usage of the interpreter:

  • We cap the variety of operations in execution to forestall infinite loops brought on by issues in LLM-generated code: at each operation, a counter gets incremented, and if it reaches a certain threshold the execution is interrupted
  • We cap the variety of lines in print outputs to avoid flooding the context length of the LLM with junk. As an illustration if the LLM reads a 1M lines text files and decides to print every line, sooner or later this output will likely be truncated, in order that the agent memory doesn’t explode.



Basic multi-agent orchestration

Web browsing is a really context-rich activity, but a lot of the retrieved context is definitely useless. As an illustration, within the above GAIA query, the one essential information to get is the image of the painting “Embroidery from Uzbekistan”. Anything around it, just like the content of the blog we found it on, is mostly useless for the broader task solving.

To unravel this, using a multi-agent step is sensible! For instance, we will create a manager agent and an online search agent. The manager agent should solve the higher-level task, and assign specific web search task to the online search agent. The net search agent should return only the useful outputs of its search, in order that the manager isn’t cluttered with useless information.

We created exactly this multi-agent orchestration in our workflow:

  • The highest level agent is a ReactCodeAgent. It natively handles code since its actions are formulated and executed in Python. It has access to those tools:
    • file_inspector to read text files, with an optional query argument to not return the entire content of the file but only return its answer to the precise query based on the content
    • visualizer to specifically answer questions on images.
    • search_agent to browse the online. More specifically, this Tool is only a wrapper around a Web Search agent, which is a JSON agent (JSON still works well for strictly sequential tasks, like web browsing where you scroll down, then navigate to a brand new page, and so forth). This agent in turn has access to the online browsing tools:
      • informational_web_search
      • page_down
      • find_in_page
      • … (full list at this line)

This embedding of an agent as a tool is a naive technique to do multi-agent orchestration, but we desired to see how far we could push it – and it seems that it goes quite far!



Planning component 🗺️

There may be now a complete zoo of planning strategies, so we opted for a comparatively easy plan-ahead workflow. Every N steps we generate two things:

  • a summary of facts we all know or we will derive from context and facts we’d like to find
  • a step-by-step plan of how you can solve the duty given fresh observations and the factual summary above

The parameter N may be tuned for higher performance on the goal use cas: we selected N=2 for the manager agent and N=5 for the online search agent.

An interesting discovery was that if we don’t provide the previous version of the plan as input, the rating goes up. An intuitive explanation is that it’s common for LLMs to be strongly biased towards any relevant information available within the context. If the previous version of the plan is present within the prompt, an LLM is prone to heavily reuse it as a substitute of re-evaluating the approach and re-generating a plan when needed.

Each the summary of facts and the plan are then used as additional context to generate the subsequent motion. Planning encourages an LLM to decide on a greater trajectory by having all of the steps to realize the goal and the present state of affairs in front of it.



Results 🏅

Here is the ultimate code used for our submission.

We get 44.2% on the validation set: so which means Transformers Agent’s ReactCodeAgent is now #1 overall, with 4 points above the second! On the test set, we get 33.3%, so we rank #2, in front of Microsoft Autogen’s submission, and we get the most effective average rating on the hardcore Level 3 questions.

We did it!

That is an information point to support that Code actions work higher. Given their efficiency, we expect Code actions will soon replace JSON/OAI format as the usual for agents writing their actions.

LangChain and LlamaIndex don’t support Code actions out of the box to our knowledge, Microsoft’s Autogen has some support for Code actions (executing code in docker containers) nevertheless it looks like an annex to JSON actions. So Transformers Agents is the one library to make this format central!



Next steps

We hope you enjoyed reading this blog post! And the work is just getting began, as we’ll keep improving Transformers Agents, along several axes:

  • LLM engine: Our submission was done with GPT-4o (alas), with none fine-tuning. Our hypothesis is that using a fine-tuned OS model would allow us to eliminate parsing errors, and rating a bit higher!
  • Multi-agent orchestration: our is a naive one, with more seamless orchestration we could probably go a great distance!
  • Web browser tool: using the selenium package, we could have an online browser that passes cookie banners and loads javascript, thus allowing us to read many pages which can be for not accessible.
  • Improve planning further: We’re running some ablation tests with other options from the literature to see which method works best. We’re planning to provide a attempt to alternative implementations of existing components and in addition some latest components. We’ll publish our updates when we now have more insights!

Regulate Transformers Agents in the subsequent few months! 🚀

And don’t hesitate to succeed in out to us together with your use cases, now that we now have built internal expertise on Agents we’ll be pleased to help! 🤝



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x