In an excellent world, AI agents could be reliable assistants. When given a question, they might easily manage ambiguity in instructions, construct step-by-step plans, accurately discover crucial resources, execute those plans without getting sidetracked, and adapt to unexpected events, all while maintaining accuracy and avoiding hallucinations.
Nonetheless, developing agents and testing these behaviors isn’t any small feat: if you have got ever tried to debug your individual agent, you’ve probably observed how tedious and frustrating this may be. Existing evaluation environments are tightly coupled with the tasks they evaluate, lack real-world flexibility, and don’t reflect the messy reality of open-world agents: simulated pages never fail to load, events don’t spontaneously emerge, and asynchronous chaos is absent.
That’s why we’re very glad to introduce Gaia2, the follow-up to the agentic benchmark GAIA, allowing evaluation of considerably more complex behaviors. Gaia2 is released with the open Meta Agents Research Environments (ARE) framework to run, debug and evaluate agents. ARE simulates complex real world-like conditions and may be customized to further study agents behaviors. Gaia2 dataset is released under CC by 4.0 license, and ARE under MIT license.
Gaia2: Agentic Evaluation on Real Life Assistant Tasks
GAIA is an agentic benchmark published in 2023, with 3 levels of data retrieval questions requiring tools, web browsing, and reasoning to resolve. In 2 years, the simplest levels have develop into too easy for models, and the community is coming near solving the toughest questions, so it was time for a completely recent and harder agent benchmark!
Here comes Gaia2, a follow as much as GAIA, going way beyond it when it comes to capabilities studied!
Where GAIA was read-only, Gaia2 is now a read-and-write benchmark, specializing in interactive behavior and complexity management. Agents at the moment are evaluated not only on search and retrieval, but additionally on instruction following over ambiguous or time-sensitive queries, in a loud environment with controlled failures – reflecting real-world conditions greater than another simulated environment. We would like to check how agents manage tools or APIs that sometimes don’t work, plan successions of actions with very specific time frames, and adapt to recent events – an entire recent range of complexity!
To do that, we use the next task groups (because of 1000 brand recent human-created scenarios):
- Execution: Multi-step instruction following and tool-use (e.g., contact updates)
- Search: Cross-source information gathering (e.g., friend cities from WhatsApp)
- Ambiguity Handling: Clarification of conflicting requests (e.g., scheduling conflicts)
- Adaptability: Response to changes within the simulation (e.g., updating an email using follow up information)
- Time/temporal Reasoning: Time-sensitive actions (e.g., cab orders after 3-minute delays)
- Agent-to-Agent Collaboration: Communication between agents without direct API access
- Noise Tolerance: Robustness to API failures and environmental instability
Within the spirit of GAIA, scenarios don’t require specialized knowledge: humans should in principle find a way to get 100%, which allows easy debugging for model developers.
Need to explore the benchmark? Try our dataset, which you’ll higher display in our demo here.
How does Gaia2 run?
Gaia2 runs with ARE, an execution environment, where an agent of your selection has access to a mix of applications and associated pre-populated data.
For Gaia2, we created a smartphone mock-up environment, simulating what a human would use of their each day life. It comprises real-world applications equivalent to messaging (Email), utilities (Calendar, Contacts, Shopping, a FileSystem, …), and a chat interface to talk over with the agent. All applications are also accessible to the agents through tool calling. Last but not least, the demo also comprises a simulated persona’s history of conversations and app interactions.
All agent interactions are routinely recorded as structured traces during execution for deep dives and evaluation: they include tool calls, API responses, model thoughts, timing metrics (e.g., response latency), user interactions, and so forth – and may all be exported as JSON.
Results
For reference, we compare a variety of enormous open and closed source models: Llama 3.3-70B Instruct, Llama-4-Maverick, GPT-4o, Qwen3-235B-MoE, Grok-4, Kimi K2, Gemini 2.5 Pro, Claude 4 Sonnet, and GPT-5 in all reasoning modes.
All models are evaluated using the identical setup (a uniform ReAct loop for consistency, temperature of 0.5, generation limit of 16K tokens), with a mix of model-as-a-judge (Llama 3.3 Instruct 70B) and exact-match evaluation depending on the actual task. All 101 tools (and the overall environment description) are provided within the system prompt.
Among the many evaluated models, the highest-scoring model overall as of September 2025 is GPT-5 with high reasoning, and the very best open source model is Kimi K2.
Some capabilities seem like already near solved by the very best models: execution of straightforward tool calls and instruction following (execution), and overall search (as we could have guessed from current results on GAIA). The anomaly, adaptability, and noise splits remain difficult for now for all models, and it’s interesting to see that performance on what were considered complex agentic tasks (instruction following and search) shouldn’t be an excellent proxy for performance on closer-to-real-world tasks. Last but not least, the toughest split for all models for the time being is the time one: it’s very hard at this moment for models to accurately handle time-sensitive actions (though this might likely be mitigated by means of specialised tools and higher temporal reasoning). Detailed evaluation of those results may be present in the paper.
Nonetheless, we consider it’s vital to push reporting beyond raw scores: if the model is correct but took several thousand tokens to succeed in the right solution, or ran for several hours, it’s “not pretty much as good” as a model which succeeded orders of magnitude faster. We due to this fact also normalize scores for cost, quantified as the typical variety of LLM calls and output tokens (which each define a cost-performance Pareto frontier). Within the paper you’ll find rating vs monetary cost and time.
Compare together with your favorite models! Evaluating on Gaia2
If you wish to evaluate your model on Gaia2, you possibly can follow these steps:
First, install Meta’s Agent Research Environment in your Python environment of selection (uv, conda, virtualenv, …)
pip install meta-agents-research-environments
Then, run the benchmark for all configurations: execution, search, adaptability, time and ambiguity. Remember to upload all results to the hub with the hf_upload kwarg!
are-benchmark run --hf meta-agents-research-environments/Gaia2 --split validation --config CONFIGURATION --model YOUR_MODEL --model_provider YOUR_PROVIDER --agent default --max_concurrent_scenarios 2 --scenario_timeout 300 --output_dir ./monitored_test_results --hf_upload YOUR_HUB_DATASET_TO_SAVE_RESULTS
Run the oracle to get your aggregated rating file
are-benchmark judge --hf meta-agents-research-environments/Gaia2 --split validation --config CONFIGURATION --agent default --max_concurrent_scenarios 2 --scenario_timeout 300 --output_dir ./monitored_test_results --hf_upload YOUR_HUB_DATASET_TO_SAVE_RESULTS
Finally, add all of the relevant details about your model within the README, and share it on the leaderboard to centralize Gaia2 traces here!
Beyond Gaia2: study your agents with ARE
Beyond benchmark scenarios, you should use Gaia2 apps and content in ARE to see if the model is in a position to accurately solve less verifiable tasks equivalent to loading emails, writing follow-ups, adding events to the calendar or booking meetings – in sum, providing the proper setup to evaluate your AI assistants through interaction!
You too can easily customise the environment, by 1) connecting your tools (via MCP or directly ) to check your agents on it; 2) implementing your individual scenarios, including defining trigger or timed events (eg: after 2 minutes, the Mail app will receive a brand new email from Contact), to see how the agent is in a position to adapt to an evolving environment
(Because the agents are by default json agents, they’ll’t mess up your machine, unless in fact you connect them to external apps with unsafe rights. So, operate with caution when adding your individual apps or using untrusted MCPs)
Listed below are several use cases that we’ve used ARE for:
- Vibe-check any agent on real or simulated data, to check quite a lot of setups, with their very own rules, tools, content, and verifications
- Test agent tool calling and orchestration capabilities, either with local apps or MCP tools
- Generate your individual tool-calling trace to fine-tune tool calling models
- Easily gather and reproduce existing agentic benchmarks in a unified framework
- Debug and study agent to agent interactions on the fly throughout the user interface
- Study model limitations in noisy environments (with API timeouts and ambiguity)
We recorded 3 videos so you possibly can check a few of these use cases (but in fact, we hope the community gets creative with ARE :hugging_face:). For these videos, we use the default demo described above, which comprises the simulated lifetime of Linda Renne, PhD student in machine learning.
1) Testing an agent on a walk in the park: event organisation
To check how good the default model is at event organisation, let’s plan a birthday celebration!
We first ask the agent to text everyone within the Renne family concerning the user’s thirtieth birthday celebration on November 7. The default universe has 21 contacts within the list, including 5 Renne members of the family – Linda, the simulation “owner”, George and Stephie, her parents, Anna her sister, and Morgan her grandfather. The agent successfully goes through the contact list, finds the 4 members of the family, and texts them.
Next, we ask the agent to create a calendar invite and add them as invitees. The agent remembers the above context! It creates a calendar invite on the right date and accurately adds the members of the family to it.
2) Understanding agents: deep diving the traces
ARE also allows us to envision the traces behind the actions taken by the agent.
Upon opening the Agent logs tool on the left, we will see the system prompt, the chain of thought, multi-step actions taken with the tools called, and the outcomes as neatly organised logs. All the pieces may be exported as json if you wish to seek the advice of things offline!
3) Fooling around and increasing the demo: Connecting the agent to your individual MCPs
On this last example, we connect ARE to a distant robot arm via MCP, so it might gesture things to us, then ask the agent to reply our yes or no questions by waving the robot arm! Here’s what it looks like.
But these examples are only quite simple starting points, and we’re really looking towards what you’ll construct! (For more advanced users, you possibly can even directly install and edit the Meta-ARE code here.)
Conclusion
Gaia2 and ARE are recent research tools that we hope will empower anyone to simply construct more reliable and adaptable AI agents – by allowing easy experiments, making real-world evaluation accessible to anyone, in addition to improving trust through transparent, reproducible benchmarks and debuggable traces.
We’d like to see what you’ll do with this project!




