Probably the most comprehensive evaluation suite for GUI Agents!

-


TL;DR

Over the past few weeks, we’ve been working tirelessly on making GUI agents more open, accessible and straightforward to integrate. Along the best way, we created the biggest benchmarking suite for GUI agents performances 👉 allow us to introduce ScreenSuite.

We’re very excited to share it with you today: ScreenSuite is essentially the most comprehensive and easiest method to evaluate Vision Language Models (VLMs)across many agentic capabilities!



WTF is a GUI Agent?

GUI Agents in motion – courtesy of OSWorld

In brief, an AI Agent is a robot that acts within the virtual world. (more thorough definition here)

Specifically, a “GUI Agent” is an agent that lives in a GUI. Think “an agent that may do clicks and navigate on my desktop or my phone”, à la Claude Computer Use.

This implies in essence that the AI model powering the agent shall be given a task like “Fill the remaining of this Excel column”, together with screen captures of the GUI. Using this information, it can then resolve to take motion on the system : click(x=130, y=540) to open an online browser, type(”Value for XYZ in 2025"), scroll(down=2) to read further… To see a GUI agent in motion, you may try our Open Computer Agent, powered by Qwen2.5-VL-72B.

An excellent GUI agent will find a way to navigate a pc similar to we’d, thus unlocking all computer tasks : scrolling through Google Maps, editing a file, buying an item online. This involves quite a lot of capabilities that may be hard to guage.



Introducing ScreenSuite 🥳

The literature, as an illustration Xu et al. (2025) or Qin et al. (2025), generally splits GUI agent abilities amongst several categories:

  1. Perception: accurately perceiving the informati displayed on screen
  2. Grounding: understanding the positioning of elements – that is paramount to click the proper place
  3. Single step actions: solving instructions accurately over one motion
  4. Multi-step agents: solving a higher-level goal through several actions in a GUI environment.

So our first contribution is to gather and unify a comprehensive suite of 13 benchmarks spanning the total range of those GUI agent capabilities.

When you take a look at the last category listed above, evaluating Multi-step agentic capabilities is particularly difficult because it requires virtual machines to run the agent’s environment, be it Windows, Android, Ubuntu… To deal with this, we offer support each for E2B desktop distant sandboxes, and we created from scratch a brand new option to simply launch Ubuntu or Android virtual machines in Docker!

Implementation details

We’ve rigorously designed our benchmark suite with modularity and consistency in mind, ensuring strong alignment across tasks and environments. When required, especially for online benchmarks, we leverage smolagents as framework layer to streamline agent execution and orchestration.

To support reproducibility and ease of use, we’ve built custom Dockerized containers that allow local deployment of full Ubuntu Desktop or Android environments.

Unlike many existing GUI benchmarks that depend on accessibility trees or other metadata alongside visual input, our stack is intentionally vision-only. While this may end up in different scores on some established leaderboards, we deem that it creates a more realistic and difficult setup, one which higher reflects how humans perceive and interact with graphical interfaces.

– All agentic frameworks (Android World, OSWorld, GAIAWeb, Mind2Web) use smolagents and rely solely on vision, with none accessibility tree or DOM added (in contrast with evaluation settings reported in other sources).
Mind2Web (Multimodal) originally used element-name-based multi-choice selection based on the accessibility tree and screenshots, but was later adapted to click precision inside bounding boxes using vision only, which significantly increases task difficulty.



Rating leading VLMs on ScreenSuite 📊

We’ve evaluated leading VLMs on the benchmark

  • The Qwen-2.5-VL series of models from 3B to 72B. These models are known for his or her amazing localization capabilities, in other words they know the coordinates of any element in a picture which makes them fitted to GUI agents that must click precisely.
  • UI-Tars-1.5-7B, the all-rounder by ByteDance.
  • Holo1-7B, the newest model by H company, showing extremely performant localization for its size.
  • GPT-4o

Our scores are basically agreement with the scores reported in various sources! With the caveat that we evaluate on vision only, causing some differences, see implementation details above.

💡 Note that ScreenSuite doesn’t intend to precisely reproduce benchmarks published within the industry: we evaluate models on GUI agentic capabilities based on vision. Consequently, on benchmarks like Mind2Web where other benchmarks gave the agent a view of data wealthy context like DOM or accessibility tree, our evaluation setting is far harder, thus ScreenSuite doesn’t match other sources.



Start your custom evaluation in 30s ⚡️

Head to the repository.

  1. Clone the repository with submodules: git clone --recurse-submodules git@github.com:huggingface/screensuite.git
  2. Install the package: uv sync --extra submodules --python 3.11
  3. Run python run.py
    • Alternatively, run python examples/run_benchmarks.py for more fine-grained control, like running evaluations for several models in parallel.

The multistep benchmarks requires a bare-metal machine to run and deploy desktop/mobile* environment *emulators (see README.md)



Next steps 🚀

Running consistent and meaningful evaluations easily allows the community to quickly iterate and make progress on this field, as we’ve seen with Eleuther LM evaluation harness, the Open LLM Leaderboard and the Chatbot Arena.

We hope to see way more capable open models in the approaching month that may run a big selection of tasks reliably and even run locally!

To support this effort:



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x