Post-Training GUI Agents for Computer Use

TL;DR: This work shows how a light-weight vision–language model can acquire GUI-grounded skills and evolve into an agentic GUI coder. We release all training recipes, data-processing tools, resulting model, demo and datasets to enable full reproducibility and foster further research 🫡. Find the gathering here.

This video demonstrates the model obtained through the recipe described below, executing a task end-to-end.

Introduction

Graphical User Interface (GUI) automation is one of the crucial difficult frontiers in computer vision. Developing models that see and interact with user interfaces enables AI agents to navigate mobile, desktop, and web platforms. This can reshape the longer term of digital interaction.

On this blog post, we present a comprehensive approach to training vision-language models for GUI automation through a multi-phase training strategy. We reveal the right way to transform a model with zero grounding capabilities into an agentic coder able to understanding and interacting with graphical interfaces.

Relatively than aiming for a SOTA model, our goal is to reveal your entire process, from data processing to model training, and, in doing so, show the right way to unlock GUI-grounding capabilities in VLMs.

GUI capabilities mix understanding of the interface and precise element localization. These abilities enable the model to translate high-level tasks into low-level GUI actions equivalent to clicking, typing, …

Our approach leverages SmolVLM2-2.2B-Instruct because the baseline model, a small powerful vision-language model that originally has no grounding capabilities for GUI tasks. This makes it an excellent candidate to reveal the effectiveness of our training methodology. Through our two-phase training process, we first instill grounding capabilities within the model, then enhance it with agentic reasoning abilities using Supervised Superb-Tuning (SFT).

We evaluate our approach on a longtime perception benchmark: ScreenSpot-v2, which tests the model’s ability to grasp and locate elements inside screenshots. Our process is inspired by the AGUVIS paper, and we leverage their fastidiously curated datasets to construct upon their foundational work.

Evolution of ScreenSpot-v2 performance through the training phase of the bottom model SmolVLM2-2.2B-Instruct.

1. Data Transformation and Unified Motion Space

This section explains how we convert heterogeneous GUI actions format from multiple datasets right into a single unified format. By standardizing function names, signatures, and parameters, we create consistent, high-quality data that forms the muse for effective model training.

The Challenge of Inconsistent Motion Spaces

One in all the first challenges when working with multiple GUI automation datasets is the shortage of standardization in motion representations. Different datasets use various function signatures, parameter naming conventions, and motion taxonomies, making it difficult to coach a unified model across diverse data sources.

Our Unified Approach

We took the open-source datasets (xlangai/aguvis-stage1, xlangai/aguvis-stage2), originally utilized by AGUVIS, and implemented a comprehensive data transformation pipeline to create a unified motion space. Our approach involved:

Function Parsing and Normalization: We developed a function parser (see utils/function_parser.py) that may extract and parse function calls from various formats across all datasets. This parser supports any function signature format, handles complex parameter structures, and might reconstruct function calls with proper parameter ordering.
Motion Space Unification: We implemented a comprehensive motion conversion system (see preprocessing/action_conversion.py) that transforms all original motion representations right into a standardized function naming and argument structure. This process highlighted the numerous inconsistencies in function signatures across different datasets and allowed us to:
- Remove undesired or redundant actions
- Standardize parameter naming conventions
- Create a cohesive motion vocabulary
(Bonus) Flexible Adaptation Framework: Our transformation pipeline includes utilities that allow users to:
- Adapt your entire dataset to their very own motion space naming conventions using the utils/action_space_converter.py tool
- Extract and analyze the present motion space structure

Example Data Transformation

Listed below are real examples from our motion conversion system (preprocessing/action_conversion.py) showing how we transform heterogeneous motion representations into our unified format (grounding coordinates normalized to [0,1]):

Before (Original Motion Dataset Formats):


mobile.home()
mobile.open_app(app_name='drupe')
mobile.swipe(from_coord=[0.581, 0.898], to_coord=[0.601, 0.518])
mobile.long_press(x=0.799, y=0.911)
mobile.terminate(status='success')

pyautogui.click(x=0.8102, y=0.9463)
pyautogui.doubleClick(x=0.8102, y=0.9463)
pyautogui.hotkey(keys=['ctrl', 'c'])
pyautogui.scroll(page=-0.1)
pyautogui.write(message='bread buns')
pyautogui.dragTo(from_coord=[0.87, 0.423], to_coord=[0.8102, 0.9463])

After (Unified Motion Dataset Formats):


navigate_home()
open_app(app_name='drupe')
swipe(from_coord=[0.581, 0.898], to_coord=[0.601, 0.518])
long_press(x=0.799, y=0.911)
final_answer('success')

click(x=0.8102, y=0.9463)
double_click(x=0.8102, y=0.9463)
press(keys=['ctrl', 'c'])
scroll(direction='up', amount=10)  
type(text='bread buns')
drag(from_coord=[0.87, 0.423], to_coord=[0.8102, 0.9463])

This unification process was essential for creating coherent training data that permits the model to learn consistent motion patterns across diverse GUI environments.

💡 Why Normalized Coordinates?

Using raw pixel coordinates in text-action datapoint (e.g. click(x=302, y=63)) ties them to a single image size. Vision Language Models (VLMs) often resize images, causing pixel coordinates to interrupt and require adjustment. Normalized coordinates (relative to image size) remain valid at any resolution and keep the dataset consistent.

(Bonus) Custom Motion Space Adaptation with Motion Space Converter

To maximise flexibility for various use cases, we developed the Motion Space Converter (utils/action_space_converter.py), a tool that permits users to simply adapt from an motion space to their very own custom motion vocabularies and naming conventions.

You should utilize this tool to rework one motion signature (function names, parameter names, and parameter value changes, …) into one other:

Before

assistant_message: "Motion: click(x=0.5, y=0.3)"

After

assistant_message: "Motion: touch(x_coord=200, y_coord=300)"

Key Features

The Motion Space Converter provides:

Configurable Mappings: Define custom mappings between unified actions and your chosen motion names
Parameter Transformation: Rename parameters, apply value transformations, and set default values
Flexible Architecture: Support for each easy parameter mappings and complicated custom transformation functions
Validation: Built-in validation to make sure mapping configurations are valid

Usage Example

from utils.action_space_converter import ActionSpaceConverter, ActionMapping, ParameterMapping
from utils.function_parser import parse_function_call


mappings = [
    ActionMapping(
        source_function="click",
        target_function="touch",
        parameter_mappings=[
            ParameterMapping(source_name="x", target_name="x_coord"),
            ParameterMapping(source_name="y", target_name="y_coord")
        ],
        description="Touch screen at coordinates"    ),
    ActionMapping(
        source_function="type", 
        target_function="write", 
        parameter_mappings=[
            ParameterMapping(source_name="text", target_name="content")
            
            
        ],
        description="Input text"    
    )
]

assistant_message = "I'll interact at those coordinates for you. click(x=0.5, y=0.3) Now I'll input the text. type(text='hello world')"


parsed_function_calls = parse_function_call(text)


converter = ActionSpaceConverter(mappings)


converted_actions = converter.convert_actions(parsed_function_calls)
for new_function_call, old_function_call in zip(converted_actions, parsed_function_calls):
    text = text.replace(old_function_call.to_string(), new_function_call.to_string())

print(text)

This tool enables researchers and practitioners to:

Customize Training Data: Adapt the dataset to match their specific motion vocabulary requirements
Domain Adaptation: Transform actions for various platforms (mobile vs. desktop vs. web)
Framework Integration: Easily align training data with existing automation frameworks
Rapid Experimentation: Quickly test different motion space configurations
Release Preparation: Standardize motion spaces for production deployment with consistent naming conventions

The Motion Space Converter is especially invaluable for preparing datasets for training, because it ensures consistent motion vocabularies across different deployment environments while maintaining compatibility with existing automation frameworks.

Transformed and Released Datasets

Through this pipeline, we transform the open-source datasets xlangai/aguvis-stage1, xlangai/aguvis-stage2 into our unified motion space (see here). The output of this process is released as two latest fully formatted datasets: smolagents/aguvis-stage-1 and smolagents/aguvis-stage-2.

2. Phase 1: From Zero to Perception

Training Data

Phase 1 leverages the smolagents/aguvis-stage-1 dataset, which introduces GUI grounding by pairing low-level instructions with diverse executable actions (expressed in code form). For instance, a user/assistant turn in smolagents/aguvis-stage-1 follows the structure:

{
  "user": "click on more button",
  "assistant": "click(x=0.8875, y=0.2281)",
}

Each sample links a screenshot with multi-turn user/assistant interactions, enabling the model to learn fine-grained motion grounding across dialogue turns. During fine-tuning, the information collator masks all the pieces except the assistant’s answers when computing the loss.

Optimization Experiments

Before proceeding with full-scale Phase 1 training, we conducted comprehensive ablation studies to find out optimal training configurations

Image Resolution and Coordinate System Evaluation

We experimented with different image sizes and coordinate representation systems to discover the optimal configuration for SmolVLM2:

Image Sizes Tested: 384px, 768px, 1152px
Coordinate Systems: Pixel coordinates vs. normalized coordinates (0-1 range)
Training Data: 400K samples from Aguvis datasets

Some SOTA GUI VLMs (e.g., Qwen-VL) appear also to make use of a unique normalized range (0–1000), which was not tested on this experiment.

Configuration (coords / image size)	Screenspot-v2 (%)
Normalized coordinates
Base / –	0.47
384	31.28
764	32.32
1152	33.72
Pixel coordinates
Base / –	0.55
384	1.17
764	2.67
1152	4.32

Table 1: Baseline on HuggingFaceTB/SmolVLM2-2.2B-Instruct (400k samples, aguvis-stage-1). Higher is healthier.

As demonstrated in our benchmark results, SmolVLM2-2.2B-Instruct base initially achieved 0% performance on perception benchmarks like ScreenSpot-v2. This whole lack of grounding capability provided us with a clean slate to judge the effectiveness of our training methodology.

Key Findings

From our experiments, we determined that:

Image Size: 1152px
Coordinate System: Normalized coordinates (0-1 range) proved simplest for SmolVLM2
Note: The optimal alternative between pixel and normalized coordinates may vary depending on the bottom model’s pre-training approach

Phase 1 Results

Using the optimal configuration (1152px resolution with normalized coordinates), we trained for two epochs on the smolagents/aguvis-stage-1 dataset. The outcomes were remarkable, +41% improvement over baseline on ScreenSpot-v2

This dramatic improvement demonstrates that our Phase 1 training successfully instilled fundamental grounding capabilities within the model, enabling it to grasp and locate visual elements inside screenshots.

Configuration (coords / image size)	Screenspot-v2 (%)
Normalized coordinates / 1152	41.27

Table 2: Baseline on HuggingFaceTB/SmolVLM2-2.2B-Instruct (2 epochs, aguvis-stage-1).

3. Phase 2: From Perception to Cognition

Whereas Phase 1 provided grounding capabilities, Phase 2 targets agentic reasoning, the power to deliberate and plan before acting. This stage transforms the model from a reactive system identifying GUI elements right into a proactive agent able to executing complex, multi-step interactions.

Training Data

Phase 2 uses the smolagents/aguvis-stage-2 dataset, which introduces agentic scenarios:

Explicit reasoning about upcoming actions
Context consistency across multiple interaction steps
High-level instructions require multi-step, low-level actions.

For instance, the smolagents/aguvis-stage-2 chat message is like this:

{
  "system": "You're a helpful GUI agent. ...",
  "user": "Please generate the subsequent move in accordance with the UI screenshot, instruction and former actions.nnInstruction: What information does the positioning provide about Judith Lauand's profession, works and exhibitions?nnPrevious actions:nNone",
  "assistant": "nClick on the link labeled 'Judith Lauand: Brazilian 1922-2022' to explore more about her profession and exhibitions.nnnclick(x=0.41, y=0.178)n",
}

Each sample links a screenshot with a system/user/assistant turn. During fine-tuning, the information collator masks all the pieces except the assistant’s answers when computing the loss.

Phase 2 Results

Ranging from the Phase 1 checkpoint (1152 px resolution, normalized coordinates), we fine-tuned the model for 2 epochs on smolagents/aguvis-stage-2. The accuracy on ScreenSpot-v2 increased from 41% to 61%, indicating that explicit reasoning improves GUI grounding performance.

Configuration (coords / image size)	Screenspot-v2 (%)
Normalized coordinates / 1152	61.71

Table 2: Baseline on HuggingFaceTB/SmolVLM2-2.2B-Instruct after Phase 1 finetuning (2 epochs, aguvis-stage-1).

💡 We also reproduced the two-phase training on a much smaller VLM (nanoVLM-460M). Despite its reduced capability, the model achieved ~58% on ScreenSpot-v2, demonstrating that the training strategy scales down effectively, making it SOTA on ScreenSpot-v2 for this model size (460M parameters). As well as, aguvis-stage-1 is already included in FineVision Dataset!

4. All you would like is Open Source

All training code, data processing pipelines, datasets and model are open-source!

Training Recipe (recipe.ipynb): Complete training pipeline for each Phase 1 and Phase 2, including dataset mixture configurations and training orchestration. We leverage the TRL library to coach our models.
Datasets (smolagents/aguvis-stage-1, smolagents/aguvis-stage-2): all datasets used are open-source.
Model (smolagents/SmolVLM2-2.2B-Instruct-Agentic-GUI): the model produced by applying the training recipe described above.
Preprocessing Tools:
- Function Parser (utils/function_parser.py): Utilities for parsing, normalizing, and reconstructing function calls from diverse dataset formats. Supports complex parameter structures, positional arguments, and multiple function call extraction.
- Motion Conversion System (preprocessing/action_conversion.py): Core unification engine transforming mobile and PyAutoGUI desktop actions right into a standardized API format. Features smart coordinate handling, direction detection for scroll actions, and comprehensive parameter normalization.
- Motion Space Converter (utils/action_space_converter.py): Flexible tool for adapting the unified motion space to custom vocabularies and naming conventions. Enables domain-specific customization through configurable parameter mappings.

💡 We’ve also released a Space to try the model’s agentic grounding capabilities: A-Mahla/Smol2Operator

5. Conclusion

Our experiments reveal that high-quality, reasoning-oriented data can substantially improve GUI grounding, even for small VLMs, using only supervised fine-tuning (SFT). Beyond raw performance gains, these results show that the GUI grounding capabilities are largely determined by the standard of the information. Fastidiously curated datasets teach models the structure and semantics of user interfaces, providing the grounding needed for accurate motion prediction.

To support the event of GUI agents, we’re open-sourcing all the pieces: our complete pipeline, datasets, and trained model. You possibly can reproduce our results, experiment with different models and architectures, or adapt our approach to latest domains. The longer term of agentic AI relies on researchers such as you pushing these boundaries further!

What’s Next?

While SFT excels at supervised tasks, emerging methods equivalent to Reinforcement Learning (RL) or Direct Preference Optimization (DPO) help develop stronger reasoning capabilities and enable real-time adaptation. These advances point toward a brand new generation of GUI agents that learn and improve through interaction somewhat than relying solely on static datasets.

Let’s construct the longer term of GUI agents together 🤗

Source link

Post-Training GUI Agents for Computer Use

Table of Contents

Introduction

1. Data Transformation and Unified Motion Space

The Challenge of Inconsistent Motion Spaces

Our Unified Approach

Example Data Transformation

(Bonus) Custom Motion Space Adaptation with Motion Space Converter

Key Features

Usage Example

Transformed and Released Datasets

2. Phase 1: From Zero to Perception

Training Data

Optimization Experiments

Image Resolution and Coordinate System Evaluation

Key Findings

Phase 1 Results

3. Phase 2: From Perception to Cognition

Training Data

Phase 2 Results

4. All you would like is Open Source

5. Conclusion

What’s Next?

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How you can Predict Biomolecular Structures Using the OpenFold3 NIM

Empowering the community to check agents

The right way to avoid becoming an “AI-first” company with zero real AI usage

The State of AI: Chatbot companions and the longer term of our privacy

Scale Biology Transformer Models with PyTorch and NVIDIA BioNeMo Recipes

Post-Training GUI Agents for Computer Use

Table of Contents

Introduction

1. Data Transformation and Unified Motion Space

The Challenge of Inconsistent Motion Spaces

Our Unified Approach

Example Data Transformation

(Bonus) Custom Motion Space Adaptation with Motion Space Converter

Key Features

Usage Example

Transformed and Released Datasets

2. Phase 1: From Zero to Perception

Training Data

Optimization Experiments

Image Resolution and Coordinate System Evaluation

Key Findings

Phase 1 Results

3. Phase 2: From Perception to Cognition

Training Data

Phase 2 Results

4. All you would like is Open Source

5. Conclusion

What’s Next?

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.