Latest family of GUI automation VLMs powering GUI agent Surfer-H

Today, at H Company, we’re releasing Holo1, a family of Motion Vision Language Models (VLMs) and WebClick, a brand new multimodal localization benchmark on the Hugging Face Hub.

Surfer-H, a web-native agent that interacts with browsers like a human relies on the Holo1.

Technical Report

Holo1

Holo1 is the primary family of open-source Motion VLMs designed specifically for deep web UI understanding and precise localization. The family includes Holo1-3B and Holo1-7B models, with the latter achieving 76.2% average accuracy on common UI localization benchmarks—the very best amongst small-size models. H Company has released these models with open-source on Hugging Face, together with the WebClick benchmark containing 1,639 human-like UI tasks.

Use with Transformers

Holo1 models are based on the Qwen2.5-VL architecture, and are fully compatible with transformers. Here we offer a straightforward usage example.
You’ll be able to load the model and the processor as follows.

from transformers import AutoModelForImageTextToText, AutoProcessor
import torch
model = AutoModelForImageTextToText.from_pretrained(
    "Hcompany/Holo1-3B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained("Hcompany/Holo1-3B")

Load the image and preprocess.

image_url = "https://huggingface.co/Hcompany/Holo1-3B/resolve/important/calendar_example.jpg" 


guidelines = "Localize a component on the GUI image in keeping with my instructions and output a click position as Click(x, y) with x num pixels from the left edge and y num pixels from the highest edge."
instruction = "Select July 14th because the check-out date"
messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "url": image_url,
                },
                {"type": "text", "text": f"{guidelines}n{instruction}"},
            ],
        }
    ]


inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

We will now infer.

generated_ids = model.generate(**inputs, max_new_tokens=128)

decoded = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

Surfer-H

Web automation represents one among AI’s most practical applications for businesses, but until now, solutions have often sacrificed cost-efficiency for performance. By making our Holo1 Motion Models available in Hugging Face, users can now implement web automation solutions that achieve 92.2% accuracy on real-world web tasks at only $0.13 per task.

Surfer-H relies on the Holo1 family of open-weights models. It’s a modular architecture for complete web task automation, which performs reading, pondering, clicking, scrolling, typing, and validating. It’s designed to be flexible and modular, composed of three independent components: a Policy model that plans and drives the agent’s behavior, a Localizer model that understands visual UIs for precise interactions, and a Validator model that confirms whether tasks are accomplished successfully. Unlike other agents that depend on custom APIs or brittle wrappers, Surfer-H operates purely through the browser — similar to an actual user.

Together, these solutions represent a brand new frontier in web automation, achieving state-of-the-art localization performance and setting the Pareto frontier in cost-efficient web navigation on the WebVoyager benchmark:

We’re looking forward to see what you will construct with Holo1! Let’s meet under the discussion tab of this blog post and the model repository!

Citation

@misc{andreux2025surferhmeetsholo1costefficient,
      title={Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights}, 
      creator={Mathieu Andreux and Breno Baldas Skuk and Hamza Benchekroun and Emilien Biré and Antoine Bonnet and Riaz Bordie and Matthias Brunel and Pierre-Louis Cedoz and Antoine Chassang and Mickaël Chen and Alexandra D. Constantinou and Antoine d'Andigné and Hubert de La Jonquière and Aurélien Delfosse and Ludovic Denoyer and Alexis Deprez and Augustin Derupti and Michael Eickenberg and Mathïs Federico and Charles Kantor and Xavier Koegler and Yann Labbé and Matthew C. H. Lee and Erwan Le Jumeau de Kergaradec and Amir Mahla and Avshalom Manevich and Adrien Maret and Charles Masson and Rafaël Maurin and Arturo Mena and Philippe Modard and Axel Moyal and Axel Nguyen Kerbel and Julien Revelle and Mats L. Richter and María Santos and Laurent Sifre and Maxime Theillard and Marc Thibault and Louis Thiry and Léo Tronchon and Nicolas Usunier and Tony Wu},
      yr={2025},
      eprint={2506.02865},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.02865}, 
}

Source link

Latest family of GUI automation VLMs powering GUI agent Surfer-H

Holo1

Use with Transformers

Surfer-H

Citation

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Code Less, Ship Faster: Constructing APIs with FastAPI

YOLOv3 Paper Walkthrough: Even Higher, But Not That Much

OpenAI’s “compromise” with the Pentagon is what Anthropic feared

Exciting Changes Are Coming to the TDS Creator Payment Program

I checked out considered one of the largest anti-AI protests ever

Latest family of GUI automation VLMs powering GUI agent Surfer-H

Holo1

Use with Transformers

Surfer-H

Citation

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.