Earlier this yr, we mentioned that we’re bringing computer use capabilities to developers via the Gemini API. Today, we’re releasing the Gemini 2.5 Computer Use model, our latest specialized model built on Gemini 2.5 Pro’s visual understanding and reasoning capabilities that powers agents able to interacting with user interfaces (UIs). It outperforms leading alternatives on multiple web and mobile control benchmarks, all with lower latency. Developers can access these capabilities via the Gemini API in Google AI Studio and Vertex AI.
While AI models can interface with software through structured APIs, many digital tasks still require direct interaction with graphical user interfaces, for instance, filling and submitting forms. To finish these tasks, agents must navigate web pages and applications just as humans do: by clicking, typing and scrolling. The power to natively fill out forms, manipulate interactive elements like dropdowns and filters, and operate behind logins is an important next step in constructing powerful, general-purpose agents.
How it really works
The model’s core capabilities are exposed through the brand new `computer_use` tool within the Gemini API and ought to be operated inside a loop. Inputs to the tool are the user request, screenshot of the environment, and a history of recent actions. The input can even specify whether to exclude functions from the full list of supported UI actions or specify additional custom functions to incorporate.
