Deploy your full stack Desktop Agent

-


Amir Mahla's avatar

Aymeric Roucher's avatar


TL;DR: ScreenEnv is a strong Python library that enables you to create isolated Ubuntu desktop environments in Docker containers for testing and deploying GUI Agents (aka Computer Use agents). With built-in support for the Model Context Protocol (MCP), it’s never been easier to deploy desktop agents that may see, click, and interact with real applications.



What’s ScreenEnv?



Imagine it’s good to automate desktop tasks, test GUI applications, or construct an AI agent that may interact with software. This used to require complex VM setups and brittle automation frameworks.

ScreenEnv changes this by providing a sandboxed desktop environment that runs in a Docker container. Consider it as an entire virtual desktop session that your code can fully control – not only clicking buttons and typing text, but managing the complete desktop experience including launching applications, organizing windows, handling files, executing terminal commands, and recording the complete session.



Why ScreenEnv?

  • 🖥️ Full Desktop Control: Complete mouse and keyboard automation, window management, application launching, file operations, terminal access, and screen recording
  • 🤖 Dual Integration Modes: Support each Model Context Protocol (MCP) for AI systems and direct Sandbox API – adapting to any agent or backend logic
  • 🐳 Docker Native: No complex VM setup – just Docker. The environment is isolated, reproducible, and simply deployed anywhere in lower than 10 seconds. Support AMD64 and ARM64 architecture.



🎯 One-Line Setup

from screenenv import Sandbox
sandbox = Sandbox()  



Two Integration Approaches

ScreenEnv provides two complementary ways to integrate along with your agents and backend systems, supplying you with flexibility to decide on the approach that most closely fits your architecture:



Option 1: Direct Sandbox API

Perfect for custom agent frameworks, existing backends, or whenever you need fine-grained control:

from screenenv import Sandbox


sandbox = Sandbox(headless=False)
sandbox.launch("xfce4-terminal")
sandbox.write("echo 'Custom agent logic'")
screenshot = sandbox.screenshot()
image = Image.open(BytesIO(screenshot_bytes))
...
sandbox.close()



Option 2: MCP Server Integration

Ideal for AI systems that support the Model Context Protocol:

from screenenv import MCPRemoteServer
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client


server = MCPRemoteServer(headless=False)
print(f"MCP Server URL: {server.server_url}")


async def mcp_session():
    async with streamablehttp_client(server.server_url) as streams:
        async with ClientSession(*streams) as session:
            await session.initialize()
            print(await session.list_tools())

            response = await session.call_tool("screenshot", {})
            image_bytes = base64.b64decode(response.content[0].data)
            image = Image.open(BytesIO(image_bytes))

server.close()

This dual approach means ScreenEnv adapts to your existing infrastructure moderately than forcing you to vary your agent architecture.



✨ Create a Desktop Agent with screenenv and smolagents

screenenv natively supports smolagents, making it easy to construct your individual custom Desktop Agent for automation. Here’s find out how to create your individual AI-powered Desktop Agent in only a couple of steps:



1. Select Your Model

Pick the backend VLM you must power your agent.

import os

from smolagents import OpenAIServerModel
model = OpenAIServerModel(
    model_id="gpt-4.1",
    api_key=os.getenv("OPENAI_API_KEY"),
)


from smolagents import HfApiModel
model = HfApiModel(
    model_id="Qwen/Qwen2.5-VL-7B-Instruct",
    token=os.getenv("HF_TOKEN"),
    provider="nebius",
)


from smolagents import TransformersModel
model = TransformersModel(
    model_id="Qwen/Qwen2.5-VL-7B-Instruct",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)


from smolagents import LiteLLMModel
model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514")




2. Define Your Custom Desktop Agent

Inherit from DesktopAgentBase and implement the _setup_desktop_tools method to construct your individual motion space!

from screenenv import DesktopAgentBase, Sandbox
from smolagents import Model, Tool, tool
from smolagents.monitoring import LogLevel
from typing import List

class CustomDesktopAgent(DesktopAgentBase):
    """Agent for desktop automation"""

    def __init__(
        self,
        model: Model,
        data_dir: str,
        desktop: Sandbox,
        tools: List[Tool] | None = None,
        max_steps: int = 200,
        verbosity_level: LogLevel = LogLevel.INFO,
        planning_interval: int | None = None,
        use_v1_prompt: bool = False,
        **kwargs,
    ):
        super().__init__(
            model=model,
            data_dir=data_dir,
            desktop=desktop,
            tools=tools,
            max_steps=max_steps,
            verbosity_level=verbosity_level,
            planning_interval=planning_interval,
            use_v1_prompt=use_v1_prompt,
            **kwargs,
        )

        
        
        
        
        

    def _setup_desktop_tools(self) -> None:
        """Define your custom tools here."""
        
        
        @tool
        def click(x: int, y: int) -> str:
            """
            Clicks at the required coordinates.
            Args:
                x: The x-coordinate of the press
                y: The y-coordinate of the press
            """
            self.desktop.left_click(x, y)
            
            return f"Clicked at ({x}, {y})"
        
        self.tools["click"] = click
        

        @tool
        def write(text: str) -> str:
            """
            Types the required text at the present cursor position.
            Args:
                text: The text to type
            """
            self.desktop.write(text, delay_in_ms=10)
            return f"Typed text: '{text}'"

        self.tools["write"] = write

        @tool
        def press(key: str) -> str:
            """
            Presses a keyboard key or combination of keys
            Args:
                key: The important thing to press (e.g. "enter", "space", "backspace", etc.) or a multiple keys string to press, for instance "ctrl+a" or "ctrl+shift+a".
            """
            self.desktop.press(key)
            return f"Pressed key: {key}"

        self.tools["press"] = press
        
        @tool
        def open(file_or_url: str) -> str:
            """
            Directly opens a browser with the required url or opens a file with the default application.
            Args:
                file_or_url: The URL or file to open
            """

            self.desktop.open(file_or_url)
            
            self.logger.log(f"Opening: {file_or_url}")
            return f"Opened: {file_or_url}"

        @tool
        def launch_app(app_name: str) -> str:
            """
            Launches the required application.
            Args:
                app_name: The name of the appliance to launch
            """
            self.desktop.launch(app_name)
            return f"Launched application: {app_name}"

        self.tools["launch_app"] = launch_app

        ... 



3. Run the Agent on a Desktop Task

from screenenv import Sandbox


sandbox = Sandbox(headless=False, resolution=(1920, 1080))


agent = CustomDesktopAgent(
    model=model,
    data_dir="data",
    desktop=sandbox,
)


task = "Open LibreOffice, write a report of roughly 300 words on the subject ‘AI Agent Workflow in 2025’, and save the document."

result = agent.run(task)
print(f"📄 Result: {result}")

sandbox.close()

In the event you encounter acces denied docker error, you’ll be able to attempt to run the agent with sudo -E python -m test.py or add your user to the docker group.

💡 For a comprehensive implementation, see this CustomDesktopAgent source on GitHub.



Get Began Today


pip install screenenv


git clone git@github.com:huggingface/screenenv.git
cd screenenv
python -m examples.desktop_agent



What’s Next?

ScreenEnv goals to expand beyond Linux to support Android, macOS, and Windows, unlocking true cross-platform GUI automation. This may enable developers and researchers to construct agents that generalize across environments with minimal setup.

These advancements pave the best way for creating reproducible, sandboxed environments ideal for benchmarking and evaluation.

Repository: https://github.com/huggingface/screenenv



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x