TL;DR: ScreenEnv is a strong Python library that enables you to create isolated Ubuntu desktop environments in Docker containers for testing and deploying GUI Agents (aka Computer Use agents). With built-in support for the Model Context Protocol (MCP), it’s never been easier to deploy desktop agents that may see, click, and interact with real applications.
What’s ScreenEnv?
Imagine it’s good to automate desktop tasks, test GUI applications, or construct an AI agent that may interact with software. This used to require complex VM setups and brittle automation frameworks.
ScreenEnv changes this by providing a sandboxed desktop environment that runs in a Docker container. Consider it as an entire virtual desktop session that your code can fully control – not only clicking buttons and typing text, but managing the complete desktop experience including launching applications, organizing windows, handling files, executing terminal commands, and recording the complete session.
Why ScreenEnv?
- 🖥️ Full Desktop Control: Complete mouse and keyboard automation, window management, application launching, file operations, terminal access, and screen recording
- 🤖 Dual Integration Modes: Support each Model Context Protocol (MCP) for AI systems and direct Sandbox API – adapting to any agent or backend logic
- 🐳 Docker Native: No complex VM setup – just Docker. The environment is isolated, reproducible, and simply deployed anywhere in lower than 10 seconds. Support AMD64 and ARM64 architecture.
🎯 One-Line Setup
from screenenv import Sandbox
sandbox = Sandbox()
Two Integration Approaches
ScreenEnv provides two complementary ways to integrate along with your agents and backend systems, supplying you with flexibility to decide on the approach that most closely fits your architecture:
Option 1: Direct Sandbox API
Perfect for custom agent frameworks, existing backends, or whenever you need fine-grained control:
from screenenv import Sandbox
sandbox = Sandbox(headless=False)
sandbox.launch("xfce4-terminal")
sandbox.write("echo 'Custom agent logic'")
screenshot = sandbox.screenshot()
image = Image.open(BytesIO(screenshot_bytes))
...
sandbox.close()
Option 2: MCP Server Integration
Ideal for AI systems that support the Model Context Protocol:
from screenenv import MCPRemoteServer
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client
server = MCPRemoteServer(headless=False)
print(f"MCP Server URL: {server.server_url}")
async def mcp_session():
async with streamablehttp_client(server.server_url) as streams:
async with ClientSession(*streams) as session:
await session.initialize()
print(await session.list_tools())
response = await session.call_tool("screenshot", {})
image_bytes = base64.b64decode(response.content[0].data)
image = Image.open(BytesIO(image_bytes))
server.close()
This dual approach means ScreenEnv adapts to your existing infrastructure moderately than forcing you to vary your agent architecture.
✨ Create a Desktop Agent with screenenv and smolagents
screenenv natively supports smolagents, making it easy to construct your individual custom Desktop Agent for automation. Here’s find out how to create your individual AI-powered Desktop Agent in only a couple of steps:
1. Select Your Model
Pick the backend VLM you must power your agent.
import os
from smolagents import OpenAIServerModel
model = OpenAIServerModel(
model_id="gpt-4.1",
api_key=os.getenv("OPENAI_API_KEY"),
)
from smolagents import HfApiModel
model = HfApiModel(
model_id="Qwen/Qwen2.5-VL-7B-Instruct",
token=os.getenv("HF_TOKEN"),
provider="nebius",
)
from smolagents import TransformersModel
model = TransformersModel(
model_id="Qwen/Qwen2.5-VL-7B-Instruct",
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
from smolagents import LiteLLMModel
model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514")
2. Define Your Custom Desktop Agent
Inherit from DesktopAgentBase and implement the _setup_desktop_tools method to construct your individual motion space!
from screenenv import DesktopAgentBase, Sandbox
from smolagents import Model, Tool, tool
from smolagents.monitoring import LogLevel
from typing import List
class CustomDesktopAgent(DesktopAgentBase):
"""Agent for desktop automation"""
def __init__(
self,
model: Model,
data_dir: str,
desktop: Sandbox,
tools: List[Tool] | None = None,
max_steps: int = 200,
verbosity_level: LogLevel = LogLevel.INFO,
planning_interval: int | None = None,
use_v1_prompt: bool = False,
**kwargs,
):
super().__init__(
model=model,
data_dir=data_dir,
desktop=desktop,
tools=tools,
max_steps=max_steps,
verbosity_level=verbosity_level,
planning_interval=planning_interval,
use_v1_prompt=use_v1_prompt,
**kwargs,
)
def _setup_desktop_tools(self) -> None:
"""Define your custom tools here."""
@tool
def click(x: int, y: int) -> str:
"""
Clicks at the required coordinates.
Args:
x: The x-coordinate of the press
y: The y-coordinate of the press
"""
self.desktop.left_click(x, y)
return f"Clicked at ({x}, {y})"
self.tools["click"] = click
@tool
def write(text: str) -> str:
"""
Types the required text at the present cursor position.
Args:
text: The text to type
"""
self.desktop.write(text, delay_in_ms=10)
return f"Typed text: '{text}'"
self.tools["write"] = write
@tool
def press(key: str) -> str:
"""
Presses a keyboard key or combination of keys
Args:
key: The important thing to press (e.g. "enter", "space", "backspace", etc.) or a multiple keys string to press, for instance "ctrl+a" or "ctrl+shift+a".
"""
self.desktop.press(key)
return f"Pressed key: {key}"
self.tools["press"] = press
@tool
def open(file_or_url: str) -> str:
"""
Directly opens a browser with the required url or opens a file with the default application.
Args:
file_or_url: The URL or file to open
"""
self.desktop.open(file_or_url)
self.logger.log(f"Opening: {file_or_url}")
return f"Opened: {file_or_url}"
@tool
def launch_app(app_name: str) -> str:
"""
Launches the required application.
Args:
app_name: The name of the appliance to launch
"""
self.desktop.launch(app_name)
return f"Launched application: {app_name}"
self.tools["launch_app"] = launch_app
...
3. Run the Agent on a Desktop Task
from screenenv import Sandbox
sandbox = Sandbox(headless=False, resolution=(1920, 1080))
agent = CustomDesktopAgent(
model=model,
data_dir="data",
desktop=sandbox,
)
task = "Open LibreOffice, write a report of roughly 300 words on the subject ‘AI Agent Workflow in 2025’, and save the document."
result = agent.run(task)
print(f"📄 Result: {result}")
sandbox.close()
In the event you encounter acces denied docker error, you’ll be able to attempt to run the agent with
sudo -E python -m test.pyor add your user to thedockergroup.
💡 For a comprehensive implementation, see this CustomDesktopAgent source on GitHub.
Get Began Today
pip install screenenv
git clone git@github.com:huggingface/screenenv.git
cd screenenv
python -m examples.desktop_agent
What’s Next?
ScreenEnv goals to expand beyond Linux to support Android, macOS, and Windows, unlocking true cross-platform GUI automation. This may enable developers and researchers to construct agents that generalize across environments with minimal setup.
These advancements pave the best way for creating reproducible, sandboxed environments ideal for benchmarking and evaluation.
Repository: https://github.com/huggingface/screenenv
