Like Anthropic’s Computer Use and Google DeepMind’s Mariner, Operator takes screenshots of a pc screen and scans the pixels to determine what actions it will possibly take. CUA, the model behind it, is trained to interact with the identical graphical user interfaces—buttons, text boxes, menus—that individuals use after they do things online. It scans the screen, takes an motion, scans the screen again, takes one other motion, and so forth. That lets the model perform tasks on most web sites that an individual can use.
“Traditionally the best way models have used software is thru specialized APIs,” says Reiichiro Nakano, a scientist at OpenAI. (An API, or application programming interface, is a chunk of code that acts as a sort of connector, allowing different bits of software to be attached to 1 one other.) That puts quite a lot of apps and most web sites off limits, he says: “But for those who create a model that may use the identical interface that humans use each day, it opens up a complete recent range of software that was previously inaccessible.”
CUA also breaks tasks down into smaller steps and tries to work through them one after the other, backtracking when it gets stuck. OpenAI says CUA was trained with techniques much like those used for its so-called reasoning models, o1 and o3.
OPENAI
OpenAI has tested CUA against a variety of industry benchmarks designed to evaluate the flexibility of an agent to perform tasks on a pc. The corporate claims that its model beats Computer Use and Mariner in all of them.
For instance, on OSWorld, which tests how well an agent performs tasks resembling merging PDF files or manipulating a picture, CUA scores 38.1% to Computer Use’s 22.0% As compared, humans rating 72.4%. On a benchmark called WebVoyager, which tests how well an agent performs tasks in a browser, CUA scores 87%, Mariner 83.5%, and Computer Use 56%. (Mariner can only perform tasks in a browser and subsequently doesn’t rating on OSWorld.)
For now, Operator may also only perform tasks in a browser. OpenAI plans to make CUA’s wider abilities available in the longer term via an API that other developers can use to construct their very own apps. That is how Anthropic released Computer Use in December.
OpenAI says it has tested CUA’s safety, using red teams to explore what happens when users ask it to do unacceptable tasks (resembling research tips on how to make a bioweapon), when web sites contain hidden instructions designed to derail it, and when the model itself breaks down. “We’ve trained the model to stop and ask the user for information before doing anything with external negative effects,” says Casey Chu, one other researcher on the team.
Look! No hands
To make use of Operator, you just type instructions right into a text box. But as a substitute of calling up the browser in your computer, Operator sends your instructions to a distant browser running on an OpenAI server. OpenAI claims that this makes the system more efficient. It’s one other key difference between Operator, Computer Use and Mariner (which runs inside Google’s Chrome browser on your individual computer).