
A stealth artificial intelligence startup founded by an MIT researcher emerged this morning with an ambitious claim: its recent AI model can control computers higher than systems built by OpenAI and Anthropic — at a fraction of the fee.
OpenAGI, led by chief executive Zengyi Qin, released Lux, a foundation model designed to operate computers autonomously by interpreting screenshots and executing actions across desktop applications. The San Francisco-based company says Lux achieves an 83.6 percent success rate on Online-Mind2Web, a benchmark that has change into the industry's most rigorous test for evaluating AI agents that control computers.
That rating is a big leap over the leading models from well-funded competitors. OpenAI's Operator, released in January, scores 61.3 percent on the identical benchmark. Anthropic's Claude Computer Use achieves 56.3 percent.
"Traditional LLM training feeds a considerable amount of text corpus into the model. The model learns to supply text," Qin said in an exclusive interview with VentureBeat. "In contrast, our model learns to supply actions. The model is trained with a considerable amount of computer screenshots and motion sequences, allowing it to supply actions to regulate the pc."
The announcement arrives at a pivotal moment for the AI industry. Technology giants and startups alike have poured billions of dollars into developing autonomous agents able to navigating software, booking travel, filling out forms, and executing complex workflows. OpenAI, Anthropic, Google, and Microsoft have all released or announced agent products previously yr, betting that computer-controlling AI will change into as transformative as chatbots.
Yet independent research has solid doubt on whether current agents are as capable as their creators suggest.
Why university researchers built a tougher benchmark to check AI agents—and what they found
The Online-Mind2Web benchmark, developed by researchers at Ohio State University and the University of California, Berkeley, was designed specifically to show the gap between marketing claims and actual performance.
Published in April and accepted to the Conference on Language Modeling 2025, the benchmark comprises 300 diverse tasks across 136 real web sites — every part from booking flights to navigating complex e-commerce checkouts. Unlike earlier benchmarks that cached parts of internet sites, Online-Mind2Web tests agents in live online environments where pages change dynamically and unexpected obstacles appear.
The outcomes, in accordance with the researchers, painted "a really different picture of the competency of current agents, suggesting over-optimism in previously reported results."
When the Ohio State team tested five leading web agents with careful human evaluation, they found that many recent systems — despite heavy investment and marketing fanfare — didn’t outperform SeeAct, a comparatively easy agent released in January 2024. Even OpenAI's Operator, the very best performer amongst business offerings of their study, achieved only 61 percent success.
"It seemed that highly capable and practical agents were possibly indeed just months away," the researchers wrote in a blog post accompanying their paper. "Nonetheless, we’re also well aware that there are still many fundamental gaps in research to completely autonomous agents, and current agents are probably not as competent because the reported benchmark numbers may depict."
The benchmark has gained traction as an industry standard, with a public leaderboard hosted on Hugging Face tracking submissions from research groups and corporations.
How OpenAGI trained its AI to take actions as an alternative of just generating text
OpenAGI's claimed performance advantage stems from what the corporate calls "Agentic Lively Pre-training," a training methodology that differs fundamentally from how most large language models learn.
Conventional language models train on vast text corpora, learning to predict the following word in a sequence. The resulting systems excel at generating coherent text but weren’t designed to take actions in graphical environments.
Lux, in accordance with Qin, takes a special approach. The model trains on computer screenshots paired with motion sequences, learning to interpret visual interfaces and determine which clicks, keystrokes, and navigation steps will accomplish a given goal.
"The motion allows the model to actively explore the pc environment, and such exploration generates recent knowledge, which is then fed back to the model for training," Qin told VentureBeat. "This can be a naturally self-evolving process, where a greater model produces higher exploration, higher exploration produces higher knowledge, and higher knowledge results in a greater model."
This self-reinforcing training loop, if it functions as described, could help explain how a smaller team might achieve results that elude larger organizations. Relatively than requiring ever-larger static datasets, the approach would allow the model to repeatedly improve by generating its own training data through exploration.
OpenAGI also claims significant cost benefits. The corporate says Lux operates at roughly one-tenth the fee of frontier models from OpenAI and Anthropic while executing tasks faster.
Unlike browser-only competitors, Lux can control Slack, Excel, and other desktop applications
A critical distinction in OpenAGI's announcement: Lux can control applications across a complete desktop operating system, not only web browsers.
Most commercially available computer-use agents, including early versions of Anthropic's Claude Computer Use, focus totally on browser-based tasks. That limitation excludes vast categories of productivity work that occur in desktop applications — spreadsheets in Microsoft Excel, communications in Slack, design work in Adobe products, code editing in development environments.
OpenAGI says Lux can navigate these native applications, a capability that may substantially expand the addressable marketplace for computer-use agents. The corporate is releasing a developer software development kit alongside the model, allowing third parties to construct applications on top of Lux.
The corporate can be working with Intel to optimize Lux for edge devices, which might allow the model to run locally on laptops and workstations slightly than requiring cloud infrastructure. That partnership could address enterprise concerns about sending sensitive screen data to external servers.
"We’re partnering with Intel to optimize our model on edge devices, which is able to make it the very best on-device computer-use model," Qin said.
The corporate confirmed it’s in exploratory discussions with AMD and Microsoft about additional partnerships.
What happens if you ask an AI agent to repeat your bank details
Computer-use agents present novel safety challenges that don’t arise with conventional chatbots. An AI system able to clicking buttons, entering text, and navigating applications could, if misdirected, cause significant harm — transferring money, deleting files, or exfiltrating sensitive information.
OpenAGI says it has built safety mechanisms directly into Lux. When the model encounters requests that violate its safety policies, it refuses to proceed and alerts the user.
In an example provided by the corporate, when a user asked the model to "copy my bank details and paste it right into a recent Google doc," Lux responded with an internal reasoning step: "The user asks me to repeat the bank details, that are sensitive information. Based on the security policy, I’m not in a position to perform this motion." The model then issued a warning to the user slightly than executing the doubtless dangerous request.
Such safeguards will face intense scrutiny as computer-use agents proliferate. Security researchers have already demonstrated prompt injection attacks against early agent systems, where malicious instructions embedded in web sites or documents can hijack an agent's behavior. Whether Lux's safety mechanisms can withstand adversarial attacks stays to be tested by independent researchers.
The MIT researcher who built two of GitHub's most downloaded AI models
Qin brings an unusual combination of educational credentials and entrepreneurial experience to OpenAGI.
He accomplished his doctorate on the Massachusetts Institute of Technology in 2025, where his research focused on computer vision, robotics, and machine learning. His academic work appeared in top venues including the Conference on Computer Vision and Pattern Recognition, the International Conference on Learning Representations, and the International Conference on Machine Learning.
Before founding OpenAGI, Qin built several widely adopted AI systems. JetMoE, a big language model he led development on, demonstrated that a high-performing model may very well be trained from scratch for lower than $100,000 — a fraction of the tens of tens of millions typically required. The model outperformed Meta's LLaMA2-7B on standard benchmarks, in accordance with a technical report that attracted attention from MIT's Computer Science and Artificial Intelligence Laboratory.
His previous open-source projects achieved remarkable adoption. OpenVoice, a voice cloning model, collected roughly 35,000 stars on GitHub and ranked in the highest 0.03 percent of open-source projects by popularity. MeloTTS, a text-to-speech system, has been downloaded greater than 19 million times, making it some of the widely used audio AI models since its 2024 release.
Qin also co-founded MyShell, an AI agent platform that has attracted six million users who’ve collectively built greater than 200,000 AI agents. Users have had a couple of billion interactions with agents on the platform, in accordance with the corporate.
Contained in the billion-dollar race to construct AI that controls your computer
The pc-use agent market has attracted intense interest from investors and technology giants over the past yr.
OpenAI released Operator in January, allowing users to instruct an AI to finish tasks across the online. Anthropic has continued developing Claude Computer Use, positioning it as a core capability of its Claude model family. Google has incorporated agent features into its Gemini products. Microsoft has integrated agent capabilities across its Copilot offerings and Windows.
Yet the market stays nascent. Enterprise adoption has been limited by concerns about reliability, security, and the flexibility to handle edge cases that occur incessantly in real-world workflows. The performance gaps revealed by benchmarks like Online-Mind2Web suggest that current systems might not be ready for mission-critical applications.
OpenAGI enters this competitive landscape as an independent alternative, positioning superior benchmark performance and lower costs against the large resources of its well-funded rivals. The corporate's Lux model and developer SDK can be found starting today.
Whether OpenAGI can translate benchmark dominance into real-world reliability stays the central query. The AI industry has an extended history of impressive demos that falter in production, of laboratory results that crumble against the chaos of actual use. Benchmarks measure what they measure, and the space between a controlled test and an 8-hour workday stuffed with edge cases, exceptions, and surprises may be vast.
But when Lux performs within the wild the way in which it performs within the lab, the implications extend far beyond one startup's success. It could suggest that the trail to capable AI agents runs not through the most important checkbooks but through the cleverest architectures—that a small team with the appropriate ideas can outmaneuver the giants.
The technology industry has seen that story before. It rarely stays true for long.
