MIT’s Harsh Review of Manus (AI Agent)

Good morning. It’s Wednesday, March twelfth.

Today in tech history: On at the present time in 1989, Sir Tim Berners-Lee submitted his proposal for an information management system to CERN, laying the groundwork for what would change into the World Wide Web.

OpenAI scales Agents, Storytelling, and more
Agibot’s GO-1 Robot
Google’s Gemma 2
Manus AI Review
5 Latest AI Tools
Latest AI Research Papers

You read. We listen. Tell us what you’re thinking that by replying to this email.

Construct and run highly customized evals with Atla Selene 1

Good evals are critical to making sure AI apps perform as intended. This normally means using fewer default metrics like ‘hallucination,’ and more custom metrics fit to your needs–resembling ‘detect responses that veer into medical advice’ or ‘flag statements that contradict company policy.’

Atla, an AI safety startup, introduces tools that help users run accurate, customized evals that measure what matters for specific evaluation needs.

Selene 1: An LLM Judge trained specifically for evals. Selene outperforms frontier models (OpenAI’s o-series, Claude 3.5 Sonnet, DeepSeek R1, etc.) across 11 benchmarks for evaluation tasks including scoring, classifying, and pairwise comparisons.

Alignment Platform: Generate, test, and refine custom eval metrics with just an outline of your task—minimal prompt engineering required. Deploy and have Selene run evals with a custom metric that’s aligned to your use case.

Evaluate your GenAI products with Selene and ship with confidence.

Selene 1 is accessible via API/SDK. The Alignment Platform is accessible to all users and comes with a tailored onboarding session with our team.

_{Thanks for supporting our sponsors!}

Today’s trending AI news stories

OpenAI scales AI agents, storytelling, and compute—while models learn to play the system

OpenAI is expanding its AI ecosystem with latest tools, models, and infrastructure plays. The Responses API and open-source Agents SDK equip developers with modular AI-building blocks, enabling autonomous workflows without complex orchestration. Responses API merges chat reasoning with built-in tools for web search, file parsing, and computer control, while the Agents SDK enables cross-model coordination and real-time task execution—even integrating competitors’ models from Anthropic, Google, and Meta.

we trained a brand new model that is nice at creative writing (undecided yet how/when it should get released). that is the primary time i’ve been really struck by something written by AI; it got the vibe of metafiction so right.

PROMPT:

Please write a metafictional literary short story… x.com/i/web/status/1…

– Sam Altman (@sama)
6:58 pm • Mar 11, 2025

Sam Altman also recently teased OpenAI’s latest creative writing model, calling its AI-generated metafictional short story the primary to really resonate with him. Nonetheless, its release stays uncertain amid ongoing legal disputes over training data.

Agibot’s GO-1 gives its humanoid robots brains, not only brawn

Agibot just handed humanoid robots a serious upgrade with Genie Operator-1 (GO-1), an AI model built to ditch preprogrammed rigidity in favor of real-world adaptability. As a substitute of following static instructions, GO-1 trains on massive datasets of images and videos, learning to interpret human actions and break tasks into step-by-step execution—no micromanagement required.

Unlike legacy automation, GO-1 doesn’t just react; it reasons. Vision-language learning sharpens perception, while advanced planning algorithms let robots adapt on the fly. This isn’t about novelty—it’s about taking humanoid machines from scripted routines to autonomous decision-making. Read more.

Google pronounces Gemma 3 as ’world’s best single-accelerator model’

Google rolls out Gemma 3, a streamlined AI model built for top-tier efficiency on single GPUs and TPUs. Offered in 1B, 4B, 12B, and 27B sizes, it outperforms Llama-405B, DeepSeek-V3, and o3-mini in LMArena, setting a brand new bar for single-accelerator performance.

Beyond text generation, Gemma 3 enhances multimodal reasoning across images, text, and short videos in models 4B and above. It supports structured function calling for AI-driven workflows and debuts official quantized versions, optimizing computational efficiency without sacrificing accuracy. A 128k-token context window enables deeper interactions, with over 35 languages natively supported. Google also introduces ShieldGemma 2, an automatic image safety classifier, alongside stringent governance measures. Gemma 3 is now available via Google AI Studio, with model downloads accessible on Kaggle and Hugging Face. Read more.

MIT: Manus AI shows flashes of brilliance—if it may stay online

MIT Technology Review tested Manus, a brand new general AI agent from China’s Butterfly Effect, designed to operate autonomously using multiple AI models like Claude 3.5 Sonnet and fine-tuned Qwen variants. Unlike traditional chatbots, Manus decomposes tasks, navigates the online, and executes actions through an interactive interface that permits user intervention

On paper, it’s a powerhouse—adapting on the fly, refining research, and structuring workflows with minimal input. In practice, it’s a mixed bag. The AI struggles with paywalls, slows under heavy loads, and crashes often enough to kill momentum. While it outperforms ChatGPT DeepResearch on select tasks, it takes longer to get there. Its transparency—letting users watch and tweak its logic in real time—sets it apart, but reliability issues hold it back.

With a fraction of DeepResearch’s cost, it could possibly be a game-changer if it stops tripping over itself. Access stays scarce, with most waitlisted users still locked out. Read more.