The Multi-Agent Trap

has handled 2.3 million customer conversations in a single month. That’s the workload of 700 full-time human agents. Resolution time dropped from 11 minutes to under 2. Repeat inquiries fell 25%. Customer satisfaction scores climbed 47%. Cost per service transaction: $0.32 all the way down to $0.19. Total savings through late 2025: roughly $60 million.

The system runs on a multi-agent architecture built with LangGraph.

Here’s the opposite side. Gartner predicted that over 40% of agentic AI projects will likely be canceled by the top of 2027. Not scaled back. Not paused. Canceled. Escalating costs, unclear business value, and inadequate risk controls.

Same technology. Same yr. Wildly different outcomes.

When you’re constructing a multi-agent system (or evaluating whether you must), the gap between these two stories incorporates every part you could know. This playbook covers three architecture patterns that work in production, the five failure modes that kill projects, and a framework comparison to assist you select the fitting tool. You’ll walk away with a pattern selection guide and a pre-deployment checklist you should utilize on Monday morning.

Why More AI Agents Often Makes Things Worse

The intuition feels solid. Split complex tasks across specialized agents, let each handle what it’s best at. Divide and conquer.

In December 2025, a Google DeepMind team led by Yubin Kim tested this assumption rigorously. They ran 180 configurations across 5 agent architectures and three Large Language Model (LLM) families. The finding ought to be taped above every AI team’s monitor:

Unstructured multi-agent networks amplify errors as much as 17.2 times in comparison with single-agent baselines.

Not 17% worse. Seventeen times worse.

When agents are thrown together without structured topology (what the paper calls a “bag of agents”), each agent’s output becomes the following agent’s input. Errors don’t cancel. They cascade.

Picture a pipeline where Agent 1 extracts customer intent from a support ticket. It misreads “billing dispute” as “billing inquiry” (subtle, right?). Agent 2 pulls the incorrect response template. Agent 3 generates a reply that addresses the incorrect problem entirely. Agent 4 sends it. The client responds, angrier now. The system processes the indignant reply through the identical broken chain. Each loop amplifies the unique misinterpretation. That’s the 17x effect in practice: not a catastrophic failure, but a quiet compounding of small errors that produces confident nonsense.

The identical study found a saturation threshold: coordination gains plateau beyond 4 agents. Below that number, adding agents to a structured system helps. Above it, coordination overhead consumes the advantages.

This isn’t an isolated finding. The Multi-Agent Systems Failure Taxonomy (MAST) study, published in March 2025, analyzed 1,642 execution traces across 7 open-source frameworks. Failure rates ranged from 41% to 86.7%. The biggest failure category: coordination breakdowns at 36.9% of all failures.

The plain counter-argument: these failure rates reflect immature tooling, not a fundamental architecture problem. As models improve, the compound reliability issue shrinks. There’s truth on this. Between January 2025 and January 2026, single-agent task completion rates improved significantly (Carnegie Mellon benchmarks showed the most effective agents reaching 24% on complex office tasks, up from near-zero). But even at 99% per-step reliability, the compound math still applies. Higher models shift the curve. They don’t eliminate the compound effect. Architecture still determines whether you land within the 60% or the 40%.

The Compound Reliability Problem

Here’s the arithmetic that almost all architecture documents skip.

A single agent completes a step with 99% reliability. Sounds excellent. Chain 10 sequential steps: 0.99¹⁰ = 90.4% overall reliability.

Drop to 95% per step (still strong for many AI tasks). Ten steps: 0.95¹⁰ = 59.9%. Twenty steps: 0.95²⁰ = 35.8%.

Compound reliability decay: agents that succeed individually produce systems that fail collectively. Image by the writer.

You began with agents that succeed 19 out of 20 times. You ended with a system that fails nearly two-thirds of the time.

Token costs compound too. A document evaluation workflow consuming 10,000 tokens with a single agent requires 35,000 tokens across a 4-agent implementation. That’s a 3.5x cost multiplier before you account for retries, error handling, and coordination messages.

That is why Klarna’s architecture works and most copies of it don’t. The difference isn’t agent count. It’s topology.

Three Multi-Agent Patterns That Work in Production

Flip the query. As a substitute of asking “what number of agents do I would like?”, ask: “how would I definitely fail at multi-agent AI?” The research answers clearly. By chaining agents without structure. By ignoring coordination overhead. By treating every problem as a multi-agent problem when a single well-prompted agent would suffice.

Three patterns avoid these failure modes. Each serves a distinct task shape.

Plan-and-Execute

A capable model creates the whole plan. Cheaper, faster models execute each step. The planner handles reasoning; the executors handle doing.

That is near what Klarna runs. A frontier model analyzes the shopper’s intent and maps resolution steps. Smaller models execute each step: pulling account data, processing refunds, generating responses. The planning model touches the duty once. Execution models handle the amount.

The associated fee impact: routing planning to 1 capable model and execution to cheaper models cuts costs by as much as 90% in comparison with using frontier models for every part.

When it really works: Tasks with clear goals that decompose into sequential steps. Document processing, customer support workflows, research pipelines.

When it breaks: Environments that change mid-execution. If the unique plan becomes invalid halfway through, you wish re-planning checkpoints or a distinct pattern entirely. It is a one-way door in case your task environment is volatile.

Supervisor-Employee

A supervisor agent manages routing and decisions. Employee agents handle specialized subtasks. The supervisor breaks down requests, delegates, monitors progress, and consolidates outputs.

Google DeepMind’s research validates this directly. A centralized control plane suppresses the 17x error amplification that “bag of agents” networks produce. The supervisor acts as a single coordination point, stopping the failure mode where (for instance) a support agent approves a refund while a compliance agent concurrently blocks it.

When it really works: Heterogeneous tasks requiring different specializations. Customer support with escalation paths, content pipelines with review stages, financial evaluation combining multiple data sources.

When it breaks: When the supervisor becomes a bottleneck. If every decision routes through one agent, you’ve recreated the monolith you were attempting to escape. The fix: give staff bounded autonomy on decisions inside their domain, escalate only edge cases.

Swarm (Decentralized Handoffs)

No supervisor. Agents hand off to one another based on context. Agent A handles intake, determines it is a billing issue, and passes to Agent B (billing specialist). Agent B resolves it or passes to Agent C (escalation) if needed.

OpenAI’s original Swarm framework was educational only (they said so explicitly within the README). Their production-ready Agents Software Development Kit (SDK), released in March 2025, implements this pattern with guardrails: each agent declares its handoff targets, and the framework enforces that handoffs follow declared paths.

When it really works: High-volume, well-defined workflows where routing logic is embedded within the task itself. Chat-based customer support, multi-step onboarding, triage systems.

When it breaks: Complex handoff graphs. And not using a supervisor, debugging “why did the user find yourself at Agent F as an alternative of Agent D?” requires production-grade observability tools. When you don’t have distributed tracing, don’t use this pattern.

Pattern selection decision tree. When doubtful, start easy and graduate up. Image by the writer.

Which Multi-Agent Framework to Use

Three frameworks dominate production multi-agent deployments at once. Each reflects a distinct philosophy about how agents ought to be organized.

LangGraph uses graph-based state machines. 34.5 million monthly downloads. Typed state schemas enable precise checkpointing and inspection. That is what Klarna runs in production. Best for stateful workflows where you wish human-in-the-loop intervention, branching logic, and sturdy execution. The trade-off: steeper learning curve than alternatives.

CrewAI organizes agents as role-based teams. 44,300 GitHub stars and growing. Lowest barrier to entry: define agent roles, assign tasks, and the framework handles coordination. Deploys teams roughly 40% faster than LangGraph for straightforward use cases. The trade-off: limited support for cycles and complicated state management.

OpenAI Agents SDK provides lightweight primitives (Agents, Handoffs, Guardrails). The one major framework with equal Python and TypeScript/JavaScript support. Clean abstraction for the Swarm pattern. The trade-off: tighter coupling to OpenAI’s models.

Downloads don’t tell the entire story (CrewAI has more GitHub stars), but they’re the most effective proxy for production adoption. Image by the writer.

One protocol price knowing: Model Context Protocol (MCP) has develop into the de facto interoperability standard for agent tooling. Anthropic donated it to the Linux Foundation in December 2025 (co-founded by Anthropic, Block, and OpenAI under the Agentic AI Foundation). Over 10,000 lively public MCP servers exist. All three frameworks above support it. When you’re evaluating tools, MCP compatibility is table stakes.

A place to begin: When you’re unsure, start with Plan-and-Execute on LangGraph. It’s probably the most battle-tested combination. It handles the widest range of use cases. And switching patterns later is a reversible decision (a two-way door, in decision theory terms). Don’t over-architect on day one.

Five Ways Multi-Agent Systems Fail

The MAST study identified 14 failure modes across 3 categories. The five below account for the vast majority of production failures. Each includes a particular prevention measure you’ll be able to implement before your next deployment.

Pre-Deployment Checklist: The Five Failure Modes

Compound Reliability Decay
Calculate your end-to-end reliability before you ship. Multiply per-step success rates across your full chain. If the number drops below 80%, reduce the chain length or add verification checkpoints.
Keep chains under 5 sequential steps. Insert a verification agent at step 3 and step 5 that checks output quality before passing downstream. If verification fails, path to a human or a fallback path (not a retry of the identical chain).
Coordination Tax (36.9% of all MAS failures)
When two agents receive ambiguous instructions, they interpret them in another way. A support agent approves a refund; a compliance agent blocks it. The user receives contradictory signals.
Explicit input/output contracts between every agent pair. Define the information schema at every boundary and validate it. No implicit shared state. If Agent A’s output feeds Agent B, each agents must agree on the format before deployment, not at runtime.
Cost Explosion
Token costs multiply across agents (3.5x in documented cases). Retry loops can burn through $40 or more in Application Programming Interface (API) fees inside minutes, with no useful output to indicate for it.
Set hard per-agent and per-workflow token budgets. Implement circuit breakers: if an agent exceeds its budget, halt the workflow and surface an error reasonably than retrying. Log cost per accomplished workflow to catch regressions early.
Security Gaps
The Open Worldwide Application Security Project (OWASP) Top 10 for LLM Applications found prompt injection vulnerabilities in 73% of assessed production deployments. In multi-agent systems, a compromised agent can propagate malicious instructions to each downstream agent.
Input sanitization at every agent boundary, not only the entry point. Treat inter-agent messages with the identical suspicion you’d apply to external user input. Run a red-team exercise against your agent chain before production launch.
Infinite Retry Loops
Agent A fails. It retries. Fails again. In multi-agent systems, Agent A’s failure triggers Agent B’s error handler, which calls Agent A again. The loop runs until your budget runs out.
Maximum 3 retries per agent per workflow execution. Exponential backoff between retries. Dead-letter queues for tasks that fail past the retry limit. And one absolute rule: never let one agent trigger one other with out a cycle check within the orchestration layer.

Prompt injection was present in 73% of production LLM deployments assessed during security audits. In multi-agent systems, one compromised agent can propagate the attack downstream.

Tool vs. Employee: The $60 Million Architecture Gap

In February 2026, the National Bureau of Economic Research (NBER) published a study surveying nearly 6,000 executives across the US, UK, Germany, and Australia. The finding: 89% of firms reported zero change in productivity from AI. Ninety percent of managers said AI had no impact on employment. These firms averaged 1.5 hours per week of AI use per executive.

Fortune called it a resurrection of Robert Solow’s 1987 paradox: “You may see the pc age in every single place but within the productivity statistics.” History is repeating, forty years later, with a distinct technology and the identical pattern.

The 90% seeing zero impact deployed AI as a tool. The businesses saving thousands and thousands deployed AI as staff.

The contrast with Klarna isn’t about higher models or greater compute budgets. It’s a structural selection. The 90% treated AI as a copilot: a tool that assists a human in a loop, used 1.5 hours per week. The businesses seeing real returns (Klarna, Ramp, Reddit via Salesforce Agentforce) treated AI as a workforce: autonomous agents executing structured workflows with human oversight at decision boundaries, not at every step.

That’s not a technology gap. It’s an architecture gap. The chance cost is staggering: the identical engineering budget producing zero Return on Investment (ROI) versus $60 million in savings. The variable isn’t spend. It’s structure.

Forty percent of agentic AI projects will likely be canceled by 2027. The opposite sixty percent will ship. The difference won’t be which LLM they selected or how much they spent on compute. It would be whether or not they understood three patterns, ran the compound reliability math, and built their system to survive the five failure modes that kill every part else.

Klarna didn’t deploy 700 agents to interchange 700 humans. They built a structured multi-agent system where a sensible planner routes work to low-cost executors, where every handoff has an explicit contract, and where the architecture was designed to fail gracefully reasonably than cascade.

You’ve got the identical patterns, the identical frameworks, and the identical failure data. The playbook is open. What you construct with it’s the only remaining variable.

References

Kim, Y. et al. “Towards a Science of Scaling Agent Systems.” Google DeepMind, December 2025.
Cemri, M., Pan, M.Z., Yang, S. et al. “MAST: Multi-Agent Systems Failure Taxonomy.” March 2025.
Coshow, T. and Zamanian, K. “Multiagent Systems in Enterprise AI.” Gartner, December 2025.
Gartner. “Over 40 Percent of Agentic AI Projects Will Be Canceled by End of 2027.” June 2025.
LangChain. “Klarna: AI-Powered Customer Service at Scale.” 2025.
Klarna. “AI Assistant Handles Two-Thirds of Customer Service Chats in Its First Month.” 2024.
Bloom, N. et al. “Firm Data on AI.” National Bureau of Economic Research, Working Paper #34836, February 2026.
Fortune. “Hundreds of CEOs Just Admitted AI Had No Impact on Employment or Productivity.” February 2026.
Moran, S. “Why Your Multi-Agent System Is Failing: Escaping the 17x Error Trap.” Towards Data Science, January 2026.
Carnegie Mellon University. “AI Agents Fail at Office Tasks.” 2025.
Redis. “AI Agent Architecture: Patterns and Best Practices.” 2025.
DataCamp. “CrewAI vs LangGraph vs AutoGen: Comparison Guide.” 2025.

The Multi-Agent Trap

Why More AI Agents Often Makes Things Worse

The Compound Reliability Problem