Why Your Multi-Agent System is Failing: Escaping the 17x Error Trap of the “Bag of Agents”

landed on arXiv just before Christmas 2025, very much an early present from the team at Google DeepMind, with the title “Towards a Science of Scaling Agent Systems.” I discovered this paper to be a genuinely useful read for engineers and data scientists. It’s peppered with concrete, measurement-driven advice, and full of takeaways you’ll be able to apply immediately. The authors run a big multi-factorial study, facilitated by the tremendous compute available at these frontier labs, to systematically various key design parameters with a purpose to really understand what drives performance in agentic systems.

Like many industry AI practitioners, I spend loads of time constructing Multi-Agent Systems (MAS). This involves a taking complex, multi-step workflow and dividing it across a set of agents, each specialised for a selected task. While the dominant paradigm for AI chatbots is zero-shot, request-response interaction, Multi-Agent Systems (MAS) offer a more compelling promise: the flexibility to autonomously “divide and conquer” complex tasks. By parallelizing research, reasoning, and power use, these systems significantly boost effectiveness over monolithic models.

To maneuver beyond easy interactions, the DeepMind research highlights that MAS performance is set by the interplay of 4 aspects:

Agent Quantity: The number of specialized units deployed.
Coordination Structure: The topology (Centralised, Decentralised, etc.) governing how they interact.
Model Capability: The baseline intelligence of the underlying LLMs.
Task Properties: The inherent complexity and requirements of the work.

The suggests that MAS performance success is found on the intersection of . If we get the balance incorrect we find yourself scaling noise somewhat than the outcomes. This post will aid you find that secret sauce for your individual tasks in a way that can reliably aid you construct a performant and robust MAS that can impress your stakeholders.

A compelling recent success story of where an optimal balance was found for a fancy task comes from Cursor, the AI-powered software development company behind a preferred IDE. They describe using large numbers of agents working in concert to automate complex tasks over prolonged runs, including generating substantial amounts of code to construct an online browser (you’ll be able to see the code here) and translating codebases (e.g., from Solid to React). Their write-up on scaling agentic AI to difficult tasks over weeks of computation is an interesting read.

Cursor report that prompt engineering is critical to performance, and the precise agent coordination architecture is essential. Specifically, they report higher results with a structured planner–employee decomposition than with a flat swarm (or bag) of agents. The role of coordination is especially interesting, and it’s a side of MAS design we’ll return to in this text. Very like real-world teams profit from a manager, the Cursor team found that a hierarchical setup, with a planner on top of things, was essential. This worked much better than a free-for-all by which agents picked tasks at will. The planner enabled controlled delegation and accountability, ensuring employee agents tackled the best sub-tasks and delivered concrete project progress. Interestingly additionally they find that it’s necessary to match the best model to the best, finding that GPT-5.2 is the most effective planner and employee agent in comparison with Claude Opus 4.5.

Nevertheless, despite this early glimpse of success from the Cursor team, Multi-Agent System development in the actual world continues to be on the boundary of scientific knowledge and due to this fact a difficult task. Multi-Agent Systems might be messy with unreliable outputs, token budgets lost to coordination chatter, and performance that drifts, sometimes worsening as an alternative of improving. Careful thought is required into the design, mapping it to the particulars of a given use-case.

When developing a MAS, I kept coming back to the identical questions: when should I split a step across multiple agents, and what criteria should drive that call? What coordination architecture should I select? With so many permutations of decomposition and agent roles, it’s easy to find yourself overwhelmed. Moreover, how should I be enthusiastic about the form of Agents available and their roles?

That gap between promise and reality is what makes developing a MAS such a compelling engineering and data science problem. Getting these systems to work reliably, and to deliver tangible business value, still involves loads of trial, error, and hard-won intuition. In some ways, the sector can feel such as you’re operating off the beaten path, currently without enough shared theory or standard practice.

This paper by the DeepMind team helps lots. It brings structure to the space and, importantly, proposes a quantitative technique to predict when a given agent architecture is prone to shine and when it’s more prone to underperform.

In the push to construct complex AI, most developers fall into the rap by throwing more LLMs at an issue and hoping for emergent intelligence. But because the recent research by DeepMind shows, a bag of agents isn’t an efficient team, somewhat it may be a source of 17.2x error amplification. Without the and of a proper topology to constrain the agentic interaction, we find yourself scaling noise somewhat than an intelligent capability that’s prone to solve a business task.

I’ve little question that mastering the standardised build-out of Multi-Agent Systems is the subsequent great technical moat. Firms that may quickly bridge the gap between ‘messy’ autonomous agents and rigorous, plane-based topologies will reap the dividends of maximum efficiency and high-fidelity output at scale, bringing massive competitive benefits inside their market.

In accordance with the DeepMind scaling research, multi-agent coordination yields the very best returns when a single-agent baseline is below 45%. In case your base model is already hitting 80%, adding more agents might actually introduce more noise than value.

On this post I distill nuggets just like the above right into a playbook that gives you with the best mental map to construct the most effective Multi-Agent Systems. You can see golden rules for constructing a Multi-Agent System for a given task, touching on what agents to construct and the way they must be coordinated. Further, I’ll define a set of ten core agent archetypes to aid you map the landscape of capabilities and make it easier to decide on the best setup on your use-case. I’ll then draw out the most important design lessons from the DeepMind paper, using them to point out learn how to configure and coordinate these agents effectively for various sorts of work.

Defining a Taxonomy of Core Agent Archetypes

On this section we are going to map out the design space of Agents, specializing in the overarching types available for solving complex tasks. Its useful to distill the forms of Agents down into 10 basic types, following closely the definitions within the 2023 autonomous agent survey of Wang et al.: Orchestrator, Planner, Executor, Evaluator, Synthesiser, Critic, Retriever, Memory Keeper, Mediator, and Monitor. In my experience, just about all useful Multi-Agent Systems might be designed through the use of a combination of those Agents connected together into a selected topology.

To date we now have a loose collection of various agent types, it is helpful if we now have an associated mental reference point for the way they might be grouped and applied. We are able to organize these agents through two complementary lenses: a static architecture of Functional Control Planes and a dynamic runtime cycle of Plan–Do–Confirm. The technique to consider that is that the control planes provide the structure, while the runtime cycle drives the workflow:

Plan: The Orchestrator (Control Plane) defines the high-level objective and constraints. It delegates to the Planner, which decomposes the target right into a task graph — mapping dependencies, priorities, and steps. As latest information surfaces, the Orchestrator sequences and revises this plan dynamically.
Do: The Executor translates abstract tasks into tangible outputs (artifacts, code changes, decisions). To make sure this work is grounded and efficient, the Retriever (Context Plane) supplies the Executor with just-in-time context, equivalent to relevant files, documentation, or prior evidence.
Confirm: That is the standard gate. The Evaluator validates outputs against objective acceptance criteria, while the Critic probes for subjective weaknesses like edge cases or hidden assumptions. Feedback loops back to the Planner for iteration. Concurrently, the Monitor watches for systemic health — tracking drift, stalls, or budget spikes — able to trigger a reset if the cycle degrades.

Figure 2: This diagram represents a Multi-Agent Cognitive Architecture. As an alternative of asking one AI to “do all the things,” this design splits the cognitive load into specialised roles (archetypes), much like how a software engineering team works. 📖 Source: image by writer.

As illustrated in our Multi-Agent Cognitive Architecture (Figure 2), these specialised agents are situated inside , which group capabilities by functional responsibility. This structure transforms a chaotic “Bag of Agents” right into a high-fidelity system by compartmentalising information flow.

To ground these technical layers, we will extend our human workplace analogy to define how these planes function:

The Control Layer — The Management

The Orchestrator: Consider this because the Project Manager. It holds the high-level objective. It doesn’t write code or search the net; its job is to delegate. It decides does next.
The Monitor: That is the “Health & Safety” officer. It watches the Orchestrator. If the agent gets stuck in a loop, burns an excessive amount of money (tokens), or drifts away from the unique goal, the Monitor pulls the emergency brake or triggers an alert.

The Planning Layer — The Strategy

The Planner: Before acting, the agent must think. The Planner takes the Orchestrator’s goal and breaks it down.
Task Graph / Backlog (Artifact): That is the “To-Do List.” It’s dynamic — because the agent learns latest things, steps is likely to be added or removed. The Planner consistently updates this graph so the Orchestrator knows the candidate next steps.

The Context Layer — The Memory

The Retriever: The Librarian. It fetches specific information (docs, previous code) needed for the task.
The Memory Keeper: The Archivist. Not all the things must be remembered. This role summarizes (compresses) what happened and decides what is very important enough to store within the Context Store for the long run.

The Execution Layer — The Staff

The Executor: The Specialist. This acts on the plan. It writes the code, calls the API, or generates the text.
The Synthesiser: The Editor. The Executor’s output is likely to be messy or too verbose. The Synthesiser cleans it up and formats it into a transparent result for the Orchestrator to review.

The Assurance Layer — Quality Control

The Evaluator: Checks for objective correctness. (e.g., “Did the code compile?” “Did the output adhere to the JSON schema?”)
The Critic: Checks for subjective risks or edge cases. (e.g., “This code runs, however it has a security vulnerability,” or “This logic is flawed.”)
Feedback Loops (Dotted Arrows): Notice the dotted lines going back as much as the Planner in Figure 2. If the Assurance layer fails the work, the agent updates the plan to repair the precise errors found.

The Mediation Layer — Conflict Resolution

The Mediator: Sometimes the Evaluator says “Pass” however the Critic says “Fail.” Or perhaps the Planner desires to do something the Monitor flags as dangerous. The Mediator acts because the tie-breaker to stop the system from freezing in a deadlock.

**Figure 3: The Agent Archetype Taxonomy.** Moving beyond a “Bag of Agents” requires a structured division of labor. This taxonomy maps 10 core archetypes into functional planes (Control, Planning, Context, Execution, Assurance, and Mediation). Each role acts as a selected “Architecture Defense” against the common failure modes of scaling, equivalent to error amplification, logic drift, and tool-coordination overhead. 📖 **Source:** table by writer.

To see these archetypes in motion, we will trace a single request through the system. Imagine we submit the next objective: The request doesn’t just go to a “employee”; it triggers a choreographed sequence across the functional planes.

Step 1: Initialisation — Control Lane

The Orchestrator receives the target. It checks its “guardrails” (e.g., “Do we now have permission to scrape? What’s the budget?”).

The Motion: It hands the goal to the Planner.
The Monitor starts a “stopwatch” and a “token counter” to make sure the agent doesn’t spend $50 attempting to scrape a $5 site.

Step 2: Decomposition — Planning Lane

The Planner realises this is definitely 4 sub-tasks: (1) Research the positioning structure, (2) Write the scraper, (3) Map the information schema, (4) Write the DB ingestion logic.

The Motion: It populates the Task Graph / Backlog with these steps and identifies dependencies (e.g., you’ll be able to’t map the schema until you’ve researched the positioning).

Step 3: Grounding — Context Lane

Before the employee starts, the Retriever looks into the Context Store.

The Motion: It pulls our “Database Schema Docs” and “Scraping Best Practices” and hands them to the Executor.
This prevents the agent from “hallucinating” a database structure that doesn’t exist.

Step 4: Production — Execution Lane

The Executor writes the Python code for Step 1 and a couple of.

The Motion: It places the code within the Workspace / Outputs.
The Synthesiser might take that raw code and wrap it in a “Status Update” for the Orchestrator, saying

Step 5: The “Trial” — Assurance Lane

That is where the dotted feedback lines spring into motion:

The Evaluator runs the code. It fails due to a 403 Forbidden error (anti-scraping bot). It sends a Dotted Arrow back to the Planner:
The Critic looks on the code and sees that the database password is hardcoded. It sends a Dotted Arrow to the Planner:

Step 6: Conflict & Resolution — Mediation Lane

Imagine the Planner tries to repair the Critic’s security concern, however the Evaluator says the brand new code now breaks the DB connection. They’re stuck in a loop.

The Motion: The Mediator steps in, looks on the two conflicting “opinions,” and decides:
The Orchestrator receives this resolution and updates the state.

Step 7: Final Delivery

The loop repeats until the Evaluator (Code works) and Critic (Code is secure) each give a “Pass.”

The Motion: The Synthesiser takes the ultimate, verified code and returns it to the user.
The Memory Keeper summarises the “403 Forbidden” encounter and stores it within the Context Store in order that next time the agent is asked to scrape a site, it remembers to make use of header-rotation from the beginning.

Core Tool Archetypes: The ten Constructing Blocks of Reliable Agentic Systems

In the identical way we defined common Agent archetypes, we will undertake the same exercise for his or her tools. Tool archetypes define how that work is grounded, executed, and verified and which failure modes are contained because the system scales.

**Figure 5— Core Tool Archetypes.** Ten tool archetypes that underpin reliable multi-agent systems, grouped by functional plane (Context, Planning, Control, Execution, Assurance). Each tool type acts as an “architecture defence,” containing common scaling failure modes equivalent to hallucination, thrashing, runaway cost, unsafe actions, and silent correctness errors. 📖 **Source:** image by writer.

As we cover in Figure 5 above, retrieval tools can prevent hallucination by forcing evidence; schema validators and test harnesses prevent silent failures by making correctness machine-checkable; budget meters and observability prevent runaway loops by exposing (and constraining) token burn and drift; and permission gates plus sandboxes prevent unsafe negative effects by limiting what agents can do in the actual world.

The Bag of Agents Anti-Pattern

Before exploring the Scaling Laws of Agency, it’s instructive to define the currently stalling most agentic AI deployments and that we aim to enhance upon: the “Bag of Agents.” On this naive setup, developers throw multiple LLMs at an issue and not using a formal topology, leading to a system that typically exhibits three fatal characteristics:

Flat Topology: Every agent has an open line to each other agent. There is no such thing as a hierarchy, no gatekeeper, and no specialized planes to compartmentalize information flow.
Noisy Chatter: Without an Orchestrator, agents descend into circular logic or “hallucination loops,” where they echo and validate one another’s mistakes somewhat than correcting them.
Open-Loop Execution: Information flows unchecked through the group. There is no such thing as a dedicated Assurance Plane (Evaluators or Critics) to confirm data before it reaches the subsequent stage of the workflow.

To make use of a workplace analogy: the jump from a “Bag” to a “System” is similar leap a startup makes when it hires its first manager. Early-stage teams quickly realize that headcount doesn’t equal output without an organizational structure to contain it. As Brooks put it in The Mythical Man-Month, “Adding manpower to a late software project makes it later.”

But how much structure is enough?

The reply lies within the from the DeepMind paper. This research uncovers the precise mathematical trade-offs between adding more LLM “brains” and the growing friction of their coordination.

The Scaling Laws of Agency: Coordination, Topology, and Trade-offs

The Science of Scaling Agent Systems’ paper’s core discovery is that cranking up the agent quantity shouldn’t be a silver bullet for higher performance. Relatively there exists a rigorous trade-off between coordination overhead and task complexity. With out a deliberate topology, adding agents is like adding engineers to a project without an orchestrating manager: you usually don’t get more precious output; you’ll likely just get more meetings, undirected and potentially wastful work and noisy chatter.

**Figure 6: The Frontiers of Agentic Scaling.** Comparative evaluation of Single-Agent Systems (SAS) versus Centralized and Decentralized Multi-Agent Systems (MAS). The info illustrates the “Coordination Tax” where accuracy gains begin to saturate or fluctuate as agent quantity increases. This highlights the need of a structured topology to take care of performance beyond the 4-agent threshold. 📖 **Source:** Kim, Y., et al. (2025). “Towards a Science of Scaling Agent Systems.” arXiv preprint.

Experimental Setup for MAS Evaluation

The DeepMind team evaluated their Multi-Agent Systems across 4 task suites and tested the agent designs across very different workloads to avoid tying the conclusions to a selected tasks or benchmark:

BrowseComp-Plus (2025): Web browsing / information retrieval, framed as multi-website information location — a test of search, navigation, and evidence gathering.
Finance-Agent (2025): Finance tasks designed to mimic entry-level analyst performance — tests structured reasoning, quantitative interpretation, and decision support.
PlanCraft (2024): Agent planning in a Minecraft environment — a classic long-horizon planning setup with state, constraints, and sequencing.
WorkBench (2024): Planning and power selection for common business activities — tests whether agents can pick tools/actions and execute practical workflows.

Five coordination topologies are examined in : a Single-Agent System (SAS)and 4 Multi-Agent System (MAS) designs — Independent, Decentralised, Centralised, and Hybrid. Topology matters since it determines whether adding agents buys useful parallel work or simply buys more communication.

SAS (Single Agent): One agent does all the things sequentially. Minimal coordination overhead, but limited parallelism.
MAS (Independent): Many agents work in parallel, then outputs are synthesised right into a final answer. Strong for breadth (research, ideation), weaker for tightly coupled reasoning chains.
MAS (Decentralised): Agents debate over multiple rounds and choose via majority vote. This will improve robustness, but communication grows quickly and errors can compound through repeated cross-talk.
MAS (Centralised): A single Orchestrator coordinates specialist sub-agents. This “manager + team” design is usually more stable at scale since it constrains chatter and accommodates failure modes.
MAS (Hybrid): Central orchestration plus targeted peer-to-peer exchange. More flexible, but additionally essentially the most complex to administer and the simplest to overbuild.

**Figure 7:** A visible comparison of the architectures studied in **Towards a Science of Scaling Agent Systems:** Single-Agent (SAS), Independent MAS (parallel staff + synthesis), Decentralized MAS (multi-round debate + majority vote), Centralized MAS (orchestrator coordinating specialist sub-agents), and Hybrid MAS (orchestrator plus targeted peer-to-peer messaging). The important thing takeaway is that topology determines whether scaling agents increases useful parallel work or mainly increases coordination overhead and error propagation.. 📖 **Source:** image by writer via GPT-5.2.

This framing also clarifies why unstructured “bag of agents” designs might be very dangerous. Kim et al. report as much as 17.2× error amplification in poorly coordinated networks, while centralised coordination accommodates this to ~4.4× by acting as a circuit breaker.

Importantly, the paper shows these dynamics are benchmark-dependent.

On Finance-Agent (highly decomposable financial reasoning), MAS delivers the largest gains — Centralised +80.8%, Decentralised +74.5%, Hybrid +73.1% over SAS — because agents can split the work into parallel analytic threads after which synthesise.
On BrowseComp-Plus (dynamic web navigation and synthesis), improvements are modest and topology-sensitive: Decentralised performs best (+9.2%), while Centralised is actually flat (+0.2%) and Independent can collapse (−35%) attributable to unchecked propagation and noisy cross-talk.
Workbench sits in the center, showing only marginal movement (~−11% to +6% overall; Decentralised +5.7%), suggesting a near-balance between orchestration profit and coordination tax.
And on PlanCraft (strictly sequential, state-dependent planning), every MAS variant degrades performance (~−39% to −70%), because coordination overhead consumes budget without providing real parallel advantage.

The sensible antidote is to impose some structure in your MAS by mapping agents into functional planes and using central control to suppress error propagation and coordination sprawl.

That takes us to the paper’s core contribution: the Scaling Laws of Agency.

The Scaling Laws of Agency

Based on the findings from Kim et al., we will derive three for constructing an efficient Agentic coordination architecture:

The 17x Rule: Unstructured networks amplify errors exponentially. Our Centralized Control Plane (the Orchestrator) suppresses this by acting as a single point of verification.
The Tool-Coordination Trade-off: More tools require more grounding. Our Context Plane (Retriever) ensures agents don’t “guess” learn how to use tools, reducing the noise that results in overhead.
The 45% Saturation Point: Agent coordination yields the very best returns when single-agent performance is low. As models get smarter, we must lean on the Monitor Agent to simplify the topology and avoid unnecessary complexity.

**Figure 8.** The Agent architecture design space as a series of filters. Without them, the system is **Open-Loop**(errors escape). With them, the system is **Closed-Loop** (errors are recycled into improvements). 📖 **Source:** table by writer.

In my experience the Assurance layer is commonly the largest differentiator to improving MAS performance. The “Assurance → Planner” loop transforms our MAS from an Open-Loop (fire and forget) system to a Closed-Loop (self-correcting) system that accommodates error propagation and allowing intelligence to scale to more complex tasks.

Mixed-Model Agent Teams: When Heterogeneous LLMs Help (and When They Hurt)

The DeepMind team explicitly test heterogeneous teams, in other words a special base LLM for the Orchestrator than for the sub-agents, and mixing capabilities in decentralised debate. The teachings listed here are very interesting from a practical standapoint. They report three most important findings (shown on BrowseComp-Plus task/dataset):

Centralised MAS: mixing might help or hurt depending on model family

For Anthropic, a low-capability orchestrator + high-capability sub-agents beats an all–high-capability centralised team (0.42 vs 0.32, +31%).
For OpenAI and Gemini, heterogeneous centralised setups degrade versus homogeneous high-capability.

Takeaway: a weak orchestrator can turn out to be a bottleneck in some families, even when the employees are strong (since it’s the routing/synthesis chokepoint).

2. Decentralised MAS: mixed-capability debate is surprisingly robust

Mixed-capability decentralised debate is near-optimal or sometimes higher than homogeneous high-capability baselines (they provide examples: OpenAI 0.53 vs 0.50; Anthropic 0.47 vs 0.37; Gemini 0.42 vs 0.43).

Takeaway: voting/peer verification can “average out” weaker agents, so heterogeneity hurts lower than you would possibly expect.

3. In centralised systems, sub-agent capability matters greater than orchestrator capability.

Across all families, configurations with high-capability sub-agents outperform those with high-capability orchestrators.

The sensible overall key message from the DeepMind team is that when you’re spending money selectively, spend it on the employees (the agents producing the substance), not necessarily on the manager but validate in your model family since the “low cost orchestrator + strong staff” pattern didn’t generalise uniformly.

Understanding the Cost of a Multi-Agent System

A often asked query is learn how to define the associated fee of a MAS i.e., the token budget that ultimately translates into dollars. Topology determines whether adding agents buys parallel work or just buys more communication. To make cost concrete, we will model it with a small set of knobs:

k = max iterations per agent (what number of plan/act/reflect steps each agent is allowed)
n = variety of agents (what number of “staff” we spin up)
r = orchestrator rounds (what number of assign → collect → revise cycles we run)
d = debate rounds (what number of back-and-forth rounds before a vote/decision)
p = peer-communication rounds (how often agents talk directly to one another)
m = average peer requests per round (what number of peer messages each agent sends per peer round)

A practical technique to take into consideration total cost is:

Total MAS cost ≈ Work cost + Coordination cost

Work cost is driven mainly by n × k (what number of agents you run, and what number of steps each takes).
Coordination cost is driven by rounds × fan-out, i.e. how over and over we re-coordinate (r, d, p) multiplied by what number of messages are exchanged (n, m) — plus the hidden tax of agents having to all that extra context.

To convert this into dollars, first use the knobs (n, k, r, d, p, m) to estimate total input/output tokens generated and consumed, then multiply by your model’s per-token price:

$ Cost ≈ (InputTokens ÷ 1M × $/1M_input) + (OutputTokens ÷ 1M × $/1M_output)

Where:

InputTokens include all the things agents (shared transcript, retrieved docs, tool outputs, other agents’ messages).
OutputTokens include all the things agents (plans, intermediate reasoning, debate messages, final synthesis).

That is why decentralised and hybrid topologies can get expensive very fast: debate and peer messaging inflate each message volume context length, so we pay twice as agents generate more text, and everybody has to read more text. In practice, once agents begin broadly communicating with one another, coordination can begin to feel closer to an n² effect.

The important thing takeaway is that agent scaling is barely helpful if the duty gains more from parallelism than it loses to coordination overhead. We should always use more agents when the work might be cleanly parallelised (research, search, independent solution attempts). Conversely we must be cautious when the duty is sequential and tightly coupled (multi-step reasoning, long dependency chains), because extra rounds and cross-talk can break the logic chain and switch “more agents” into “more noise.”

MAS Architecture Scaling Law: Making MAS Design Data-Driven As an alternative of Exhaustive

A natural query is whether or not multi-agent systems have “architecture scaling laws,” analogous to the empirical scaling laws for LLM parameters. Kim et al. argue the reply is yes. To tackle the combinatorial search problem — topology × agent count × rounds × model family — they evaluated 180 configurations across 4 benchmarks, then trained a predictive model on coordination traces (e.g., efficiency vs. overhead, error amplification, redundancy). The model can forecast which topology is prone to perform best, reaching R² ≈ 0.513 and choosing the most effective coordination strategy for ~87% of held-out configurations. The sensible shift is from “try all the things” to running a small set of short probe configurations (less expensive and faster), measure early coordination dynamics, and only then commit full budget to the architectures the model predicts will win.

Conclusions & Final Thoughts

On this post, we reviewed DeepMind’s Towards a Science of Scaling Agent Systems and distilled essentially the most practical lessons for constructing higher-performing multi-agent systems. These design rules helps us avoid poking around at midnight and hoping for the most effective. The headline takeaway is that more agents shouldn’t be a guaranteed path to higher results. Agent scaling involves real trade-offs, governed by measurable scaling laws, and the “right” variety of agents relies on task difficulty, the bottom model’s capability, and the way the system is organised.

Here’s what the research suggests about agent quantity:

Diminishing returns (saturation): Adding agents doesn’t produce indefinite gains. In lots of experiments, performance rises initially, then plateaus — often around ~4 agents — after which additional agents contribute little.
The “45% rule”: Extra agents help most when the bottom model performs poorly on the duty (below ~45%). When the bottom model is already strong, adding agents can trigger capability saturation, where performance stagnates or becomes noisy somewhat than improving.
Topology matters: Quantity alone shouldn’t be the story; organisation dominates outcomes.
Centralised designs are inclined to scale more reliably, with an Orchestrator helping contain errors and implement structure.
Decentralised “bag of agents” designs can turn out to be volatile because the group grows, sometimes amplifying errors as an alternative of refining reasoning.
Parallel vs. sequential work: More agents shine on parallelisable tasks (e.g., broad research), where they will materially increase coverage and throughput. But for sequential reasoning, adding agents can degrade performance because the “logic chain” weakens when passed through too many steps.
The coordination tax: Every additional agent adds overhead. More messages, more latency, more opportunities for drift. If the duty isn’t complex enough to justify that overhead, coordination costs outweigh the advantages of additional LLM “brains”.

With these Golden Rules of Agency in mind, I would like to finish by wishing you the most effective in your MAS build-out. Multi-Agent Systems sit right on the frontier of current applied AI, primed to bring the subsequent level of business value in 2026 — they usually include the sorts of technical trade-offs that make this work genuinely interesting: balancing capability, coordination, and design to get reliable performance. In constructing your individual MAS you’ll undoubtedly discover Golden Rules of your individual that expand our knowledge over this uncharted territory.

A final thought on DeepMind’s “45%” threshold: multi-agent systems are, in some ways, a workaround for the boundaries of today’s LLMs. As base models turn out to be more capable, fewer tasks will sit within the low-accuracy regime where extra agents add real value. Over time, we may have less decomposition and coordination, and more problems could also be solvable by a single model end-to-end, as we move toward artificial general intelligence (AGI).

Paraphrasing Tolkien: one model may yet rule all of them.

📚 Further Learning

Yubin Kim et al. (2025) — — A big controlled study deriving quantitative scaling principles for agent systems across multiple coordination topologies, model families, and benchmarks (including saturation effects and coordination overhead).
Cursor Team (2026) — — A practical case study describing how Cursor coordinates large numbers of coding agents over prolonged runs (including their “construct an online browser” experiment) and what that suggests for planner–employee style topologies.
Lei Wang et al. (2023) — — A comprehensive survey that maps the agent design space (planning, memory, tools, interaction patterns), useful for grounding your 10-archetype taxonomy in prior literature.
Qingyun Wu et al. (2023) — — A framework paper on programming multi-agent conversations and interaction patterns, with empirical results across tasks and agent configurations.
Shunyu Yao et al. (2022) — — Introduces the interleaved “reason + act” loop that underpins many modern tool-using agent designs and helps reduce hallucination via grounded actions.
Noah Shinn et al. (2023) — — A clean template for closed-loop improvement: agents reflect on feedback and store “lessons learned” in memory to enhance subsequent attempts without weight updates.
LangChain (n.d.) — — Practical documentation covering common multi-agent patterns and learn how to structure interaction so you’ll be able to avoid uncontrolled “bag of agents” chatter.
LangChain (n.d.) — — A focused overview of orchestration features that matter for real systems (durable execution, human-in-the-loop, streaming, and control-flow).
langchain-ai (n.d.) — — The reference implementation for a graph-based orchestration layer, useful if readers need to inspect concrete design selections and primitives for stateful agents.
Frederick P. Brooks, Jr. (1975) — — The classic coordination lesson (Brooks’s Law) that translates surprisingly well to agent systems: adding more “staff” can increase overhead and slow progress without the best structure.

Why Your Multi-Agent System is Failing: Escaping the 17x Error Trap of the “Bag of Agents”