Yaohua Chen for ImagineX

Posted on Jan 5

Why Your Multi-Agent AI System Is Probably Making Things Worse?

#ai #agents #discuss #architecture

2025 has been dubbed the "Year of the Agent" by investors and tech media. Companies like Manus, Lovart, Fellou, and many others have captured headlines with their AI agent applications, which are software systems that can autonomously perform tasks on your behalf, from browsing the web to analyzing documents.

Over the past two years, I've built multi-agent systems for various clients across different industries by using various AI models and agent frameworks. A pattern keeps emerging: these projects look impressive in demos but struggle to work reliably in production. The same questions come up again and again: why isn't adding more agents helping? Why doesn't giving the system more tokens (via prompt engineering or Retrieval-Augmented Generation pipeline (RAG)), more tool calls, or more compute budget improve results?

The industry has embraced two assumptions that seem logical on the surface:

More agents = better results. Since a single AI agent has limited capabilities, having multiple agents collaborate should solve more complex problems.
More compute = better performance. If results aren't good enough, just give the AI more time to think (more "tokens") and more tools to work with.

But are these assumptions actually true?

Recent research tells a very different story. A report from UC Berkeley, "Measuring Agents in Production" (December 2025), combined with two papers from Google DeepMind, systematically debunks both assumptions:

More agents ≠ better results. Multi-agent systems often perform worse than single agents due to coordination overhead.
More compute ≠ better performance. Agents don't know how to effectively use extra resources. They leave 85% of their budget untouched.

These studies reveal that current AI agents have fundamental limitations that no amount of scaling can easily fix. Let me walk you through what the research actually shows.

The Reality Check: What Berkeley Found in Production Systems

The Berkeley team surveyed 306 practitioners and conducted 20 in-depth case studies with organizations actually running AI agents in production, including Accenture, Amazon, AMD, Anyscale, Broadcom Inc., Google, IBM, Intel, Intesa Sanpaolo, Lambda, Mibura Inc, Samsung SDS, and SAP. Crucially, they filtered out demo-stage or conceptual projects, focusing only on systems generating real business value.

Their findings paint a surprisingly conservative picture:

Most agents are kept on a very short leash. 68% of production systems limit agents to 10 steps or fewer. Only 16.7% allow dozens of steps, and a mere 6.7% give agents unlimited autonomy.

Companies build safety barriers between agents and real systems. Rather than letting agents directly call production APIs, engineering teams create simplified "wrapper APIs", bundling multiple complex operations into single, safer commands. For example, instead of making an agent call three separate database queries, engineers package them into one pre-tested function. This reduces what could go wrong.

Human-designed workflows dominate. 80% of successful deployments use "structured control flow", meaning humans draw the flowchart, and the AI simply fills in the blanks at predetermined decision points. The agent isn't autonomously planning, it's following a script.

Agents require massive instruction sets. 12% of deployed systems use prompts exceeding 10,000 tokens (roughly 7,500 words of instructions). These aren't lightweight assistants, they're heavily engineered systems with extensive guardrails.

In essence, today's successful AI agents work like tireless interns with good reading comprehension, useful within a tightly defined process, capable of handling some ambiguity, but not the autonomous problem-solvers the marketing suggests.

So why are production systems so constrained? Two papers from Google DeepMind, published in late 2025, may provide the answers by systematically disproving the core assumptions behind agent scaling:

Limitation #1: More Agents ≠ Better Performance

DeepMind's paper "Towards a Science of Scaling Agent Systems" tackled a seductive idea: if one AI isn't smart enough, why not create a whole team? Imagine GPT handling product management, Claude writing code, and Gemini running tests—a virtual software company where PhD-level AIs collaborate to solve any problem.

It sounds logical. After all, that's how human organizations scale. But 180 controlled experiments later, DeepMind proved this intuition wrong.

The Experiment Setup

The researchers tested five different ways to organize AI agents:

Single Agent: One AI handles everything (think: a solo developer)
Independent Multi-Agent: Multiple AIs work on the same problem separately, then their answers are combined through voting (think: getting multiple opinions, then picking the consensus)
Decentralized Multi-Agent: Agents communicate directly with each other to negotiate solutions (think: a peer-to-peer discussion group)
Centralized Multi-Agent: One "manager" agent assigns tasks and verifies results (think: a team with a project manager)
Hybrid: A combination of centralized coordination with peer communication

They tested these architectures using top models from OpenAI, Google, and Anthropic across four different task types: financial analysis, web browsing, game planning (Minecraft-style crafting), and general workflows.

The Core Finding

DeepMind discovered a formula that predicts agent system performance with average 87% accuracy:

Net Performance = (Individual Capability + Collaboration Benefits) − (Coordination Chaos + Communication Overhead + Tool Complexity)

The key insight: the costs often outweigh the benefits. When coordination overhead, miscommunication, and tool management burden exceed the gains from parallelization, adding more agents makes systems worse, not better.

The results varied dramatically by task type:

Financial analysis: Multi-agent systems helped (up to 81% improvement with centralized architecture)
Web browsing: Minimal benefit; errors actually got amplified
Game planning (PlanCraft): Multi-agent systems performed significantly worse than single agents
General workflows: Mixed results; decentralized approaches slightly better

See the figure below for the detailed results:

Why Multi-Agent Systems Fail: Three Key Reasons

1. The Coordination Tax

In complex, open-ended tasks, adding more agents makes the system dumber, not smarter.

Consider the PlanCraft benchmark (a Minecraft-style planning task). When Anthropic's Claude model was put into a multi-agent setup, performance dropped by 35%. Why? Every agent must understand tool interfaces, maintain conversation context, and process results from other agents. When the tool count exceeds a threshold, agents spend all their "mental bandwidth" on coordination, reading documentation and attending virtual meetings, with no capacity left for actual problem-solving.

2. The Capability Saturation Effect

When a single agent can already solve a problem with greater than 45% accuracy, adding more agents typically provides diminishing or negative returns.

The logic is straightforward: if one agent can correctly answer "What is 2+2?", having three agents debate the answer for an hour won't improve accuracy, it just wastes resources.

3. Error Amplification (The Most Surprising Finding)

Intuitively, you'd expect that having three agents vote on an answer would reduce errors, the wisdom-of-crowds effect. But DeepMind found the opposite.

In independent multi-agent systems (where agents work separately then vote), errors don't cancel out—they multiply. The paper quantifies this with an "error amplification factor" of 17.2. This means if a single agent has a 5% error rate, an independent multi-agent system can have an error rate as high as 86% (5% × 17.2).

Why does this happen? Without cross-verification during reasoning, each agent makes errors based on its own flawed logic. These errors become self-reinforcing within each agent's context. When you aggregate three independently wrong answers through voting, you're not getting wisdom, you're getting confident wrongness.

Limitation #2: More Thinking Time ≠ Better Results

If adding more agents doesn't work, what about giving a single agent more time to think?

After OpenAI released its o1 model, "test-time compute" became a buzzword. The idea: let AI models "think longer" by giving them more computational budget during inference. Search more, reason more, and eventually they'll find the answer, right? But is it true?

DeepMind's paper "Budget-Aware Tool-Use Enables Effective Agent Scaling" tested this assumption—and found it largely false.

The Experiment

Researchers increased an agent's "tool-call budget", the number of web searches, API calls, or other actions it could perform, from 10 to 100. The expectation: 10x more resources should yield significantly better results.

The reality: doubling the budget improved accuracy by only 0.2 percentage points.

Even more telling: when given a budget of 100 tool calls, agents only used an average of 14.24 searches and 1.36 browsing sessions. They left 85% of their budget untouched.

Why Agents Can't Use Extra Resources Effectively

The core problem: agents don't know what they don't know, and they don't track their remaining budget.

When an agent goes down a wrong path, say, searching for a paper title that doesn't exist, it has no concept of opportunity cost. It will keep digging deeper into a dead end rather than trying a different approach. Give it unlimited compute, and it just digs a deeper hole.

Making matters worse, long conversation contexts cause "attention drift". After a dozen failed searches, the agent gets lost in its own accumulated noise, the search results, error messages, and dead ends it generated. Performance actually declines as context grows.

A Potential Solution: Budget-Aware Agents (BATS)

DeepMind proposed a framework called BATS (Budget-Aware Test-time Scaling) that addresses these issues with two key mechanisms:

1. Budget-Aware Planning: Instead of making a fixed plan upfront, the agent maintains a dynamic task tree. Each node tracks its status (pending, completed, failed) and resource consumption. When budget is plentiful, expand exploration; when budget is tight, focus on verification.

2. Budget-Aware Verification: After proposing an answer, a separate verification step checks constraints: What's confirmed? What's contradicted? What can't be verified? Based on this assessment and remaining budget, the agent decides whether to dig deeper or abandon the current path.

The results were significant:

BrowseComp benchmark: 24.6% accuracy (95% improvement over standard approaches at 12.6%)
BrowseComp-ZH: 46.0% accuracy (46% improvement over 31.5%)
Cost efficiency: 40% lower total cost (tokens + tool calls) at equivalent accuracy

The lesson: raw thinking time isn't enough. Agents need structured self-reflection, the ability to recognize dead ends, and the wisdom to cut losses early.

What's Actually Needed for AI Agents to Work?

Let's return to DeepMind's core formula:

Net Performance = (Individual Capability + Collaboration Benefits) − (Coordination Chaos + Communication Overhead + Tool Complexity)

The problem isn't that we need smarter models or more compute. The problem is that the negative factors, including coordination overhead, communication noise, and tool complexity, are overwhelming the positive factors. All of these boil down to one root cause: inefficient use of context.

Every token spent on coordination, error recovery, or tool documentation is a token not spent on actual problem-solving. To make agents work, we need to reduce this context burden, not pile on more resources.

Three Directions That Actually Work

1. Smarter Tool Management

The key insight from the financial analysis success (81% improvement with multi-agent systems) is instructive: it worked because the task had clear boundaries and well-defined steps.

Financial analysis follows a predictable pattern: read report → extract data → calculate ratios → generate summary. Each agent fills in blanks within a predetermined framework, no creative planning required.

This tells us something important: current AI models cannot self-organize division of labor. They can handle easily parallelizable tasks (like financial analysis) or consensus-based error correction (like multi-path search), but not emergent collaboration.

The implication? For complex tasks, human-designed task decomposition (SOPs) remains necessary. The dream of throwing agents together and watching them spontaneously develop hierarchies has been empirically disproven.

This is why Anthropic's Skills mechanism matters: it lets agents accumulate reusable capability modules instead of starting from scratch, reducing the cognitive load of tool management.

2. Built-in Self-Verification

BATS works because it formalizes verification. The system explicitly tracks constraints: what's satisfied, what's contradicted, what can't be verified yet. This isn't emergent behavior, it's enforced through careful prompt engineering.

Without structured verification, errors accumulate silently. Each mistake pollutes the context with garbage that degrades future reasoning. Formal verification catches errors early, preventing context pollution.

3. Efficient Inter-Agent Communication

Current agents coordinate via natural language, verbose, ambiguous, requiring constant clarification. This high message density is inherently wasteful.

Future improvements might come from:

Structured communication protocols (like Google's A2A framework)
Latent-space communication where models exchange compressed representations rather than text
Shared memory architectures that reduce redundant information exchange

Until these three capabilities mature, such as smart tool management, built-in verification, and efficient communication, multi-agent systems will continue to underperform their theoretical potential.

The Bottom Line

Despite the hype, the "Year of the Agent" hasn't truly arrived. The research tells us:

Production agents are heavily constrained: most limited to 10 steps or fewer, running within human-designed workflows
More agents often means worse performance: coordination costs overwhelm collaboration benefits
More compute doesn't help much: agents don't know how to use extra resources effectively
The path forward is reducing overhead, not adding more power

Current successful AI agents are best understood as capable interns with good reading comprehension, working within strict SOPs. They handle ambiguity better than traditional software, but they're not autonomous problem-solvers.

For practitioners, the implication is clear: invest in workflow design, tool abstraction, and structured verification rather than chasing multi-agent architectures or unlimited compute budgets. The engineering fundamentals—not the scaling laws—determine success.