Sergei Peleskov

Posted on May 12

Why Single Agents Beat Multi-Agent Systems at Equal Token Budgets

#agents #llm #ai #architecture

TL;DR

Stanford (Tran & Kiela, arXiv 2604.02460) tested single-agent vs multi-agent systems with identical thinking-token budgets
Single agent wins on accuracy AND on compute, across three model families
The mechanism is information theory — every handoff loses information (Data Processing Inequality)
The Gemini 2.5 API has token-budget enforcement artifacts that biased a year of prior benchmarks

The hidden variable nobody controls for

When you compare a single-agent LLM to a multi-agent orchestration (CrewAI, AutoGen, LangGraph), most published benchmarks let the multi-agent system spend 2–4x more reasoning tokens than the single agent — longer traces, more intermediate steps, more coordination passes.

The variable nobody controls for is the thinking-token budget. The multi-agent system wins because it's allowed to think for longer.

Pin the budget. The advantage disappears.

What Stanford did

Tran and Kiela built the experiment around one strict constraint: they fixed the thinking-token budget — the number of tokens spent on intermediate reasoning, separate from the input prompt and the final answer.

Models tested:

Qwen3
DeepSeek-R1-Distill-Llama
Gemini 2.5

Datasets: FRAMES, MuSiQue 4-hop (multi-hop reasoning)
Budgets: 100 to 10,000 thinking tokens
Architectures compared: SAS vs 5 MAS variants (Sequential, Subtask-parallel, Parallel-roles, Debate, Ensemble)

Across all three model families, with budget held constant, single agent produced higher-accuracy answers and consumed less compute on average than the multi-agent systems.

The Gemini 2.5 API bias

The methodology section has a line that hits harder than the headline result:

"significant artifacts in API-based budget control, particularly in Gemini 2.5"

In plain language: when researchers tell a single agent to think for a fixed number of tokens, it often stops short. The multi-agent system, running multiple separate calls, surfaces more visible thinking under the same requested budget.

The cap is not the cap. Every prior multi-agent benchmark that trusted those labels as the fairness control was comparing two things that were never the same size.

A year of architecture decisions, framework adoption, vendor pitches — much of it stacked on benchmarks that didn't measure what they claimed to measure.

Why it works this way (information theory)

The theoretical argument the paper builds on is the Data Processing Inequality — a foundational result from Shannon (1948) and Fano (1952).

The principle: once you have a piece of information, no amount of further processing can add information to it. You can only preserve it or lose it.

When you split a reasoning task across multiple agents, every handoff is a processing step. Each agent receives a summary of what the previous agent did, not the full chain of reasoning. The summary is lossy by definition.

A single agent reasoning end-to-end never has to compress and re-expand its own thinking through someone else. The chain stays intact.

More agents do not add intelligence. They add stages where information can leak out.

When multi-agent still wins

To be fair to the architecture, the paper identifies the conditions under which MAS wins:

Context fragmentation — when the input is so long or heterogeneous that one agent can't hold the relevant pieces in working memory. Splitting across specialists with cleaner smaller contexts recovers ground.
More compute is allowed — if the budget is genuinely larger, more agents can buy more accuracy.

Decision boundary:

Problem type	Architecture
Reasoning depth (multi-hop logic, chained inference)	Single agent
Context fragmentation (long heterogeneous docs, parallel sub-tasks)	Multi-agent

Most multi-agent deployments in the wild are reasoning-depth problems mislabeled as fragmentation problems.

The cheap experiment to run first

Before you build the next multi-agent system:

Take the task you were going to give to four agents
Give it to one agent
Match the total token budget — give the single agent room to think for as long as your multi-agent system would have, in aggregate
Add an explicit pre-answer analysis prompt — tell it to reason step by step before responding

If the single agent matches the multi-agent result — and the paper says, on reasoning tasks, it usually will — you've just saved yourself an orchestration layer, a coordination cost, a debugging surface, and the latency of every handoff.

The paper's quieter finding: single-agent prompts with explicit pre-answer analysis recover most of what looks like a "collaboration benefit" in multi-agent traces. The collaboration wasn't the source of the gain. The extra thinking was.

You can have the extra thinking without the extra agents.

Sources

Tran, D., Kiela, D. — Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets. arXiv:2604.02460. https://arxiv.org/abs/2604.02460
Data Processing Inequality — Shannon (1948), Fano (1952), Cover & Thomas, Elements of Information Theory (2006)

Top comments (7)

Rasmus Ros • May 13

Good hit on CrewAI style orchestration, but it does not really touch the useful case for LangGraph style systems where each node has different tools or a smaller retrieval scope. That is closer to pipeline design than "many minds." The paper seems strongest against agent handoff for pure reasoning, not against structured tool use.

Sergei Peleskov • May 13

Right. The paper explicitly limits its claims to multi-hop reasoning under matched token budgets — it isn't a verdict on tool use, retrieval routing, or any pipeline where the splits are bounded contexts rather than reasoning handoffs. The "swarm tax" frame applies to: planner → researcher → writer → critic on a single reasoning task. Not to: agent calls tool, gets structured output, continues. That's a different architecture even if the framework markets both as "multi-agent."

Rasmus Ros • May 13

Agreed. The paper is about one task split across agents with the same token budget. In that setup, coordination is the cost. That is not the same as tool use, retrieval, or bounded-context steps where an agent gets a result and moves on. Calling both multi-agent blurs two different architectures.

Sergei Peleskov • May 13

Right — "multi-agent" became a marketing label that bundles two unrelated things: coordination cost (handoff reasoning) and bounded-context steps (tool use, retrieval). The paper only indicts the first. Frameworks selling both under one banner are part of why the comparison stayed unclear for a year.

Rasmus Ros • May 13

Yes. Once those get collapsed into one bucket, people start benchmarking orchestration overhead against tool latency and call it a result. That made a lot of "multi-agent beats single-agent" discourse pretty hard to take seriously.

Sergei Peleskov • May 13

Yeah. The paper's contribution is forcing budget control on reasoning handoffs. Splitting "orchestration overhead" from "tool-call latency" cleanly is a separate benchmark someone still needs to design. The video covers the decision boundary the paper draws — reasoning depth vs context fragmentation — but you're right that real-world frameworks blur even that.

NOVAInetwork • May 13

The token budget argument is valid for systems
where all agents share the same trust boundary.
One model with good tool routing will outperform
three models coordinating through serialized
context.

The case for multi-agent starts when agents cross
trust boundaries. Agent A is an oracle owned by
company X. Agent B is a trading bot owned by
company Y. They cannot share a context window
because they do not trust each other. The
coordination overhead is not wasted tokens. It is
the cost of operating without shared state.

That is the problem I keep hitting building AI
infrastructure. When agents are owned by different
parties, the orchestration layer cannot live inside
one model's context. It has to be protocol-level
coordination with on-chain identity, reputation,
and payment settlement so each agent can verify
the other's claims without trusting its reasoning.

Single-agent wins inside one org. Multi-agent is
unavoidable across orgs. The question is where the
coordination layer sits.