TL;DR
- Stanford (Tran & Kiela, arXiv 2604.02460) tested single-agent vs multi-agent systems with identical thinking-token budgets
- Single agent wins on accuracy AND on compute, across three model families
- The mechanism is information theory — every handoff loses information (Data Processing Inequality)
- The Gemini 2.5 API has token-budget enforcement artifacts that biased a year of prior benchmarks
The hidden variable nobody controls for
When you compare a single-agent LLM to a multi-agent orchestration (CrewAI, AutoGen, LangGraph), most published benchmarks let the multi-agent system spend 2–4x more reasoning tokens than the single agent — longer traces, more intermediate steps, more coordination passes.
The variable nobody controls for is the thinking-token budget. The multi-agent system wins because it's allowed to think for longer.
Pin the budget. The advantage disappears.
What Stanford did
Tran and Kiela built the experiment around one strict constraint: they fixed the thinking-token budget — the number of tokens spent on intermediate reasoning, separate from the input prompt and the final answer.
Models tested:
- Qwen3
- DeepSeek-R1-Distill-Llama
- Gemini 2.5
Datasets: FRAMES, MuSiQue 4-hop (multi-hop reasoning)
Budgets: 100 to 10,000 thinking tokens
Architectures compared: SAS vs 5 MAS variants (Sequential, Subtask-parallel, Parallel-roles, Debate, Ensemble)
Across all three model families, with budget held constant, single agent produced higher-accuracy answers and consumed less compute on average than the multi-agent systems.
The Gemini 2.5 API bias
The methodology section has a line that hits harder than the headline result:
"significant artifacts in API-based budget control, particularly in Gemini 2.5"
In plain language: when researchers tell a single agent to think for a fixed number of tokens, it often stops short. The multi-agent system, running multiple separate calls, surfaces more visible thinking under the same requested budget.
The cap is not the cap. Every prior multi-agent benchmark that trusted those labels as the fairness control was comparing two things that were never the same size.
A year of architecture decisions, framework adoption, vendor pitches — much of it stacked on benchmarks that didn't measure what they claimed to measure.
Why it works this way (information theory)
The theoretical argument the paper builds on is the Data Processing Inequality — a foundational result from Shannon (1948) and Fano (1952).
The principle: once you have a piece of information, no amount of further processing can add information to it. You can only preserve it or lose it.
When you split a reasoning task across multiple agents, every handoff is a processing step. Each agent receives a summary of what the previous agent did, not the full chain of reasoning. The summary is lossy by definition.
A single agent reasoning end-to-end never has to compress and re-expand its own thinking through someone else. The chain stays intact.
More agents do not add intelligence. They add stages where information can leak out.
When multi-agent still wins
To be fair to the architecture, the paper identifies the conditions under which MAS wins:
Context fragmentation — when the input is so long or heterogeneous that one agent can't hold the relevant pieces in working memory. Splitting across specialists with cleaner smaller contexts recovers ground.
More compute is allowed — if the budget is genuinely larger, more agents can buy more accuracy.
Decision boundary:
| Problem type | Architecture |
|---|---|
| Reasoning depth (multi-hop logic, chained inference) | Single agent |
| Context fragmentation (long heterogeneous docs, parallel sub-tasks) | Multi-agent |
Most multi-agent deployments in the wild are reasoning-depth problems mislabeled as fragmentation problems.
The cheap experiment to run first
Before you build the next multi-agent system:
- Take the task you were going to give to four agents
- Give it to one agent
- Match the total token budget — give the single agent room to think for as long as your multi-agent system would have, in aggregate
- Add an explicit pre-answer analysis prompt — tell it to reason step by step before responding
If the single agent matches the multi-agent result — and the paper says, on reasoning tasks, it usually will — you've just saved yourself an orchestration layer, a coordination cost, a debugging surface, and the latency of every handoff.
The paper's quieter finding: single-agent prompts with explicit pre-answer analysis recover most of what looks like a "collaboration benefit" in multi-agent traces. The collaboration wasn't the source of the gain. The extra thinking was.
You can have the extra thinking without the extra agents.
Full breakdown in video form
I walked through the methodology, the Gemini 2.5 API bias, the information theory, and the practical decision boundary in 12 minutes:
Sources
- Tran, D., Kiela, D. — Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets. arXiv:2604.02460. https://arxiv.org/abs/2604.02460
- Data Processing Inequality — Shannon (1948), Fano (1952), Cover & Thomas, Elements of Information Theory (2006)
Top comments (0)