Adding more agents to LLM-driven multi-agent systems degrades performance past a task-dependent optimum, with weaker models peaking at 4 agents and stronger ones at 2.
A new study from researchers at multiple institutions finds that adding more agents to single-LLM multi-agent systems degrades performance past a task-dependent optimum. The paper, shared on X by @omarsar0, reports that weaker models like Llama-3.2-3B peak at 4 agents while stronger models like Llama-3.1-8B top out at 2.
Key facts
- Optimal agent count: 4 for 3B models, 2 for 8B models
- Adding agents past optimum reduces MATH-500 accuracy
- Study tested Llama-3.2-3B, Llama-3.1-8B, GPT-4o-mini
- Information redundancy and coordination overhead identified as failure modes
- Interaction design matters more than agent plurality
The prevailing assumption in multi-agent system design has been that more agents yield better collective intelligence. A new preprint challenges that directly, showing that the relationship between agent count and performance is parabolic, not monotonic.
How the study worked
The researchers tested single-LLM-driven multi-agent systems across several base models (Llama-3.2-3B, Llama-3.1-8B, GPT-4o-mini) on reasoning benchmarks including MATH-500, GSM8K, and MMLU. They varied agent count from 1 to 10 while keeping the interaction protocol (agent-to-agent communication via structured messages) fixed. [According to the arXiv preprint]
Key finding: For weaker base models (3B parameters), performance climbs from 1 to 4 agents, then declines. For stronger models (8B parameters), the optimum is just 2 agents — adding more reduces accuracy on complex math and reasoning tasks. GPT-4o-mini showed similar early-peak behavior.
Why more agents hurts
The paper identifies two failure modes: information redundancy and coordination overhead. As agent count increases, agents produce overlapping reasoning traces, and the single LLM acting as both the agent and the orchestrator struggles to integrate conflicting outputs. "Collective intelligence emerges from interaction design rather than from agent plurality," the authors write. [Per the arXiv preprint]
This echoes findings from earlier work on mixture-of-experts architectures, where routing quality degrades past a certain number of experts. The study extends that insight to multi-agent systems, suggesting that the bottleneck is the base model's capacity to process multi-source inputs, not the number of agents per se.
Practical implications
For engineers building multi-agent workflows: the default of "add more agents for better reasoning" is likely wrong. The optimal agent count is a function of both the base model's capability and the task complexity — and it is almost always below 5. The paper recommends starting with 2 agents for strong models and 4 for weak ones, then tuning downward.
One unique take the AP wire would miss: this result suggests that multi-agent systems are not a free lunch for scaling reasoning. The real lever is interaction design (prompt structure, communication protocol, agent roles), not headcount. Companies like CrewAI and AutoGen that sell multi-agent frameworks may need to recalibrate their default configurations.
What to watch
Watch for follow-up work testing this scaling behavior with larger base models (70B+) and more sophisticated interaction protocols like role-based delegation. Also monitor whether CrewAI and AutoGen ship updated default agent counts based on this finding.
Originally published on gentic.news
Top comments (0)