"The agents got confused."
"It went off the rails."
That is how most multi-agent post-mortems read. The vocabulary is too imprecise to be actionable. A planning failure requires a different fix from a communication failure or a verification failure, and treating them as a single category produces interventions that address none of them.
Cemri et al. (2025) built a taxonomy bottom-up from 1,642 annotated execution traces across seven popular frameworks. The result is 14 failure modes in 3 categories, with inter-annotator agreement at Cohen's kappa 0.88. Published at NeurIPS 2025.
The MAST taxonomy
The study covered seven frameworks: AutoGPT, AgentVerse, MetaGPT, ChatDev, and three others. Six expert annotators labeled each trace, with multiple rounds of refinement to resolve disagreements. Cohen's kappa 0.88 is considered strong agreement.
14 failure modes. 3 categories.
Category 1: Specification and system design (44.2%)
The largest category, and the most tractable.
These failures are introduced at design time. The agent does not break; it faithfully executes a flawed setup.
The five modes:
- Step repetition: the single most frequent mode at 15.7% of all failures
- Disobey task specification (11.8%)
- Disobey role specification
- Loss of conversation history
- Unaware of termination conditions
All five are addressable before the coordination layer exists. A precise task spec, enforced roles, and explicit stop conditions prevent a large share of everything downstream.
Category 2: Inter-agent misalignment (32.3%)
These are the failures unique to having more than one agent.
Conversation resets. Agents that withhold information another agent needs. Task derailment. Agents that ignore each other's output. Reasoning-action mismatch (13.2%), where an agent decides one thing and does another.
Every one of these is impossible in a single-agent system.
Cognition identified the mechanism: "Actions carry implicit decisions, and conflicting decisions carry bad results." (Walden Yan, Don't Build Multi-Agents, June 2025.) When agents operate from partial context, their decisions conflict in ways that are not visible to any individual agent.
Mitigation requires sharing full agent execution traces, not just inter-agent messages. This is architectural, not a configuration change.
Category 3: Verification and termination (23.5%)
The smallest category, and often the highest-leverage to fix.
Premature termination. No verification. Incorrect verification.
Cemri et al. tested a direct intervention on ChatDev: adding a high-level verification step improved task success by 15.6 percentage points. Tightening role specifications improved it by 9.4 percentage points.
A verification step is contained, measurable, and has the strongest documented effect size in the MAST intervention studies.
The detection problem
A taxonomy tells you what to look for. Detecting failures automatically after the fact is a separate, harder problem.
Zhang et al. (ICML 2025) benchmarked automated failure attribution across 127 multi-agent systems. The best method reached 53.5% accuracy at identifying the responsible agent and 14.2% at pinpointing the responsible step. Frontier reasoning models performed below the automated baseline on step attribution.
Failures are usually cascades. An early specification ambiguity surfaces ten steps later as a verification failure. The trace does not announce the link.
Where to spend effort, in order
Fix specification failures first. They are the cheapest, they are front-loaded, and preventing them reduces exposure across all three categories.
Add a verification step. It is contained, has a measured effect size, and is the most straightforward architectural addition.
Address misalignment structurally. Share full agent execution traces rather than individual messages. This requires architectural commitment, not a single patch.
Sources: Cemri et al., Why Do Multi-Agent LLM Systems Fail? arXiv:2503.13657, NeurIPS 2025; Zhang et al., Which Agent Causes Task Failures and When? arXiv:2505.00212, ICML 2025; Cognition, Don't Build Multi-Agents (Jun 2025).
Top comments (1)
The 44.2% being specification/design failures tracks with what I've seen running my own agents. The agent isn't broken — it's doing exactly what you asked, which turns out to be the wrong thing. That "step repetition" at 15.7% being the single biggest mode is hilarious in a painful way. I've watched agents loop on a subtask for 20 minutes because nobody told it when to stop.