Why multi-agent workflows fail in production

#ai #research #opensource #openwalrus

Multi-agent sounds like the obvious answer: parallelize work, specialize agents,
go faster. And for demos, it works — you can show three agents collaborating on
a feature and it looks impressive.

In production, the failures are consistent enough that Cognition — the team behind
Devin — published a post titled
Don't Build Multi-Agents.
The GitHub blog ran
Multi-agent workflows often fail. Here's how to engineer ones that don't.

These aren't fringe complaints. They're structural.

Context doesn't travel

The foundational problem: each subagent starts fresh. The only information that
passes between agents is the task prompt string. Everything the parent agent
discovered — the codebase structure, constraints, decisions already made — has
to be re-communicated explicitly or re-discovered from scratch.

The Claude Code docs acknowledge this
directly:

"Subagents might miss the strategic goal or important constraints known to
the parent agent, leading to solutions that are technically correct but not
perfectly aligned with the user's original intent."

In practice this plays out as "context amnesia." One documented case: a user asked
Claude Code to fix failing tests and it repeatedly spawned subagents for work that
could have been done in the main context — burning through tokens with no benefit
because each subagent re-explored files the parent already understood.
GitHub issue #11712
captures a related failure: when agents are resumed, they lose the user prompt that
initiated the resumption, so the resumed agent lacks the context that explains why
it exists.

The community workaround is "Main Agent as Project Manager with State Awareness":
the parent agent maintains a shared context document and explicitly passes relevant
state to each subagent's prompt. This works, but it's manual prompt engineering —
the developer is doing the coordination work that the system should handle.

Parallel agents conflict

When agents run in parallel, they make independent decisions about shared state.
Cognition's analysis makes the
problem concrete:

"If a task is 'build a Flappy Bird clone' divided into subtasks, one subagent
might build a Super Mario Bros. background while another builds an incompatible
bird, leaving the final agent to combine these miscommunications."

The GitHub Blog identifies the systemic version of this:

"Agents may close issues that other agents just opened, or ship changes that fail
downstream checks they didn't know existed, because agents make implicit assumptions
about state, ordering, and validation without explicit instructions."

The failure mode compounds. From Towards Data Science:

"When one agent decides something incorrectly, downstream agents assume it's true,
and by discovery time, 10 downstream decisions are built on that error."

This is why Devin avoids parallel agents entirely. It's not a capability limitation —
it's an architectural choice based on the failure modes.

Cost and latency explode

Multi-agent token consumption doesn't scale linearly. The GitHub Blog documents the
production gap:

3-agent workflows that cost $5–50 in demos reach $18,000–90,000/month at scale
Response times jump from 1–3 seconds to 10–40 seconds per request
Reliability drops from 95–98% in pilots to 80–87% under production load

The underlying cause: every inter-agent handoff requires token-intensive context reconstruction.
The parent encodes its state into a prompt; the subagent re-processes the entire relevant context
from scratch. Multiplied across many agents and many calls, the token budget explodes.

Cursor's background agents add a different dimension: cloud environment reliability.
User-reported failures include Docker builds failing during apt-get update, git branch
push failures, connection dropouts that stall agents mid-task, and cloud environment
initialization errors. The compute is remote and shared, so failures that don't exist
locally appear at scale.

Where each system struggles

[Interactive chart — see original post]
The chart reflects the research above. Claude Code is strong on environment reliability
(local execution) but has no mechanism for context continuity or parallel conflict handling.
Cursor partially addresses parallelism through Git worktrees but has the opposite reliability
profile — cloud execution introduces environment failures. Devin avoids parallel agents
entirely and invests heavily in error recovery through its review agent, which is why
it scores high on those axes but zero on parallel conflict handling.

No system in the current survey scores well across all five dimensions. Context continuity
is the universal weak spot.

Why better models don't fix this

The 2026 AI Agent Report
is direct:

"Most multi-agent failures aren't caused by weak models — they're caused by weak
reasoning architecture. Orchestrating multiple agents with divergent goals, conflicting
information, and cascading failures requires architectural discipline."

Code quality compounds the issue. A January 2026 Stack Overflow Blog analysis
found that AI-generated code includes bugs at 1.5–2x the rate of human-written code when
supervision gaps exist, with 3x the readability issues. Multi-agent workflows create
supervision gaps by design — no single reviewer sees the whole picture.

The integration layer is where failures originate: how agents hand off state, coordinate
writes, report progress, and signal when they're stuck. Models are getting better;
orchestration architecture largely isn't.

What the research says works

The GitHub Blog identifies several patterns that prevent the most common failures:

Typed schemas for inter-agent messages. Without explicit contracts between agents,
every handoff is a natural language interpretation problem. Typed schemas eliminate a
class of coordination errors before they happen.

Explicit handoff contracts. The orchestrator maintains state; workers are stateless
and only know what the orchestrator tells them per-invocation. This is the "Main Agent
as Project Manager" pattern formalized. It's more overhead to design but dramatically
reduces inter-agent confusion.

Budget meters and permission gates. Catching runaway token consumption before it
becomes a $90,000 surprise requires active monitoring. Permission gates before
destructive or expensive operations give the system a chance to pause.

Observable task state. When agents can report their current status to a shared
registry — not just to their own context — the orchestrator and user can see what's
happening and intervene. This is the problem the
task registry design addresses.

Checkpointing over re-discovery. Explicit handoff documents (a structured summary
of what's been done, what constraints apply, what decisions have been made) reduce
context amnesia. The cost of writing a handoff document is cheaper than the cost of
a subagent re-exploring the same territory.