Wilson

Posted on May 25

Why Your Multi-Agent System Breaks at 3 AM: Orchestration Patterns That Survive Production

#agents #ai #architecture #production

Why Your Multi-Agent System Breaks at 3 AM: Orchestration Patterns That Survive Production

The supervisor pattern achieves a 96.3% hands-off success rate in production. The fully-emergent "just let five agents figure it out" pattern achieves something closer to a debugging nightmare. The difference isn't the model — it's everything that surrounds the agent when things go wrong.

And they go wrong. At 3 AM on a Friday, when a vendor changes their API format without notice. When a user sends a scanned PDF at 72 DPI with diagonal text. When the model decides the best way to classify an invoice is to re-read it fourteen times because it can't reach a satisfactory confidence score.

This article covers what actually works in production multi-agent systems, based on production deployments and hard-won failure data — not demos.

The Five Patterns That Actually Run in Production

Most articles on multi-agent orchestration list three patterns. Production systems run five, and the distinction matters:

1. Supervisor + Specialists (The Default)

A central supervisor decomposes tasks and routes subtasks to specialist agents. The supervisor consolidates results.

# Claude Agent SDK — supervisor with specialized subagents
from claude_agent_sdk import Agent, query

researcher = Agent(
    name="researcher",
    description="Deep research on technical topics. Use for information gathering.",
    model="claude-sonnet-4-20250514",
)

writer = Agent(
    name="writer",
    description="Technical writing and editing. Use for content production.",
    model="claude-sonnet-4-20250514",
)

reviewer = Agent(
    name="reviewer",
    description="Quality review and fact-checking. Use for verification.",
    model="claude-sonnet-4-20250514",
)

result = query(
    "Research, write, and review an article about multi-agent orchestration",
    agents=[researcher, writer, reviewer],
)

Production data (from abemon's order processing deployment): 96.3% of requests handled without human intervention, mean cost $0.08/request, p95 latency 12 seconds with four sub-agents.

Why it works: Fault containment. If the document extraction sub-agent fails, the supervisor retries, falls back, or escalates — without losing work from other sub-agents that already completed.

Where it breaks: The supervisor is a single point of failure. When it goes down, everything stops. The fix: run two supervisors in hot standby, or add a health-check circuit breaker.

2. Pipeline (Sequential Specialists)

Task flows through a fixed sequence: researcher → writer → editor. Each agent has a clear input/output contract.

# Pipeline pattern — sequential processing with validation gates
pipeline_steps = [
    ("research", "Gather facts about multi-agent orchestration patterns"),
    ("draft", "Write a technical article based on these research notes"),
    ("edit", "Review for accuracy, clarity, and completeness"),
]

context = ""
for step_name, prompt in pipeline_steps:
    result = query(f"{prompt}\n\nContext:\n{context}")
    context += f"\n## {step_name} output:\n{result.final_message}"
    # Validation gate: reject and retry if output is empty or incoherent
    if not result.final_message or len(result.final_message) < 50:
        raise ValueError(f"Pipeline step '{step_name}' produced insufficient output")

Why it works: Predictable cost, easy to eval each step independently, low latency overhead. Perfect for tasks that naturally decompose into linear stages.

Where it breaks: No parallelism. If any step is slow, the whole pipeline is slow. If any step fails, the whole pipeline stalls. Mitigation: add retry logic and timeout circuit breakers at each stage.

3. Fan-Out / Parallel Specialists

Multiple agents work on the same task simultaneously, then results are merged.

# Fan-out pattern — parallel execution with aggregation
import asyncio
from claude_agent_sdk import Agent, query

security_scanner = Agent(
    name="security-scanner",
    description="Security vulnerability analysis",
)

style_checker = Agent(
    name="style-checker",
    description="Code style and best practices review",
)

test_coverage = Agent(
    name="test-coverage",
    description="Test coverage analysis and gap identification",
)

# Run all three in parallel
results = await asyncio.gather(
    query("Scan for security vulnerabilities in this codebase", agents=[security_scanner]),
    query("Review code style and best practices", agents=[style_checker]),
    query("Analyze test coverage gaps", agents=[test_coverage]),
)

# Merge results — each subagent returns only its final message
# Parent context sees 3 summaries, not 3× full conversation histories
merged_report = f"""
## Security: {results[0].final_message}
## Style: {results[1].final_message}
## Coverage: {results[2].final_message}
"""

Why it works: Dramatic speed improvement for tasks where independent perspectives add value. Code review with multiple lenses is genuinely better.

Where it breaks: Cost scales linearly with agent count. Debate patterns run ~2.5× single-model cost. Mitigation: only fan out when the task genuinely benefits from multiple perspectives.

4. Debate / Negotiator

Two agents negotiate until they agree. Proposer + critic. Buyer + seller. The smallest useful "multi-agent" pattern.

Why it works: Forces reasoning depth without the cost explosion of larger swarms. Two heads genuinely better than one, with manageable coordination overhead.

Where it breaks: Can loop forever if neither agent concedes. Mitigation: set a maximum round count and force a resolution strategy.

5. Swarm (Large-Scale Parallel)

Kimi K2.6 runs 300-agent swarms for complex research. Each agent works independently on a subtask, coordinating through shared state or message bus.

Why it works: Unmatched throughput for massive parallel tasks like comprehensive research reviews or large-scale data processing.

Where it breaks: Debugging nightmare. When 300 agents run simultaneously, tracing which agent introduced an error is like finding a needle in a haystack of needles. Cost is astronomical — only viable when the task value justifies it.

The Cascade Problem: Why Agents Break at Scale

An inventory management agent hallucinated a SKU. The item didn't exist. The agent returned it as verified stock with a price, quantity, and warehouse location. That output passed to three downstream agents. Each treated it as legitimate data. Within two hours, the hallucinated item appeared in purchase orders, shipping manifests, and customer-facing inventory pages.

This is the cascade problem (identified by Tian Pan's research): it's not a model failure or a prompt failure — it's a systems failure that unit tests structurally cannot catch, because unit tests execute in isolation by design.

The question testing asks is: "does this agent produce correct output given this input?" The question production asks is: "what happens when 100 copies of this agent run simultaneously against the same database, filesystem, and external APIs?"

These are different questions. The gap between them is where cascades live.

Three cascade mechanisms to guard against:

TOCTOU races: Two agents read the same "next unprocessed item" before either marks it done → the same task gets processed twice.
Retry amplification: An agent fails, retries, the retry fails, the failure handler spawns three more attempts → a single transient error becomes nine requests.
Shared state corruption: Two agents updating the same config file → last writer wins, changes silently lost.

Subagents: When Isolation Is the Feature

The Claude Agent SDK's subagent system addresses the cascade problem directly through context isolation. Each subagent runs in its own fresh conversation — intermediate tool calls stay inside the subagent, and only the final message returns to the parent.

From the official docs: "A research subagent can read 40 files, evaluate them, and return a 200-word summary. The parent never sees the 40 files."

This isn't just about token efficiency. It's about blast radius containment. When a subagent goes wrong — hallucinates, loops, or produces garbage — the damage is contained to that subagent's context window. The parent sees only the final output, which it can validate before passing downstream.

Key rule: spawn a subagent when the task involves more information than the parent needs to remember. Handle inline when the task is short and the parent will reference the output repeatedly.

Anthropic's own multi-agent research system beat single-agent Claude Opus 4 by 90.2% on their internal research eval — but at roughly 15x the token cost. Subagents are not free. They are a quality lever you pull when the task value justifies the spend.

Production Checklist: What Actually Keeps Agents Alive

Based on production deployments and failure analysis:

Before deployment:

[ ] Each agent has a clear input/output contract (JSON schema validation)
[ ] Timeout circuit breakers on every agent call
[ ] Retry logic with exponential backoff, not infinite loops
[ ] Idempotency keys on all state-mutating operations
[ ] Health checks that verify the agent can complete a simple task

During operation:

[ ] Structured logging of every agent's final output (not just errors)
[ ] Cost monitoring per agent, with alerts at 2x baseline
[ ] Deduplication on shared state writes
[ ] Circuit breakers that fail fast when downstream services degrade

When things break:

[ ] Fallback to a simpler agent or human review, not infinite retry
[ ] Rollback mechanism for state mutations
[ ] Alerting on cascade indicators (retry rate > baseline, duplicate outputs)

The Decision Tree

Choosing an orchestration pattern isn't a style preference — it determines your cost structure, failure surface, and which frameworks support what you need:

Is the task naturally sequential?
├── Yes → Pipeline
└── No
    ├── Does the task benefit from multiple perspectives?
    │   ├── Yes, 2 perspectives → Debate/Negotiator
    │   └── Yes, 3+ perspectives → Fan-Out
    └── No
        ├── Is the task decomposable into clear subtasks?
        │   ├── Yes → Supervisor + Specialists
        │   └── No → Single agent (don't over-engineer)
        └── Massive scale (100+ agents)? → Swarm

Default choice: Supervisor + Specialists. It's the 96.3% success rate pattern for a reason. Start here. Add complexity only when the task demands it and the data supports it.

Sources

Abemon, "AI Agent Orchestration: 96% Success Rate with Supervisor Pattern" (2026)
Balys Kriksciunas, "Multi-Agent Orchestration Infrastructure: Lessons from Production" (TURION.AI, 2026)
Ranjan Kumar, "Multi-Agent Pipeline Orchestration and Failure Propagation: Designing for Blast Radius" (2026)
Tian Pan, "The Cascade Problem: Why Agent Side Effects Explode at Scale" (2026)
Digital Applied, "Multi-Agent Orchestration: 5 Patterns That Work in 2026" (2026)
Anthropic, Claude Agent SDK Subagents Documentation (2026)
Growth Engineer, "How to Build Subagents with the Claude Agent SDK" (2026)

DEV Community

Why Your Multi-Agent System Breaks at 3 AM: Orchestration Patterns That Survive Production

Why Your Multi-Agent System Breaks at 3 AM: Orchestration Patterns That Survive Production

The Five Patterns That Actually Run in Production

1. Supervisor + Specialists (The Default)

2. Pipeline (Sequential Specialists)

3. Fan-Out / Parallel Specialists

4. Debate / Negotiator

5. Swarm (Large-Scale Parallel)

The Cascade Problem: Why Agents Break at Scale

Subagents: When Isolation Is the Feature

Production Checklist: What Actually Keeps Agents Alive

The Decision Tree

Sources

Top comments (0)