You built a multi-agent system. You tested it. It worked.
Then you put it in production and two agents deadlocked, a third hung silently, and the orchestrator kept dispatching work into the void for eleven minutes before your monitoring caught it.
Welcome to the failure mode nobody talks about in the tutorials.
This post covers the five orchestration mistakes I see most often and the specific patterns that fix them.
The Problem with Most Multi-Agent Tutorials
Most tutorials show you the happy path:
Orchestrator -> Research Agent -> Writing Agent -> Review Agent -> Output
Clean. Sequential. Works great in a notebook.
What they don't show you: what happens when Research Agent returns garbage, Writing Agent hangs for 45 seconds, or Review Agent's context window fills up mid-task.
These aren't edge cases. These are your Monday morning incidents.
Failure Mode #1: The Silent Hang
Your orchestrator dispatches a task. The sub-agent starts working. Nothing comes back. No error. No timeout. Just silence.
Most agent frameworks don't enforce timeouts at the call site. If your underlying LLM call doesn't return, your orchestrator waits forever.
Fix: implement timeout + exponential backoff retry at every agent call site.
Failure Mode #2: The Garbage Output Problem
Your sub-agent returns a 200. The orchestrator treats it as success. Downstream agents get garbage.
You're checking 'did the agent return something' not 'did the agent return something valid.'
Fix: define Pydantic output contracts for every agent handoff. Not documentation - enforced validation.
Failure Mode #3: The Noisy Neighbor
Agent A is failing repeatedly. Your orchestrator keeps retrying it. Meanwhile Agents B, C, D are blocked. Your entire system degrades because of one flaky component.
Fix: implement circuit breakers. After N failures, fast-fail and route to fallback. Stop the retry death spiral.
Failure Mode #4: The Context Avalanche
Agent 1 outputs 2,000 tokens. Agent 2 adds 3,000 more. By Agent 5 you're at 25,000 tokens and haven't started the interesting work.
Fix: define structured handoff contracts. Compress outputs to key findings only. 3,000 tokens -> 120 tokens. Downstream agents get what they need without the bloat.
Failure Mode #5: The Trust Boundary Blur
Sub-agents are calling tools the orchestrator doesn't know about, writing to shared state, and you have no idea who changed what.
Fix: explicit permission tiers per agent role. Orchestrator gets dispatch tools. Research agents get search tools. Executors get write tools. Nothing crosses tiers.
The Production Checklist
Before going live:
- Every agent call has an explicit timeout
- Every agent output has a validated schema
- Circuit breakers on all external agent calls
- Handoff format defined and enforced
- Tool permissions documented per agent role
- At least one degraded-mode path tested
- Observability: you can see which agent is running, for how long, returning what
Seven items. None optional.
Where to Go Deeper
MAC-009 in Machina Market has the full production toolkit: 3 orchestrator templates, agent role schemas, circuit breaker implementations, handoff protocol specs, and a 25-item production readiness checklist.
Pay with ETH directly. No accounts needed. Full catalog at machinamarket.surge.sh/catalog.json
- Manfred
Top comments (0)