The 5 Failure Modes of Multi-Agent Systems Nobody Warns You About

#ai #agents #python #observability

Book: AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You build the demo with two agents and it sings. A planner. A worker. A handoff. The Slack channel gets a screenshot. Three weeks later you wake up to a $312 bill from one conversation, an empty result folder, and a trace tab so wide you cannot scroll to the end of it.

Multi-agent systems fail at the seams. The prompts are usually fine. Five seams show up in every production codebase. Examples use the Anthropic SDK. The same patterns apply to the OpenAI Agents SDK, LangGraph, and homegrown loops.

1. Loop-of-loops

Each agent has its own retry policy. The planner retries on tool errors, the worker on LLM errors, the orchestrator on agent errors. None of them know about each other. A flaky API blips once and the planner's three retries each spawn the worker's three retries, each of which retries the LLM three times. One blip, 27 LLM calls.

The OTel signal. Span tree depth on agent.turn parents balloons past your normal p99. Look at the ratio of llm.chat spans to agent.turn spans per conversation. Healthy is 1–4. A loop-of-loops shows 20+.

The fix. One retry policy per turn, declared at the top, enforced by a step counter that every agent increments through.

from dataclasses import dataclass

@dataclass
class StepBudget:
    max_steps: int = 12
    used: int = 0

    def spend(self, n: int = 1) -> None:
        self.used += n
        if self.used > self.max_steps:
            raise RuntimeError(
                f"step budget exhausted: "
                f"{self.used}/{self.max_steps}"
            )

Pass the same StepBudget instance to every agent in the run. Each LLM call, tool call, and handoff calls budget.spend(). The whole system shares one ceiling. Sub-layers cannot quietly add their own retries because the budget lives in the call frame, not any single agent's config.

2. Ownership ambiguity

The user types "summarize my last 10 invoices and email it to my accountant." The system has a research agent and a comms agent. Both pick up the task. Research fetches the invoices and tries to send the email. Comms waits for a draft, then composes its own from the user's raw words. The user gets two emails. One is wrong.

In the demo the orchestrator hard-coded the route. In production the orchestrator is itself an LLM, and "which agent owns this query" becomes a probability instead of a contract.

The OTel signal. Two agent.turn spans with the same conversation.id and overlapping wall-clock windows. If your trace UI shows them side-by-side instead of stacked, you have a race.

The fix. Make ownership a typed field. One agent owns the run at a time, the field is updated atomically, and any agent that tries to act without ownership raises.

from typing import Literal

Owner = Literal["planner", "research", "comms", "done"]

@dataclass
class RunState:
    run_id: str
    user_query: str
    owner: Owner = "planner"
    output: str = ""

def assert_owner(state: RunState, who: Owner) -> None:
    if state.owner != who:
        raise RuntimeError(
            f"{who} acted while owner is {state.owner}"
        )

Every agent's first line is assert_owner(state, "research"). The handoff function is the only thing allowed to mutate state.owner. The contract is visible to the type checker, not buried in a prompt.

3. Shared-state race

Two agents writing to the same memory store. Research appends a source. The critic, running in parallel, filters the list down to confidence > 0.5. They interleave. Half the sources disappear. Nobody reproduces it locally because the timing only goes wrong under load.

The OTel signal. Tool spans named state.write whose parents are different agent.turn spans within the same conversation. Sort by start time and look for overlap. Races also show up as ghost values in replays: fields that change between trace queries because writes are still flying after the trace closed.

The fix. Single-writer rule. One agent owns each region of state. Other agents get a read-only view.

from copy import deepcopy

class StateView:
    def __init__(self, state: RunState, owner: Owner):
        self._state = state
        self._owner = owner

    def read(self) -> RunState:
        return deepcopy(self._state)

    def write(self, mutator) -> None:
        if self._state.owner != self._owner:
            raise RuntimeError(
                "write attempted by non-owner"
            )
        mutator(self._state)

Every agent gets a StateView, not the raw state. Reads are deep copies. Writes pass through the owner check. If you actually need concurrency, replace the check with an asyncio.Lock per region and split state into regions owned by different agents from the start.

4. Infinite handoff ping-pong

Agent A hands off to B with reason: "needs review". B decides the work is incomplete and hands back with reason: "needs more detail". A adds two sentences. B is still not satisfied. The bounce continues until something else kills it. Neither agent is wrong inside its own turn; the loop is in the topology.

The OTel signal. A flat fan of agent.handoff spans under one agent.turn ancestor, alternating between two to values. The shape is the giveaway. Handoffs should have a direction; symmetric ones are a smell.

The fix. A handoff counter, a hard cap, and a topology rule that forbids reverse edges.

@dataclass
class HandoffPayload:
    from_agent: str
    to_agent: str
    reason: str

def handoff(
    state: RunState, to: Owner, reason: str, count: dict
) -> HandoffPayload:
    count["n"] = count.get("n", 0) + 1
    if count["n"] > 4:
        state.owner = "done"
        raise RuntimeError(
            f"handoff cap exceeded: {count['n']}"
        )
    payload = HandoffPayload(
        from_agent=state.owner,
        to_agent=to,
        reason=reason,
    )
    state.owner = to
    return payload

Four handoffs is generous for two agents and tight for three. The number is not the point. The cap is. If you hit it, the run fails loud instead of bouncing for fourteen minutes. The OpenAI Agents SDK and LangGraph expose handoff primitives where you wire this guard at framework level; rolling your own, the snippet above is the whole pattern.

If your design genuinely needs A → B → A, add a critic between them with a terminal "approve/reject" output. A critic that can only say one of two things is not a participant in a loop.

5. Cost runaway

The orchestrator spawns three sub-agents. Each decides it needs two helpers. The helpers have tools that wrap other agents. Fan-out is unbounded because every layer decides in isolation and no layer sees the total. The first call costs $0.04. The 50th costs $40, because the context now contains every nested agent's transcript.

A multi-agent system built on LangChain ran up a $47K bill in eleven days before the team caught it (Tech Startups, Nov 2025). Orchestrator agents that spawn sub-agents now compound the same problem a layer deeper.

The OTel signal. A budget metric on the run, exported per turn. Plot token cost against wall-clock; a runaway shows as a near-vertical line. Before that, you can see it coming in span counts: agent.turn spans per conversation.id crossing a threshold no healthy run would cross.

The fix. A total budget cap, owned at the run level, decremented by every agent and every tool. Same shape as the step budget in failure mode #1, but denominated in dollars.

import anthropic

client = anthropic.Anthropic()

PRICE_PER_MTOK_IN = 3.00
PRICE_PER_MTOK_OUT = 15.00

@dataclass
class CostBudget:
    cap_usd: float
    spent_usd: float = 0.0

    def charge(self, in_tok: int, out_tok: int) -> None:
        cost = (
            in_tok * PRICE_PER_MTOK_IN / 1_000_000
            + out_tok * PRICE_PER_MTOK_OUT / 1_000_000
        )
        self.spent_usd += cost
        if self.spent_usd > self.cap_usd:
            raise RuntimeError(
                f"cost cap exceeded: "
                f"${self.spent_usd:.2f}/${self.cap_usd:.2f}"
            )

def call_claude(messages, budget: CostBudget):
    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=messages,
    )
    budget.charge(
        resp.usage.input_tokens,
        resp.usage.output_tokens,
    )
    return resp

Pass the same CostBudget to every agent and tool that calls an LLM. The first agent to push over the cap fails the whole run. Verify model and current pricing in the Anthropic pricing docs before you ship — token rates change.

Use a per-conversation cap one or two orders of magnitude smaller than your daily fleet cap. A $0.50 cap on a normal turn is harmless; the day a runaway hits, it stops at $0.50 instead of blowing past your daily fleet cap.

The shape that ties them together

Four of these failure modes share a structural fix: a single ceiling held at the run level, not inside any one agent. Steps. Writes. Handoffs. Dollars. Ownership ambiguity is what gives that ceiling teeth. If two agents can hold the run at the same time, no ceiling is real.

Instrument the seams first. Span on every handoff. Attribute every tool span with a step counter. Emit a token-cost gauge per turn. The agents can be black boxes; the seams cannot.

If this was useful

The five above are the headline failures. The AI Agents Pocket Guide walks through about a dozen with the same code-level depth, plus the supervisor, swarm, planner-executor, and critic-loop patterns and the failure modes each one ships with. If you are putting two or more agents in front of users this quarter, it is the kind of book you read in an afternoon and reach for in code review.