Most of the multi-agent demos you'll see are a single-agent architecture wearing a costume.
They show you Agent A doing something, then Agent B doing something else. What they don't show you is what happens when Agent A's output doesn't match what Agent B expects — or when the handoff silently fails and the whole chain keeps running as if nothing happened.
I've shipped three multi-agent systems in production this year. The agents themselves were never the hard part. The handoffs were.
What "Handoff" Actually Means in Practice
A handoff isn't just passing output from one agent to another. It's:
- Schema alignment — Agent B needs to parse Agent A's output reliably
- Failure propagation — when one agent fails, the chain needs to know
- Context window hygiene — every handoff is a chance to accumulate noise
The most common mistake is treating agents as black boxes connected by a string. You prompt Agent A, get a result, stuff it into Agent B. It works until it doesn't, and when it breaks, you have no idea where.
Here's a concrete example. A common pattern: a planner agent decomposes a task, then a set of worker agents execute sub-tasks in parallel.
The naive version:
# Naive handoff — no contract, no error handling
planner_output = planner_agent.run(task)
worker_results = [worker.run(subtask) for subtask in planner_output["subtasks"]]
final = aggregator.run(worker_results)
This will fail in production. Not because the agents are bad, but because planner_output["subtasks"] might be a list one run and a string the next. Or the planner might return {"subtasks": []} and the workers silently do nothing. Or a worker throws an exception and the whole thing eats it.
The explicit contract version:
from pydantic import BaseModel
from typing import Optional
class SubTask(BaseModel):
id: str
description: str
priority: int
class PlannerOutput(BaseModel):
subtasks: list[SubTask]
reasoning: str
confidence: float # New: lets downstream agents calibrate trust
class WorkerResult(BaseModel):
task_id: str
status: Literal["success", "failed", "skipped"]
output: Optional[str] = None
error: Optional[str] = None
# Planner produces a typed contract
plan = planner_agent.run(task, output_schema=PlannerOutput)
# Workers validate input and produce typed output
results = []
for subtask in plan.subtasks:
try:
result = worker.run(subtask, input_schema=SubTask, output_schema=WorkerResult)
results.append(result)
except ValidationError as e:
results.append(WorkerResult(task_id=subtask.id, status="failed", error=str(e)))
# Aggregator receives structured data it can actually reason about
summary = aggregator.run(results, input_schema=list[WorkerResult])
Now when something breaks, you know exactly which task, which agent, and what went wrong.
The Three Failure Modes Nobody Talks About
1. Silent truncation. Agent A produces 2,000 tokens. Agent B's context window is 128k but you're running a system with a 4k budget on the worker. The output gets silently truncated. Agent B processes a partial result and returns confident nonsense. The fix: measure actual token counts at every handoff and fail explicitly if you exceed budget.
2. Schema drift. Your planner prompt changes slightly. Now it returns reasoning as a single word instead of a paragraph. Agent B was doing string matching on reasoning. The fix: use structured output (Pydantic, JSON schema) everywhere, not prompts.
3. Parallel agent race conditions. You launch 5 workers in parallel. Three finish. Two are still running. Your aggregator starts processing. It gets partial results and returns. This is especially nasty because it works fine in testing with small workloads and fails in production with real latency. The fix: use a barrier (e.g., asyncio.gather with return_exceptions=False, or a result collector that waits for all or fails fast).
A Minimal Production Pattern That Actually Works
After burning through all three failure modes, I settled on this structure:
import asyncio
from dataclasses import dataclass
from enum import Enum
class AgentRole(Enum):
PLANNER = "planner"
WORKER = "worker"
AGGREGATOR = "aggregator"
@dataclass
class HandoffEnvelope:
"""Every handoff gets wrapped in metadata."""
source: AgentRole
target: AgentRole
payload: dict
trace_id: str # For debugging across agents
confidence: float
warnings: list[str] # e.g., ["output truncated from 2048 to 1024 tokens"]
async def run_pipeline(task: str, config: PipelineConfig) -> str:
trace_id = generate_trace_id()
# Phase 1: Plan with explicit output contract
plan_envelope = HandoffEnvelope(
source=AgentRole.PLANNER,
target=AgentRole.WORKER,
payload={}, # Filled by planner
trace_id=trace_id,
confidence=0.0,
warnings=[]
)
plan_output = await planner.run(task, output_schema=PlannerOutput)
plan_envelope.payload = plan_output.model_dump()
plan_envelope.confidence = plan_output.confidence
if plan_output.confidence < config.min_confidence:
raise PipelineError(f"Plan confidence {plan_output.confidence} below threshold")
# Phase 2: Execute workers with error isolation
worker_tasks = [
worker_pool.run(subtask, envelope=plan_envelope)
for subtask in plan_output.subtasks
]
worker_results = await asyncio.gather(*worker_tasks, return_exceptions=True)
# Phase 3: Aggregate with partial-result tolerance
valid_results = [r for r in worker_results if isinstance(r, WorkerResult)]
failed_count = len(worker_results) - len(valid_results)
aggregate_envelope = HandoffEnvelope(
source=AgentRole.WORKER,
target=AgentRole.AGGREGATOR,
payload={"results": [r.model_dump() for r in valid_results]},
trace_id=trace_id,
confidence=len(valid_results) / len(worker_results), # Ratio as confidence
warnings=[f"{failed_count}/{len(worker_results)} workers failed"]
)
return await aggregator.run(valid_results, envelope=aggregate_envelope)
This is not elegant. It's verbose and explicit. That's the point.
What I Learned
The first multi-agent system I built looked clever. It used dynamic routing, context-aware agent selection, and implicit handoffs based on agent names. It worked great until I ran it on 50 concurrent tasks at 3 AM and woke up to a mess of partial results and silent failures.
The second version was ugly but correct. Every handoff was a typed contract. Every failure was explicit. Every agent was isolated.
The third version — the current one — is the first version's elegance built on the second version's discipline. The agents still use structured output. The handoffs still carry metadata. But I've hidden the boilerplate behind a thin framework so the actual agent logic stays clean.
If you're building multi-agent systems: start with the ugly-correct version. Get it wrong in production first. Then make it elegant.
The handoff problem doesn't get easier — but you stop being surprised by it.
Top comments (0)