Everyone building multi-agent systems is focused on making agents smarter. Nobody talks about what happens when your agents are smart enough but your state files are three days stale.
I run 39 agents daily. The system that breaks isn't the one with dumb agents. It's the one where nobody can tell what the agents were looking at when they made their decisions. You built the agents, you defined their roles, you wired the routing. But when the system produces a result, can you trace the reasoning chain? Can you tell what Agent 3 decided, what context it received, what it chose to ignore?
Probably not. And that invisible middle is where your worst bugs live.
Logging is not observability
The first instinct is to add logging. Log every agent invocation, every tool call, every response. Some frameworks do this by default. You end up with thousands of lines per task, and the signal-to-noise ratio approaches zero.
Logging tells you what happened. Observability tells you why it happened. The difference matters because in a multi-agent system, the "what" is usually obvious. Agent A called Agent B. Agent B produced a summary. Agent C made a decision based on that summary. The "why" is where things get interesting.
Why did Agent B summarize the document that way? What context did it receive? Was there information it should have seen but didn't? When Agent C made its decision, was it responding to the actual document or to Agent B's interpretation of the document?
These questions can't be answered with log lines.
The confident wrong answer
The worst failure mode in a multi-agent system isn't a crash. Crashes are loud. You notice them. You fix them. The worst failure mode is the confident wrong answer.
Agent A retrieves the right documents. Agent B summarizes them but subtly mischaracterizes one key point. Agent C makes a decision based on that summary. Agent D formats the output beautifully. The final result looks correct, reads professionally, and is wrong in a way that nobody catches until a human notices the downstream damage days later.
This failure mode exists because each agent in the chain operated correctly within its own scope. Agent B didn't fail. It summarized. The summarization just lost a critical nuance. And since nobody is watching the intermediate representations, the error propagates silently through the system.
# What most systems track
{
"agent": "summarizer",
"input_tokens": 4200,
"output_tokens": 380,
"latency_ms": 1240,
"status": "success"
}
# What observability actually requires
{
"agent": "summarizer",
"task_id": "review-q1-financials",
"input_context": {
"documents": ["q1-report.pdf", "budget-variance.csv"],
"scoped_to": ["financial_data"],
"excluded": ["employee_records"]
},
"reasoning_trace": {
"key_points_extracted": 7,
"points_included_in_summary": 5,
"points_omitted": [
"Q1 variance exceeded threshold by 12%",
"Vendor contract renewal pending"
],
"omission_reason": "below relevance threshold (0.6)"
},
"downstream_consumers": ["decision_agent", "audit_trail"],
"confidence": 0.82
}
The first record tells you the agent ran. The second tells you what it thought it was doing. That difference is the entire gap between debugging and guessing.
Three layers of agent observability
I've been running a 39-agent system for a few months now. Three observability layers consistently matter:
1. Context tracing
For every agent invocation, capture what context the agent received, not just what it produced. This includes scoped documents, upstream agent outputs, and any system state it had access to. When something goes wrong, the first question is always "what did this agent actually see?" Without context tracing, you're reconstructing the answer from logs and hope.
2. Decision boundaries
Agents make decisions. Summarizers decide what to include and what to omit. Routers decide which agent handles a task. Reviewers decide whether work passes or fails. For each decision point, capture the inputs to the decision, the decision itself, and the threshold or reasoning that produced it. This turns opaque agent behavior into auditable decision records.
3. Propagation tracking
When Agent B's output becomes Agent C's input, track that lineage explicitly. Not just "B ran before C," but "C's context included B's output, specifically these fields." When a confident wrong answer emerges at the end of a chain, propagation tracking lets you walk backward through the chain and find exactly where the signal degraded.
Implementation without overhead
The practical concern is always performance. Adding observability shouldn't double your latency or token costs. Three approaches that keep overhead minimal:
Structured metadata, not full traces. You don't need to capture every token. Capture the decision-relevant metadata: what context was scoped, what was included vs. excluded, what threshold was applied. This is typically 5-10% of the full trace size.
Sampling for healthy paths. Trace 100% of failures and anomalies. Sample 10-20% of successful paths. You'll catch degradation patterns without drowning in data.
Async emission. Don't block agent execution to write observability data. Emit events asynchronously to a separate store. The agent keeps working. The trace data arrives slightly behind, which is fine because you're not reading it in real time anyway. You're reading it when something goes wrong.
The observability question
Before you add another agent to your system, try answering these questions about the agents you already have:
- When Agent B summarizes a document, can you see what it omitted and why?
- When the final output is wrong, can you trace backward to the specific agent that introduced the error?
- Can you tell the difference between an agent that failed and an agent that succeeded at the wrong task?
If you can't answer these, you're operating a black box. The fact that you built the box doesn't mean you can see inside it.
The pattern holds across every complex system. Capability without observability is a liability. If you can't watch your agents think, you're just waiting for the confident wrong answer to find its way to production.
I build and operate multi-agent systems daily. Writing about what breaks and what works at The Alignment Layer.
Sigil (cryptographic audit trails for AI agents): github.com/sly-the-fox/sigil
Top comments (1)
The confident wrong answer failure mode is real, but I'd push back on logging intermediate representations as the fix — at scale that doubles your token spend per task. Have you found a way to sample traces selectively without missing the subtle mischaracterizations?