sly-the-fox

Posted on Mar 21

Your Multi-Agent System Is a Black Box You Built Yourself

#ai #agents #observability #architecture

Everyone building multi-agent systems is focused on making agents smarter. Nobody talks about what happens when your agents are smart enough but your state files are three days stale.

I run 39 agents daily. The system that breaks isn't the one with dumb agents. It's the one where nobody can tell what the agents were looking at when they made their decisions. You built the agents, you defined their roles, you wired the routing. But when the system produces a result, can you trace the reasoning chain? Can you tell what Agent 3 decided, what context it received, what it chose to ignore?

Probably not. And that invisible middle is where your worst bugs live.

Logging is not observability

The first instinct is to add logging. Log every agent invocation, every tool call, every response. Some frameworks do this by default. You end up with thousands of lines per task, and the signal-to-noise ratio approaches zero.

Logging tells you what happened. Observability tells you why it happened. The difference matters because in a multi-agent system, the "what" is usually obvious. Agent A called Agent B. Agent B produced a summary. Agent C made a decision based on that summary. The "why" is where things get interesting.

Why did Agent B summarize the document that way? What context did it receive? Was there information it should have seen but didn't? When Agent C made its decision, was it responding to the actual document or to Agent B's interpretation of the document?

These questions can't be answered with log lines.

The confident wrong answer

The worst failure mode in a multi-agent system isn't a crash. Crashes are loud. You notice them. You fix them. The worst failure mode is the confident wrong answer.

Agent A retrieves the right documents. Agent B summarizes them but subtly mischaracterizes one key point. Agent C makes a decision based on that summary. Agent D formats the output beautifully. The final result looks correct, reads professionally, and is wrong in a way that nobody catches until a human notices the downstream damage days later.

This failure mode exists because each agent in the chain operated correctly within its own scope. Agent B didn't fail. It summarized. The summarization just lost a critical nuance. And since nobody is watching the intermediate representations, the error propagates silently through the system.

# What most systems track
{
    "agent": "summarizer",
    "input_tokens": 4200,
    "output_tokens": 380,
    "latency_ms": 1240,
    "status": "success"
}

# What observability actually requires
{
    "agent": "summarizer",
    "task_id": "review-q1-financials",
    "input_context": {
        "documents": ["q1-report.pdf", "budget-variance.csv"],
        "scoped_to": ["financial_data"],
        "excluded": ["employee_records"]
    },
    "reasoning_trace": {
        "key_points_extracted": 7,
        "points_included_in_summary": 5,
        "points_omitted": [
            "Q1 variance exceeded threshold by 12%",
            "Vendor contract renewal pending"
        ],
        "omission_reason": "below relevance threshold (0.6)"
    },
    "downstream_consumers": ["decision_agent", "audit_trail"],
    "confidence": 0.82
}

The first record tells you the agent ran. The second tells you what it thought it was doing. That difference is the entire gap between debugging and guessing.

Three layers of agent observability

I've been running a 39-agent system for a few months now. Three observability layers consistently matter:

1. Context tracing

For every agent invocation, capture what context the agent received, not just what it produced. This includes scoped documents, upstream agent outputs, and any system state it had access to. When something goes wrong, the first question is always "what did this agent actually see?" Without context tracing, you're reconstructing the answer from logs and hope.

2. Decision boundaries

Agents make decisions. Summarizers decide what to include and what to omit. Routers decide which agent handles a task. Reviewers decide whether work passes or fails. For each decision point, capture the inputs to the decision, the decision itself, and the threshold or reasoning that produced it. This turns opaque agent behavior into auditable decision records.

3. Propagation tracking

When Agent B's output becomes Agent C's input, track that lineage explicitly. Not just "B ran before C," but "C's context included B's output, specifically these fields." When a confident wrong answer emerges at the end of a chain, propagation tracking lets you walk backward through the chain and find exactly where the signal degraded.

Implementation without overhead

The practical concern is always performance. Adding observability shouldn't double your latency or token costs. Three approaches that keep overhead minimal:

Structured metadata, not full traces. You don't need to capture every token. Capture the decision-relevant metadata: what context was scoped, what was included vs. excluded, what threshold was applied. This is typically 5-10% of the full trace size.

Sampling for healthy paths. Trace 100% of failures and anomalies. Sample 10-20% of successful paths. You'll catch degradation patterns without drowning in data.

Async emission. Don't block agent execution to write observability data. Emit events asynchronously to a separate store. The agent keeps working. The trace data arrives slightly behind, which is fine because you're not reading it in real time anyway. You're reading it when something goes wrong.

The observability question

Before you add another agent to your system, try answering these questions about the agents you already have:

When Agent B summarizes a document, can you see what it omitted and why?
When the final output is wrong, can you trace backward to the specific agent that introduced the error?
Can you tell the difference between an agent that failed and an agent that succeeded at the wrong task?

If you can't answer these, you're operating a black box. The fact that you built the box doesn't mean you can see inside it.

The pattern holds across every complex system. Capability without observability is a liability. If you can't watch your agents think, you're just waiting for the confident wrong answer to find its way to production.

I build and operate multi-agent systems daily. Writing about what breaks and what works at The Alignment Layer.

Sigil (cryptographic audit trails for AI agents): github.com/sly-the-fox/sigil

Top comments (8)

klement Gunndu • Mar 22

The confident wrong answer failure mode is real, but I'd push back on logging intermediate representations as the fix — at scale that doubles your token spend per task. Have you found a way to sample traces selectively without missing the subtle mischaracterizations?

sly-the-fox • Mar 23

What's worked for me: structured metadata over full representations. Instead of logging the entire context an agent passes along, you capture lightweight decision records. Confidence scores, which sources informed the decision, what the agent chose not to do. Fraction of the token cost and it gives you enough signal to reconstruct the reasoning path when something breaks.

Garvit Singh • Apr 4

Another thing we can do is in the nodes, where the output of confidence score is less than a set threshold (let's say 0.5), we can add more auditing by explicitly outputting what all points the model deliberately omitted from the input, to the output (let's say a node which summarizes some docs) and why. Yes it's costly, but again, adding onto specific nodes with less confidence score is something i think is a good tradeoff

Garvit Singh • Mar 22

But doesn't LangSmith handles all of this by it's own?

sly-the-fox • Mar 23

The gap is the coordination layer. When multiple agents make decisions that affect each other, you need to track why Agent B acted on what Agent A told it, and whether that handoff preserved the right context. LangSmith wasn't really designed to capture decision boundaries between agents or trace how a bad characterization compounds across three or four handoffs. LangSmith does give good trace visibility into individual chain execution especially if you're already in the LangChain ecosystem. LangSmith does have LangGraph now, and that works as long as you're in their ecosystem.

Benjamin Nguyen • Mar 22

It should! I used LangSmith in my personal project all the time.

sly-the-fox • Mar 23

For a single agent or linear pipeline, yeah, it covers most of it. The problem changes shape once agents are making decisions based on other agents' outputs and you need to trace why something went sideways three handoffs deep.

Benjamin Nguyen • Mar 23

gotcha! A2A protocols