DEV Community

SAI
SAI

Posted on

Traces are trees. Multi-agent failures are graphs.

Multi-agent AI systems fail in a way that's hard to catch. Nothing crashes. No error is thrown. Agent A hands work to Agent B, who hands it back to Agent A. Or Agent A produces bad output that quietly corrupts Agent B's input, which cascades to Agent C. The agents look busy. Your only signal is a billing dashboard days later, or a user reporting wrong results that nobody can trace back to a root cause.

This actually happened. They deployed a research tool with four AI agents coordinating to help users find market data. Two agents got stuck delegating to each other. It ran for 11 days before anyone noticed. The bill was $47,000. The only reason it was caught at all was because someone looked at the billing dashboard.

These are different from the agent failures making headlines. Meta's rogue agent SEV1, Replit's database deletion, Claude Code running terraform destroy on production -- those are single-agent safety failures. They're bad, but they're loud. Something visibly breaks. Someone notices. Coordination failures between agents are the quiet kind.

And they're common. The UC Berkeley MAST study analyzed 1,642 multi-agent traces across 7 frameworks and found that 36.9% of all failures are coordination problems between agents. Step repetition alone -- agents doing the same work over and over -- accounts for 17.14%. Google DeepMind's scaling research found 17x error amplification in independent multi-agent architectures. These aren't edge cases. They're the most common failure mode.

Why existing tools miss this

I started looking at what observability tools were available. LangSmith, Langfuse, Arize, Datadog's LLM monitoring, AgentOps, Braintrust. They're all good tools. They all use the same data model: traces. A tree of parent-child spans, inherited from traditional APM.

This makes sense for single-agent workflows. User request goes to agent, agent calls tool, tool calls LLM, LLM returns. That's a tree. Traces represent it perfectly.

But when Agent A delegates to Agent B, who delegates to Agent C, who delegates back to Agent A -- that's a cycle. Trees can't represent cycles. In a trace, a loop between three agents just looks like a very long list of spans. There's nothing structurally wrong with it according to the data model. The loop is invisible because the data model literally cannot express it.

Same problem with cascade failures. If Agent A's bad output corrupts Agent B, which corrupts Agent C, the causal path exists in the relationships between agents, not within any individual agent's span. Traces model relationships as parent-child. The actual failure pattern is a directed graph with arbitrary edge patterns.

I kept coming back to a simple observation: the data model doesn't match the problem domain. Multi-agent systems are graphs. Agents talk to other agents in arbitrary patterns. Cycles, fan-outs, cascades, bottlenecks -- these are all graph structures. But every tool models them as trees.

What I built

The fix seemed obvious: model agents as nodes and interactions as edges in a directed graph. Then loops are just cycles, cascades are paths, and bottlenecks are high-centrality nodes. These are all solved problems in graph theory.

I wanted to see if this actually worked in practice, so I started building. The result is AgentSonar, a Python SDK AgentSonar that hooks into CrewAI and LangGraph as a passive listener and builds a runtime interaction graph as agents execute.

# CrewAI -- 2 lines
from agentsonar import AgentSonarListener
sonar = AgentSonarListener()
# ...run your crew normally. Detection happens automatically.

# LangGraph -- 2 lines
from agentsonar import monitor
graph = monitor(graph)
result = graph.invoke(input)
Enter fullscreen mode Exit fullscreen mode

It currently detects three coordination failure classes: cyclic delegation (agents stuck in a loop), repetitive delegation (one agent hammering another without progress), and resource exhaustion (runaway throughput that would burn your token budget). All three fire in real time while your crew is still running.

Every run produces four output files, including a self-contained HTML report you can email or attach to a bug ticket:

agentsonar report

The report shows each coordination event as a card with its severity, failure class, the agent topology involved, and the thresholds that triggered the alert. There's also a live JSONL timeline you can tail -f from a second terminal to watch failures surface the moment they happen, before the run finishes.

It's designed to be lightweight: no accounts, no API keys, no cloud service required. Just pip install agentsonar and two lines of code. The SDK is on PyPI and the public repo (issues, discussions, changelog) is on GitHub.

What I don't know

Observability tools like Langfuse or Datadog could theoretically add cycle detection. But their data models are trace-based -- trees of parent-child spans. Bolting graph analysis onto a tree isn't a feature toggle, it's an architectural change. It's the difference between adding a column to a database and switching from SQL to a graph database.

Frameworks like CrewAI and LangGraph could build loop detection natively (AutoGen is already discussing it). But frameworks are built to run agents, not to analyze how they coordinate. CrewAI knows that Agent A delegated to Agent B. But it's not asking whether A has delegated to B thirteen times in the last thirty seconds, or whether A→B→C→A is forming a cycle, or whether this delegation pattern costs $14 every time it fires. That pattern-level analysis is a different layer, the same way your web framework handles requests but a separate tool (Datadog, Sentry) detects when something is wrong with the patterns across those requests.

But I could be wrong. Maybe better models just make the problem go away. One person I talked to said they haven't had coordination issues since switching to newer frontier models. Maybe the problem gets solved by smarter agents rather than better observability.

If you're running multi-agent systems in production, I'd like to hear your experience. What coordination failures have you hit? How did you find them? Did you build internal tooling, or just add retry limits and move on?

GitHub: https://github.com/agentsonar/agentsonar
PyPI: pip install agentsonar

Top comments (0)