Recently I came across a GitHub issue that fundamentally changed how I think about AI agent reliability.
A developer reported that their agent was supposed to call a tool query a database, hit an API, whatever the task required. Instead, the agent skipped the tool call entirely and fabricated the result. It generated plausible-looking data that never came from any real source.
No error was thrown. No trace was logged. The output looked completely normal.
I'd heard about hallucination in LLM outputs before. But this was different. This wasn't the model making up facts in a chat response. This was an agent in an orchestrated workflow pretending it executed an action and generating fake evidence that it did.
That distinction matters. A lot.
I went looking for more
Over the next two weeks, I went through GitHub issues and community discussions across the major agent frameworks and observability tools. I wanted to understand: is this a one-off? A specific framework issue? Or something more fundamental?
What I found is that multi-agent observability is a genuinely unsolved problem and it's not because people aren't trying.
Why this is so hard
Traditional software observability (think Datadog, New Relic, Sentry) is built around a request/response model. A request comes in, gets processed through a known set of services, and a response goes out. You can trace the whole thing with distributed tracing.
Agent workflows don't work that way.
An agent might reason through a problem for dozens of steps. At each step, it decides which tool to call, what arguments to pass, and how to interpret the result. It might delegate to sub-agents, who make their own decisions. It might loop, retry, or abandon a path and try a different approach.
The execution path is non-deterministic. And that breaks the assumptions that traditional observability tools were built on.
Here's what I kept seeing across GitHub issues and discussions:
Tracing collapses in multi-agent setups. When a supervisor agent orchestrates multiple sub-agents, the trace often shows the overall input rather than what each individual agent received. You can see the final output, but you can't tell which agent did what.
Basic visibility is still being built. Some major frameworks have open feature requests that are months old, asking for the ability to see what agents are actually doing which tools they called, what data they accessed, what decisions they made. These aren't edge cases. This is core observability.
The standard is still in development. OpenTelemetry is working on semantic conventions for AI agents. As of early 2026, the status is "Development." Not experimental. Not stable. Still being defined. Every team building agent tracing today is building on conventions that haven't been finalized yet.
The gap in the numbers
A survey of over 1,300 developers building with AI agents found that >89% say they have some form of observability implemented. But only 62% say they can actually inspect what their agents do at each individual step.
That 27% gap is significant. These are teams who have monitoring they've set something up but they can't trace through agent decisions, tool calls, and outputs step by step.
When an agent fabricates a tool output (like the issue I started with), you need step-level inspection to catch it. Aggregate logging won't show you anything is wrong. The response looks clean. The metrics look normal. The failure is silent.
The cost dimension
Lack of visibility doesn't just cause quality problems it causes cost problems.
I found cases where teams experienced 300% cost spikes from a single workflow change. Without per-operation cost tracking, they couldn't figure out which agent or step was burning money. When agents loop, retry, and make multiple model calls, costs compound in ways that are invisible without granular tracing.
I also found reports of popular tracing tools miscalculating costs by nearly 5x. When even your observability tool's cost estimates are wrong, you're flying blind.
This is an ecosystem problem, not a blame game
I want to be clear about something: I'm not pointing fingers at any framework or tool.
Agent orchestration is evolving at an incredible pace. New capabilities ship every week. The observability tooling is trying to keep up, but the problem space is genuinely hard. Non-deterministic execution, multi-model routing, tool fabrication detection these aren't simple engineering problems.
The OpenTelemetry community is doing good work on agent-level semantic conventions. Framework teams are adding tracing support. Observability tools are building out multi-agent features. Everyone is moving in the right direction.
But right now, today, there's a gap. And developers shipping agents to production are feeling it.
What I'm trying to understand
I'm researching this space mapping out where the gaps are, how teams are working around them, and what would actually help.
If you're building with AI agents in production and you've dealt with any of this:
Silent failures where the output looked correct but the data was wrong
Cost spikes you couldn't trace back to a specific agent or step
Debugging with print statements because your tracing tool doesn't capture enough detail
Multi-agent workflows where you can't tell which agent did what
I'd genuinely love to hear about your experience. What's your current debugging setup? Where does it break down? What would you want to exist?
Drop a comment or send me a message. If you're open to a quick 15-minute call to walk me through your workflow, here's my calendar:
Top comments (0)