Anil Kurmi

Posted on Apr 12

Observability for Agentic Systems: Why Your Dashboards Are Lying to You

#ai #programming #architecture #devops

Only 14% of organizations run observability on their LLM workloads. Up from 5% a year ago, sure. But still: 86% are flying blind.

Meanwhile, agents are making 6-27 tool calls per investigation. They loop. They branch. They backtrack when a tool returns garbage. They spawn sub-agents that spawn sub-agents. And every one of those interactions generates traces that look nothing like the HTTP request-response pairs your Grafana dashboards were designed to render.

We spent fifteen years perfecting observability for services that receive a request and return a response. Agents don't do that. And the gap between what we can see and what's actually happening is growing wider every week.

5-Minute Skim

If you're short on time, here's the shape:

Traditional distributed tracing captures the "what" but misses the "why." A span tree shows you an agent called a tool. It doesn't show you the reasoning chain that decided which tool to call, or why it retried three times before switching strategies.
OpenTelemetry's gen_ai.* semantic conventions are the emerging standard. They add model name, token counts, prompt content, and tool invocation metadata to spans. Red Hat demonstrated full W3C context propagation across MCP server boundaries.
Discord's Envelope pattern solves the actor-model tracing problem. By wrapping every message in an observable envelope, they trace fanout across millions of recipients with adaptive sampling -- 100% for single-recipient messages, 0.1% for 10K+ fanouts.
The Three Villains of observability data -- retention, sampling, and rollups -- hit agents harder than traditional services. Agent traces need 30-365 days of full-fidelity data, not the 7-14 day window most platforms default to. ClickHouse argues this is achievable at $0.0005/GB/month.
Auto-instrumentation gets you 60% of the way. The last 40% requires manual spans on reasoning steps, tool selection logic, and agent-to-agent delegation.

What Does an Agent Trace Actually Look Like?

Here's the problem in one picture.

A traditional microservice trace is a tree. Request comes in, fans out to three services, each returns, done. Clean. Predictable. Your Jaeger UI renders it beautifully.

An agent trace is something else entirely.

See the difference? The agent trace has cycles. The planner calls a tool, gets a result, reasons about it, calls another tool, reasons again, maybe decides the first result was wrong and retries. The trace isn't a tree -- it's a directed graph with loops. And the most important information isn't in the spans themselves. It's in the transitions between them: why did the agent choose tool B after tool A returned?

Traditional tracing captures I/O. Agent observability needs to capture intent.

Why Does Request-Response Tracing Break?

Three reasons. Each one is a paper cut. Together, they bleed out your entire observability strategy.

Reason one: agents are stateful across turns. A microservice handles a request and forgets. An agent accumulates context across a session that might last minutes or hours. The "trace" isn't one request -- it's a conversation. Your trace ID scoping, which assumes one ID per request-response cycle, can't represent this.

Reason two: tool calls cross trust boundaries. Red Hat's work on distributed tracing for agentic workflows showed that when an agent calls an MCP server, the trace context needs to propagate across a protocol boundary that wasn't designed for observability. W3C traceparent headers work for HTTP. MCP uses JSON-RPC over stdio or SSE. The context propagation mechanism is completely different.

Reason three: the cardinality explosion. Every prompt variation, every tool argument, every intermediate reasoning step is a unique attribute. Traditional services might have 50-100 distinct span attribute combinations. An agent interacting with 5 tools across 10 reasoning steps can generate thousands. Your metrics backend charges by series cardinality. Do the math.

How Did Discord Solve Tracing at Actor-Model Scale?

Discord processes billions of messages daily. Their architecture is built on Elixir's actor model -- millions of lightweight processes, each handling a slice of state. This is structurally similar to agentic systems: many autonomous units, communicating through messages, with no central orchestrator.

Their solution was the Envelope pattern.

Every message in the system gets wrapped in an Envelope -- a lightweight wrapper that carries trace context, sampling decisions, and causal metadata. The Envelope isn't the message. It's the observable skin around the message.

The key insight is fanout-aware sampling. When a message goes to one recipient, Discord samples at 100%. When a message fans out to 10,000+ recipients, they drop to 0.1%. The reasoning: a message that reaches 10,000 actors is structurally identical across all of them. You don't need 10,000 traces to understand what happened. You need ten.

This is directly applicable to agentic systems. When a coordinator agent delegates to 20 sub-agents running the same analysis on different data shards, you don't need to trace all 20. You need enough to detect the outlier -- the one that failed, the one that took 10x longer, the one that produced a different result.

The Envelope pattern gives you that. And it keeps your trace storage from growing linearly with your agent count.

What Are the Three Villains Destroying Your Agent Data?

ClickHouse published a sharp analysis this week that names three structural problems in how we store observability data. They're bad for traditional systems. They're catastrophic for agents.

Villain 1: Retention. Most observability platforms default to 7-14 days of trace retention. For request-response services, that's usually fine. You debug the incident, you move on. But agents learn. They build context over sessions. When an agent misbehaves on day 15, and the root cause was a subtle prompt drift that started on day 3, your data is already gone. Agent traces need 30-365 day retention. ClickHouse claims this is feasible at $0.0005/GB/month using tiered storage.

Villain 2: Sampling. Head-based sampling decides at trace start whether to keep or drop. For agents, this is a disaster. The most interesting traces -- the ones where the agent looped 14 times, switched strategies, and eventually produced a wrong answer -- are the long, expensive ones that sampling is biased to discard. You're systematically deleting your most valuable debugging data.

Tail-based sampling helps. It waits until the trace completes and keeps interesting ones. But "interesting" for agents means something different than "interesting" for HTTP services. Latency alone doesn't cut it. You need to sample based on reasoning depth, tool retry count, and output confidence -- metrics that only exist inside the agent's cognitive loop.

Villain 3: Rollups. Pre-aggregating raw data into summary metrics destroys dimensions. When you roll up "average agent latency per tool" into a 1-minute bucket, you lose the ability to answer "which specific reasoning chain caused the latency spike?" Agents need full-fidelity data because the debugging questions are always about specific chains of decisions, not averages.

The compounding effect is brutal. After retention deletes old data, sampling discards rare-but-critical traces, and rollups flatten what's left into averages, you have maybe 2-5% of the information you'd need to debug a complex agent failure. You just don't know which 2-5%.

How Do You Actually Instrument an Agent with OTel?

OpenTelemetry's gen_ai.* semantic conventions, which stabilized in early 2026, give you a vocabulary for agent telemetry. Here's the layered approach that Red Hat demonstrated and Uptrace documents.

Layer 1: Auto-instrumentation. Libraries like opentelemetry-instrumentation-openai and opentelemetry-instrumentation-anthropic hook into the SDK client and automatically emit spans for every LLM call. You get model name, token counts (input, output, total), latency, and error status without writing a single line of instrumentation code. This is your baseline. It takes five minutes to set up and covers the LLM-call layer.

Layer 2: Tool-call spans. Each tool invocation needs its own span, nested under the parent agent span. The span should carry gen_ai.tool.name, the input arguments (scrubbed of PII), the output summary, and the latency. Most MCP frameworks are starting to emit these automatically, but coverage is inconsistent. Red Hat's decorator pattern -- wrapping each tool handler in a span-emitting decorator -- is the pragmatic approach.

Layer 3: Reasoning spans. This is the manual layer. When the agent decides which tool to call, when it evaluates a result and decides to retry, when it synthesizes multiple tool outputs into a response -- these reasoning steps are invisible to auto-instrumentation. You need to manually create spans around them with attributes like agent.reasoning.step, agent.strategy.selected, and agent.confidence.score.

The ratio in practice is roughly 60% auto, 25% semi-auto (tool-level decorators), and 15% manual (reasoning instrumentation). That last 15% is where all the debugging value lives.

Context propagation across MCP boundaries deserves special attention. When your agent calls an MCP server running in a separate process, the trace context must survive the JSON-RPC boundary. Red Hat's approach: inject the W3C traceparent into the MCP request metadata, and extract it on the server side before creating the child span. It's the same pattern as HTTP header propagation, just over a different transport.

What Should a Unified Agent Dashboard Show?

Gravitee's work this week on AI observability for MCP tools points to what the dashboard of the future looks like. It's not just traces. It's traces plus cost plus reasoning quality in a single view.

Five panels, minimum:

Agent session timeline. Not a waterfall chart. A directed graph showing the actual flow of reasoning, including loops and backtracks. Each node is a step (LLM call, tool call, reasoning checkpoint). Color-coded by latency. Clickable to see the full prompt and response.

Token economics. Input tokens, output tokens, cache hits, cost per session. Broken down by model (because agents often use different models for different steps -- a cheap model for classification, an expensive one for synthesis). Gravitee shows this as a running cost ticker alongside the trace.

Tool reliability. Success rate, latency P50/P95/P99, and error classification for each tool the agent uses. When a tool starts returning errors, you want to see it before the agent's output quality degrades -- not after users report bad answers.

Reasoning depth distribution. A histogram of how many reasoning steps agents take per session. A sudden rightward shift means agents are struggling -- looping more, retrying more, working harder to produce answers. This is the leading indicator that something changed in your tools or data.

SLO burn rate. Conf42's Signal-to-Context Framework maps golden signals (latency, error rate, throughput, saturation) to agent-specific SLOs. The burn rate panel tells you whether you're consuming your error budget faster than expected. For agents, the SLO isn't just "respond within 2 seconds." It's "produce a correct, grounded answer within the token budget 99.5% of the time."

Auto vs. Manual Instrumentation: Where's the Line?

This is the trade-off that every team hits.

Auto-instrumentation is low effort and high coverage for the mechanical parts -- LLM calls, HTTP requests, database queries. You install the SDK, add three lines to your entrypoint, and you get spans. The problem: auto-instrumentation treats the agent like a black box. You see inputs and outputs. You don't see thinking.

Manual instrumentation is high effort and irreplaceable for the cognitive parts. Nobody except the developer who wrote the agent's reasoning loop knows where the critical decision points are. No library can automatically detect "this is where the agent decided to abandon strategy A and try strategy B."

The pragmatic approach: start with auto-instrumentation everywhere. Run it for two weeks. Look at the traces when debugging real incidents. Every time you find yourself saying "I can see what happened but I don't know why," that's where you add a manual span. Let production incidents guide your instrumentation investment.

Red Hat's hybrid auto+manual pattern formalizes this. Auto-instrumentation covers the infrastructure layer. Manual spans cover the cognitive layer. The two are connected through standard OTel parent-child span relationships.

One warning: don't over-instrument reasoning. I've seen teams add spans to every line of their agent's decision logic. The result is traces with 500+ spans per session that are harder to read than the code itself. Instrument decision boundaries, not decision internals.

Sampling vs. Full-Fidelity: Can You Afford to Keep Everything?

The standard observability answer is "sample aggressively, keep summaries." For agents, that answer is wrong.

Here's why. Agent failures are rare but high-impact. When an agent produces a hallucinated answer that a customer acts on, you need the full trace -- every prompt, every tool response, every reasoning step. If you sampled that trace away, you can't debug it. You can't even confirm it happened.

ClickHouse's argument: full-fidelity storage at $0.0005/GB/month makes the economics work. A typical agent session generates 10-50 KB of trace data. At 1 million sessions per day, that's 10-50 GB daily, or 300 GB-1.5 TB monthly. At their pricing, that's $0.15-$0.75/month for full-fidelity retention. The storage cost is a rounding error compared to the LLM inference cost.

But storage isn't the only cost. Query performance on full-fidelity data matters too. Column-oriented stores like ClickHouse handle this well because agent traces are highly compressible -- lots of repeated model names, tool names, and boilerplate prompt text. Compression ratios of 10-20x are common.

Discord's fanout sampling is the middle ground for systems that genuinely can't store everything. Sample 100% of novel traces (new tools, new agent versions, error cases). Sample proportionally for repetitive fanout. Never sample below a floor that guarantees statistical significance for anomaly detection.

The bottom line: if your agent traces cost less than 1% of your inference bill to store, keep them all. You'll thank yourself during the next postmortem.

What Should You Actually Do This Quarter?

Add gen_ai.* semantic conventions to your OTel configuration. Even if you're not ready for full agent observability, start collecting model name, token counts, and tool call metadata on every LLM interaction. The data is cheap to store and invaluable when you need it.

Extend your trace retention to 90 days for agent workloads. The 7-14 day default is designed for stateless request-response services. Agents accumulate behavioral drift over weeks. If your observability vendor can't do 90 days affordably, that's a signal to evaluate alternatives.

Instrument reasoning boundaries, not reasoning internals. Add manual spans at the five to ten decision points in your agent's logic -- tool selection, strategy switches, confidence thresholds, delegation to sub-agents. Skip the internal chain-of-thought details unless you're debugging a specific failure.

Adopt tail-based sampling with agent-aware criteria. Sample based on reasoning depth, tool retry count, and output confidence -- not just latency and error status. Keep 100% of traces where the agent exceeded its reasoning budget or produced low-confidence outputs.

Treat token cost as a first-class observability signal. A cost spike is often the earliest indicator of an agent behavior change. If your agent suddenly consumes 3x more tokens per session, something changed in its reasoning pattern, its tool responses, or its prompt. Surface this in your dashboards alongside latency and errors.

Deep Dive Resources

Red Hat: Distributed Tracing for Agentic Workflows with OpenTelemetry -- W3C context propagation across MCP servers, decorator-pattern instrumentation, hybrid auto+manual for agents. redhat.com
InfoQ: Discord's Envelope Pattern -- Elixir actor-model tracing, fanout-aware sampling at billion-message scale. infoq.com
Gravitee: AI Observability for MCP Tools -- Unified dashboards for agent traffic, LLM costs, and tool reliability. gravitee.io
Uptrace: OpenTelemetry gen_ai Semantic Conventions -- Auto-instrumentation for OpenAI, Anthropic, and LangChain with agent trace hierarchies. uptrace.dev
ClickHouse: Three Villains of Observability -- Retention, sampling, and rollup anti-patterns with cost analysis for full-fidelity storage. clickhouse.com
Grafana: Observability Survey 2026 -- 92% find AI valuable for anomaly detection, 14% observe LLM workloads. grafana.com
Conf42 SRE: Signal-to-Context Framework -- SLO-focused golden signals and agentic auto-remediation strategies. conf42.com

Sources

Red Hat, "Distributed Tracing for Agentic Workflows with OpenTelemetry," April 6, 2026
InfoQ / Discord Engineering, "The Envelope Pattern: Distributed Tracing in Elixir at Scale," March 28, 2026
Gravitee, "AI Observability: Monitoring MCP Tools and Agent Traffic," April 10, 2026
Uptrace, "OpenTelemetry for LLMs and AI Agents," 2026
ClickHouse, "The Three Villains of Observability Data," April 8, 2026
Grafana Labs, "State of Observability 2026 Survey," 2026
Conf42 SRE, "Signal-to-Context: Observability for Agentic Systems," 2026

DEV Community