Wolyra

Posted on Apr 25 • Originally published at wolyra.ai

AI Observability: Monitoring Agent Failures in Production

#ai #automation #machinelearning

Teams that ship AI-powered features often discover, six months in, that their observability stack was designed for a different kind of software. Traditional monitoring tells you when a service returns a five-hundred, when latency spikes, when a queue backs up. These signals are still necessary for AI workloads. They are no longer sufficient.

The distinctive failure modes of production AI systems — silent regressions, confident wrong answers, cost blowouts from a single looping agent, drift introduced by an upstream model update — all happen inside the boundary where traditional monitoring stops looking. A system can be ninety-nine point nine percent available and still be wrong half the time. That is a category of failure most monitoring stacks were never designed to detect.

This post is a practical framework for the observability layer you actually need around language models and agents in production.

The three signal categories

AI observability divides naturally into three categories. Treating them as one stack is how teams end up with dashboards that look thorough and miss every real incident.

Operational signals are the ones your existing monitoring already handles: request volume, error rates, latency percentiles, token throughput, cost per request. These tell you whether the system is running. They do not tell you whether it is working.

Quality signals measure whether the answers the system produces are correct or useful for the task. This is the category most teams skip or implement weakly because it requires thinking clearly about what “correct” means for each workload.

Behavioral signals capture how the agent or model is making decisions over time. Are tool calls succeeding? Are multi-step reasoning chains getting longer or shorter? Is the model increasingly routing through expensive paths? These are the signals that detect drift before it becomes a quality problem.

Building the quality signal

The hardest part of AI observability is measuring quality in production without human reviewers in the loop for every request. Three patterns tend to work.

Golden set regression. Maintain a curated set of representative inputs with known good outputs. Run the production system against the golden set on a schedule — daily is usually enough — and alert on regression. This catches the case where an upstream model update or a prompt change silently degrades quality on inputs your team cares about.

Downstream signals. If the AI system feeds a workflow that eventually produces an observable outcome — a ticket gets resolved, a document gets approved, a recommendation gets accepted — track the outcome rate over time, segmented by whether the AI path was involved. A ten percent drop in resolution rate on tickets handled by the agent, while the non-agent path is stable, is a quality regression even if every operational signal is green.

Model-as-judge scoring. For a sample of production traffic, have a separate model score the primary model’s output against a rubric. This is less reliable than the first two patterns and is prone to its own drift, but it scales to workloads where ground-truth labels are expensive. Use it to detect large regressions, not small ones.

Tracing an agent is not tracing a request

A request through a traditional microservice stack produces a trace with tens of spans. A request through an agent can produce a trace with hundreds, spread over minutes. The planning loop fires repeatedly, each tool call is its own sub-request, memory is read and written, and failures at any step can trigger a fallback path that looks nothing like the original request.

Two practical consequences. First, the shape of a useful agent trace is hierarchical rather than flat — you want to see the planning decisions as parent spans, with tool calls and model invocations nested beneath them. Flat tracing views turn an agent trace into a wall of text that nobody reads. Second, sampling strategy matters differently. Head-based sampling (one-in-N requests traced in full) loses the signal on the rare, expensive agent runs — which are precisely the ones worth investigating. Tail-based sampling, where the decision to keep a trace is made after the run completes based on its characteristics, is more appropriate for agent workloads.

Cost is a first-class signal

A bug in a traditional service might leak memory or retry a failing request too many times. The cost of those bugs is real but bounded. A bug in an agent — a planning loop that decides to gather more context repeatedly, a tool call that returns noise and triggers more tool calls — can burn through months of budget in hours. We have seen real incidents where a single runaway agent consumed a six-figure budget over a weekend.

This is why cost per request, not just aggregate cost, has to be an alerting signal for AI workloads. Alert on requests that exceed a threshold of tokens consumed, tool calls made, or wall-clock duration. These are cheap alarms to set up and they catch a category of incident that traditional APM will not.

The drift problem

Model providers update their models. Sometimes they announce this. Sometimes they do not. Sometimes the announcement is in a changelog nobody reads, and the first signal your team has that the underlying model has shifted is when production behavior changes in a way that no diff in your own codebase can explain.

The defense is layered. Pin model versions explicitly where the provider supports it. Monitor behavioral signals — average response length, tool-use rates, refusal rates — over time, so that a shift is visible before it becomes a quality problem. Keep the golden-set regression running so that when a shift does happen, you can quantify its impact on workloads that matter.

The providers are getting better about this, but the responsibility for noticing drift remains with the consumer. A team that does not instrument for drift will learn about it from a customer complaint.

What a useful AI observability dashboard looks like

Three panels tend to be the minimum for a dashboard that a team will actually look at:

Today’s quality: golden-set pass rate, downstream outcome rate, sampled model-as-judge score — each trended against last week and last month.
Today’s cost: total spend, spend per feature, distribution of cost per request, a count of requests that exceeded the cost alarm threshold.
Today’s behavior: average tool calls per request, distribution of response length, refusal rate, top ten slowest or most expensive runs linked to their traces.

The operational dashboard stays where it is. These three panels go next to it. When someone says “is the AI feature healthy?”, all three have to be green.

Where to start

If your AI workloads are in production and your observability answer is “we have the same dashboards we had before AI,” there is a gap. The highest-leverage first investment is a golden-set regression, because it catches the broadest class of silent failures and it keeps working even when the team that set it up has moved on. Cost-per-request alerting is a close second. Everything else is refinement on top of those two.

Visibility into AI systems is not optional infrastructure. It is the difference between a feature you can trust in production and one you are hoping is still working.

Top comments (1)

Armorer Labs • Jun 13

For agent failures, I think the failure record should preserve the recovery path, not only the error.

Useful fields: root run, user goal, tool call that failed, args after normalization, retry count, partial side effects, artifacts produced, cleanup status, and whether a human can resume safely.

That makes postmortems and retries much less speculative.