Why AI Agents Fail Silently and How to Build an Observability Monitor

#webdev #devops #cloud #astro

A normal service fails loudly. The process crashes, the health check turns red, and your pager goes off. An LLM-powered agent fails differently. It returns a 200, exits with code 0, and hands you a confident answer that happens to be wrong. Nothing in your existing monitoring stack reacts, because by every metric it watches, nothing broke.

That gap is the problem. Uptime checks, error-rate dashboards, and latency alerts all watch the transport layer. An agent can keep that layer green while quietly producing garbage, burning your API budget, or looping for thirty steps where it used to take four. We ran a handful of agent workloads behind standard HTTP monitoring and watched the dashboard stay green through failures a human reviewer caught in seconds.

Four ways an agent fails without telling you

Hallucinated output. The agent invents an API parameter, a function name, or a citation. The response is still well-formed text or valid JSON, so a schema check passes it. The mistake only surfaces downstream — a failed deploy, a wrong number in a report, a support ticket.

Rate-limit degradation. When a provider returns a 429, a naive retry layer either retries into a backoff storm or falls back to a smaller, cheaper model. The agent keeps running. The output quality drops, and unless you logged which model actually answered, nothing records that the run was degraded.

Cost overruns. A retry loop, a runaway tool call, or a prompt injection can multiply token usage. There is no exception thrown for "this run cost $4.10 instead of $0.03." You find out on the monthly invoice.

Truncated responses. The model hits its output token ceiling and stops mid-sentence. The API tells you this — OpenAI returns finish_reason: "length", Anthropic returns stop_reason: "max_tokens" — but only if you read that field. Most agent code reads the content and ignores the stop reason entirely.

Retries amplify every other failure. A retry-on-error loop wrapped around a degraded model can turn one bad run into dozens of paid calls before anyone notices. Cap retries and count them — an unbounded retry is a cost incident waiting to happen.

What a monitor actually needs to watch

Because the transport layer stays green, a useful monitor has to watch one layer up: the semantics of what the model returned. Four signal categories cover most silent failures.

Cost. Track input and output tokens per call, per run, and cumulatively. A per-run token budget turns an invisible overrun into an alert.

Shape. Does the output parse? Does it match the schema the agent expects? Did the stop reason come back clean, or was it length / max_tokens? These are cheap, deterministic checks that need no model to evaluate.

Behavior. Track tool-call success rate, retry count, fallback-model usage, and step count. An agent that suddenly takes thirty steps to finish a task it used to do in four is looping, even if it eventually returns something.

Drift. Track response length, refusal rate, and latency against a rolling baseline rather than a fixed threshold. This is the category that catches failures you did not predict. You cannot define in advance what a degraded output looks like, but you can detect that it does not look like last week's.

Drift detection is the part teams skip and the part that pays off. Fixed thresholds only catch the failure modes you already imagined. A baseline catches the ones you didn't.

Building a minimal monitor

You don't need a new platform. Start with a wrapper around the LLM call itself:

async function tracedCall(params) {
  const start = Date.now();
  const res = await client.messages.create(params);
  emit({
    model: params.model,
    tokensIn: res.usage.input_tokens,
    tokensOut: res.usage.output_tokens,
    stopReason: res.stop_reason,
    latencyMs: Date.now() - start,
  });
  return res;
}

Every call now emits a structured event. From there, the monitor is a set of small, boring rules:

Assert on the stop reason. If it is max_tokens, the response is truncated — flag the run instead of acting on a half-answer.
Validate the parsed output against a schema before the agent acts on it, not after.
Sum tokens per run against a budget. A reasonable starting alert is anything above three times your median run cost — tighten it once you have real data.
Store the events somewhere queryable: a Postgres table, your existing log pipeline, whatever you already operate.
Compute a rolling median of output length and alert when a run drops well below it. Forty percent is a sane place to begin, not a measured constant.

None of those rules need a model to evaluate them, so the monitor itself costs nothing per run and cannot hallucinate. The wrappers, schema validators, and alerting glue are mostly boilerplate — the kind of code an AI editor writes quickly while you focus on which signals matter for your agent.

A monitor like this won't make your agent smarter. It will make its failures visible on the same day they happen instead of the day a user complains — which, for anything running unattended, is the difference between a quick fix and a quiet outage.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.