Gabriel Anhaia

Posted on Apr 18

The Failure Mode Your Observability Stack Cannot See

#ai #observability #llm #devops

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

It is 02:14 on a Tuesday and the page wakes somebody up.

A customer on the enterprise tier has emailed support to say the product's new AI assistant is "confidently making things up." The ticket is tagged P1 because the customer's logo is on the landing page. The on-call opens the laptop.

They do what they were trained to do. They open the dashboard.

p99 latency on /v1/assistant/chat is 1.8 seconds. Flat green for the last six hours. Error rate is 0.00%. No 5xx. No 4xx worth looking at. The Datadog APM flame graph for the offending trace shows the outbound call to api.anthropic.com took 1.42 seconds and returned 200 OK. The vector DB span is green. The Redis span is green. Every span in the trace is green.

They stare at the trace. Everything the tooling was built to measure is telling them the system is healthy. The customer is telling them the system is broken. Both statements are true.

That is the failure mode this entire book is about.

The gap in one sentence

The RED method (rate, errors, duration) and the USE method (utilization, saturation, errors) were built for a world where the correct behavior of a service was a question about its transport layer. A 200 meant the thing worked. A 500 meant it didn't. That world had the decency to make failure loud.

LLM applications have no such decency. They fail quietly, inside the 200.

Your APM is not wrong. It is answering a different question than the one your users are asking. Until you understand which questions it cannot answer, you will keep waking up at 02:14 and finding a green dashboard.

Four failure modes that return 200

The book walks through eight. Here are four that have put real engineers on real pages in the last twelve months.

1. Silent provider-side model drift

In August and September 2025, users of Claude began reporting that the model had gotten worse. Diffuse reports. Worse at coding. Occasionally answering English prompts with Thai or Chinese characters. Sometimes refusing tasks it had cheerfully done a week earlier.

When Anthropic published the postmortem, it described three overlapping infrastructure bugs. A context-window routing error on Sonnet 4. A TPU misconfiguration on Sonnet 4, Opus 4, and Opus 4.1. An XLA:TPU approximate top-k miscompile on Haiku 3.5. None of the three triggered a server-side error. None changed the shape of the API response. All three changed the content.

Simon Willison's note on it is the one to remember: "The evaluations Anthropic ran simply didn't capture the degradation users were reporting". A world-class eval team missed what users were catching by eye — because their instrument asked "does the model score well on our benchmarks?" when it should have asked "does this week's behavior match last week's on real user queries?"

2. Hallucinations

A RAG assistant ships into a product. A user asks a question. The assistant returns a factual-sounding answer with a citation. The citation points to a document that does not exist, or a document that says the opposite.

The APM view: HTTP 200, 1.4 seconds, 312 output tokens, a few cents of output cost. The response is a well-formed string. The string happens to describe a universe that does not exist.

Worse: hallucination is not uniform across your traffic. It concentrates on the slices where the model has the least grounding: rare entities, new products, edge-case phrasings. A 1% hallucination rate across all traffic can hide a 40% rate on the slice that asks about customers added in the last seven days. Your aggregate dashboard cannot isolate that slice. Your users can.

3. Retrieval drift in RAG

You ship a RAG pipeline in March. It works. In June, support tickets start drifting up. Not a spike nobody could miss; a slow climb that takes weeks to register. The vector DB dashboards are clean. You cannot find anything wrong.

The retrieval is wrong anyway. The embedding provider pushed a new model version while your index still holds vectors from the old one, and the two spaces no longer align. Or your knowledge base has accumulated six months of new documents and the chunking strategy that worked for the original corpus no longer works. Or the user-query distribution has drifted.

None of these show up as errors. You need context-relevance and faithfulness evals running on a rolling window of real traffic. That is not something your current stack has.

4. Tool-call misfires in agents

An agent has six tools. A customer asks to cancel an order. The agent calls search_orders, finds the order, then calls email_customer with a polite explanation of the cancellation policy. It never calls refund_order. The customer is not refunded. The agent returns a confident message saying the refund is in progress.

Every span in the trace returned 200. Each tool call succeeded. The agent just picked the wrong next action.

The unit of failure is the relationship between spans. No single span shows it. A correct trace and an incorrect trace can look structurally identical. Same number of spans. Same tool names. Same token counts. The only difference is whether the agent picked the tool a human reviewer would have picked. APM has no vocabulary for that.

Its worse cousin is the $47,000 agent loop: four LangChain agents that looped for 11 days on a misclassified retry, on a legitimate production system, before anyone noticed. Each iteration was a green span.

The scale of the problem

PwC's 2026 AI Agent Survey reports 79% of organizations have adopted AI agents in some form, and most cannot trace failures through multi-step workflows or measure quality systematically.

Four fifths of the industry is running this stuff in production. Most of them have no instrument for knowing whether it works.

Braintrust put the gap crisply in their 2026 piece on the three pillars of AI observability: "APM tools treat AI like any other service — they capture latency, error rates, and token counts, but don't evaluate whether the model's response was faithful, relevant, or safe."

Faithful. Relevant. Safe. None of those are HTTP status codes.

What this book is

Observability for LLM Applications: Tracing, Evals, and Shipping AI You Can Trust. 18 chapters. ~80,000 words. Book 1 of The AI Engineer's Library.

Five parts:

Why LLMs break silently in production. The thesis above, expanded. Non-determinism, drift, the cost of being wrong.
Tracing LLM applications. OpenTelemetry GenAI semantic conventions (the vendor-neutral standard you should already be using). Your first instrumented LLM call. Tracing agents, tools, multi-step workflows. Tracing RAG: retrievals, vector search, reranking.
Evaluating LLM applications. What an eval actually measures. Offline evals, test sets, regressions, CI gates. Online evals, sampling live traffic. LLM-as-judge, code-based evals, human-in-the-loop.
The tooling landscape. Langfuse self-hosted. LangSmith and Arize Phoenix compared. Braintrust, DeepEval, Helicone. Building your own stack with OTel Collector + ClickHouse + Grafana.
Shipping and operating. Cost tracking and token accounting. Alerting, drift detection, regression handling. Incident response and the production readiness checklist.

The reader is a working backend or platform engineer who ships production systems, suddenly has an LLM-powered feature in prod, knows OpenTelemetry, and does not want a theory book about transformers. The tone is Cormac McCarthy meets runbook: direct, second-person, no hedging, dry humor when it lands. Every tool version, every CLI invocation, every API endpoint is anchored to April 2026 and verified against live documentation.

Paperback and hardcover are live on Amazon today. The ebook drops on April 22.

If this was useful

If you're running LLM features in production and your dashboard has been suspiciously green, the failure is probably in the 200s. The book teaches you to build the instruments that see into them.

Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
Hermes IDE: hermes-ide.com — if you ship with Claude Code and other AI tools, this is what I'm building for you. GitHub here.
Me: xgabriel.com · github.com/gabrielanhaia.

DEV Community