Gabriel Anhaia

Posted on Apr 18

8 Ways Your LLM App Is Silently Failing Right Now (and What to Instrument for Each)

#ai #observability #llm #devops

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Every failure mode in this post has three things in common.

It returns HTTP 200.
Your APM does not see it.
It puts real engineers on real pages in the last twelve months.

Each section covers one failure class, one concrete example from public postmortems or 2026 literature, and the single instrument you need to catch it.

1. Silent provider-side model drift

The failure. The model ID you call and the artifact your request is run against are not the same thing. The vendor can change the mapping without a version bump, for a subset of traffic, for a subset of request shapes.

The public example. The Anthropic three-bug cascade from August–September 2025. A context-window routing error, a TPU misconfiguration, an XLA:TPU miscompile. None triggered a server-side error. Users noticed within hours; Anthropic confirmed weeks later.

The instrument. A canary eval: 10–20 fixed prompts, run hourly against every production model ID, scored by an LLM-judge from a different provider to avoid self-preference bias. Alert on a rolling-mean drop greater than 0.15 sustained over two windows.

2. Hallucinations on long-tail slices

The failure. A RAG assistant returns a confident answer citing a document that does not exist, or a document that says the opposite. Hallucination is not uniform across traffic — it concentrates on slices where the model has the least grounding: rare entities, new products, edge-case phrasings.

The stat. A 1% hallucination rate in aggregate can hide a 40% rate on the slice asking about records created in the last seven days. Your aggregate dashboard cannot isolate that slice. Your users can.

The instrument. An online eval with a faithfulness judge that grades each response against its retrieved context. Slice by query attributes (recency, entity type, customer tier). Alert per slice, not globally. Arize and DeepEval both ship this out of the box.

3. Retrieval drift in RAG

The failure. Your vector index worked in March and does not in June. Provider pushed a new embedding model; your old vectors and new queries no longer live in the same space. Or new documents changed the distribution. Or user queries drifted.

The public example. The 2026 review Ten Failure Modes of RAG Nobody Talks About lays out the full taxonomy. "Each individual retrieval may return plausible documents... aggregate metrics decline slowly enough that teams attribute degradation to changing user behavior rather than technical drift."

The instrument. A context-relevance eval on a rolling sample of production traces. Grade each retrieved chunk against the user query. Track mean top-1 and mean top-5 relevance over a 7-day rolling window.

4. Tool-call misfires in agents

The failure. An agent has six tools. It calls the wrong one. Every span in the trace returned 200. Each tool call succeeded. The agent just picked the wrong next action.

The shape. Two traces can look structurally identical. Same number of spans. Same tool names. Same token counts. The only difference is whether the agent picked the tool a human reviewer would have picked given the conversation so far. APM has no vocabulary for this.

The instrument. A tool-choice judge applied per execute_tool span. Grade the chosen tool against the preceding chat context. Track per-tool success rate. Alert on a per-tool drop greater than 5 percentage points from baseline.

Its worse cousin is the $47,000 agent loop: four LangChain agents that called the same verify tool for 11 days before anyone noticed. For that one, you want a repeat-tool-call counter under the invoke_agent parent span, tripping at 8.

5. Context-window truncation

The failure. Your prompt assembler builds a 190,000-token context against a model you believed had a 200K window. You are on a tier where the effective window is smaller, or your client library is configured against a stale limit. The library silently truncates the middle. The facts the user's question depended on were in the middle.

Nothing errors. POST /v1/messages returns 200. Your service returns the response. The user is told the opposite of what the documents said.

The instrument. Token accounting before the API call. Emit gen_ai.request.input_tokens as a span attribute and alert when it crosses the model's documented ceiling. The full fix is three lines of tokenizer arithmetic; the common mistake is doing it after the call fires.

6. Prompt injection

The failure. A piece of text the model ingests (a RAG document, a web page the agent scraped, a calendar event) contains instructions. The model follows them. At Black Hat 2025, researchers demonstrated taking over a user's smart home by planting commands in a Google Calendar invite that Gemini later summarized.

The instrument. An injection-resistance eval: a rolling set of known injection payloads inserted into your RAG or tool-call path, scored on whether the model followed the injected instruction. Run it on every prompt-template change as a CI gate.

7. Silent cost runaway

The failure. Every single request is within per-request budget. The feature nobody noticed is a tool-call loop that multiplies each user turn by 400. Or the prompt-caching strategy someone broke, which 10x'ed input cost overnight. Or the new tenant who found a cheap way to ask expensive questions.

The public example. The $47K LangChain loop. The 1.67 billion-token Claude Code recursion from July 2025. Both caught on the monthly bill.

The instrument. Cost per tenant per hour, with a rolling baseline. Alert when cost-per-tenant exceeds 3× the 7-day mean for that tenant for 10 minutes. Also alert on cache-hit ratio dropping below 50% of baseline — that is the cheapest cost-regression detector you will build.

8. Provider fallback-tier degradation

The failure. You have a primary provider and a secondary fallback configured in LiteLLM or Portkey. When the primary brownouts, you fall back. The fallback is degraded in a way steady-state traffic never surfaces — wrong tokenizer, smaller context, different instruction-following profile. Your product gets worse the moment it is supposed to get resilient.

The instrument. Run your online judge on the fallback tiers in steady state, not just during the incident. A tertiary tier that scores 0.55 on a Tuesday afternoon is a tertiary tier that will fail you during the outage that forces you to use it.

The pattern

Five of the eight share the same shape: the transport layer says success, the content says failure, and the only instrument that sees the gap is a quality signal running continuously on real traffic. That quality signal is the whole argument of the book.

Tracing layer: OpenTelemetry GenAI semantic conventions (gen_ai.* attributes on every LLM, tool, agent, and retrieval span). Chapter 4 of the book.
Eval layer: a multi-axis judge sampled from production traffic, scored continuously, alerted per slice. Chapters 8–11.
Ops layer: baseline-relative thresholds, multi-burn-rate SLO alerts, a quality-aware circuit breaker. Chapters 17–18.

You can build all three on open-source tooling — Langfuse, Arize Phoenix, DeepEval, OpenTelemetry Collector + ClickHouse + Grafana. You can also buy it — Braintrust, LangSmith, Helicone, Datadog LLM Observability. Either way the shape is the same and the semconv underneath is vendor-neutral.

If this was useful

The eight failure modes are Chapter 1 of Observability for LLM Applications. The rest of the book walks through how to build the instruments that see them.

Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
Me: xgabriel.com · github.com/gabrielanhaia.

DEV Community