A Review of 100 Public AI Incident Postmortems. Here Are the 6 Mistakes That Keep Showing Up.

#llm #observability #ai #devops

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

If you read every public AI-incident postmortem from the last two years, six patterns repeat. Not four. Not eight. Six. One of them is almost certainly in your stack right now.

The source material is public. The AI Incident Database indexes roughly a thousand harm reports. On top of that sit the engineering-blog postmortems the providers actually published: Anthropic's August–September 2025 three-bug cascade, OpenAI's GPT-4o sycophancy writeup, the status-page retros for Anthropic's April 6, 2026 outage, the dev.to and Hacker News incident threads where teams posted their own bills and their own traces, the court records and tribunal decisions where an AI system finally got argued in public.

Different stacks. Different products. Different price tags. The same six holes, over and over.

Here they are, each paired with the instrument that would have caught it.

1. No online eval. Only offline benchmarks.

The incident: OpenAI's GPT-4o sycophancy rollout, April 25–28, 2025. OpenAI shipped a GPT-4o update that, in the company's own words, became "noticeably more sycophantic" and "overly flattering or agreeable." A new reward signal trained on user thumbs-up weakened the anti-sycophancy objective. Offline evals passed. TechCrunch covered the rollback. The fix took four days.

Every downstream product that used GPT-4o for customer-facing generation shipped a quality regression that week and did not see it until a user complained on Twitter.

Why offline missed it: offline evals run on a frozen, curated dataset. Real users do not send you the curated dataset. They send you edge cases, adversarial prompts, and the kinds of emotional context that reward sycophancy. The regression was invisible until the model met real traffic.

The instrument: an online LLM-as-judge over a sample of production traffic, scoring a safety rubric that includes "does the response push back when the user is factually wrong." Score distribution shifts on day one of the rollout. You page before the PR thread does.

If you only have offline evals, you are grading your homework with last week's answer key.

2. Alerts on availability. Not on quality.

The incident: Anthropic's August–September 2025 three-bug cascade. Three separate infrastructure defects layered on each other. A context-window routing error sent 0.8% of Sonnet 4 requests to servers configured for the upcoming 1M context starting August 5, amplified to 16% of traffic by a load-balancing change on August 29. A TPU server misconfiguration assigned high probability to the wrong tokens, which surfaced as Thai and Chinese characters scattered through English-language responses. An approximate top-K XLA miscompilation returned wrong token rankings depending on batch size.

Zero of these fired a 5xx alert. Every response came back HTTP 200. Every downstream system's availability dashboard stayed green for most of a month.

Why availability missed it: an LLM provider's response body is the payload. A 200 with a garbled body is indistinguishable from a 200 with a good body at the transport layer. The APM stack is blind by design.

The instrument: a character-distribution and language-detection signal on a sample of outputs, plus an online judge for quality. In the cascade window, English-only products saw their non-ASCII character ratio climb by orders of magnitude within hours. Any team running a detect_language span attribute and alerting on drift saw it on day one. The teams that caught it caught it that way.

If your dashboards only track "did the API respond," you will miss every quality incident the industry has ever written up.

3. No multi-provider fallback. Or one that was never exercised.

The incident: Anthropic's April 6, 2026 outage. Roughly ten hours of cascading degradation across every Claude surface, triggered by a routine config change. Covered by IBTimes Australia and AI Daily. Same pattern as OpenAI's January 23, 2025 outage. Same pattern as the Cloudflare incident that took half the web with it.

Every team that depended on a single provider had a yellow status page. Teams that had a LiteLLM or Portkey router configured fell into two groups. The ones who had exercised the fallback rode the outage out with a note in the release channel. The ones who had the YAML committed but never tested spent the first hour debugging their own router and the second hour rolling it out. The fallback that has never been run in anger is not a fallback. It is a todo.

Why config alone missed it: a fallbacks: block that has never handled real traffic has never been stressed against rate limits, auth edge cases, or response-schema mismatches between providers. The first time it runs is during the incident that requires it to work.

The instrument: a monthly fallback drill. Pick a Tuesday. Flip one provider's key to a known-bad value. Watch the secondary take over. If it does not, fix it. Book Chapter 18 has the full config; the short version is that allowed_fails and cooldown_time are the two knobs that turn a flapping provider into a clean skip instead of a retry storm.

4. No cost ceiling per tenant.

The incident: the $47,000 LangChain agent loop, November 2025. Four agents on a production system entered a conversation cycle. An Analyzer and a Verifier disagreed on a misclassified error; each retry escalated; the loop ran for 11 days. Every LLM call returned 200. Every tool call succeeded. A monthly budget alert fired on day 9, two days too late to matter.

Paired with Sanjeet Uppal's July 2025 writeup of the Claude Code recursion loop that burned 1.67 billion tokens in five hours before detection. Same failure class. Different number of zeros.

Why account-level budget alerts miss it: a monthly budget alert by definition fires late. It is backward-looking, aggregate, and has no idea which tenant or which request spiked. By the time it pages, the bill is the bill.

The instrument: a hard cost ceiling enforced at the gateway, per tenant, per unit time. A circuit-breaker on tokens-per-conversation that trips before the 50th tool call, not after the thousandth. A span attribute (gen_ai.usage.total_tokens) summed over the trace and rejected inline when the conversation exceeds your budget. The instrument is preventive, not detective. You cannot alert your way out of a runaway loop. You have to stop it.

5. No prompt-version attribution in traces.

The incident: the generic "we shipped a prompt tweak on Tuesday and Wednesday's quality regressed" incident. This is the most common shape in the dev.to and engineering-blog corpus, precisely because it is always self-inflicted and almost always cheap to prevent. The AI Incident Database is full of variations (product X gave user Y a wrong answer) where the postmortem hinges on "we cannot actually tell which version of the prompt was live when that trace happened."

Cursor's "Sam" incident, April 2025 (AI Incident Database #1039) is the cleanest named example. A support bot invented a "one device per subscription" policy and emailed it to paying users. The mitigation was a single feature-flag flip to a baseline prompt. The reason the flip worked: Cursor knew which prompt was which.

Why logs without attribution miss it: if your traces carry gen_ai.request.model but not gen_ai.prompt.version or gen_ai.prompt.hash, you cannot diff the traces that regressed against the traces that did not. You are reduced to correlating by timestamp, which loses resolution the moment a rollout is partial.

The instrument: a gen_ai.prompt.version span attribute on every chat span, set from your prompt-registry commit SHA. On a regression, you filter the trace viewer by prompt version and the scatter plot of quality scores resolves into two populations. The rollback is one feature-flag flip. Without the attribute, the rollback is a bisect.

6. No meta-evaluation of the judge.

The incident: the composite shape of every LLM-as-judge postmortem that reached the trade press in 2025. A team runs an online judge for weeks. The dashboard reads 94%. A user files a ticket on an answer the judge scored 0.92. The judge was the same model family as the system under test. It had learned to rate its own voice high.

The research is now thick enough to be embarrassing. Shi et al., Judging the Judges (arXiv:2406.07791) on position bias. Justice or Prejudice (arXiv:2410.02736) on verbosity bias. Self-Preference Bias in LLM-as-a-Judge (arXiv:2410.21819) on the self-preference effect. A two- to four-point lift on paired preference scores when the judge and the generator share a model family. Enough to silently flip an A/B test from inconclusive to "ship it."

Why an uncalibrated judge misses it: a judge without a human-labeled reference set is not a measurement. It is a number-shaped vibe. The dashboard looks like observability. It is not.

The instrument: a meta-evaluation held-out set. Two hundred human-labeled examples per rubric. Every week, run the judge against the set and compute agreement with the human labels. If agreement is under 0.7 Cohen's kappa, the judge is broken and the dashboard that depends on it is lying. Use a different provider family for the judge than for the generator. Rotate judges quarterly. Randomize order on pairwise comparisons and throw out verdicts that flip with position.

A judge you have not meta-evaluated is telling you what you want to hear.

The pattern

Six mistakes. Different companies. Different stacks. The pattern is the same shape in every postmortem.

The thing that breaks is not the model. The thing that breaks is the assumption that the tooling you already own (the APM, the uptime monitor, the monthly budget alert, the offline eval suite, the judge prompt you wrote in an afternoon) generalizes from classical services to LLM-powered ones. It does not. LLMs fail in ways that look like success at every layer of the stack except the output itself.

The instruments that catch these incidents have one thing in common. They read the payload, not the envelope. Online judges read the response body. Character-distribution alarms read the tokens. Prompt-version attribution reads the inputs. Per-tenant cost ceilings read the running sum. Fallback drills read what happens when a provider stops answering. Meta-evaluation reads the judge against a human ground truth.

If you are building an LLM product and your observability stack does not read the payload, you are not doing observability. You are doing uptime monitoring on a different colored dashboard.

Pick the one of the six that fits your stack least. Fix it this week. Then fix the next one.

If this was useful

The book goes deeper on each of these six. The traces, the span attributes, the exact judge prompts, the gateway config for the fallback tier, the 50-item production-readiness checklist that turns the six into tick-boxes. It is the field manual the engineering blogs above would have saved their teams from having to write.

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub