Gabriel Anhaia

Posted on Apr 18

Debugging an LLM Bug at 3 AM: The Runbook I Wish I'd Had

#ai #observability #llm #devops

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Your pager goes off at 03:14. The chat says llm_judge_score_avg is down 0.14 against the 24h baseline. Latency is fine. Cost is fine. Every dashboard except one is green. You have thirty minutes before this becomes an executive problem. Here is the order.

This is the runbook I wish I'd had when I started writing the book on LLM observability. The five triage branches come from the book's incident-response chapter, which pulls from two years of published postmortems: Anthropic's August 2025 three-bug cascade, the April 6 2026 ten-hour outage, the $47K LangChain loop, the GPT-4o sycophancy rollback, Air Canada's invented fare policy. Each branch has a first check, the commands to run, and the fallback move.

00:00 — The page fires

The first instinct is to open the model playground and start poking. Do not. You cannot debug a distributed system by asking it one question at a time. Open the incident channel. Paste the alert. Read the last five messages in #deploys and #llm-gateway.

One sentence in the channel, not a thread:

Paged on llm_judge_score_avg down 0.14 vs 24h baseline. Investigating. No hypothesis yet.

This is the entire value of the incident channel for the first minute. It tells the next person who shows up that you are on it and what you know.

00:02 — Shape of the change

Do not look at the model. Look at the shape of the change. There are exactly five shapes an LLM incident can take, and getting the shape right in the first five minutes saves you the whole half hour.

Provider availability. HTTP errors, elevated latency, outright outage.
Provider quality. 200 OK everywhere, silently worse outputs.
Self-inflicted quality. You shipped something. The model did not change; your inputs did.
Cost. Dollars move. Latency and quality do not.
Regulatory / reputational. Nothing in your stack misbehaved by its own lights. The output is the problem.

The triage question is never "is it up." The triage question is: what changed, and on whose side.

00:05 — The three commands that pick a branch

Three commands. In this order. Do not skip one because you think you know the answer.

# 1. Are the upstream providers healthy?
curl -s https://status.anthropic.com/api/v2/status.json \
  | jq '.status.indicator'
curl -s https://status.openai.com/api/v2/status.json \
  | jq '.status.indicator'

Indicator values are none, minor, major, critical. If anything is non-none on a provider you depend on, you are likely in branch 1. Pin the page in the channel and keep going — you still need the next two commands to rule out a compounding incident on your side.

# 2. What does the last hour of your own traffic look like?
promtool query range \
  --start="$(date -u -v-1H +%s)" \
  --end="$(date -u +%s)" --step=60s \
  'sum by (gen_ai_provider_name, gen_ai_request_model) (
     rate(gen_ai_client_operation_duration_count[5m])
   )'

You are looking for three patterns. Error rate spiking on one provider but not the others, which points at branch 1. Error rate flat but judge score falling, which points at branch 2 or 3. Judge score flat but cost per request climbing, which points at branch 4.

# 3. What did your own team ship in the last hour?
git log --since='1 hour ago' --oneline \
  -- prompts/ configs/ tools/ retrieval/

If that command returns a commit, your prior just shifted hard toward branch 3. An LLM incident that coincides with a prompt or retrieval change is almost always self-inflicted until proven otherwise.

State your hypothesis in the channel as a sentence. "Judge score regressed on summarize-v7; prompts/ had a commit 42 minutes ago; rolling back." The sentence is the incident.

00:10 — Branch 1: provider availability

Signal: gen_ai.response.status_code distribution shifted to 5xx or overloaded_error. The status page is yellow or red.

First check:

# Per-provider error rate over the last 15 minutes.
promtool query instant \
  'sum by (gen_ai_provider_name) (
     rate(gen_ai_client_operation_duration_count{
       gen_ai_response_status_code=~"5.."
     }[15m])
   )
   /
   sum by (gen_ai_provider_name) (
     rate(gen_ai_client_operation_duration_count[15m])
   )'

Fallback move: trip the router. If you run LiteLLM, Portkey, or OpenRouter, the fallback tier is a config flip, not a deploy. If you have never exercised it, this is the incident you learn whether it works. The April 6, 2026 Anthropic outage separated teams into two groups: the ones whose fallback was committed and tested, and the ones whose fallback was committed and untested. The gap is one practice drill.

If you do not have a router, your only fallback is a prepared canned response on the user-facing surface. Serve it. Update the public status page within ten minutes of declaring the incident.

00:10 — Branch 2: provider quality

Signal: 200 OK everywhere, but your online judge score or a character-distribution signal shifted. The canonical example is Anthropic's August–September 2025 bug cascade — three defects shipped as HTTP 200s, peaking at 16% of Sonnet 4 requests on August 31. No availability SLO fired. Only quality drift showed it.

First check:

# Judge score, last hour vs same hour yesterday,
# bucketed by model.
promtool query range \
  --start="$(date -u -v-1H +%s)" \
  --end="$(date -u +%s)" --step=60s \
  'avg by (gen_ai_response_model) (
     llm_judge_score
   )'

Then scan a sample of traces for character-distribution anomalies — Thai or Chinese characters in English responses was one of the August 2025 symptoms. A one-liner against your trace store:

# Percent of outputs containing non-ASCII where they
# shouldn't. Adjust the feature filter to yours.
curl -s "$TRACE_API/search?feature=summarize&limit=500" \
  | jq '[.traces[].output
         | test("[^\\x00-\\x7F]")] | map(select(.)) | length'

Fallback move: route traffic to the secondary tier and open a support ticket with trace IDs attached. Providers are faster to confirm a quality bug when you hand them span data, not screenshots. Keep the primary in shadow mode so you can confirm recovery without rolling back.

00:10 — Branch 3: self-inflicted quality

Signal: git log --since='1 hour ago' returned something, or your feature flag system shows a rollout that started in the last two hours. The judge score regression is localized to the feature that changed.

First check:

# What's running, per feature, right now.
curl -s "$FLAGS_API/state" \
  | jq '.features[]
        | select(.updated_at > (now - 7200))'

Then bisect. You have four axes of change: the prompt version, the model version, the retrieval index, and the tool definitions. A trace backend worth using (Langfuse, Phoenix, Braintrust, Datadog LLM Observability) lets you filter on each independently and diff the last hour against the same hour yesterday in a three-click operation. If your backend cannot do that, the incident is now also a retrospective action item.

Fallback move: roll back the prompt behind its feature flag. No redeploy. The rollback budget should be under ten minutes. If it takes longer, flip to the kill-switch prompt. The kill-switch is a known-good conservative baseline, reviewed quarterly, and exists for exactly this moment. Cursor's April 2025 "Sam" incident is the canonical recovery: they flipped to a baseline prompt and recovered because the flip existed as a single action.

00:10 — Branch 4: cost

Signal: latency flat, judge score flat, but cost per request or cost per tenant has climbed. The dashboards for this branch accumulate rather than ring, which is why cost incidents are almost always found late.

First check:

# Cost per hour, per tenant, last 6 hours vs last 7d avg.
promtool query range \
  --start="$(date -u -v-6H +%s)" \
  --end="$(date -u +%s)" --step=300s \
  'sum by (app_tenant_id) (
     rate(app_llm_cost_usd[10m])
   )'

If a single tenant is three times their seven-day rolling average, you likely have a loop. The July 2025 Claude Code recursion loop burned 1.67B tokens in five hours before detection. Look at invoke_agent spans and count execute_tool children under a single parent:

# Tool-call fanout per agent run, top 10 offenders.
curl -s "$TRACE_API/query?span=invoke_agent&limit=1000" \
  | jq '[.traces[]
         | {id: .trace_id,
            tool_calls: ([.spans[]
                          | select(.name == "execute_tool")]
                         | length)}]
        | sort_by(-.tool_calls) | .[0:10]'

Fallback move: trip the circuit breaker at the gateway. A hard per-tenant budget cap that refuses requests past threshold is the only brake that works in real time. Cap cumulative tokens per agent session. If you do not have a gateway-level cap, your rollback is the per-feature flag that disables the agent surface until the loop is diagnosed.

00:10 — Branch 5: regulatory / reputational

Signal: nothing in your metrics fired. Someone on the support team or a customer on social media surfaced a specific output that is legally or reputationally bad. Air Canada's Moffatt v. Air Canada chatbot invented a bereavement-fare policy. The BC tribunal held the airline liable for the invented policy. What your LLM says, your company said.

First check:

# Pull the exact trace by user_id and timestamp window.
curl -s "$TRACE_API/search?\
user_id=$REPORTED_USER&since=$REPORTED_TS_MINUS_10M&\
until=$REPORTED_TS_PLUS_10M" | jq '.traces[0]'

You need three things from the trace: the prompt version that was live, the retrieval context that was pulled, and the full model output as stored. PII redaction should have scrubbed sensitive data at write time. If it did not, the incident has a data-handling component too.

Fallback move: flip the feature to the kill-switch prompt or disable the surface entirely. This is the one branch where "serve a canned response" is the correct first action, not the last. Legal gets looped in immediately, not after the bisect. For user-facing LLM output, an AI disclosure label and an output moderation layer are the controls that prevent the next occurrence — if they are not live, the postmortem action item writes itself.

00:25 — Communicate, then fix

By the 25-minute mark you should have declared the branch, stated the hypothesis, and applied the mitigation. The public status page is updated. The incident channel has a single pinned message with the current state. If the mitigation is a rollback, the rollback is complete. If the mitigation is a router flip, the flip is in effect and you are watching the secondary tier's judge score to confirm the fallback is actually producing acceptable output.

One detail most teams miss: the fallback tier needs its own online judge, running in steady state, not only during the incident. A secondary model that scores 0.55 against your rubric on a quiet Tuesday is a latent incident — you find out about it the first time primary fails over. Run the online judge on the fallback. Confirm the number before you need it.

00:30 — The postmortem template

When it is over, write it down. Within 48 hours. Memory degrades on the wrong axis for postmortems: you remember the fix, you forget the false leads, and the false leads are where the organizational learning lives.

# Postmortem: <incident title>

**Was this detectable from metrics alone? If not, what signal
would have caught it, and what do we need to build to capture
that signal next time?**

## Summary
<Two sentences. What broke, how long, user impact.>

## Timeline (UTC)
- HH:MM — first signal
- HH:MM — page fired
- HH:MM — hypothesis stated
- HH:MM — mitigation applied
- HH:MM — resolved

## Root cause
<Technical cause. Link traces, dashboards, provider status.>

## What went well
## What went badly
## Action items
<Each item has an owner and a date. No exceptions.>

The bolded question at the top is the entire point. LLM incidents fail the metrics-alone test more often than traditional incidents do, and every time the answer is "no, not from metrics alone," you have found a gap in your observability that the next incident will walk through. Treat every "no" as a ticket. Assign it. Close it before the next quarter.

The five branches on one page

Pin this to the wall next to the on-call laptop:

Branch	First signal	First check	Fallback
Provider availability	5xx / `overloaded_error` rate up	`curl status.anthropic.com`	Router flip to secondary tier
Provider quality	200 OK, judge score down	Char-distribution scan, per-model judge	Route to secondary + support ticket with trace IDs
Self-inflicted quality	`git log --since='1h ago'` returns	Bisect prompt / model / retrieval / tools	Feature-flag rollback to baseline
Cost	$/tenant above 3x 7d avg	`execute_tool` fanout per `invoke_agent`	Gateway per-tenant cap, kill agent surface
Regulatory / reputational	Support ticket, social media	Pull trace by user + timestamp	Kill-switch + AI disclosure + legal

The order you read this table matters. Availability and cost are visible in metrics. Quality and regulatory are not. Most teams wire alerts for the first two and learn about the second two from customers. The job is to close that gap before the next incident does it for you.

If this was useful

The playbook above is a one-page compression of the book's chapter on incident response. The full version covers the OTel span schema that makes the bisect a three-click operation, the LiteLLM router config that makes the failover a config flip, the quality-aware circuit breaker that trips on judge score and not just HTTP status, and the fifty-item production readiness checklist that blocks a launch. That is what the book is for.

Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go.
Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
Me: xgabriel.com · github.com/gabrielanhaia.

DEV Community

Debugging an LLM Bug at 3 AM: The Runbook I Wish I'd Had

00:00 — The page fires

00:02 — Shape of the change

00:05 — The three commands that pick a branch

00:10 — Branch 1: provider availability

00:10 — Branch 2: provider quality

00:10 — Branch 3: self-inflicted quality

00:10 — Branch 4: cost

00:10 — Branch 5: regulatory / reputational

00:25 — Communicate, then fix

00:30 — The postmortem template

The five branches on one page

If this was useful

Top comments (0)