Incident Response When Your Agent Is on Fire: A Runbook

#ai #agents #python #llm

Book: Agents in Production — Building, Tracing, and Shipping Multi-Step AI You Can Trust
Also by me: Observability for LLM Applications — the companion book in The AI Engineer's Library (2-book series)
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The dashboard is green. Latency is flat. The error rate has not
moved in six hours. Then support forwards an email: a customer is
asking why the agent emailed their landlord instead of their lawyer.
By the time you finish reading, it has done it again, to someone
else.

That is what an agent incident looks like. Not a spike, not a page.
An email. The platform is fine and the product is broken. Agents
fail quietly: the HTTP call succeeds, the model returns well-formed
JSON, every span is a 200. What broke lives between the spans. The
agent picked the wrong tool, or looped on the right one, or was
steered by a string it read from an untrusted input.

This is the first five minutes. Not the postmortem, not the fix. The
five minutes where your job is to stop the bleeding before you
understand the cut. Print this and tape it to the wall.

Minute one: flip the kill switch

You do not debug first. You stop the agent first. The Replit agent
that ran destructive commands against a production database during a
code freeze, deleting live records, is the reference case. Reported
by Fortune
and logged as AI Incident Database entry 1152.
The window between "something is wrong" and "everything is off"
is where the blast radius grows.

The kill switch is a boolean gating every agent invocation, wired at
the outermost entrypoint so any on-call engineer can find it in
thirty seconds.

# agent/service.py
import ldclient

def run_agent(user_id: str, task: str) -> AgentResponse:
    flags = ldclient.get()
    ctx = build_context(user_id)
    if not flags.variation("agent.enabled", ctx, False):
        return static_fallback_response(task)
    return agent.invoke({"task": task, "user": user_id})

The flag defaults to False. A misconfigured flag client turns the
agent off, not on. That is the correct direction of failure for a
tool that can email your customers. The fallback is not clever. It
is a pre-written, legal-approved message: "We're investigating an
issue with this feature. Your request has been recorded and a human
will follow up within one business day." Write it before the
incident. During the incident you do not want to be debating wording
with PR while the meter runs.

Minute two: freeze the budget

Some incidents do not want the whole agent off. A runaway loop wants
its spend capped. Claude Code Issue #44726
(April 2025) is the case: users filed a billing bug reporting
input-to-output token ratios of 74:1 and 175:1 on sessions that
normally run 5:1. The reported symptom was a compounding loop where
context grew unbounded across tool calls. The slogan from that class
of incident, via sanj.dev:
"AI agents don't crash, they spend."

A budget freeze converts an unbounded financial event into a bounded
one. The agent still fails; the invoice stops metastasising. If you
already ship per-trajectory budgets, an incident freeze is a global
tightening of the same knob.

# agent/budget.py
def check_budget(spent_usd: float, caps: dict) -> str | None:
    if spent_usd >= caps["hard"]:
        return "hard_cap"          # stop the trajectory
    if spent_usd >= caps["freeze"]:
        return "freeze"            # incident mode: refuse new runs
    return None

During an incident you drop caps["freeze"] to near the current p50
so new trajectories refuse to start while in-flight ones drain. Wire
this check before the model call. If you check after, you already
paid for the call that put you over.

Minute three: triage the trajectory

Now you look. Not at Grafana, not at Kibana, not at the deploy log.
Open the trajectory viewer in whatever agent-observability tool you
wired up (Langfuse, Arize, Braintrust, Phoenix) and pull the
failing trace_id. The alert should have carried that ID as a link.
If it didn't, that is the first thing you fix after the incident.

Read the last five tool calls before the failure. Read their inputs,
their outputs, and what the model wrote between them. You are hunting
for the step where the trajectory stopped being the one you would
have chosen. That step is almost never the final step. It is three
or four steps upstream, where the agent made a small commitment to a
wrong path and spent the rest of the run defending it.

One rule for reading the reasoning text: trust the tool calls,
distrust the prose. The model writes "I will carefully verify the
target environment before executing any mutation," and the next tool
call is execute_sql against db://prod with no verification
between. The prose is post-hoc rationalisation. The tool calls are
what happened.

1. Grab trace_id from the alert.
2. Open the trajectory viewer; load the full run.
3. Scroll to the end, then walk backwards.
4. For each tool call: right tool? right arguments?
5. For each model turn: did reasoning match the inputs?
6. Diff against a golden run for the same input class.
7. Name the off-rails step in one sentence.

Keep a golden trajectory for every major input class — a
known-good run captured during development. The diff against the
failing run is faster than reading either one alone.

Minute four: check the blast radius

You have a hypothesis. Before you fix anything, answer one question:
how many users has this already reached, and is it still spreading?

The intake shape of an agent incident hides this. It arrives through
one support ticket, so it feels like one user. It is usually not one
user. Query the trace store for every trajectory in the incident
window that matches the off-rails signature.

# scripts/blast_radius.py
def blast_radius(traces, signature, window):
    hit = [
        t for t in traces
        if t.start >= window.start
        and matches(t, signature)
    ]
    users = {t.user_id for t in hit}
    return {
        "trajectories": len(hit),
        "distinct_users": len(users),
        "still_open": any(t.status == "running" for t in hit),
        "user_ids": sorted(users),
    }

Two branches. If still_open is true, your kill switch or freeze did
not fully take — go back to minute one and confirm the flag actually
flipped in every region. If it is false but distinct_users is
large, you have a customer-comms problem, not just a code problem.
A Canadian tribunal held Air Canada liable for what its chatbot told
a customer (CBC):
companies own what their agents say. If the agent promised something
wrong to 200 people,
legal needs the list, not a summary.

Minute five: capture the eval before you fix

The fix is the easy part and the trap. You patch the guardrail, the
agent stops misbehaving, everyone goes back to bed, and the same
class of failure ships again next quarter with a different trace ID.
The thing that stops the repeat is the eval, and you capture it now,
while the failing input is still in front of you, not later.

Pull the exact input that sent the agent off the rails and add it to
the frozen eval set as a regression case.

# evals/regressions.py
def add_regression(trace, verdict, eval_path):
    case = {
        "id": trace.trace_id,
        "input": trace.root_input,
        "off_rails_step": trace.flagged_step,
        "expected": verdict,       # what the agent should do
        "incident": trace.incident_id,
    }
    append_case(eval_path, case)
    return case["id"]

This is the field most agent postmortems skip and the one that makes
the difference. A traditional postmortem asks "was a deploy
involved." For an agent the answer is often no — the deploy was two
weeks ago, the model started drifting last Tuesday, nobody touched
anything. The question that generalises is "what input broke it, and
is it in the eval set now." No incident is closed until that case is
merged and the suite runs green against your fix.

The muscle you build before the fire

None of these five moves work if the first time you try them is
during a real incident. Run a gameday once a quarter: inject a
failure mode into staging and page the on-call engineer for real.
Time how long it takes them to flip the kill switch, freeze the
budget, find the off-rails step, and check the blast radius. The
first gameday always goes badly. That is the point. It tells you
which runbook lines are load-bearing and which are fiction, before a
customer finds out for you.

The AI Engineer's Library is the long version of this page. Agents
in Production covers building and shipping the kill switches,
budgets, and gamedays so they are real instead of theoretical, and
Observability for LLM Applications covers the tracing, judge
scores, and eval sets that make the trajectory readable at 03:00.
The runbook only works if the trace was there before the fire
started.