When Your AI Agent Has an Incident, Your Runbook Isn't Ready

#ai #agents #devops #sre

Your on-call engineer gets paged at 2am. The alert says your customer-facing AI agent is misbehaving — producing garbled outputs, possibly taking unintended actions, burning through tokens at ten times the expected rate. They open the runbook.

The runbook says: check the error rate, examine the trace, identify the failing component, roll back or patch.

None of that applies. The error rate is fine — the agent is executing successfully, it's just doing the wrong thing. The "trace" is a wall of LLM completions with no clear causal structure. There's no component to isolate because the failure is in reasoning, not in code. And rolling back the agent deployment doesn't roll back whatever it already did.

This is the gap that most engineering teams discover at the worst possible time: 57% of organizations now have agents running in production, according to LangChain's 2026 State of Agent Engineering report, but the same research found that quality (cited by 32% of respondents as their top blocker) and latency (20%) — not capability — are the primary barriers teams fight in production. Most of them don't have runbooks written for how agents actually fail.

AI agent incident response is the set of practices, tools, and procedures for detecting, containing, investigating, and learning from failures in autonomous AI agent systems running in production. It differs from traditional software incident response in three fundamental ways: agent failures are often non-deterministic and non-reproducible, the blast radius of an agent incident extends across tool calls and external systems rather than a single service boundary, and the failure mode is frequently behavioral — the agent did something unexpected — rather than operational — the service went down. Effective agent incident response requires instrumentation, containment capability, and ownership structures that most SRE practices were not designed to provide.

Why does AI agent incident response differ from standard SRE?

The SRE toolkit was built for deterministic systems. Given the same input, a correctly-implemented service returns the same output. Incidents have causes you can identify, reproduce, and fix. Error rates measure something meaningful because errors are discrete, catchable events.

Agents break all three of these assumptions.

An agent that produced a wrong output at 14:32 may not produce the same wrong output if you replay the exact same session at 14:45. Model temperature, context window effects, subtle differences in retrieved data, and the inherent non-determinism of LLM sampling mean that replayable reproduction of agent failures is the exception, not the rule. Your on-call engineer cannot reliably reproduce what happened — which means standard debugging practices (write a test that replicates the failure, fix it, verify the test passes) don't apply cleanly.

Error rates are the wrong metric. Most agent failures don't look like errors. The API call succeeded. The session completed. The agent returned something. It just returned something wrong, or took an action outside its intended scope, or consumed 50x the expected token budget while doing it. Error rate monitoring tells you when your agent is down. It tells you nothing about whether your agent is doing the right thing.

The investigation workflow changes completely. When a service misbehaves, you look at logs, metrics, and the trace for the specific request that failed. When an agent misbehaves, you need the full execution record for the session — every LLM call, every tool invocation, every piece of data that entered the context window — because the failure may be distributed across a dozen steps rather than localized to a single function call. Without a session-level execution record that captures this graph, you're reading tea leaves.

This is why the standard runbook breaks. It was written for incidents with discrete causes. Agent incidents often have diffuse causes that only become visible when you can see the entire session history at once.

Why is ownership of AI agent incidents so fragmented?

Here's the question your runbook should answer and probably doesn't: who owns the agent?

Not who built it. Who is responsible for its behavior in production — right now, when something is wrong, at 2am?

In most organizations, this is unclear in ways that don't exist for traditional services. A microservice has an owning team, a deployment pipeline, a Slack channel for incidents, an on-call rotation. The ownership model for AI agents is younger and less established, and the fragmentation runs deep. The team that built the agent may not be the team that deployed it. The team that deployed it may not have authority over the model configuration. The compliance team may have opinions about agent behavior that nobody operationalized as policy. The security team discovered the agent is using a database credential they didn't know existed.

When an agent incident hits, the first thirty minutes are often spent answering questions that should have been answered before the agent ever deployed: Who can change the agent's configuration? Who can roll it back? Who has authority to shut it down entirely? Who is responsible for informing affected users?

The last question — who can shut it down — is particularly costly when the answer is unclear. An agent that's behaving incorrectly in a way that affects customers needs to be stopped. If no one has a pre-established path to do that, the incident compounds while ownership is being negotiated. This is the specific failure mode that ends agent deployments: not the technical incident itself, but the organizational unreadiness to respond to it.

Organizations that have worked through this are explicit about agentic governance at the operational level — not just policy documents, but defined human ownership for every agent in production, with a RACI for incidents, a documented kill-switch procedure, and an on-call rotation that includes someone with authority to act.

What is the blast radius of an AI agent incident?

When a microservice throws a 500, the blast radius is bounded: the clients that called it got an error. They retry or they don't. The failure is contained within a service boundary.

An agent's blast radius doesn't respect service boundaries. It extends to every system the agent had access to during the session — every API it called, every database it queried or wrote to, every external message it sent.

This matters for incident response because the alert is almost never the full picture. An alert that fires because an agent session consumed abnormal token volume tells you there was a problem. It doesn't tell you what the agent did during those tokens. By the time the alert fired, the agent may have already written records to your CRM, sent messages to a customer, made API calls to external services, or processed sensitive data through systems that weren't in scope for the original task.

Incident containment for agents has to start with a blast radius audit, not just service restoration. You need to know: what actions did the agent actually take before it was stopped? What data did it access? What external systems did it touch? What's the status of every write operation it initiated?

Standard agent monitoring — token counts, latency, error rates — tells you the operational facts of your session. Production telemetry like this is the cost dashboard of your incident. What it doesn't tell you is the action history: which records the agent read, which it modified, what it sent to which external system, and in what sequence. That's the session execution record — the artifact that makes a blast radius audit possible. Without it, you're left trying to reconstruct the agent's activity from the scattered logs of every downstream system it touched — a forensic exercise that takes hours and frequently misses things.

What does a runbook for AI agents actually need?

A runbook written for agent incidents looks different from a service runbook in four specific ways.

Session trace first, error trace second. The first action in an agent incident is pulling the complete session record for the affected session — every LLM call, every tool invocation, token count per step, the full context window at each decision point. This is your primary diagnostic artifact. If you don't have it, you're working blind. Your monitoring setup needs to capture this proactively, before incidents happen, not reactively as a consequence of them.

Pre-established kill-switch. Before an agent deploys, the runbook must document exactly how to stop it — completely, immediately, for a specific session or for all sessions. This is not "submit a PR and redeploy." It's a procedure that can be executed in under five minutes by an on-call engineer who did not build the agent. Whether that's a kill-switch policy, a circuit breaker configuration, a feature flag, or an emergency API call — document it, test it in staging, and make it executable without needing the original author.

Blast radius protocol. The runbook needs a documented procedure for auditing what the agent did before containment. This means a checklist of every external system the agent has write access to, a procedure for querying each system's logs for activity from the affected session, and clear ownership for contacting those systems' teams when needed. This protocol should be defined before the incident — not assembled under pressure during it.

Behavioral SLA, not just operational SLA. Traditional SLAs cover availability: the service must respond within X ms, error rate must stay below Y%. Agents need a second tier: behavioral SLAs that define what constitutes unacceptable agent behavior. Token spend above X per session. Output confidence below Y. Tool call to a system outside the agent's defined scope. These behavioral thresholds are what trigger your incident response process before a customer complains — not after.

Most teams skip the behavioral SLA entirely because it requires answering a hard question: what, exactly, does your agent's correct behavior look like? You can't define acceptable without defining expected. The teams that survive their first agent incident are the ones that wrote down the answer before they needed it.

How Waxell handles this

How Waxell handles this: Waxell Observe instruments AI agents across any framework to capture the complete session execution graph — every LLM call, tool invocation, external API request, token cost, and timing data — stored as a durable execution record that exists before an incident happens, not assembled during it. When the 2am alert fires, your on-call engineer opens the session trace and sees exactly what the agent did, step by step, rather than piecing together scattered logs from downstream systems. Operational governance policies — circuit breakers, per-session token budgets, behavioral thresholds — give you the kill-switch mechanism that every runbook needs: a pre-configured rule that terminates the session automatically when a behavioral threshold is crossed, before the incident compounds. You define the behavioral SLA once at the governance layer; Waxell enforces it every session. The production telemetry layer surfaces session-level cost, latency, and behavioral metrics in real time — so the behavioral threshold breach is an alert you act on, not a customer complaint you explain.

To build the session-level visibility your runbook requires, get early access to Waxell.

Frequently Asked Questions

What is AI agent incident response?
AI agent incident response is the set of practices and procedures for detecting, containing, investigating, and learning from failures in autonomous AI agent systems running in production. It differs from traditional software incident response because agent failures are often non-deterministic (cannot be reliably reproduced), behavioral (the agent did something unexpected rather than throwing an error), and have a blast radius that extends across every system the agent had tool access to during the failing session — not just the service that generated an alert.

How is an AI agent incident different from a software service incident?
A software service incident usually involves a discrete failure you can identify, reproduce, and fix: a 500 error, a timeout, a null pointer. Agent incidents are often behavioral — the agent completed successfully but did something wrong. Error rates don't capture them because the API calls succeeded. Blast radius is broader because agents have tool access to multiple external systems. And reproduction is difficult because agent behavior is non-deterministic; the same input may not produce the same failure twice. Standard runbook steps (trace the failing request, reproduce in staging, fix and verify) require significant adaptation for agent incidents.

What should be in a runbook for AI agents?
An agent runbook needs four elements that service runbooks typically don't have: a procedure for pulling the complete session execution trace (not just the error log), a pre-established kill-switch procedure that can be executed in under five minutes without the original author, a blast radius audit protocol listing every external system the agent can write to and how to check their logs, and documented behavioral SLAs that define what constitutes unacceptable agent behavior — token spend per session, output confidence thresholds, tool scope violations — so incidents are triggered by policy breaches rather than customer complaints.

Why do AI agents get shut down after their first production incident?
The pattern is consistent across organizations: an agent hits a novel edge case, behaves unexpectedly, and causes some customer-visible harm. The incident response process is slow and disorganized because nobody defined incident ownership before the agent deployed. By the time the root cause is understood, confidence in the agent is gone and the deployment is rolled back indefinitely. The technical failure was recoverable. The organizational unreadiness wasn't. Agents that survive their first production incident are almost always ones whose teams pre-defined ownership, kill-switch procedures, and behavioral SLAs before deploying — not as bureaucracy, but as operational prerequisites.