The Runbook Is Already Lying to you.

#architecture #sre #devops #security

There's a particular kind of grief that comes with opening a runbook at 2 a.m. and realizing, halfway through step four, that whoever wrote this was describing infrastructure that no longer exists. The load balancer they're referring to got renamed in a Terraform refactor eight months ago. The S3 bucket path is wrong. The Slack channel they say to ping was archived when the team reorganized. You're standing in the dark, flashlight pointed at a map of a city that was demolished and rebuilt while you were sleeping.

This is not a rare experience. It is, for most SREs working in organizations with any meaningful deployment velocity, the default experience.

Static runbooks have a half-life. In a system where services redeploy multiple times a day, where infrastructure is provisioned and torn down by CI pipelines nobody reads carefully, and where team boundaries shift and ownership gets murky—a document that was accurate when written becomes fiction surprisingly quickly. Not maliciously. Just entropically. The person who wrote it moved to a different squad. The service they were describing got absorbed into a platform team. Nobody updated the runbook because updating runbooks is not on anybody's OKRs.

So you work around it. You've learned to treat runbooks as approximate, not authoritative. You read them with one eye while your other eye is in the actual system—CloudWatch, Datadog, the GitHub blame on the service config. The runbook is a starting hypothesis. You're the reasoning layer on top of it.

The question that's been nagging at the field for the past two years is whether that reasoning layer can be partially offloaded to something that doesn't need sleep.

What an Agent Actually Does (Mechanically)
Before anything else, let's be concrete about what "AI agent for incident response" actually means at the implementation level, because the vendor language obscures more than it reveals.

A modern LLM-based incident agent is, at its core, a retrieval-augmented generation system with tool-use capabilities bolted on. Here's the rough architecture:

When an alert fires, the agent receives a structured payload—PagerDuty event, CloudWatch alarm, whatever your alerting fabric emits. It parses that payload and uses it to construct a semantic query against a vector index that contains your runbooks, postmortems, architecture docs, service READMEs, and recent incident notes. The retrieval step surfaces the three or five most semantically relevant chunks of institutional knowledge. These get stuffed into the LLM's context window alongside the raw alert data, recent log excerpts, and current metric snapshots.

Then the model reasons. Not reasoning in the philosophical sense—reasoning in the pragmatic sense of "Given this alert signature plus these retrieved steps plus this log tail, what is the most probable next action?" It generates a structured response: either an observation ("this looks like the same OOM pattern from the March 14th incident") or an action ("restart the worker pod in the payments namespace").

The action part is where it gets interesting and dangerous. Tool use means the agent has access to actual API calls. It can invoke kubectl. It can call your restart endpoint. It can open a PagerDuty incident and auto-populate the summary. It can query your CMDB to find the service owner. It can pull a Grafana snapshot and attach it to the incident ticket. All of this happens without a human typing a single command.

Datadog's Bits agent—their flagship demonstration of this pattern—reportedly took routine triage tasks that previously required an engineer to manually correlate telemetry across four or five surfaces and compressed them into automated sequences. The 95% MTTR reduction number gets cited constantly, and while I'm appropriately skeptical of vendor-reported benchmarks (there's selection bias baked into every case study), the directionality is consistent with what teams report in practice. Triage time drops dramatically. Resolution time drops less dramatically because the hard part of incident response was never locating the problem; it was understanding why and deciding what to change.

Where the Architecture Fractures
Here's what the vendor pitch doesn't linger on: The quality of everything the agent retrieves is only as good as the quality of everything you've indexed. Which means the garbage-in, garbage-out problem doesn't go away. It moves.

Previously, the garbage was in the runbook that nobody updated. Now, the garbage is in your vector index that nobody curated. Same problem, different substrate. If your postmortems are inconsistently structured—some written with meticulous RCA detail, others knocked out in ten minutes by someone who just wanted to close the ticket—the retrieval quality degrades proportionally. The agent will surface the badly-written postmortem with the same confidence as the brilliant one, because semantic similarity is not the same as epistemic quality.

This is the first fracture point: knowledge quality degrades, and the agent has no native way to express uncertainty about the provenance of its retrieved context. It doesn't know that the runbook it pulled was last updated two years ago. You have to build that into the metadata and the retrieval logic yourself, filtering by last-modified date or flagging staleness explicitly in the chunk.

The second fracture is around the dynamic state. An agent reasoning about a production incident needs to know the current topology. Not the topology as described in a document from Q3. It needs to know which pods are actually running right now, what the current traffic split is, and whether the circuit breaker for the upstream service is open or closed. This requires live tool calls into your observability and orchestration layers—and those integrations are nontrivial to build reliably. A misconfigured tool call that returns stale data is worse than no tool call because it creates confident wrongness.

The third fracture, and the one that makes experienced operators most nervous: blast radius under automation. A human SRE who makes a wrong call during an incident can typically course-correct within seconds. They see the system respond to their action, recognize the misfire, and countermand it. An agent executing a sequence of tool calls can get two or three steps down a wrong branch before the feedback loop closes. Restarting a service is usually safe. Restarting a service while simultaneously adjusting its upstream rate limits while modifying a feature flag is a combination of state changes that might be fine in isolation and catastrophic together.

This is why every serious deployment I've heard about starts with the agent operating in advisory mode for months before it's given actual execution permissions. You let it run alongside the human responder, generating recommendations in parallel, and you audit whether those recommendations were correct. You build a calibration dataset. Only when the agent's recommendation accuracy on your specific systems, with your specific failure modes, exceeds some threshold do you start granting it write access.

The Tiered Model Is Correct but undersold.
PagerDuty's tiered automation framework—Tier-1 fully automated, Tier-2 agent-assisted, and Tier-3 human-led—maps well to operational reality, but the framing undersells how much judgment goes into drawing those tier boundaries.

Tier-1 is for incidents with a known, mechanical response: the auto-scaling group needs to scale out, the cron job needs a kick, the dead-letter queue needs to be flushed. These are incidents where the diagnostic step is essentially zero—the alert is self-describing—and the remediation is idempotent and bounded in blast radius. An agent is not just appropriate here; it's better than a human, because it's faster and doesn't need to be woken up.

But the boundary between Tier-1 and Tier-2 is not a clean line. It's a fuzzy region that shifts based on the current state of the system, the time of day, recent deployment activity, and whether you're in a change freeze. An auto-scaling action that's safe at 10 a.m. on a Tuesday might be dangerous at 11 p.m. when a half-deployed migration is in flight. Building agents that are contextually aware of these situational factors requires more than basic runbook retrieval—it requires the agent to have a coherent model of operational risk at this moment, which is a significantly harder problem.

This is where I'd push back on the "just map your runbook steps to agent actions" framing from onboarding checklists. The mapping is necessary but insufficient. What you actually need to encode is the conditional logic: when is this action safe to take autonomously, and when should it escalate? That logic is often implicit in experienced engineers' heads. It never made it into the runbook because it was the kind of context-sensitive judgment that seemed obvious to the person writing the doc. Externalizing it is hard work.

The Operational Transformation Nobody Talks About
Here's the thing that changes most when you deploy effective incident automation: the nature of on-call fatigue shifts but doesn't disappear.

Pre-agent, on-call fatigue is primarily about volume and interruptions. You're getting paged at 3 a.m. for something an agent could handle. That's the exhausting part. Post-agent, that fatigue decreases meaningfully, probably. The 81% of executives who say they'd trust agents to act during crises are not wrong to feel that way in principle.

But a different fatigue emerges. Your on-call engineers are no longer triaging alerts. They're increasingly triaging agent decisions. They're reviewing the audit trail of what the agent did and verifying it made sense. They're being paged for Tier-3 escalations, which by definition are the hardest incidents—the ones with novel failure modes, cascading dependencies, and unclear ownership. The stuff that didn't fit any pattern in the training data. On-call is now a job for senior people with deep system knowledge, because the easy stuff got automated away.

Whether that's better depends on the person. Some engineers will love it. The mechanical toil evaporates; what remains is genuinely interesting. Others will find it more cognitively taxing. You're not getting paged less often; you're getting paged for harder problems. The variance in incident severity goes up even if the mean severity goes down.

What You'd Actually Do on Monday Morning
Start with the audit, not the tooling. Before you think about which agent framework to evaluate, spend a week cataloguing your actual incident response patterns over the last six months. What alert types recur most frequently? For each, what's the median response sequence—the actual steps an engineer takes, not the steps the runbook prescribes? Those two things are often different, and the delta between them is where the institutional knowledge lives that never got documented.

Then ask: Which of those sequences are fully mechanical? Not "mostly mechanical" or "mechanical unless something weird is happening"—fully mechanical. Those are your Tier-1 candidates. The list will be shorter than you expect.

For the vector index: structure your postmortems before you index them. Postmortems that include a structured "failure signature" field—a brief, queryable description of the error pattern—retrieve dramatically better than unstructured prose. Same for runbooks: adding a YAML frontmatter block that specifies the alert types the runbook applies to, the services involved, and the last-validated date will improve retrieval precision considerably. This is ingestion engineering, not AI engineering, and it's where most teams underinvest.

Then instrument your agent to be paranoid about its own confidence. A recommendation with a confidence score below your threshold should page the engineer with the agent's reasoning, not just the raw alert. "I retrieved three relevant postmortems, but they all describe slightly different failure modes; please review before I take action" is more useful than either silent execution or a generic escalation. The agent's uncertainty is a signal. Preserve it.

Finally: treat the agent's action log as a first-class artifact of your incident process. Every step the agent takes, every tool call it makes, and every piece of context it retrieves—that should be surfaced in your incident retrospective. You will learn things about how your systems fail by watching what the agent noticed that humans typically don't, and vice versa. The agent is a collaborator. Its reasoning trace is worth reading.

The runbook was never the problem. The problem was the gap between what the runbook said and what the system currently was. Agents don't eliminate that gap automatically—but they give you the infrastructure to narrow it continuously, if you're willing to do the unglamorous work of keeping the knowledge base honest.

DEV Community

The Runbook Is Already Lying to you.

Top comments (0)