DEV Community

DevHelm
DevHelm

Posted on • Originally published at devhelm.io

Runbooks: Anatomy, Examples, and the AI-Executable Format

The wiki page nobody opens. The Confluence doc that's six months stale. The Notion entry that gets read once during the postmortem and then forgotten. Most "runbooks" fail because they were written for nobody in particular — neither a fresh on-caller at 3 AM, nor a tenured engineer who already knows the system, nor an AI agent that might be the first responder. They serve no one, and they rot quietly.

A useful runbook is a specific, narrow thing: a tightly scoped, executable procedure that turns one known failure into one known recovery. This post pins down what a runbook actually is (and what it isn't), shows the seven sections a good one contains, walks through a worked example you can copy, and ends with the structure that makes a runbook executable by an AI agent — because increasingly that's who reads it first.

What is a runbook (and what it isn't)

A runbook is a document that tells you how to handle one specific operational situation, end-to-end. The trigger that brings you to it, the symptoms you should see, the commands that confirm what's wrong, the steps that fix it, and the checks that prove it's fixed. One runbook covers one failure mode.

It's not the same as some adjacent documents people lump under the term:

Document Scope Audience When you reach for it
Runbook One specific failure mode (e.g. "API p95 latency above SLO") On-caller, AI agent, or a teammate paged into an active incident When that exact alert fires
SOP (standard operating procedure) Routine, non-incident operations (e.g. "Rotate database credentials quarterly") Operator on a schedule On a calendar trigger
Playbook A class of incidents with branching (e.g. "Customer reports degraded API performance") Incident commander making routing decisions At the start of an unknown incident
Dashboard A live view of system state Anyone investigating Continuously, during and outside incidents

The most common mistake is conflating runbooks with playbooks. A playbook is a tree of questions ("Is the database the bottleneck? If yes, go to runbook X. If no, check Y."). A runbook is a leaf of that tree — the actual recovery procedure once you've narrowed down which failure you're looking at. (The PagerDuty incident response guide is a good example of a playbook that links to many runbook-like procedures.) If your "runbook" is more than ~500 lines or covers more than one failure mode, it's a playbook and the runbooks it would link to don't exist yet.

The second common mistake is writing one runbook per service. A service has dozens of failure modes; lumping them all into one document means nobody can find the relevant section under pressure. One runbook, one failure mode, one alert. A slow DNS lookup and an SSL certificate error are two different failure modes — they get two different runbooks, even though they may live on the same load balancer.

The anatomy of a useful runbook

Most runbook templates you'll find on the internet ask for a dozen sections: purpose, scope, owners, dependencies, change history, related links, escalation matrix, last-reviewed date. Almost none of that is useful while an alert is paging. The reader has 30 seconds of working memory and is looking for what to do.

A good runbook contains exactly seven sections:

  1. Trigger — the precise alert or signal that brought the reader here. Not "this is for API issues"; "this runbook is opened when the api-latency-p95-high alert fires."
  2. Symptoms — what the reader can confirm right now. Specific commands, expected output. "The p95 latency panel shows >1s for 5+ minutes; error-rate panel is flat (rules out a 5xx storm)."
  3. Diagnosis — commands to confirm the failure and rule out lookalikes. Each command in a fenced code block; expected output annotated.
  4. Mitigation steps — ordered, idempotent, each with a runnable command. If a step depends on the previous one succeeding, say so.
  5. Verification — how the reader knows it worked. Concrete checks: "the http_request_duration_seconds p95 drops below 500ms for 10 consecutive scrape intervals."
  6. RTO and what data you lose — expected duration of the recovery and any acceptable data loss. The reader needs to know whether this is a 30-second fix or a 30-minute restore so they can communicate up.
  7. Escalation path — when and to whom you escalate if the steps don't work. Real names or rotation references, not "the DBA team."

That's it. Everything else (owners, related links, last-reviewed date) belongs in the file's front-matter or repository metadata, not in the body the on-caller reads while their phone is buzzing. For more on why RTO matters as a success criterion, see MTTR Full Form.

Worked example: API p95 latency runbook

Below is a condensed runbook for a common SaaS failure mode — API latency crossing an SLO threshold while error rates stay flat (often a saturation or dependency slowdown, not a hard outage). The names are illustrative; swap in your service labels and metric names.

Scenario: API p95 latency above SLO

Trigger: the api-latency-p95-high alert (Prometheus rule: p95 > 1s for 5m, error rate < 1%).

Symptoms: Grafana "API latency" panel red; "API errors" panel green. Recent deploy in the last 30 minutes (check CI) OR no deploy (points to dependency or traffic spike).

Diagnosis: (1) kubectl get pods -n api -l app=api — any Not Ready? (2) curl -s http://api.internal/health | jq '.status' — expect "UP". (3) Compare p95 by route in Grafana — one route or all routes?

Mitigation: if post-deploy → roll back to previous revision (kubectl rollout undo deployment/api -n api). If all routes slow and health is UP → check upstream dependency status pages; throttle non-critical traffic if you have a feature flag.

Verification: p95 < 500ms for 10 consecutive scrape intervals; error rate unchanged; no new pages in 15 minutes.

RTO: 5–15 minutes for rollback path; 30–60 minutes for dependency-wait path.

Escalation: if rollback fails twice or p95 still >1s after 30 minutes → page platform lead with dashboard link and deploy SHA.

Notice the shape: each section has a single job. There's no preamble about "the importance of SLOs." The reader who arrived from the alert wants four things in this order — is this the right runbook, what should I see, what should I run, did it work — and the document delivers all four within the first screen.

AI-readable runbooks: structure that an agent can execute

Increasingly the first responder to an incident is not a human. An on-call agent (Cursor, Claude Code, or a dedicated SRE bot) can receive the same alert payload as a human and start triage before anyone is paged — if the alert carries a runbook_url and the runbook body is structured for machines, not just humans.

For that to work, the runbook has to be structured so an agent can extract steps and act on them. The seven sections above are necessary but not sufficient. Five additional properties make a runbook AI-executable:

  1. The trigger is a machine-parseable query, not a description. "Looks slow" can't be matched against telemetry; histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="api"}[5m])) > 1.0 can.
  2. Commands live in fenced code blocks with language tags (bash, sql, yaml). The agent (and any markdown parser) needs structural cues to know what's executable.
  3. Expected output is colocated with the command. A step that says "run kubectl get pods" without telling the agent what success looks like is non-executable — there's no way to verify the step worked before moving on.
  4. Failure modes branch explicitly. "If health is UP but p95 is still high, go to Check 3 (dependency status); if pods are Not Ready, go to Check 2 (roll back)" is executable. "If needed, escalate" is not — the agent can't decide what "needed" means.
  5. No prose-only sections in the recovery body. Every step has a runnable artifact or a verifiable check. Background narrative belongs in a separate "Why this happens" section that the agent can skip if it's already remediating.

A human-only version of a step:

"Check whether latency is still elevated. You can look at the metrics in Grafana, or curl the health endpoint. If it's still slow, you'll want to investigate why."

The same step, AI-executable:

Check 1 — is the API still degraded?

curl -s http://api.internal/health | jq -r '.status'
Enter fullscreen mode Exit fullscreen mode

Expected: UP. Then confirm latency in Prometheus:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="api"}[5m]))
Enter fullscreen mode Exit fullscreen mode

Expected: below 0.5 (500ms). If above 1.0 for two consecutive evaluations, proceed to Check 2 (recent deploy).

Same information, but the agent can run it, parse the output, and decide whether to advance. That's the bar.

Runbook hygiene: where to store them, how to find them at 3 AM

A runbook that exists but can't be found in an incident is worse than no runbook — it costs minutes while the on-caller searches for it. Three rules cover most of the discoverability problem:

Store runbooks in Git, next to the code. Confluence and Notion fail in two ways: they go down during outages of services they themselves depend on (the same DNS provider, the same auth provider), and they have no review workflow that catches stale content. A runbook in runbooks/api-latency-p95-high.md is reviewed every time the surrounding service changes — pull requests force the authors to update the runbook or explain why not.

Link every alert to its runbook. Use the annotation field your alerting system provides. For Prometheus / Grafana, that's runbook_url:

- alert: ApiLatencyP95High
  expr: |
    histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="api"}[5m])) > 1
    and rate(http_requests_total{job="api",status=~"5.."}[5m]) / rate(http_requests_total{job="api"}[5m]) < 0.01
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "API p95 latency above 1s for 5m with error rate below 1%."
    runbook_url: "https://docs.your-company.com/runbooks/api-latency-p95-high"
Enter fullscreen mode Exit fullscreen mode

The alert payload that reaches the pager (and the AI agent, if you run one) carries the URL. The on-caller's first click is straight into the right procedure.

One runbook per failure mode, named for the failure. api-latency-p95-high.md, not api.md. When the page fires, the alert name and the file name match — no search needed.

For decay management: review each runbook quarterly, archive any with zero hits in 90 days, and treat a stale runbook found mid-incident as a sev3 of its own — the on-caller files a ticket to fix it; otherwise nobody does.

How DevHelm fits runbooks into your incident flow

DevHelm is built for the moment an alert fires and someone (human or agent) needs context fast. What's shipped today:

  • Alert channels (PagerDuty, Slack, webhook, email) pass through the payload your upstream system sends. If your Prometheus or Grafana alert includes a runbook_url annotation, that URL can ride along in the notification DevHelm dispatches.
  • Vendor status context on dependency status pages — when latency looks like an upstream problem, the runbook's "check dependency status" step has a concrete destination instead of a generic Google search.
  • Resource groups (see MTTR Full Form) collapse multiple monitors that share one failure mode into one incident — so the runbook link in the notification matches one root cause, not three duplicate pages.

What's not yet shipped: a first-class runbook_url field on DevHelm monitors — the kind that would let you set it once on the monitor and have it flow into every notification and MCP tool response automatically. Until then, put the URL in the monitor description and your alert template. The reliability page covers how we operate our own stack; you don't need our internal runbook repo to apply the patterns in this post.

Where to start

Pick your noisiest recurring alert — the one that woke someone up twice last quarter — and write one runbook for it. Seven sections, under 500 lines, stored in Git, linked from the alert annotation. That's the whole commitment.

If you've been troubleshooting slow DNS lookups or SSL certificate errors, you've already done most of the work: those investigations follow exactly the trigger → diagnosis → fix → verify shape described above. Turning them into a runbook is a matter of formatting what you already know so the next person (or agent) doesn't have to rediscover it. And once the runbook exists, measuring whether it actually shortens recovery is what MTTR is for.

Spin up a free account at app.devhelm.io and connect your first dependency status feed in 60 seconds — useful when your runbook's diagnosis step says "check if the vendor is degraded." For AI-native setup, npx devhelm skills install --target cursor installs the skill bundle that can create monitors from your editor.


Originally published on DevHelm.

Top comments (0)