Samson Tanimawo

Posted on May 6

Building an Incident Response Playbook Library

#incidents #sre #runbooks #oncall

The Folder Full of Stale Runbooks

Every engineering org has a Confluence folder of incident runbooks. Every runbook was written during or after an incident. Each is a snapshot of how to fix one specific thing.

After 2 years, the folder has 400 runbooks. Nobody knows which are current. During the next incident, nobody opens any of them.

A playbook library is supposed to help. Usually it just generates debt.

What Playbooks Should Actually Do

A good playbook library:

Reduces cognitive load during incidents
Provides confidence to junior responders
Captures institutional knowledge
Trains new hires

A bad playbook library:

Is out of date
Is impossible to search
Has no ownership
Contradicts itself across runbooks

The difference is process, not content.

The Structure

Every runbook follows the same template:

# Title: [Specific Problem Being Fixed]

## When to Use This
- Alert that fires: [exact alert name]
- Symptoms: [what the user sees]
- Impact level: [SEV-1/2/3/4]

## Quick Fix (90 seconds)
1. Command to run
2. Command to run
3. Verification step

## Deeper Investigation (if quick fix fails)
- Check this
- Look at this dashboard
- Tail these logs

## Root Cause Category
- Known cause 1: [link to deeper doc]
- Known cause 2: [link to deeper doc]

## Escalation
- Primary: [role]
- Secondary: [role]
- SME: [specific person]

## Related Runbooks
- [Similar problem A]
- [Similar problem B]

## Metadata
- Owner: [team]
- Last Verified: [date]
- Expires: [date + 3 months]

Every field is required. Missing fields fail CI.

The Ownership Rule

Every runbook has an owner. The owner is responsible for:

Keeping it current
Re-verifying it quarterly
Updating when the underlying system changes
Deleting it when no longer relevant

If a runbook has no owner, delete it. A stale runbook is worse than no runbook it wastes time during incidents and may contain wrong instructions.

The Expiration Date

Every runbook has a 90-day expiration. After 90 days:

CI warns the owner
After 30 more days, CI fails builds that reference it
After 60 more days, the runbook is auto-moved to archive

The owner must re-verify and reset the expiration date. Re-verification means:

Read it end to end
Try the commands in staging
Update anything that's changed
Set a new expiration date

This forces continuous maintenance. It's painful. It also means the runbooks are trustworthy.

The Discovery Problem

A library of 400 runbooks is useless if you can't find the right one during an incident.

Three techniques:

1. Alert-to-runbook mapping

Every alert includes a runbook link:

alert: HighErrorRate
annotations:
summary: "API error rate above 5%"
runbook: "https://runbooks.internal/api/high-error-rate"

When the alert fires, the runbook is one click away. No searching.

2. Symptom-based search

We tag runbooks by symptom:

symptoms:
- "slow response time"
- "database queries timing out"
- "connection pool exhausted"

During an incident, you search by symptom, not by service name.

3. AI-assisted search

A Slack bot that takes a natural-language description and returns the top 3 relevant runbooks. Only works if runbooks are well-structured.

The Writing Rule: Before You Have the Outage

The worst time to write a runbook is during an incident. You're stressed, you're in a hurry, you'll write something incomplete.

The best time to write a runbook is:

During a change: new service? write the runbook before launch
During a quiet week: pick a service you're familiar with, write the runbook
During a post-mortem: document what you just learned while it's fresh
During a drill: tabletop exercises expose gaps

Companies that write runbooks proactively have better incident response than those who write reactively.

The Post-Incident Rule

Every incident with a post-mortem generates at least one runbook update. Either:

Create a new runbook if this was a novel problem
Update an existing runbook if the old instructions didn't work
Delete a runbook if it was wrong or misleading

No post-mortem is complete without a runbook change. We enforce this in the template.

The Hidden Cost of Playbooks

Maintaining a library of 100 current runbooks takes real time. Approximately:

Write new runbook: 2 hours
Quarterly verification: 30 min/runbook/quarter
Post-incident update: 1 hour/incident
Search index maintenance: 2 hours/month

Total: ~10 hours/week for a mature library

Budget this time. If you don't, your library will rot.

When NOT to Write a Runbook

Counter-intuitive: sometimes the right answer is not to write a runbook.

If the fix is:

Obvious from the error message → no runbook needed
Already automated → delete the manual path
Only applicable once (one-time migration) → write a ticket, not a runbook
Constantly changing → document the general approach, not specific commands

Runbooks are for recurring, stable, manual procedures. Everything else belongs elsewhere.

The "Death Star" Runbook

The biggest runbook anti-pattern: one giant runbook that covers "incident response" generically. 50 pages long. No specific triggers. No specific steps.

These exist because nobody wanted to write specific ones, so they wrote one massive generic one.

Nobody reads them during incidents. They're doorstops.

Delete them. Replace with specific, focused runbooks tied to specific alerts.

The Starter Kit

If you're starting a runbook library from scratch:

List your top 20 most-fired alerts in the last 6 months
Write a runbook for each of them
Link each alert to its runbook
Set quarterly verification cycles
Assign owners
Put them in version control
Review monthly with your team

After 6 months, you'll have 30-50 high-quality runbooks that are actually used. This is infinitely more valuable than 400 stale ones.

The Ultimate Test

During your next incident, watch the responder:

Did they open a runbook? (good)
Was the runbook current? (better)
Did it contain the right answer? (best)
Did it resolve the incident faster than winging it would have? (mission accomplished)

If the answer to any of these is "no," your library has work to do.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community