The Folder Full of Stale Runbooks
Every engineering org has a Confluence folder of incident runbooks. Every runbook was written during or after an incident. Each is a snapshot of how to fix one specific thing.
After 2 years, the folder has 400 runbooks. Nobody knows which are current. During the next incident, nobody opens any of them.
A playbook library is supposed to help. Usually it just generates debt.
What Playbooks Should Actually Do
A good playbook library:
- Reduces cognitive load during incidents
- Provides confidence to junior responders
- Captures institutional knowledge
- Trains new hires
A bad playbook library:
- Is out of date
- Is impossible to search
- Has no ownership
- Contradicts itself across runbooks
The difference is process, not content.
The Structure
Every runbook follows the same template:
# Title: [Specific Problem Being Fixed]
## When to Use This
- Alert that fires: [exact alert name]
- Symptoms: [what the user sees]
- Impact level: [SEV-1/2/3/4]
## Quick Fix (90 seconds)
1. Command to run
2. Command to run
3. Verification step
## Deeper Investigation (if quick fix fails)
- Check this
- Look at this dashboard
- Tail these logs
## Root Cause Category
- Known cause 1: [link to deeper doc]
- Known cause 2: [link to deeper doc]
## Escalation
- Primary: [role]
- Secondary: [role]
- SME: [specific person]
## Related Runbooks
- [Similar problem A]
- [Similar problem B]
## Metadata
- Owner: [team]
- Last Verified: [date]
- Expires: [date + 3 months]
Every field is required. Missing fields fail CI.
The Ownership Rule
Every runbook has an owner. The owner is responsible for:
- Keeping it current
- Re-verifying it quarterly
- Updating when the underlying system changes
- Deleting it when no longer relevant
If a runbook has no owner, delete it. A stale runbook is worse than no runbook it wastes time during incidents and may contain wrong instructions.
The Expiration Date
Every runbook has a 90-day expiration. After 90 days:
- CI warns the owner
- After 30 more days, CI fails builds that reference it
- After 60 more days, the runbook is auto-moved to archive
The owner must re-verify and reset the expiration date. Re-verification means:
- Read it end to end
- Try the commands in staging
- Update anything that's changed
- Set a new expiration date
This forces continuous maintenance. It's painful. It also means the runbooks are trustworthy.
The Discovery Problem
A library of 400 runbooks is useless if you can't find the right one during an incident.
Three techniques:
1. Alert-to-runbook mapping
Every alert includes a runbook link:
alert: HighErrorRate
annotations:
summary: "API error rate above 5%"
runbook: "https://runbooks.internal/api/high-error-rate"
When the alert fires, the runbook is one click away. No searching.
2. Symptom-based search
We tag runbooks by symptom:
symptoms:
- "slow response time"
- "database queries timing out"
- "connection pool exhausted"
During an incident, you search by symptom, not by service name.
3. AI-assisted search
A Slack bot that takes a natural-language description and returns the top 3 relevant runbooks. Only works if runbooks are well-structured.
The Writing Rule: Before You Have the Outage
The worst time to write a runbook is during an incident. You're stressed, you're in a hurry, you'll write something incomplete.
The best time to write a runbook is:
- During a change: new service? write the runbook before launch
- During a quiet week: pick a service you're familiar with, write the runbook
- During a post-mortem: document what you just learned while it's fresh
- During a drill: tabletop exercises expose gaps
Companies that write runbooks proactively have better incident response than those who write reactively.
The Post-Incident Rule
Every incident with a post-mortem generates at least one runbook update. Either:
- Create a new runbook if this was a novel problem
- Update an existing runbook if the old instructions didn't work
- Delete a runbook if it was wrong or misleading
No post-mortem is complete without a runbook change. We enforce this in the template.
The Hidden Cost of Playbooks
Maintaining a library of 100 current runbooks takes real time. Approximately:
Write new runbook: 2 hours
Quarterly verification: 30 min/runbook/quarter
Post-incident update: 1 hour/incident
Search index maintenance: 2 hours/month
Total: ~10 hours/week for a mature library
Budget this time. If you don't, your library will rot.
When NOT to Write a Runbook
Counter-intuitive: sometimes the right answer is not to write a runbook.
If the fix is:
- Obvious from the error message → no runbook needed
- Already automated → delete the manual path
- Only applicable once (one-time migration) → write a ticket, not a runbook
- Constantly changing → document the general approach, not specific commands
Runbooks are for recurring, stable, manual procedures. Everything else belongs elsewhere.
The "Death Star" Runbook
The biggest runbook anti-pattern: one giant runbook that covers "incident response" generically. 50 pages long. No specific triggers. No specific steps.
These exist because nobody wanted to write specific ones, so they wrote one massive generic one.
Nobody reads them during incidents. They're doorstops.
Delete them. Replace with specific, focused runbooks tied to specific alerts.
The Starter Kit
If you're starting a runbook library from scratch:
- List your top 20 most-fired alerts in the last 6 months
- Write a runbook for each of them
- Link each alert to its runbook
- Set quarterly verification cycles
- Assign owners
- Put them in version control
- Review monthly with your team
After 6 months, you'll have 30-50 high-quality runbooks that are actually used. This is infinitely more valuable than 400 stale ones.
The Ultimate Test
During your next incident, watch the responder:
- Did they open a runbook? (good)
- Was the runbook current? (better)
- Did it contain the right answer? (best)
- Did it resolve the incident faster than winging it would have? (mission accomplished)
If the answer to any of these is "no," your library has work to do.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)