LinkedIn Draft — Insight (2026-03-28)
One insight that changed how I design systems:
Runbook quality decays silently — and that decay kills MTTR
Runbooks that haven't been run recently are wrong. Not outdated — wrong. The service changed. The tool was deprecated. The endpoint moved. Nobody updated the doc because nobody reads it until 3am. And at 3am, a wrong runbook is worse than no runbook — it sends engineers down confident paths that dead-end.
Runbook decay curve:
Quality
│▓▓▓▓▓▓▓▓▓▓
│ ▓▓▓▓▓
│ ▓▓▓▓
│ ▓▓▓▓▓
│ ▓▓▓░░░░░░░
│ ░░░░░░░░ ← "last validated 8 months ago"
└────────────────────────────────────▶
Write Month 1 Month 3 Month 6 Month 9
The non-obvious part:
→ The highest-leverage runbook improvement isn't better writing — it's a validation date and a quarterly review reminder. A runbook with 'last validated: 2 weeks ago' that's 70% accurate is worth more than a beautifully written one from 8 months ago that's 40% accurate.
My rule:
→ Every runbook gets a 'last validated' date. Anything older than 3 months is assumed broken until proven otherwise. Review is part of the on-call rotation, not optional.
Worth reading:
▸ PagerDuty Incident Response guide — runbook standards and validation cadence
▸ Post-incident review template: 'Did the runbook help, mislead, or was it missing?' (standard question)
What's the version of this that your org gets wrong? Drop it below.
Top comments (0)