One insight that changed how I design systems:

#observability #sre #devops #platformengineering

LinkedIn Draft — Insight (2026-03-28)

Runbook quality decays silently — and that decay kills MTTR

Runbooks that haven't been run recently are wrong. Not outdated — wrong. The service changed. The tool was deprecated. The endpoint moved. Nobody updated the doc because nobody reads it until 3am. And at 3am, a wrong runbook is worse than no runbook — it sends engineers down confident paths that dead-end.

Runbook decay curve:

Quality
  │▓▓▓▓▓▓▓▓▓▓
  │         ▓▓▓▓▓
  │              ▓▓▓▓
  │                  ▓▓▓▓▓
  │                       ▓▓▓░░░░░░░
  │                             ░░░░░░░░ ← "last validated 8 months ago"
  └────────────────────────────────────▶
  Write   Month 1  Month 3  Month 6  Month 9

The non-obvious part:
→ The highest-leverage runbook improvement isn't better writing — it's a validation date and a quarterly review reminder. A runbook with 'last validated: 2 weeks ago' that's 70% accurate is worth more than a beautifully written one from 8 months ago that's 40% accurate.

My rule:
→ Every runbook gets a 'last validated' date. Anything older than 3 months is assumed broken until proven otherwise. Review is part of the on-call rotation, not optional.

Worth reading:
▸ PagerDuty Incident Response guide — runbook standards and validation cadence
▸ Post-incident review template: 'Did the runbook help, mislead, or was it missing?' (standard question)

https://neeraja-portfolio-v1.vercel.app/insights/runbook-quality-decays-silently-and-that-decay-kills-mttr

What's the version of this that your org gets wrong? Drop it below.