Here's an idea I've been advocating for the last year: write the runbook before you ship the feature.
Sounds backwards. It's transformative. Let me explain.
The usual way
Team ships feature → feature breaks at 3 AM → on-call engineer tries to debug without context → writes runbook after the post-mortem.
The problem: the runbook gets written under duress, incomplete, and often never. Worse, the first on-call engineer suffers needlessly.
The runbook-first way
Before shipping, the team writes:
- What alerts this feature will introduce
- What each alert means
- The first 3 things to check for each alert
- The most likely causes
- Escalation path
The runbook becomes a design review artifact. If you can't write the runbook, your design isn't clear.
What this exposes
Writing the runbook forces the team to answer questions most feature specs don't:
- How will this fail?
- What will on-call see when it does?
- Who owns the fix?
- What metrics should we add to make this debuggable?
I've seen feature designs get reworked just because writing the runbook exposed observability gaps. That's the signal you're doing it right.
The format
Keep the runbook short — one page. Structure:
- New alerts (what they mean)
- New dashboards (where to find them)
- Common failure modes
- First actions for on-call
- Owner and escalation path
The hard part
Engineers resist this because it feels like extra work. Frame it as 'this is part of the design, not an add-on.' Block merges until the runbook exists.
After 3 months, the team will thank you. Their 3 AM pages start having actual context. Their post-mortems don't start from zero.
The bigger insight
You cannot ship reliable software without thinking about how it fails. Runbook-first makes that thinking explicit and early. It's the cheapest reliability investment you can make.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)