DEV Community

Cover image for Building Your First Runbook: A Template That Actually Works
Samson Tanimawo
Samson Tanimawo

Posted on

Building Your First Runbook: A Template That Actually Works

Most runbooks are useless. Either they're too abstract ('check the logs') or they're a 40-page Confluence doc that nobody reads at 3 AM.

Here is the template I use. It fits on one page and works.

The template

1. Trigger. The exact alert name and what it means.

2. Impact. Who is affected? What are they seeing? Is this user-facing?

3. First 5 minutes. The single most useful command to run. One. Not five.

4. Common causes. The 3 things that most often cause this alert, in order of likelihood.

5. Fix per cause. For each common cause, the exact fix. Copy-paste-ready.

6. Escalation. Who to page if none of the above works. Include their timezone.

7. Post-incident. What to update after the incident is done (ticket, dashboard, doc).

Why this works

At 3 AM, your brain is running at 60%. You need a runbook that gives you the next action in under 30 seconds. A 40-page doc makes you think. A one-pager tells you what to do.

Start with your noisiest alert. Write the runbook. Test it on a new team member. If they can follow it without you, it works.

Repeat for your top 10 alerts. That's 90% of your on-call load handled.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)