Most runbook steps are about what to do when something is broken. The most useful one I've added in years is about what to do before anything is broken: a step that captures what the system looks like when it's healthy, in the moment you're reading the runbook.
It sounds trivial. It is not.
The story
Got paged at 3am. Symptoms: queue depth high, some user reports of slow responses. Opened the runbook. The runbook said:
- Check if the queue depth is elevated
- Check if any consumers are crashlooping
- Check error logs for patterns
- Restart consumers if needed
Queue depth was at 12,000. Was that elevated? I had no idea. The runbook didn't say. I went looking through dashboards for a baseline and couldn't find a clean one (the time-series only went back two weeks; anything before that had been rolled up). I burned 25 minutes figuring out that 12,000 was actually slightly below normal for that hour of day, and the real problem was a single slow consumer that happened to be consuming the most expensive job type.
The runbook had me looking at the wrong number with no frame of reference.
The step I add now
The first step in every runbook I write or edit:
Before touching anything: capture "normal" in the same dashboard you're about to use. Open the dashboard. Take a screenshot of the last 7 days. Note the band that "healthy" sits in for the metric you care about. Then compare current to that.
The goal is not the screenshot. The goal is forcing the responder to anchor on a baseline before they look at the live number, because the live number has no meaning without one.
Why it works
It's the engineering equivalent of "before you adjust the thermostat, look at what temperature it currently is." Most pages where the responder makes things worse start with the responder treating an in-band number as out-of-band, or vice versa.
A few related habits that came out of this:
- The runbook screenshots get committed alongside the runbook. When the dashboard URL eventually rots (and it will), the screenshot of "what 'healthy' looked like in 2024" is the last surviving baseline.
- Every alert threshold in the runbook gets its rationale next to it. "Page if queue > 50,000 (rationale: 99th percentile of last 30 days was 38,000; saw a real incident at 47,000 in March)." When I'm woken at 3am the rationale matters more than the threshold.
- The dashboard panel order matches the runbook step order. If step 2 says check consumer health, the consumer health panel is the second one on the dashboard, not buried under nine other panels. This sounds obvious. Almost no team I've worked with does it.
What it changed
My mean-time-to-actually-doing-something went down by maybe 60%. The page-but-no-action-needed rate went up (correctly — I was no longer treating in-band numbers as incidents). The number of times I've made a system worse during an on-call dropped to almost zero.
The expensive failure mode of on-call isn't acting too slowly. It's acting on the wrong information. The "what does normal look like right now" step is cheap insurance against that.
Top comments (0)