The Runbook Nobody Reads
We had runbooks. Beautiful, detailed, Google-Docs runbooks. 47 pages long. Nobody read them at 3am.
The problem isn't the documentation. The problem is expecting a sleep-deprived human to follow a 47-step procedure correctly.
The Automation Ladder
I think about runbook automation as a ladder:
Level 0: No runbook (tribal knowledge)
Level 1: Written runbook (Google Doc)
Level 2: Structured runbook (checklist format)
Level 3: Semi-automated (scripts for each step)
Level 4: Fully automated (one-click remediation)
Level 5: Self-healing (no human needed)
Most teams are at Level 1-2. The goal is Level 4-5 for your top 10 incidents.
Identifying Automation Candidates
Not everything should be automated. Start with high-frequency, well-understood procedures:
-- Query your incident database
SELECT
root_cause_category,
COUNT(*) as frequency,
AVG(resolution_time_minutes) as avg_mttr,
COUNT(*) * AVG(resolution_time_minutes) as total_impact_minutes
FROM incidents
WHERE created_at > NOW() - INTERVAL '6 months'
GROUP BY root_cause_category
ORDER BY total_impact_minutes DESC
LIMIT 10;
For us, the top 5 were:
- Disk full on log volumes (2x/week)
- Memory leak requiring pod restart (1x/week)
- Certificate expiry (1x/month, but high impact)
- Database connection pool exhaustion (1x/week)
- Stuck deployment (2x/week)
Example: Disk Full Auto-Remediation
Before (Level 1 — runbook):
1. SSH to the affected host
2. Run df -h to confirm
3. Check /var/log for large files
4. Run logrotate manually
5. If still full, find and remove old files
6. If still full, expand the volume
7. Verify service recovered
After (Level 5 — self-healing):
#!/bin/bash
# disk-remediation.sh — triggered by monitoring alert
HOST=$1
THRESHOLD=90
USAGE=$(ssh $HOST "df /var/log --output=pcent | tail -1 | tr -dc '0-9'")
if [ "$USAGE" -gt "$THRESHOLD" ]; then
echo "[Auto-Remediation] Disk at ${USAGE}% on ${HOST}"
# Step 1: Rotate logs
ssh $HOST "sudo logrotate -f /etc/logrotate.conf"
# Step 2: Clean old logs (>7 days)
ssh $HOST "find /var/log -name '*.gz' -mtime +7 -delete"
# Step 3: Clean temp files
ssh $HOST "find /tmp -mtime +3 -delete 2>/dev/null"
# Verify
NEW_USAGE=$(ssh $HOST "df /var/log --output=pcent | tail -1 | tr -dc '0-9'")
if [ "$NEW_USAGE" -lt "$THRESHOLD" ]; then
echo "[Auto-Remediation] Resolved. ${USAGE}% -> ${NEW_USAGE}%"
notify_slack "Disk full on ${HOST} auto-resolved (${USAGE}% -> ${NEW_USAGE}%)"
else
echo "[Auto-Remediation] Still at ${NEW_USAGE}%. Escalating."
page_oncall "Disk full on ${HOST} - auto-remediation failed. Manual intervention needed."
fi
fi
The Results
| Incident Type | Before (MTTR) | After (MTTR) | Automation Level |
|---|---|---|---|
| Disk full | 25 min | 90 sec | Self-healing |
| Memory leak | 15 min | 45 sec | One-click |
| Cert expiry | 45 min | 0 (prevented) | Proactive |
| DB conn pool | 20 min | 60 sec | Self-healing |
| Stuck deploy | 30 min | 2 min | One-click |
Total monthly incident time: 14 hours → 45 minutes.
The Golden Rule
If you've fixed the same incident three times manually, it's time to automate. The third time pays for the automation effort. Everything after that is pure savings.
If you're tired of repetitive incident response and want to automate your runbooks with AI, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)