Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries

#sre #automation #runbooks #devops

The Runbook Nobody Reads

We had runbooks. Beautiful, detailed, Google-Docs runbooks. 47 pages long. Nobody read them at 3am.

The problem isn't the documentation. The problem is expecting a sleep-deprived human to follow a 47-step procedure correctly.

The Automation Ladder

I think about runbook automation as a ladder:

Level 0: No runbook (tribal knowledge)
Level 1: Written runbook (Google Doc)
Level 2: Structured runbook (checklist format)
Level 3: Semi-automated (scripts for each step)
Level 4: Fully automated (one-click remediation)
Level 5: Self-healing (no human needed)

Most teams are at Level 1-2. The goal is Level 4-5 for your top 10 incidents.

Identifying Automation Candidates

Not everything should be automated. Start with high-frequency, well-understood procedures:

-- Query your incident database
SELECT 
  root_cause_category,
  COUNT(*) as frequency,
  AVG(resolution_time_minutes) as avg_mttr,
  COUNT(*) * AVG(resolution_time_minutes) as total_impact_minutes
FROM incidents
WHERE created_at > NOW() - INTERVAL '6 months'
GROUP BY root_cause_category
ORDER BY total_impact_minutes DESC
LIMIT 10;

For us, the top 5 were:

Disk full on log volumes (2x/week)
Memory leak requiring pod restart (1x/week)
Certificate expiry (1x/month, but high impact)
Database connection pool exhaustion (1x/week)
Stuck deployment (2x/week)

Example: Disk Full Auto-Remediation

Before (Level 1 — runbook):

1. SSH to the affected host
2. Run df -h to confirm
3. Check /var/log for large files
4. Run logrotate manually
5. If still full, find and remove old files
6. If still full, expand the volume
7. Verify service recovered

After (Level 5 — self-healing):

#!/bin/bash
# disk-remediation.sh — triggered by monitoring alert

HOST=$1
THRESHOLD=90

USAGE=$(ssh $HOST "df /var/log --output=pcent | tail -1 | tr -dc '0-9'")

if [ "$USAGE" -gt "$THRESHOLD" ]; then
  echo "[Auto-Remediation] Disk at ${USAGE}% on ${HOST}"

  # Step 1: Rotate logs
  ssh $HOST "sudo logrotate -f /etc/logrotate.conf"

  # Step 2: Clean old logs (>7 days)
  ssh $HOST "find /var/log -name '*.gz' -mtime +7 -delete"

  # Step 3: Clean temp files
  ssh $HOST "find /tmp -mtime +3 -delete 2>/dev/null"

  # Verify
  NEW_USAGE=$(ssh $HOST "df /var/log --output=pcent | tail -1 | tr -dc '0-9'")

  if [ "$NEW_USAGE" -lt "$THRESHOLD" ]; then
    echo "[Auto-Remediation] Resolved. ${USAGE}% -> ${NEW_USAGE}%"
    notify_slack "Disk full on ${HOST} auto-resolved (${USAGE}% -> ${NEW_USAGE}%)"
  else
    echo "[Auto-Remediation] Still at ${NEW_USAGE}%. Escalating."
    page_oncall "Disk full on ${HOST} - auto-remediation failed. Manual intervention needed."
  fi
fi

The Results

Incident Type	Before (MTTR)	After (MTTR)	Automation Level
Disk full	25 min	90 sec	Self-healing
Memory leak	15 min	45 sec	One-click
Cert expiry	45 min	0 (prevented)	Proactive
DB conn pool	20 min	60 sec	Self-healing
Stuck deploy	30 min	2 min	One-click

Total monthly incident time: 14 hours → 45 minutes.

The Golden Rule

If you've fixed the same incident three times manually, it's time to automate. The third time pays for the automation effort. Everything after that is pure savings.

If you're tired of repetitive incident response and want to automate your runbooks with AI, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com