DEV Community

Cover image for Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries
Samson Tanimawo
Samson Tanimawo

Posted on

Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries

The Runbook Nobody Reads

We had runbooks. Beautiful, detailed, Google-Docs runbooks. 47 pages long. Nobody read them at 3am.

The problem isn't the documentation. The problem is expecting a sleep-deprived human to follow a 47-step procedure correctly.

The Automation Ladder

I think about runbook automation as a ladder:

Level 0: No runbook (tribal knowledge)
Level 1: Written runbook (Google Doc)
Level 2: Structured runbook (checklist format)
Level 3: Semi-automated (scripts for each step)
Level 4: Fully automated (one-click remediation)
Level 5: Self-healing (no human needed)
Enter fullscreen mode Exit fullscreen mode

Most teams are at Level 1-2. The goal is Level 4-5 for your top 10 incidents.

Identifying Automation Candidates

Not everything should be automated. Start with high-frequency, well-understood procedures:

-- Query your incident database
SELECT 
  root_cause_category,
  COUNT(*) as frequency,
  AVG(resolution_time_minutes) as avg_mttr,
  COUNT(*) * AVG(resolution_time_minutes) as total_impact_minutes
FROM incidents
WHERE created_at > NOW() - INTERVAL '6 months'
GROUP BY root_cause_category
ORDER BY total_impact_minutes DESC
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

For us, the top 5 were:

  1. Disk full on log volumes (2x/week)
  2. Memory leak requiring pod restart (1x/week)
  3. Certificate expiry (1x/month, but high impact)
  4. Database connection pool exhaustion (1x/week)
  5. Stuck deployment (2x/week)

Example: Disk Full Auto-Remediation

Before (Level 1 — runbook):

1. SSH to the affected host
2. Run df -h to confirm
3. Check /var/log for large files
4. Run logrotate manually
5. If still full, find and remove old files
6. If still full, expand the volume
7. Verify service recovered
Enter fullscreen mode Exit fullscreen mode

After (Level 5 — self-healing):

#!/bin/bash
# disk-remediation.sh — triggered by monitoring alert

HOST=$1
THRESHOLD=90

USAGE=$(ssh $HOST "df /var/log --output=pcent | tail -1 | tr -dc '0-9'")

if [ "$USAGE" -gt "$THRESHOLD" ]; then
  echo "[Auto-Remediation] Disk at ${USAGE}% on ${HOST}"

  # Step 1: Rotate logs
  ssh $HOST "sudo logrotate -f /etc/logrotate.conf"

  # Step 2: Clean old logs (>7 days)
  ssh $HOST "find /var/log -name '*.gz' -mtime +7 -delete"

  # Step 3: Clean temp files
  ssh $HOST "find /tmp -mtime +3 -delete 2>/dev/null"

  # Verify
  NEW_USAGE=$(ssh $HOST "df /var/log --output=pcent | tail -1 | tr -dc '0-9'")

  if [ "$NEW_USAGE" -lt "$THRESHOLD" ]; then
    echo "[Auto-Remediation] Resolved. ${USAGE}% -> ${NEW_USAGE}%"
    notify_slack "Disk full on ${HOST} auto-resolved (${USAGE}% -> ${NEW_USAGE}%)"
  else
    echo "[Auto-Remediation] Still at ${NEW_USAGE}%. Escalating."
    page_oncall "Disk full on ${HOST} - auto-remediation failed. Manual intervention needed."
  fi
fi
Enter fullscreen mode Exit fullscreen mode

The Results

Incident Type Before (MTTR) After (MTTR) Automation Level
Disk full 25 min 90 sec Self-healing
Memory leak 15 min 45 sec One-click
Cert expiry 45 min 0 (prevented) Proactive
DB conn pool 20 min 60 sec Self-healing
Stuck deploy 30 min 2 min One-click

Total monthly incident time: 14 hours → 45 minutes.

The Golden Rule

If you've fixed the same incident three times manually, it's time to automate. The third time pays for the automation effort. Everything after that is pure savings.

If you're tired of repetitive incident response and want to automate your runbooks with AI, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)