DEV Community

Cover image for Incident Response Runbook Template for DevOps
Sajja Sudhakararao
Sajja Sudhakararao

Posted on

Incident Response Runbook Template for DevOps

Incident Response Runbook Template for DevOps

Incidents are stressful when the team is improvising. A simple runbook reduces MTTR by making response repeatable, not heroic.

This post provides a ready to use incident response runbook template plus a practical Linux triage checklist you can run from any box.

What this runbook optimizes for

  • Fast acknowledgement and clear ownership (Incident Commander + roles).
  • Early impact assessment and severity assignment to avoid under/over‑reacting.
  • Communication cadence and “known/unknown/next update” structure that builds trust.
  • Evidence capture (commands + logs) to support post‑incident review.

The incident runbook template

Copy this into your internal wiki, README, Notion, or ops repo.

1. Trigger

Triggers:

  • Monitoring alert / SLO breach

  • Customer report escalated

  • Internal detection (logs, latency spikes, error spikes)

2. Acknowledge (0–5 minutes)

  • Acknowledge page/alert in your paging system.

  • Create an incident channel: #inc-YYYYMMDD-service-shortdesc.

  • Assign Incident Commander (IC) and Comms Lead.

  • Start an incident document: timeline + links + decisions.

3. Assess severity (5–10 minutes)

Answer quickly:
- What’s impacted (service, region, feature)?
- How many users / revenue / compliance impact?
- Is impact ongoing and spreading?

Suggested severity:
- SEV1: Major outage / severe user impact; immediate coordination.
- SEV2: Partial outage / significant degradation; urgent but controlled.
- SEV3: Minor impact; can be handled async.

4. Stabilize first (10–30 minutes)

Goal: stop the bleeding before chasing root cause.

Typical mitigations:

  • Roll back the last deploy/config change.

  • Disable a feature flag.

  • Scale up/out temporarily.

  • Fail over if safe.

  • Rate-limit or block abusive traffic.

5. Triage checklist (host-level)

Run these to establish the baseline quickly (copy/paste friendly).

CPU

ps aux --sort=-%cpu | head -15
Enter fullscreen mode Exit fullscreen mode

Alert cue: any process >50% sustained.

Memory

free -h
Enter fullscreen mode Exit fullscreen mode

Alert cue: available <20% total RAM.

Disk

df -h
du -sh /var/log/* 2>/dev/null | sort -h | tail -10
Enter fullscreen mode Exit fullscreen mode

Alert cue: any filesystem >90%.​

Disk I/O

iostat -x 1 3
Enter fullscreen mode Exit fullscreen mode

Alert cue: %util >80%, await >20ms.

Network listeners

ss -tuln
Enter fullscreen mode Exit fullscreen mode

Alert cue: unexpected listeners/ports.

Logs (example: nginx)

journalctl -u nginx -f
Enter fullscreen mode Exit fullscreen mode

Alert cue: 5xx errors spiking.

6. Comms cadence (keep it boring)

SEV1: updates every 10–15 minutes.​
SEV2: updates every 30 minutes.
SEV3: async updates acceptable.

Use this structure:

  • What we know
  • What we don’t know
  • What we’re doing now
  • Next update at: TIME

7. Verify resolution

  • Confirm user impact is gone (synthetic checks + error rate + latency).

  • Confirm saturation is back to normal (CPU/memory/disk/I/O).

  • Watch for 30–60 minutes for regression.

8. Close and learn (post-incident)

  • Write a brief timeline (detection → mitigation → resolution).
  • Capture what worked, what didn’t, and what to automate.
  • Create follow-ups: alerts tuning, runbook updates, tests, guardrails.

Bonus: “Golden signals” lens for incidents

When you’re lost, anchor on the four golden signals:

  • Latency (are requests slower?)
  • Traffic (is demand abnormal?)
  • Errors (is failure rate rising?)
  • Saturation (are resources hitting limits?)

This keeps triage focused on user impact and system limits, not vanity metrics.

Download / reuse

If you reuse this template internally, make one improvement immediately: add links to dashboards, logs, deploy history, and owners for each service. Your future self will thank you.

Top comments (0)