Sajja Sudhakararao

Posted on Jan 18

Incident Response Runbook Template for DevOps

#devops #incidentmanagement #observability #linux

Incident Response Runbook Template for DevOps

Incidents are stressful when the team is improvising. A simple runbook reduces MTTR by making response repeatable, not heroic.

This post provides a ready to use incident response runbook template plus a practical Linux triage checklist you can run from any box.

What this runbook optimizes for

Fast acknowledgement and clear ownership (Incident Commander + roles).
Early impact assessment and severity assignment to avoid under/over‑reacting.
Communication cadence and “known/unknown/next update” structure that builds trust.
Evidence capture (commands + logs) to support post‑incident review.

The incident runbook template

Copy this into your internal wiki, README, Notion, or ops repo.

1. Trigger

Triggers:

Monitoring alert / SLO breach
Customer report escalated
Internal detection (logs, latency spikes, error spikes)

2. Acknowledge (0–5 minutes)

Acknowledge page/alert in your paging system.
Create an incident channel: #inc-YYYYMMDD-service-shortdesc.
Assign Incident Commander (IC) and Comms Lead.
Start an incident document: timeline + links + decisions.

3. Assess severity (5–10 minutes)

Answer quickly:
- What’s impacted (service, region, feature)?
- How many users / revenue / compliance impact?
- Is impact ongoing and spreading?

Suggested severity:
- SEV1: Major outage / severe user impact; immediate coordination.
- SEV2: Partial outage / significant degradation; urgent but controlled.
- SEV3: Minor impact; can be handled async.

4. Stabilize first (10–30 minutes)

Goal: stop the bleeding before chasing root cause.

Typical mitigations:

Roll back the last deploy/config change.
Disable a feature flag.
Scale up/out temporarily.
Fail over if safe.
Rate-limit or block abusive traffic.

5. Triage checklist (host-level)

Run these to establish the baseline quickly (copy/paste friendly).

CPU

ps aux --sort=-%cpu | head -15

Alert cue: any process >50% sustained.

Memory

free -h

Alert cue: available <20% total RAM.

Disk

df -h
du -sh /var/log/* 2>/dev/null | sort -h | tail -10

Alert cue: any filesystem >90%.

Disk I/O

iostat -x 1 3

Alert cue: %util >80%, await >20ms.

Network listeners

ss -tuln

Alert cue: unexpected listeners/ports.

Logs (example: nginx)

journalctl -u nginx -f

Alert cue: 5xx errors spiking.

6. Comms cadence (keep it boring)

SEV1: updates every 10–15 minutes.
SEV2: updates every 30 minutes.
SEV3: async updates acceptable.

Use this structure:

What we know
What we don’t know
What we’re doing now
Next update at: TIME

7. Verify resolution

Confirm user impact is gone (synthetic checks + error rate + latency).
Confirm saturation is back to normal (CPU/memory/disk/I/O).
Watch for 30–60 minutes for regression.

8. Close and learn (post-incident)

Write a brief timeline (detection → mitigation → resolution).
Capture what worked, what didn’t, and what to automate.
Create follow-ups: alerts tuning, runbook updates, tests, guardrails.

Bonus: “Golden signals” lens for incidents

When you’re lost, anchor on the four golden signals:

Latency (are requests slower?)
Traffic (is demand abnormal?)
Errors (is failure rate rising?)
Saturation (are resources hitting limits?)

This keeps triage focused on user impact and system limits, not vanity metrics.

Download / reuse

If you reuse this template internally, make one improvement immediately: add links to dashboards, logs, deploy history, and owners for each service. Your future self will thank you.

DEV Community

Incident Response Runbook Template for DevOps

Incident Response Runbook Template for DevOps

What this runbook optimizes for

The incident runbook template

1. Trigger

2. Acknowledge (0–5 minutes)

3. Assess severity (5–10 minutes)

4. Stabilize first (10–30 minutes)

5. Triage checklist (host-level)

6. Comms cadence (keep it boring)

7. Verify resolution

8. Close and learn (post-incident)

Bonus: “Golden signals” lens for incidents

Download / reuse

Top comments (0)