Incident Management: Building Effective On-Call Rotations and Runbooks

#incident #oncall #sre #devops

Introduction

Every engineering team dreads the 3 AM page. How your team handles these moments defines your reliability as a service provider.

This guide covers on-call rotation design, runbook creation, incident response processes, and blameless post-mortems.

On-Call Best Practices

Designing Sustainable Rotations

Rotation Length: Weekly rotations work best
Minimum Team Size: Never fewer than 4 engineers
Compensation: Fair pay for on-call work

Reducing Alert Fatigue

Actionable alerts only
Consolidate related alerts
Regular alert reviews

Runbook Template Example

# Runbook: Database Connection Pool Exhausted

## Impact Assessment
- User Impact: New requests may fail
- Affected Services: All services using primary database

## Immediate Actions
1. Check current connections:
   SELECT count(*), state FROM pg_stat_activity GROUP BY state;

2. Identify connection hogs:
   SELECT usename, application_name, count(*)
   FROM pg_stat_activity GROUP BY usename, application_name;

## Resolution Steps
- If long-running query: pg_terminate_backend(<pid>)
- If traffic spike: Scale up read replicas
- If connection leak: Restart affected service

Incident Response Process

Phase 1: Detection and Triage (0-5 minutes)

Acknowledge the alert
Assess severity
Open relevant runbook

Phase 2: Response Coordination

Incident Commander: Coordinates response
Technical Lead: Hands-on debugging
Communications Lead: Updates stakeholders

Blameless Post-Mortems

Focus on systems and processes, not individuals. The goal is learning and improvement.

Conclusion

Effective incident management is built on preparation, not heroics. Well-designed rotations, comprehensive runbooks, and blameless post-mortems transform incident response from stress into a competitive advantage.