On-Call Best Practices: An SRE Guide to Incident Response

#sre #incidents #oncall #devops

On-Call Best Practices: Rotations, Escalation Policies, and Reducing Alert Fatigue

On-call is where reliability engineering meets human sustainability. A poorly designed on-call rotation burns out engineers, produces alert fatigue that causes real incidents to be ignored, and creates a culture where nobody wants to be on-call. A well-designed rotation distributes load fairly, ensures every alert is actionable, and gives on-call engineers the context and authority to resolve incidents quickly.

Start with the alerts themselves - every page should be actionable, meaning the on-call engineer can do something about it right now. Alerts on symptoms (error rate above 1%, P99 latency above 2 seconds) are actionable. Alerts on causes (CPU at 80%) are not - high CPU might be normal during a traffic spike and resolve on its own. Apply the test: if the alert fires and the correct response is "wait and see," it should not page. Move informational alerts to a dashboard or low-priority channel. Target fewer than 2 pages per on-call shift - more than that indicates systemic issues that need engineering investment, not more alert tuning.

Rotation design matters for team health. Weekly rotations are the most common, but follow-the-sun rotations across time zones prevent overnight pages entirely. Implement a primary and secondary on-call - the secondary is the escalation path and provides backup. Escalation policies should auto-escalate to the secondary after 10 minutes of no acknowledgment, then to the engineering manager after 20 minutes. Compensate on-call fairly (additional pay or time off), and track on-call load per engineer to ensure equitable distribution. Post-incident reviews should evaluate whether each alert was necessary and whether runbooks need updating.

Need help building your on-call practice? InstaDevOps helps teams design sustainable on-call rotations and monitoring strategies. Book a free consultation.

DEV Community

On-Call Best Practices: An SRE Guide to Incident Response

On-Call Best Practices: Rotations, Escalation Policies, and Reducing Alert Fatigue

Top comments (0)