3 on-call rotation mistakes that burn out your best engineers first

#sre #devops #infrastructure #career

The engineers who leave over on-call are rarely the ones who complain about it. They're the ones who quietly absorb everything, resolve incidents fast, never escalate, and one day accept an offer somewhere else. By the time you notice the pattern, you've already lost the person the rotation was grinding down.
Three mistakes that create that outcome.

Measuring shifts per engineer instead of load per engineer. Equal shifts are not equal load. A week with two P1 incidents resolved in 20 minutes each is not the same as a week with twelve alerts that each require 45 minutes of investigation at 2am. If you track only who was on-call and not what that shift actually cost, you will consistently underestimate the burden on your senior engineers, who resolve things faster but get paged more often because they're trusted to handle anything. Track actionable pages per shift per engineer. If one person consistently receives 3x the load of others, the rotation is broken regardless of how the calendar looks. The fix is alert hygiene first (delete alerts nobody acts on for 30 consecutive days), then rebalance the schedule based on load data, not headcount fairness.
Putting engineers on independent on-call before shadow shifts. The correct progression before anyone carries the pager alone: observer phase (receive all the same pages, take no action, watch how the primary responds), then reverse shadow (lead the response with an experienced engineer watching), then independent. Skipping this costs you higher MTTR on every incident that engineer handles alone, plus an experience that makes on-call feel dangerous rather than manageable. Four to six weeks of partial senior engineer time upfront costs significantly less than the first major incident where an unprepared engineer makes it worse.
Treating on-call as part of the job with no additional recognition. An engineer paged three times outside business hours in a single week and expected to deliver full sprint capacity the following week is being asked to absorb a cost that isn't being acknowledged. This doesn't require complex compensation structures. Time in lieu for overnight pages, reduced sprint commitment after heavy on-call weeks, or explicit acknowledgment in performance reviews are all sufficient. The failure mode is pretending the cost doesn't exist. If Opsgenie is still in your stack: end-of-support is April 5, 2027. If your runbooks and escalation policies live inside it, export everything now. The format doesn't migrate cleanly into alternatives.

Top comments (1)

Rahul Joshi • Apr 29

Spot on about measuring 'load' instead of just 'shifts'—treating headcount fairness as equivalent to operational fairness is exactly how senior talent gets quietly ground down.