The On-Call Burnout Epidemic
I watched three senior SREs leave our team in six months. Exit interviews all said the same thing: on-call was unsustainable.
We were spending $500K+ recruiting replacements for a problem that could have been fixed with $0 and better practices.
The Warning Signs
Before someone quits, they show these signals:
- Cynicism in post-mortems — "This will never get fixed"
- Alert numbness — Slow to respond, missed pages
- Vacation avoidance — "I can't take time off, who would cover?"
- Scope creep rejection — "That's not my problem"
- Meeting silence — Previously engaged, now checked out
If you see three or more of these in someone on your team, they're already halfway out the door.
What We Changed
1. Hard Cap on Pages
We set a maximum of 2 pages per 8-hour on-call shift. If someone gets paged more than that, the secondary automatically takes over and the incident is escalated as a process failure.
on_call_policy:
max_pages_per_shift: 2
shift_duration_hours: 8
overflow_action: "escalate_to_secondary"
overflow_review: "weekly_ops_review"
2. Follow-the-Sun Rotation
We stopped asking people to be on-call at 3am. With team members across US timezones, we created overlapping business-hours shifts:
Shift A (Eastern): 6am - 2pm ET
Shift B (Central): 11am - 7pm CT
Shift C (Pacific): 2pm - 10pm PT
Overnight: Managed by alert automation + escalation
Nobody gets paged between 10pm and 6am unless it's a true P1.
3. On-Call Compensation
We implemented:
- $500 flat fee per on-call week
- $200 per off-hours page
- Comp day after any overnight incident > 30 minutes
- On-call swaps require zero management approval
4. The "Toil Budget"
Each engineer gets a toil budget: maximum 30% of their time on operational work. If toil exceeds 30%, they're pulled from on-call until the team automates the excess.
Weekly toil tracking:
Alert response: 4 hours
Manual deployments: 2 hours
Config updates: 1 hour
Ad-hoc debugging: 3 hours
─────────────────────────
Total: 10 hours (25% of 40hr week) ✓
5. Quarterly On-Call Reviews
Every quarter, we review:
- Pages per person
- Off-hours disruptions
- Toil percentages
- Team sentiment survey
- Attrition risk signals
The Results
| Metric | Before | After (6 months) |
|---|---|---|
| Attrition rate | 40%/year | 8%/year |
| Pages per shift | 4.7 | 1.2 |
| Off-hours pages | 12/week | 2/week |
| Team NPS | -15 | +45 |
| Recruitment cost saved | - | ~$400K/year |
The Key Insight
On-call wellness isn't a perk. It's a business decision. Replacing a senior SRE costs $150-200K in recruiting, onboarding, and lost productivity. Preventing burnout costs almost nothing.
If you're looking to reduce on-call toil and protect your team from burnout, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)