LinkedIn Draft — Workflow (2026-04-13)
Not in any textbook — learned this from a 3am page:
On-call burnout is an alert design problem, not a schedule problem
Every team I've seen fight burnout by rotating people faster. The actual fix is almost always the same: the alerts are wrong.
Alert quality spectrum:
Noisy ◀─────────────────────────── ▶ Actionable
[cpu > 80%] [pod restart] [error budget burn] [customer impact]
│ │ │ │
ignore me maybe? investigate! wake me up
Where it breaks:
▸ Alerts without a named owner and a runbook produce paralysis, not action — especially at 2am.
▸ Flapping alerts are the fastest path to alert blindness — engineers learn to dismiss pages before reading them.
▸ Cause-based alerts (disk full) and symptom-based alerts (latency spike) need different urgency and routing.
The rule I keep coming back to:
→ Before any alert ships: Who acts on it? What do they do? What's the cost of 30 minutes of inaction? If you can't answer all three, it's not ready.
How I sanity-check it:
▸ Weekly alert review ritual: tag every last-week page as actionable / noisy / redundant. Kill the bottom two categories.
▸ PagerDuty/OpsGenie grouping + escalation policies — reduce interrupt rate without hiding real incidents.
Reliability is a product feature. The engineers who treat it that way are the ones who get asked into the room.
Hiring managers: engineers who think about this in interviews are the ones worth calling back.
Top comments (0)