Not in any textbook — learned this from a 3am page:

#devops #sre #kubernetes #terraform

LinkedIn Draft — Workflow (2026-04-13)

On-call burnout is an alert design problem, not a schedule problem

Every team I've seen fight burnout by rotating people faster. The actual fix is almost always the same: the alerts are wrong.

Alert quality spectrum:

Noisy ◀─────────────────────────── ▶ Actionable

[cpu > 80%]  [pod restart]  [error budget burn]  [customer impact]
     │              │               │                    │
 ignore me      maybe?         investigate!          wake me up

Where it breaks:
▸ Alerts without a named owner and a runbook produce paralysis, not action — especially at 2am.
▸ Flapping alerts are the fastest path to alert blindness — engineers learn to dismiss pages before reading them.
▸ Cause-based alerts (disk full) and symptom-based alerts (latency spike) need different urgency and routing.

The rule I keep coming back to:
→ Before any alert ships: Who acts on it? What do they do? What's the cost of 30 minutes of inaction? If you can't answer all three, it's not ready.

How I sanity-check it:
▸ Weekly alert review ritual: tag every last-week page as actionable / noisy / redundant. Kill the bottom two categories.
▸ PagerDuty/OpsGenie grouping + escalation policies — reduce interrupt rate without hiding real incidents.

Reliability is a product feature. The engineers who treat it that way are the ones who get asked into the room.

Deep dive: https://neeraja-portfolio-v1.vercel.app/workflows/on-call-burnout-is-an-alert-design-problem-not-a-schedule-problem

Hiring managers: engineers who think about this in interviews are the ones worth calling back.