Most on-call engineers don't burn out from hours, they burn out from noise. Alert fatigue isn't just a human problem; it's a design failure. Here's how to build an on-call process that keeps both systems and people healthy.
The Human Side of Alerting
It's 3 AM. Your phone buzzes. Again. Third time tonight. You drag yourself out of bed, squint at PagerDuty, and see: "Disk usage at 76% on server-web-03."
Not urgent. Not even close to critical. But you're awake now.
This is alert fatigue, and it's quietly destroying engineering teams everywhere.
The numbers tell a grim story:
- 40% of on-call engineers report burnout symptoms
- Teams with high alert volumes have 3x higher attrition rates
- 70% of alerts don't require immediate action
- Engineers spend an average of 8 hours per week on alert noise
But here's what the numbers don't capture: the anticipatory anxiety, the inability to plan personal time, the erosion of trust between teams, and the normalization of dysfunctional systems.
Alert fatigue isn't just inconvenient, it's a slow-motion disaster for team health and system reliability.
Understanding Decision Fatigue
Every alert requires decisions:
- Is this real or false?
- Is it urgent or can it wait?
- Do I need to wake someone else?
- Should I roll back or investigate?
- Is this getting worse?
On a quiet shift, you might handle 2-3 decisions. On a noisy one? Try 30-50. Your brain's decision-making capacity isn't unlimited, it depletes like a battery.
After hours of low-quality alerts, when the real crisis hits, you're running on empty. You miss signals, make poor choices, and take longer to resolve incidents.
This is why alert fatigue kills reliability. It's not that engineers don't care, it's that their capacity to respond effectively has been drained by noise.
The Long-Term Burnout Cycle
- Week 1: Alert volume is manageable, you respond dutifully
- Month 2: You start dreading on-call, anxiety builds before shifts
- Month 4: You develop "alert numbness," dismissing notifications without reading
- Month 6: You're applying to other companies
The tragedy? This cycle is avoidable. Alert fatigue is a symptom of broken observability, not an inevitable cost of running services.
Metrics That Matter
To fix alert fatigue, you need to measure it. Here are the key metrics:
Mean Time to Acknowledge (MTTA)
How quickly do alerts get acknowledged?
- Healthy: Under 5 minutes for critical alerts
- Warning: 5-15 minutes
- Critical: Over 15 minutes (indicates alert fatigue or alert blindness)
What it tells you: If MTTA is climbing, your team is either overwhelmed or has stopped trusting alerts.
Mean Time to Resolve (MTTR)
How long does it take to fix issues?
- Good: Under 1 hour for P1 incidents
- Average: 1-4 hours
- Needs improvement: Over 4 hours
What it tells you: Long MTTRs might indicate complex systems, but could also point to poor runbooks or alert context.
Mean Time Between Wake-Ups
How often is on-call disturbed during off-hours?
- Healthy: 0-2 pages per night
- Tolerable: 3-5 pages per night
- Unsustainable: More than 5 pages per night
What it tells you: If engineers are woken up frequently, they can't rest and recharge. This is the most direct predictor of burnout.
Alert-to-Incident Ratio
What percentage of alerts represent real incidents?
- Excellent: 80%+ of alerts require action
- Acceptable: 50-80%
- Problematic: Under 50% (most alerts are false positives or non-urgent)
What it tells you: Low ratios mean you're training your team to ignore alerts.
Alert Fatigue Index (Custom Metric)
Combine metrics into a single health score:
alert_fatigue_index = (
(false_positive_rate * 0.3) +
(avg_alerts_per_shift / 20 * 0.3) +
(night_wake_ups_per_week / 5 * 0.2) +
(mtta_minutes / 10 * 0.2)
)
# Score interpretation:
# 0.0-0.3: Healthy
# 0.3-0.6: Warning
# 0.6-1.0: Critical
Track this monthly and set targets for improvement.
Designing Sane Alerts
Most alert problems stem from poor alert design. Here's how to fix them:
Principle 1: Every Alert Must Be Actionable
Bad Alert: "CPU usage is high"
- What should I do? Is this normal? Is it getting worse?
Good Alert: "CPU usage above 80% for 10 minutes on api-server. Check recent deployments or scale horizontally. Runbook: [link]"
- Clear threshold, context, suggested actions, and resources
Rule: If an alert doesn't suggest an action, it's not an alert, it's a log entry. If I need another person to understand it, I shouldn't be the on-call then.
Principle 2: Define Clear Severity Levels
Create and enforce a severity taxonomy:
P0 - Critical (page immediately):
- Complete service outage affecting customers
- Data loss or corruption in progress
- Security breach
- SLA violations in progress
P1 - High (page during business hours, notify off-hours):
- Partial service degradation
- Single-region outage
- Approaching SLA violations
P2 - Medium (ticket, no page):
- Performance degradation not affecting users yet
- Non-critical component failure with redundancy
- Resource usage trends requiring attention
P3 - Low (ticket, no urgency):
- Information for trending analysis
- Maintenance reminders
Rule: Only P0 should wake people up. P1 can page during working hours.
Principle 3: Use Deduplication and Correlation
Don't fire 20 alerts when one service dies:
# Example: Alert deduplication in Prometheus Alertmanager
route:
receiver: 'team-pager'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s # Wait to batch similar alerts
group_interval: 5m # Send grouped alerts every 5 min
repeat_interval: 3h # Don't re-alert for 3 hours
inhibit_rules:
# If API is down, silence dependent service alerts
- source_match:
alertname: 'APIServerDown'
target_match:
alertname: 'HighLatency'
equal: ['cluster']
Rule: Root cause alerts should suppress symptom alerts.
Principle 4: Provide Rich Context
Alerts without context force engineers to investigate from scratch every time:
{
"alert": "High Error Rate",
"severity": "P0",
"service": "payment-api",
"current_value": "15% errors",
"threshold": "5% errors",
"duration": "10 minutes",
"affected_endpoints": ["/checkout", "/refund"],
"recent_deployments": [
{
"version": "v2.3.1",
"deployed_at": "2025-10-29T13:45:00Z",
"deployed_by": "jenkins"
}
],
"dashboard": "https://grafana.example.com/d/payments",
"runbook": "https://wiki.example.com/runbooks/payment-errors",
"related_alerts": ["DatabaseConnectionPool-Exhausted"],
"suggested_actions": [
"Check recent deployment v2.3.1",
"Review database connection pool metrics",
"Consider rollback if errors persist"
]
}
This alert gives the on-call engineer everything they need to start investigating immediately.
Principle 5: Implement Alert Fatigue Circuit Breakers
Automatically suppress excessive alerting:
class AlertCircuitBreaker:
def __init__(self, threshold=10, window_minutes=30):
self.threshold = threshold
self.window = window_minutes
self.recent_alerts = []
def should_send(self, alert):
now = datetime.now()
# Remove old alerts outside the window
self.recent_alerts = [
a for a in self.recent_alerts
if (now - a['timestamp']).seconds < self.window * 60
]
# Check if we've hit the threshold
if len(self.recent_alerts) >= self.threshold:
# We're in alert storm - only send P0 alerts
if alert['severity'] == 'P0':
self.recent_alerts.append({
'timestamp': now,
'alert': alert
})
return True
else:
self.send_to_team_channel(
f"⚠️ Alert storm detected: {len(self.recent_alerts)} alerts in {self.window} minutes. Suppressing non-critical alerts."
)
return False
# Normal operation
self.recent_alerts.append({'timestamp': now, 'alert': alert})
return True
Rule: Protect humans from alert storms by temporarily raising the alert threshold.
Automation That Helps (Not Harms)
Smart automation reduces on-call burden without creating new problems:
Self-Healing for Known Issues
class SelfHealingAlert:
def trigger(self, alert):
# Try automated remediation first
remediation_result = self.attempt_remediation(alert)
if remediation_result.success:
# Fixed! Just notify for visibility
self.send_notification(
f"✅ Auto-resolved: {alert.name}. Action taken: {remediation_result.action}"
)
self.create_postmortem_ticket(alert, remediation_result)
else:
# Couldn't fix automatically - escalate to human
self.page_on_call(alert, context={
'attempted_remediation': remediation_result.action,
'failure_reason': remediation_result.error
})
def attempt_remediation(self, alert):
remediation_playbook = {
'DiskSpaceHigh': self.clean_old_logs,
'MemoryLeak': self.restart_service,
'CertExpiringSoon': self.renew_certificate,
}
handler = remediation_playbook.get(alert.name)
if handler:
return handler(alert)
return RemediationResult(success=False, error="No playbook available")
Common self-healing patterns:
- Restart crashed services
- Clear disk space by rotating logs
- Scale up resources temporarily
- Drain and replace unhealthy nodes
- Renew expiring certificates
Context-Aware Alerts
Use recent activity to inform alert routing:
def route_alert(alert, context):
# If there's an active incident, route related alerts to incident channel
active_incidents = get_active_incidents()
for incident in active_incidents:
if alert.service in incident.affected_services:
return incident.slack_channel
# If it's a known issue with recent activity, route to team channel
recent_tickets = search_tickets(alert.service, days=7)
if recent_tickets:
return 'team-channel-non-urgent'
# If there was a recent deployment, loop in the deployer
recent_deploy = get_recent_deployment(alert.service, hours=2)
if recent_deploy:
return f"@{recent_deploy.author} - potential impact from your deployment"
# Otherwise, standard on-call page
return 'on-call-pager'
Progressive Alert Escalation
Don't page humans until automation has tried first:
alert_policy:
- name: HighErrorRate
conditions:
- error_rate > 5%
- duration: 5 minutes
escalation:
- stage: 1
delay: 0 seconds
action: attempt_auto_remediation
- stage: 2
delay: 3 minutes
action: post_to_team_slack
condition: not_resolved
- stage: 3
delay: 10 minutes
action: page_on_call
condition: not_resolved AND error_rate > 10%
This gives systems a chance to self-recover before waking humans.
Cultural Shifts: Making On-Call Sustainable
Technology alone won't fix alert fatigue. You need cultural changes too:
1. Fair Compensation
On-call work is different from regular work. Compensate accordingly:
Options:
- On-call stipend ($X per shift, regardless of pages)
- Per-incident bonus for off-hours responses
- Time-in-lieu: every hour worked off-hours = 1.5 hours of flex time
- Shift differentials (nights/weekends pay more)
Rule: Don't treat on-call as "just part of the job." It's extra labor and should be compensated.
2. Rotation Hygiene
Shift length: 1 week is standard. Longer creates extended stress; shorter creates planning chaos.
Handoff rituals: Schedule 30-minute handoffs where outgoing on-call briefs incoming on:
- Active issues and workarounds
- Ongoing investigations
- Known flaky alerts
- Recent deployments
Follow-the-sun rotations: For global teams, rotate on-call by region to minimize overnight pages.
No solo on-call: Always have a backup. Primary handles pages; backup covers if primary is unavailable or needs help.
3. Post-Incident Care
After a hard on-call shift:
Immediate:
- Give engineers the next day off or at least a late start
- Don't schedule meetings first thing after on-call
- Have leads check in personally
Within a week:
- Conduct blameless postmortems
- Track action items to prevent recurrence
- Celebrate quick resolutions
Monthly:
- Review on-call metrics with the team
- Discuss what's getting better (or worse)
- Prioritize alert quality improvements
Rule: Treat incident response like an injury: Give people time to recover.
4. Psychological Safety
Make it safe to:
- Escalate unclear situations
- Ask for help without judgment
- Make mistakes during incidents
- Push back on noisy alerts
- Say "I need a break from on-call"
Anti-patterns to avoid:
- Heroism culture ("Jane fixed it alone at 3 AM again!")
- Shaming slow responses
- Hiding alert fatigue metrics
- Forcing people onto on-call rotation against their will
SLO-Driven On-Call: Using Service Levels as Noise Filters
Service Level Objectives (SLOs) are a powerful tool for reducing alert noise:
Define Error Budgets
Instead of alerting on every error, alert when you're burning your error budget too fast:
slo:
name: API Availability
target: 99.9% # Allows 43 minutes of downtime per month
error_budget: 0.1% # 1 - 0.999
burn_rate_alerts:
- severity: P0
condition: error_budget_consumed > 10% in 1 hour
# Burning budget 240x faster than sustainable
- severity: P1
condition: error_budget_consumed > 50% in 24 hours
# Burning budget 14x faster than sustainable
This focuses alerts on what matters: are we violating customer expectations?
Multi-Window, Multi-Burn-Rate Alerting
Reduce false positives with time-based context:
def should_alert(service_slo):
# Check both short and long windows
short_window = service_slo.burn_rate(window='1h')
long_window = service_slo.burn_rate(window='6h')
# Alert only if BOTH windows show high burn rate
if short_window > 14 and long_window > 14:
return True, "P0: Severe error budget burn"
if short_window > 6 and long_window > 6:
return True, "P1: Elevated error budget burn"
return False, None
Why this works: Requires sustained problems, not transient blips.
Stop Alerting on Fixed Thresholds
Traditional alerts use arbitrary thresholds:
- "CPU > 80%" (based on what?)
- "Latency > 200ms" (is that actually bad?)
SLO-based alerts use customer impact:
- "Latency p99 > 500ms for 10 minutes" (our SLO is p99 < 500ms)
- "Error rate > 1% for 5 minutes" (our SLO allows max 0.1% errors)
Rule: Alert on SLO violations, not infrastructure metrics.
Building the Improvement Flywheel
Fixing alert fatigue is a continuous process:
Week 1: Audit Current State
# Analyze alert data
- Total alerts last month: ?
- False positive rate: ?
- Overnight pages: ?
- Average resolution time: ?
- Top 10 noisiest alerts: ?
Week 2-4: Quick Wins
- Delete or tune the top 5 noisiest alerts
- Add runbooks to alerts without them
- Implement alert deduplication
- Set up self-healing for top 3 repetitive issues
Month 2: Systematic Improvements
- Migrate to SLO-based alerting for critical services
- Implement alert severity taxonomy
- Add context-rich alert templates
- Set up alert fatigue metrics dashboard
Month 3: Cultural Changes
- Introduce fair on-call compensation
- Establish handoff rituals
- Start monthly on-call retrospectives
- Create on-call feedback loops
Ongoing: Never Stop Improving
- Review metrics quarterly
- Celebrate alert reduction milestones
- Share learnings across teams
- Continuously prune and tune alerts
How you can fix it (examples)
- Before: 300 alerts/week, 8 overnight pages/night, 45% attrition in on-call pool
- After 6 months: 40 alerts/week, 1 overnight page/night, 10% attrition
Key changes: SLO-based alerting, self-healing automation, tripled on-call compensation
Before: 15-minute MTTA, 4-hour MTTR, constant firefighting
After 1 year: 3-minute MTTA, 45-minute MTTR, proactive issue prevention
Key changes: Context-rich alerts, runbook automation, follow-the-sun rotations
Before: "Always-on" culture, engineers checking phones at dinner
After 9 months: Work-life boundaries restored, team satisfaction up 40%
Key changes: Alert circuit breakers, post-incident recovery time, psychological safety initiatives
Healthy systems need healthy humans. A sustainable on-call culture protects both.
Alert fatigue isn't inevitable. It's a design problem with engineering solutions:
- Better alerts (actionable, contextual, deduplicated)
- Smart automation (self-healing, progressive escalation)
- Fair compensation (recognize the burden)
- Cultural safety (make it okay to push back)
Start small. Pick one metric to improve this month. Celebrate progress. And remember: the goal isn't zero alerts. it's zero noise!
Your engineers will sleep better. Your systems will run better. And when real incidents happen, your team will have the energy and focus to handle them effectively.
That's what a healthy on-call culture looks like. Build it intentionally, or watch burnout build it for you.
Top comments (0)