Anderson Leite

Posted on Nov 5

Building a Sustainable On-Call Culture: Escaping Alert Fatigue Before It Breaks You

#monitoring #infrastructure #availability #teammanagement

Most on-call engineers don't burn out from hours, they burn out from noise. Alert fatigue isn't just a human problem; it's a design failure. Here's how to build an on-call process that keeps both systems and people healthy.

The Human Side of Alerting

It's 3 AM. Your phone buzzes. Again. Third time tonight. You drag yourself out of bed, squint at PagerDuty, and see: "Disk usage at 76% on server-web-03."

Not urgent. Not even close to critical. But you're awake now.

This is alert fatigue, and it's quietly destroying engineering teams everywhere.

The numbers tell a grim story:

40% of on-call engineers report burnout symptoms
Teams with high alert volumes have 3x higher attrition rates
70% of alerts don't require immediate action
Engineers spend an average of 8 hours per week on alert noise

But here's what the numbers don't capture: the anticipatory anxiety, the inability to plan personal time, the erosion of trust between teams, and the normalization of dysfunctional systems.

Alert fatigue isn't just inconvenient, it's a slow-motion disaster for team health and system reliability.

Understanding Decision Fatigue

Every alert requires decisions:

Is this real or false?
Is it urgent or can it wait?
Do I need to wake someone else?
Should I roll back or investigate?
Is this getting worse?

On a quiet shift, you might handle 2-3 decisions. On a noisy one? Try 30-50. Your brain's decision-making capacity isn't unlimited, it depletes like a battery.

After hours of low-quality alerts, when the real crisis hits, you're running on empty. You miss signals, make poor choices, and take longer to resolve incidents.

This is why alert fatigue kills reliability. It's not that engineers don't care, it's that their capacity to respond effectively has been drained by noise.

The Long-Term Burnout Cycle

Week 1: Alert volume is manageable, you respond dutifully
Month 2: You start dreading on-call, anxiety builds before shifts
Month 4: You develop "alert numbness," dismissing notifications without reading
Month 6: You're applying to other companies

The tragedy? This cycle is avoidable. Alert fatigue is a symptom of broken observability, not an inevitable cost of running services.

Metrics That Matter

To fix alert fatigue, you need to measure it. Here are the key metrics:

Mean Time to Acknowledge (MTTA)

How quickly do alerts get acknowledged?

Healthy: Under 5 minutes for critical alerts
Warning: 5-15 minutes
Critical: Over 15 minutes (indicates alert fatigue or alert blindness)

What it tells you: If MTTA is climbing, your team is either overwhelmed or has stopped trusting alerts.

Mean Time to Resolve (MTTR)

How long does it take to fix issues?

Good: Under 1 hour for P1 incidents
Average: 1-4 hours
Needs improvement: Over 4 hours

What it tells you: Long MTTRs might indicate complex systems, but could also point to poor runbooks or alert context.

Mean Time Between Wake-Ups

How often is on-call disturbed during off-hours?

Healthy: 0-2 pages per night
Tolerable: 3-5 pages per night
Unsustainable: More than 5 pages per night

What it tells you: If engineers are woken up frequently, they can't rest and recharge. This is the most direct predictor of burnout.

Alert-to-Incident Ratio

What percentage of alerts represent real incidents?

Excellent: 80%+ of alerts require action
Acceptable: 50-80%
Problematic: Under 50% (most alerts are false positives or non-urgent)

What it tells you: Low ratios mean you're training your team to ignore alerts.

Alert Fatigue Index (Custom Metric)

Combine metrics into a single health score:

alert_fatigue_index = (
    (false_positive_rate * 0.3) +
    (avg_alerts_per_shift / 20 * 0.3) +
    (night_wake_ups_per_week / 5 * 0.2) +
    (mtta_minutes / 10 * 0.2)
)

# Score interpretation:
# 0.0-0.3: Healthy
# 0.3-0.6: Warning
# 0.6-1.0: Critical

Track this monthly and set targets for improvement.

Designing Sane Alerts

Most alert problems stem from poor alert design. Here's how to fix them:

Principle 1: Every Alert Must Be Actionable

Bad Alert: "CPU usage is high"

What should I do? Is this normal? Is it getting worse?

Good Alert: "CPU usage above 80% for 10 minutes on api-server. Check recent deployments or scale horizontally. Runbook: [link]"

Clear threshold, context, suggested actions, and resources

Rule: If an alert doesn't suggest an action, it's not an alert, it's a log entry. If I need another person to understand it, I shouldn't be the on-call then.

Principle 2: Define Clear Severity Levels

Create and enforce a severity taxonomy:

P0 - Critical (page immediately):

Complete service outage affecting customers
Data loss or corruption in progress
Security breach
SLA violations in progress

P1 - High (page during business hours, notify off-hours):

Partial service degradation
Single-region outage
Approaching SLA violations

P2 - Medium (ticket, no page):

Performance degradation not affecting users yet
Non-critical component failure with redundancy
Resource usage trends requiring attention

P3 - Low (ticket, no urgency):

Information for trending analysis
Maintenance reminders

Rule: Only P0 should wake people up. P1 can page during working hours.

Principle 3: Use Deduplication and Correlation

Don't fire 20 alerts when one service dies:

# Example: Alert deduplication in Prometheus Alertmanager
route:
  receiver: 'team-pager'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s        # Wait to batch similar alerts
  group_interval: 5m     # Send grouped alerts every 5 min
  repeat_interval: 3h    # Don't re-alert for 3 hours

inhibit_rules:
  # If API is down, silence dependent service alerts
  - source_match:
      alertname: 'APIServerDown'
    target_match:
      alertname: 'HighLatency'
    equal: ['cluster']

Rule: Root cause alerts should suppress symptom alerts.

Principle 4: Provide Rich Context

Alerts without context force engineers to investigate from scratch every time:

{
  "alert": "High Error Rate",
  "severity": "P0",
  "service": "payment-api",
  "current_value": "15% errors",
  "threshold": "5% errors",
  "duration": "10 minutes",
  "affected_endpoints": ["/checkout", "/refund"],
  "recent_deployments": [
    {
      "version": "v2.3.1",
      "deployed_at": "2025-10-29T13:45:00Z",
      "deployed_by": "jenkins"
    }
  ],
  "dashboard": "https://grafana.example.com/d/payments",
  "runbook": "https://wiki.example.com/runbooks/payment-errors",
  "related_alerts": ["DatabaseConnectionPool-Exhausted"],
  "suggested_actions": [
    "Check recent deployment v2.3.1",
    "Review database connection pool metrics",
    "Consider rollback if errors persist"
  ]
}

This alert gives the on-call engineer everything they need to start investigating immediately.

Principle 5: Implement Alert Fatigue Circuit Breakers

Automatically suppress excessive alerting:

class AlertCircuitBreaker:
    def __init__(self, threshold=10, window_minutes=30):
        self.threshold = threshold
        self.window = window_minutes
        self.recent_alerts = []

    def should_send(self, alert):
        now = datetime.now()

        # Remove old alerts outside the window
        self.recent_alerts = [
            a for a in self.recent_alerts 
            if (now - a['timestamp']).seconds < self.window * 60
        ]

        # Check if we've hit the threshold
        if len(self.recent_alerts) >= self.threshold:
            # We're in alert storm - only send P0 alerts
            if alert['severity'] == 'P0':
                self.recent_alerts.append({
                    'timestamp': now,
                    'alert': alert
                })
                return True
            else:
                self.send_to_team_channel(
                    f"⚠️ Alert storm detected: {len(self.recent_alerts)} alerts in {self.window} minutes. Suppressing non-critical alerts."
                )
                return False

        # Normal operation
        self.recent_alerts.append({'timestamp': now, 'alert': alert})
        return True

Rule: Protect humans from alert storms by temporarily raising the alert threshold.

Automation That Helps (Not Harms)

Smart automation reduces on-call burden without creating new problems:

Self-Healing for Known Issues

class SelfHealingAlert:
    def trigger(self, alert):
        # Try automated remediation first
        remediation_result = self.attempt_remediation(alert)

        if remediation_result.success:
            # Fixed! Just notify for visibility
            self.send_notification(
                f"✅ Auto-resolved: {alert.name}. Action taken: {remediation_result.action}"
            )
            self.create_postmortem_ticket(alert, remediation_result)
        else:
            # Couldn't fix automatically - escalate to human
            self.page_on_call(alert, context={
                'attempted_remediation': remediation_result.action,
                'failure_reason': remediation_result.error
            })

    def attempt_remediation(self, alert):
        remediation_playbook = {
            'DiskSpaceHigh': self.clean_old_logs,
            'MemoryLeak': self.restart_service,
            'CertExpiringSoon': self.renew_certificate,
        }

        handler = remediation_playbook.get(alert.name)
        if handler:
            return handler(alert)
        return RemediationResult(success=False, error="No playbook available")

Common self-healing patterns:

Restart crashed services
Clear disk space by rotating logs
Scale up resources temporarily
Drain and replace unhealthy nodes
Renew expiring certificates

Context-Aware Alerts

Use recent activity to inform alert routing:

def route_alert(alert, context):
    # If there's an active incident, route related alerts to incident channel
    active_incidents = get_active_incidents()
    for incident in active_incidents:
        if alert.service in incident.affected_services:
            return incident.slack_channel

    # If it's a known issue with recent activity, route to team channel
    recent_tickets = search_tickets(alert.service, days=7)
    if recent_tickets:
        return 'team-channel-non-urgent'

    # If there was a recent deployment, loop in the deployer
    recent_deploy = get_recent_deployment(alert.service, hours=2)
    if recent_deploy:
        return f"@{recent_deploy.author} - potential impact from your deployment"

    # Otherwise, standard on-call page
    return 'on-call-pager'

Progressive Alert Escalation

Don't page humans until automation has tried first:

alert_policy:
  - name: HighErrorRate
    conditions:
      - error_rate > 5%
      - duration: 5 minutes

    escalation:
      - stage: 1
        delay: 0 seconds
        action: attempt_auto_remediation

      - stage: 2
        delay: 3 minutes
        action: post_to_team_slack
        condition: not_resolved

      - stage: 3
        delay: 10 minutes
        action: page_on_call
        condition: not_resolved AND error_rate > 10%

This gives systems a chance to self-recover before waking humans.

Cultural Shifts: Making On-Call Sustainable

Technology alone won't fix alert fatigue. You need cultural changes too:

1. Fair Compensation

On-call work is different from regular work. Compensate accordingly:

Options:

On-call stipend ($X per shift, regardless of pages)
Per-incident bonus for off-hours responses
Time-in-lieu: every hour worked off-hours = 1.5 hours of flex time
Shift differentials (nights/weekends pay more)

Rule: Don't treat on-call as "just part of the job." It's extra labor and should be compensated.

2. Rotation Hygiene

Shift length: 1 week is standard. Longer creates extended stress; shorter creates planning chaos.

Handoff rituals: Schedule 30-minute handoffs where outgoing on-call briefs incoming on:

Active issues and workarounds
Ongoing investigations
Known flaky alerts
Recent deployments

Follow-the-sun rotations: For global teams, rotate on-call by region to minimize overnight pages.

No solo on-call: Always have a backup. Primary handles pages; backup covers if primary is unavailable or needs help.

3. Post-Incident Care

After a hard on-call shift:

Immediate:

Give engineers the next day off or at least a late start
Don't schedule meetings first thing after on-call
Have leads check in personally

Within a week:

Conduct blameless postmortems
Track action items to prevent recurrence
Celebrate quick resolutions

Monthly:

Review on-call metrics with the team
Discuss what's getting better (or worse)
Prioritize alert quality improvements

Rule: Treat incident response like an injury: Give people time to recover.

4. Psychological Safety

Make it safe to:

Escalate unclear situations
Ask for help without judgment
Make mistakes during incidents
Push back on noisy alerts
Say "I need a break from on-call"

Anti-patterns to avoid:

Heroism culture ("Jane fixed it alone at 3 AM again!")
Shaming slow responses
Hiding alert fatigue metrics
Forcing people onto on-call rotation against their will

SLO-Driven On-Call: Using Service Levels as Noise Filters

Service Level Objectives (SLOs) are a powerful tool for reducing alert noise:

Define Error Budgets

Instead of alerting on every error, alert when you're burning your error budget too fast:

slo:
  name: API Availability
  target: 99.9%  # Allows 43 minutes of downtime per month

  error_budget: 0.1%  # 1 - 0.999

  burn_rate_alerts:
    - severity: P0
      condition: error_budget_consumed > 10% in 1 hour
      # Burning budget 240x faster than sustainable

    - severity: P1
      condition: error_budget_consumed > 50% in 24 hours
      # Burning budget 14x faster than sustainable

This focuses alerts on what matters: are we violating customer expectations?

Multi-Window, Multi-Burn-Rate Alerting

Reduce false positives with time-based context:

def should_alert(service_slo):
    # Check both short and long windows
    short_window = service_slo.burn_rate(window='1h')
    long_window = service_slo.burn_rate(window='6h')

    # Alert only if BOTH windows show high burn rate
    if short_window > 14 and long_window > 14:
        return True, "P0: Severe error budget burn"

    if short_window > 6 and long_window > 6:
        return True, "P1: Elevated error budget burn"

    return False, None

Why this works: Requires sustained problems, not transient blips.

Stop Alerting on Fixed Thresholds

Traditional alerts use arbitrary thresholds:

"CPU > 80%" (based on what?)
"Latency > 200ms" (is that actually bad?)

SLO-based alerts use customer impact:

"Latency p99 > 500ms for 10 minutes" (our SLO is p99 < 500ms)
"Error rate > 1% for 5 minutes" (our SLO allows max 0.1% errors)

Rule: Alert on SLO violations, not infrastructure metrics.

Building the Improvement Flywheel

Fixing alert fatigue is a continuous process:

Week 1: Audit Current State

# Analyze alert data
- Total alerts last month: ?
- False positive rate: ?
- Overnight pages: ?
- Average resolution time: ?
- Top 10 noisiest alerts: ?

Week 2-4: Quick Wins

Delete or tune the top 5 noisiest alerts
Add runbooks to alerts without them
Implement alert deduplication
Set up self-healing for top 3 repetitive issues

Month 2: Systematic Improvements

Migrate to SLO-based alerting for critical services
Implement alert severity taxonomy
Add context-rich alert templates
Set up alert fatigue metrics dashboard

Month 3: Cultural Changes

Introduce fair on-call compensation
Establish handoff rituals
Start monthly on-call retrospectives
Create on-call feedback loops

Ongoing: Never Stop Improving

Review metrics quarterly
Celebrate alert reduction milestones
Share learnings across teams
Continuously prune and tune alerts

How you can fix it (examples)

Before: 300 alerts/week, 8 overnight pages/night, 45% attrition in on-call pool
After 6 months: 40 alerts/week, 1 overnight page/night, 10% attrition
Key changes: SLO-based alerting, self-healing automation, tripled on-call compensation
Before: 15-minute MTTA, 4-hour MTTR, constant firefighting
After 1 year: 3-minute MTTA, 45-minute MTTR, proactive issue prevention
Key changes: Context-rich alerts, runbook automation, follow-the-sun rotations
Before: "Always-on" culture, engineers checking phones at dinner
After 9 months: Work-life boundaries restored, team satisfaction up 40%
Key changes: Alert circuit breakers, post-incident recovery time, psychological safety initiatives

Healthy systems need healthy humans. A sustainable on-call culture protects both.

Alert fatigue isn't inevitable. It's a design problem with engineering solutions:

Better alerts (actionable, contextual, deduplicated)
Smart automation (self-healing, progressive escalation)
Fair compensation (recognize the burden)
Cultural safety (make it okay to push back)

Start small. Pick one metric to improve this month. Celebrate progress. And remember: the goal isn't zero alerts. it's zero noise!

Your engineers will sleep better. Your systems will run better. And when real incidents happen, your team will have the energy and focus to handle them effectively.

That's what a healthy on-call culture looks like. Build it intentionally, or watch burnout build it for you.

DEV Community