DEV Community

Cover image for Building a Sustainable On-Call Culture: Escaping Alert Fatigue Before It Breaks You
Anderson Leite
Anderson Leite

Posted on

Building a Sustainable On-Call Culture: Escaping Alert Fatigue Before It Breaks You

Most on-call engineers don't burn out from hours, they burn out from noise. Alert fatigue isn't just a human problem; it's a design failure. Here's how to build an on-call process that keeps both systems and people healthy.

The Human Side of Alerting

It's 3 AM. Your phone buzzes. Again. Third time tonight. You drag yourself out of bed, squint at PagerDuty, and see: "Disk usage at 76% on server-web-03."

Not urgent. Not even close to critical. But you're awake now.

This is alert fatigue, and it's quietly destroying engineering teams everywhere.

The numbers tell a grim story:

  • 40% of on-call engineers report burnout symptoms
  • Teams with high alert volumes have 3x higher attrition rates
  • 70% of alerts don't require immediate action
  • Engineers spend an average of 8 hours per week on alert noise

But here's what the numbers don't capture: the anticipatory anxiety, the inability to plan personal time, the erosion of trust between teams, and the normalization of dysfunctional systems.

Alert fatigue isn't just inconvenient, it's a slow-motion disaster for team health and system reliability.

Understanding Decision Fatigue

Every alert requires decisions:

  1. Is this real or false?
  2. Is it urgent or can it wait?
  3. Do I need to wake someone else?
  4. Should I roll back or investigate?
  5. Is this getting worse?

On a quiet shift, you might handle 2-3 decisions. On a noisy one? Try 30-50. Your brain's decision-making capacity isn't unlimited, it depletes like a battery.

After hours of low-quality alerts, when the real crisis hits, you're running on empty. You miss signals, make poor choices, and take longer to resolve incidents.

This is why alert fatigue kills reliability. It's not that engineers don't care, it's that their capacity to respond effectively has been drained by noise.

The Long-Term Burnout Cycle

  1. Week 1: Alert volume is manageable, you respond dutifully
  2. Month 2: You start dreading on-call, anxiety builds before shifts
  3. Month 4: You develop "alert numbness," dismissing notifications without reading
  4. Month 6: You're applying to other companies

The tragedy? This cycle is avoidable. Alert fatigue is a symptom of broken observability, not an inevitable cost of running services.

Metrics That Matter

To fix alert fatigue, you need to measure it. Here are the key metrics:

Mean Time to Acknowledge (MTTA)

How quickly do alerts get acknowledged?

  • Healthy: Under 5 minutes for critical alerts
  • Warning: 5-15 minutes
  • Critical: Over 15 minutes (indicates alert fatigue or alert blindness)

What it tells you: If MTTA is climbing, your team is either overwhelmed or has stopped trusting alerts.

Mean Time to Resolve (MTTR)

How long does it take to fix issues?

  • Good: Under 1 hour for P1 incidents
  • Average: 1-4 hours
  • Needs improvement: Over 4 hours

What it tells you: Long MTTRs might indicate complex systems, but could also point to poor runbooks or alert context.

Mean Time Between Wake-Ups

How often is on-call disturbed during off-hours?

  • Healthy: 0-2 pages per night
  • Tolerable: 3-5 pages per night
  • Unsustainable: More than 5 pages per night

What it tells you: If engineers are woken up frequently, they can't rest and recharge. This is the most direct predictor of burnout.

Alert-to-Incident Ratio

What percentage of alerts represent real incidents?

  • Excellent: 80%+ of alerts require action
  • Acceptable: 50-80%
  • Problematic: Under 50% (most alerts are false positives or non-urgent)

What it tells you: Low ratios mean you're training your team to ignore alerts.

Alert Fatigue Index (Custom Metric)

Combine metrics into a single health score:

alert_fatigue_index = (
    (false_positive_rate * 0.3) +
    (avg_alerts_per_shift / 20 * 0.3) +
    (night_wake_ups_per_week / 5 * 0.2) +
    (mtta_minutes / 10 * 0.2)
)

# Score interpretation:
# 0.0-0.3: Healthy
# 0.3-0.6: Warning
# 0.6-1.0: Critical
Enter fullscreen mode Exit fullscreen mode

Track this monthly and set targets for improvement.

Designing Sane Alerts

Most alert problems stem from poor alert design. Here's how to fix them:

Principle 1: Every Alert Must Be Actionable

Bad Alert: "CPU usage is high"

  • What should I do? Is this normal? Is it getting worse?

Good Alert: "CPU usage above 80% for 10 minutes on api-server. Check recent deployments or scale horizontally. Runbook: [link]"

  • Clear threshold, context, suggested actions, and resources

Rule: If an alert doesn't suggest an action, it's not an alert, it's a log entry. If I need another person to understand it, I shouldn't be the on-call then.

Principle 2: Define Clear Severity Levels

Create and enforce a severity taxonomy:

P0 - Critical (page immediately):

  • Complete service outage affecting customers
  • Data loss or corruption in progress
  • Security breach
  • SLA violations in progress

P1 - High (page during business hours, notify off-hours):

  • Partial service degradation
  • Single-region outage
  • Approaching SLA violations

P2 - Medium (ticket, no page):

  • Performance degradation not affecting users yet
  • Non-critical component failure with redundancy
  • Resource usage trends requiring attention

P3 - Low (ticket, no urgency):

  • Information for trending analysis
  • Maintenance reminders

Rule: Only P0 should wake people up. P1 can page during working hours.

Principle 3: Use Deduplication and Correlation

Don't fire 20 alerts when one service dies:

# Example: Alert deduplication in Prometheus Alertmanager
route:
  receiver: 'team-pager'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s        # Wait to batch similar alerts
  group_interval: 5m     # Send grouped alerts every 5 min
  repeat_interval: 3h    # Don't re-alert for 3 hours

inhibit_rules:
  # If API is down, silence dependent service alerts
  - source_match:
      alertname: 'APIServerDown'
    target_match:
      alertname: 'HighLatency'
    equal: ['cluster']
Enter fullscreen mode Exit fullscreen mode

Rule: Root cause alerts should suppress symptom alerts.

Principle 4: Provide Rich Context

Alerts without context force engineers to investigate from scratch every time:

{
  "alert": "High Error Rate",
  "severity": "P0",
  "service": "payment-api",
  "current_value": "15% errors",
  "threshold": "5% errors",
  "duration": "10 minutes",
  "affected_endpoints": ["/checkout", "/refund"],
  "recent_deployments": [
    {
      "version": "v2.3.1",
      "deployed_at": "2025-10-29T13:45:00Z",
      "deployed_by": "jenkins"
    }
  ],
  "dashboard": "https://grafana.example.com/d/payments",
  "runbook": "https://wiki.example.com/runbooks/payment-errors",
  "related_alerts": ["DatabaseConnectionPool-Exhausted"],
  "suggested_actions": [
    "Check recent deployment v2.3.1",
    "Review database connection pool metrics",
    "Consider rollback if errors persist"
  ]
}
Enter fullscreen mode Exit fullscreen mode

This alert gives the on-call engineer everything they need to start investigating immediately.

Principle 5: Implement Alert Fatigue Circuit Breakers

Automatically suppress excessive alerting:

class AlertCircuitBreaker:
    def __init__(self, threshold=10, window_minutes=30):
        self.threshold = threshold
        self.window = window_minutes
        self.recent_alerts = []

    def should_send(self, alert):
        now = datetime.now()

        # Remove old alerts outside the window
        self.recent_alerts = [
            a for a in self.recent_alerts 
            if (now - a['timestamp']).seconds < self.window * 60
        ]

        # Check if we've hit the threshold
        if len(self.recent_alerts) >= self.threshold:
            # We're in alert storm - only send P0 alerts
            if alert['severity'] == 'P0':
                self.recent_alerts.append({
                    'timestamp': now,
                    'alert': alert
                })
                return True
            else:
                self.send_to_team_channel(
                    f"⚠️ Alert storm detected: {len(self.recent_alerts)} alerts in {self.window} minutes. Suppressing non-critical alerts."
                )
                return False

        # Normal operation
        self.recent_alerts.append({'timestamp': now, 'alert': alert})
        return True
Enter fullscreen mode Exit fullscreen mode

Rule: Protect humans from alert storms by temporarily raising the alert threshold.

Automation That Helps (Not Harms)

Smart automation reduces on-call burden without creating new problems:

Self-Healing for Known Issues

class SelfHealingAlert:
    def trigger(self, alert):
        # Try automated remediation first
        remediation_result = self.attempt_remediation(alert)

        if remediation_result.success:
            # Fixed! Just notify for visibility
            self.send_notification(
                f"✅ Auto-resolved: {alert.name}. Action taken: {remediation_result.action}"
            )
            self.create_postmortem_ticket(alert, remediation_result)
        else:
            # Couldn't fix automatically - escalate to human
            self.page_on_call(alert, context={
                'attempted_remediation': remediation_result.action,
                'failure_reason': remediation_result.error
            })

    def attempt_remediation(self, alert):
        remediation_playbook = {
            'DiskSpaceHigh': self.clean_old_logs,
            'MemoryLeak': self.restart_service,
            'CertExpiringSoon': self.renew_certificate,
        }

        handler = remediation_playbook.get(alert.name)
        if handler:
            return handler(alert)
        return RemediationResult(success=False, error="No playbook available")
Enter fullscreen mode Exit fullscreen mode

Common self-healing patterns:

  • Restart crashed services
  • Clear disk space by rotating logs
  • Scale up resources temporarily
  • Drain and replace unhealthy nodes
  • Renew expiring certificates

Context-Aware Alerts

Use recent activity to inform alert routing:

def route_alert(alert, context):
    # If there's an active incident, route related alerts to incident channel
    active_incidents = get_active_incidents()
    for incident in active_incidents:
        if alert.service in incident.affected_services:
            return incident.slack_channel

    # If it's a known issue with recent activity, route to team channel
    recent_tickets = search_tickets(alert.service, days=7)
    if recent_tickets:
        return 'team-channel-non-urgent'

    # If there was a recent deployment, loop in the deployer
    recent_deploy = get_recent_deployment(alert.service, hours=2)
    if recent_deploy:
        return f"@{recent_deploy.author} - potential impact from your deployment"

    # Otherwise, standard on-call page
    return 'on-call-pager'
Enter fullscreen mode Exit fullscreen mode

Progressive Alert Escalation

Don't page humans until automation has tried first:

alert_policy:
  - name: HighErrorRate
    conditions:
      - error_rate > 5%
      - duration: 5 minutes

    escalation:
      - stage: 1
        delay: 0 seconds
        action: attempt_auto_remediation

      - stage: 2
        delay: 3 minutes
        action: post_to_team_slack
        condition: not_resolved

      - stage: 3
        delay: 10 minutes
        action: page_on_call
        condition: not_resolved AND error_rate > 10%
Enter fullscreen mode Exit fullscreen mode

This gives systems a chance to self-recover before waking humans.

Cultural Shifts: Making On-Call Sustainable

Technology alone won't fix alert fatigue. You need cultural changes too:

1. Fair Compensation

On-call work is different from regular work. Compensate accordingly:

Options:

  • On-call stipend ($X per shift, regardless of pages)
  • Per-incident bonus for off-hours responses
  • Time-in-lieu: every hour worked off-hours = 1.5 hours of flex time
  • Shift differentials (nights/weekends pay more)

Rule: Don't treat on-call as "just part of the job." It's extra labor and should be compensated.

2. Rotation Hygiene

Shift length: 1 week is standard. Longer creates extended stress; shorter creates planning chaos.

Handoff rituals: Schedule 30-minute handoffs where outgoing on-call briefs incoming on:

  • Active issues and workarounds
  • Ongoing investigations
  • Known flaky alerts
  • Recent deployments

Follow-the-sun rotations: For global teams, rotate on-call by region to minimize overnight pages.

No solo on-call: Always have a backup. Primary handles pages; backup covers if primary is unavailable or needs help.

3. Post-Incident Care

After a hard on-call shift:

Immediate:

  • Give engineers the next day off or at least a late start
  • Don't schedule meetings first thing after on-call
  • Have leads check in personally

Within a week:

  • Conduct blameless postmortems
  • Track action items to prevent recurrence
  • Celebrate quick resolutions

Monthly:

  • Review on-call metrics with the team
  • Discuss what's getting better (or worse)
  • Prioritize alert quality improvements

Rule: Treat incident response like an injury: Give people time to recover.

4. Psychological Safety

Make it safe to:

  • Escalate unclear situations
  • Ask for help without judgment
  • Make mistakes during incidents
  • Push back on noisy alerts
  • Say "I need a break from on-call"

Anti-patterns to avoid:

  • Heroism culture ("Jane fixed it alone at 3 AM again!")
  • Shaming slow responses
  • Hiding alert fatigue metrics
  • Forcing people onto on-call rotation against their will

SLO-Driven On-Call: Using Service Levels as Noise Filters

Service Level Objectives (SLOs) are a powerful tool for reducing alert noise:

Define Error Budgets

Instead of alerting on every error, alert when you're burning your error budget too fast:

slo:
  name: API Availability
  target: 99.9%  # Allows 43 minutes of downtime per month

  error_budget: 0.1%  # 1 - 0.999

  burn_rate_alerts:
    - severity: P0
      condition: error_budget_consumed > 10% in 1 hour
      # Burning budget 240x faster than sustainable

    - severity: P1
      condition: error_budget_consumed > 50% in 24 hours
      # Burning budget 14x faster than sustainable
Enter fullscreen mode Exit fullscreen mode

This focuses alerts on what matters: are we violating customer expectations?

Multi-Window, Multi-Burn-Rate Alerting

Reduce false positives with time-based context:

def should_alert(service_slo):
    # Check both short and long windows
    short_window = service_slo.burn_rate(window='1h')
    long_window = service_slo.burn_rate(window='6h')

    # Alert only if BOTH windows show high burn rate
    if short_window > 14 and long_window > 14:
        return True, "P0: Severe error budget burn"

    if short_window > 6 and long_window > 6:
        return True, "P1: Elevated error budget burn"

    return False, None
Enter fullscreen mode Exit fullscreen mode

Why this works: Requires sustained problems, not transient blips.

Stop Alerting on Fixed Thresholds

Traditional alerts use arbitrary thresholds:

  • "CPU > 80%" (based on what?)
  • "Latency > 200ms" (is that actually bad?)

SLO-based alerts use customer impact:

  • "Latency p99 > 500ms for 10 minutes" (our SLO is p99 < 500ms)
  • "Error rate > 1% for 5 minutes" (our SLO allows max 0.1% errors)

Rule: Alert on SLO violations, not infrastructure metrics.

Building the Improvement Flywheel

Fixing alert fatigue is a continuous process:

Week 1: Audit Current State

# Analyze alert data
- Total alerts last month: ?
- False positive rate: ?
- Overnight pages: ?
- Average resolution time: ?
- Top 10 noisiest alerts: ?
Enter fullscreen mode Exit fullscreen mode

Week 2-4: Quick Wins

  • Delete or tune the top 5 noisiest alerts
  • Add runbooks to alerts without them
  • Implement alert deduplication
  • Set up self-healing for top 3 repetitive issues

Month 2: Systematic Improvements

  • Migrate to SLO-based alerting for critical services
  • Implement alert severity taxonomy
  • Add context-rich alert templates
  • Set up alert fatigue metrics dashboard

Month 3: Cultural Changes

  • Introduce fair on-call compensation
  • Establish handoff rituals
  • Start monthly on-call retrospectives
  • Create on-call feedback loops

Ongoing: Never Stop Improving

  • Review metrics quarterly
  • Celebrate alert reduction milestones
  • Share learnings across teams
  • Continuously prune and tune alerts

How you can fix it (examples)

  • Before: 300 alerts/week, 8 overnight pages/night, 45% attrition in on-call pool
  • After 6 months: 40 alerts/week, 1 overnight page/night, 10% attrition
  • Key changes: SLO-based alerting, self-healing automation, tripled on-call compensation

  • Before: 15-minute MTTA, 4-hour MTTR, constant firefighting

  • After 1 year: 3-minute MTTA, 45-minute MTTR, proactive issue prevention

  • Key changes: Context-rich alerts, runbook automation, follow-the-sun rotations

  • Before: "Always-on" culture, engineers checking phones at dinner

  • After 9 months: Work-life boundaries restored, team satisfaction up 40%

  • Key changes: Alert circuit breakers, post-incident recovery time, psychological safety initiatives


Healthy systems need healthy humans. A sustainable on-call culture protects both.

Alert fatigue isn't inevitable. It's a design problem with engineering solutions:

  • Better alerts (actionable, contextual, deduplicated)
  • Smart automation (self-healing, progressive escalation)
  • Fair compensation (recognize the burden)
  • Cultural safety (make it okay to push back)

Start small. Pick one metric to improve this month. Celebrate progress. And remember: the goal isn't zero alerts. it's zero noise!

Your engineers will sleep better. Your systems will run better. And when real incidents happen, your team will have the energy and focus to handle them effectively.

That's what a healthy on-call culture looks like. Build it intentionally, or watch burnout build it for you.

Top comments (0)