DEV Community

Cover image for I Reduced Our Alert Volume by 90%. Here's the Playbook
Samson Tanimawo
Samson Tanimawo

Posted on

I Reduced Our Alert Volume by 90%. Here's the Playbook

The Night I Almost Quit

Three months into my SRE role, I was averaging 47 alerts per on-call shift. Most were noise. I was exhausted, making bad decisions at 3am, and seriously considering switching careers.

Then I decided to fix it instead of complaining about it.

Step 1: The Alert Audit

Before touching a single threshold, I spent two weeks categorizing every alert into four buckets:

  1. Actionable — Someone needs to do something right now
  2. Informational — Useful context but not urgent
  3. Redundant — Another alert already covers this
  4. Stale — The service changed but the alert didn't

The results were brutal:

Actionable:    12%
Informational: 28%
Redundant:     35%
Stale:         25%
Enter fullscreen mode Exit fullscreen mode

Only 12% of our alerts actually required human intervention.

Step 2: Kill the Redundancies

We had monitoring on every layer — infrastructure, application, and synthetic. The problem was overlap. A single database slowdown would trigger:

  • High CPU alert on the DB host
  • Slow query alert from APM
  • Elevated p99 latency from the application
  • Failed synthetic check from the health endpoint
  • SLO burn rate alert

That's five alerts for one problem. I created dependency maps and established a hierarchy:

# Before: 5 alerts fire
alerts:
  - db_cpu_high
  - slow_queries
  - app_p99_high
  - synthetic_health_fail
  - slo_burn_rate

# After: 1 alert fires with context
alerts:
  - slo_burn_rate:
      context:
        - db_cpu: "{{ db_cpu_percent }}%"
        - slow_queries: "{{ slow_query_count }}"
        - p99_latency: "{{ p99_ms }}ms"
Enter fullscreen mode Exit fullscreen mode

Step 3: Raise the Thresholds

Most thresholds were set by gut feeling when the service launched. I replaced them with data-driven baselines:

import numpy as np

# Pull 30 days of metrics
values = get_metric_history('cpu_percent', days=30)

# Set threshold at p95 + 2 standard deviations
baseline = np.percentile(values, 95)
stddev = np.std(values)
new_threshold = baseline + (2 * stddev)

print(f"Old threshold: 80%")
print(f"New threshold: {new_threshold:.1f}%")
# Output: New threshold: 92.3%
Enter fullscreen mode Exit fullscreen mode

Step 4: Add Time-Based Suppression

Some alerts are expected during deployments, batch jobs, or maintenance windows. Instead of people mentally filtering these, I automated it:

suppression_rules:
  - name: "deploy_window"
    match: "service=api AND severity<critical"
    during: "deploy_in_progress"
  - name: "batch_jobs"
    match: "service=etl AND metric=cpu"
    schedule: "0 2 * * *  0 4 * * *"  # 2am-4am daily
Enter fullscreen mode Exit fullscreen mode

Step 5: The Routing Overhaul

Not every alert needs to wake someone up. I established clear escalation tiers:

  • P1 (Page): Customer-facing outage, data loss risk
  • P2 (Slack + ticket): Degraded but functional, fix within 4 hours
  • P3 (Ticket only): Non-urgent, fix within sprint
  • P4 (Dashboard): FYI, review in next planning

The Results

After eight weeks of implementation:

Metric Before After
Alerts per shift 47 5
Mean time to acknowledge 12 min 2 min
False positive rate 62% 8%
On-call satisfaction (1-5) 1.8 4.2

The team stopped dreading on-call. Response times improved because every alert actually mattered.

Key Takeaway

Alert fatigue is a systems problem, not a people problem. You can't train your way out of bad signal-to-noise ratios. Audit ruthlessly, set data-driven thresholds, and make every page count.

If you're drowning in alerts and looking for a smarter way to manage incident noise, check out what we're building at Nova AI Ops.

Top comments (0)