Samson Tanimawo

Posted on Apr 13

I Reduced Our Alert Volume by 90%. Here's the Playbook

#monitoring #observability #devops #sre

The Night I Almost Quit

Three months into my SRE role, I was averaging 47 alerts per on-call shift. Most were noise. I was exhausted, making bad decisions at 3am, and seriously considering switching careers.

Then I decided to fix it instead of complaining about it.

Step 1: The Alert Audit

Before touching a single threshold, I spent two weeks categorizing every alert into four buckets:

Actionable — Someone needs to do something right now
Informational — Useful context but not urgent
Redundant — Another alert already covers this
Stale — The service changed but the alert didn't

The results were brutal:

Actionable:    12%
Informational: 28%
Redundant:     35%
Stale:         25%

Only 12% of our alerts actually required human intervention.

Step 2: Kill the Redundancies

We had monitoring on every layer — infrastructure, application, and synthetic. The problem was overlap. A single database slowdown would trigger:

High CPU alert on the DB host
Slow query alert from APM
Elevated p99 latency from the application
Failed synthetic check from the health endpoint
SLO burn rate alert

That's five alerts for one problem. I created dependency maps and established a hierarchy:

# Before: 5 alerts fire
alerts:
  - db_cpu_high
  - slow_queries
  - app_p99_high
  - synthetic_health_fail
  - slo_burn_rate

# After: 1 alert fires with context
alerts:
  - slo_burn_rate:
      context:
        - db_cpu: "{{ db_cpu_percent }}%"
        - slow_queries: "{{ slow_query_count }}"
        - p99_latency: "{{ p99_ms }}ms"

Step 3: Raise the Thresholds

Most thresholds were set by gut feeling when the service launched. I replaced them with data-driven baselines:

import numpy as np

# Pull 30 days of metrics
values = get_metric_history('cpu_percent', days=30)

# Set threshold at p95 + 2 standard deviations
baseline = np.percentile(values, 95)
stddev = np.std(values)
new_threshold = baseline + (2 * stddev)

print(f"Old threshold: 80%")
print(f"New threshold: {new_threshold:.1f}%")
# Output: New threshold: 92.3%

Step 4: Add Time-Based Suppression

Some alerts are expected during deployments, batch jobs, or maintenance windows. Instead of people mentally filtering these, I automated it:

suppression_rules:
  - name: "deploy_window"
    match: "service=api AND severity<critical"
    during: "deploy_in_progress"
  - name: "batch_jobs"
    match: "service=etl AND metric=cpu"
    schedule: "0 2 * * * — 0 4 * * *"  # 2am-4am daily

Step 5: The Routing Overhaul

Not every alert needs to wake someone up. I established clear escalation tiers:

P1 (Page): Customer-facing outage, data loss risk
P2 (Slack + ticket): Degraded but functional, fix within 4 hours
P3 (Ticket only): Non-urgent, fix within sprint
P4 (Dashboard): FYI, review in next planning

The Results

After eight weeks of implementation:

Metric	Before	After
Alerts per shift	47	5
Mean time to acknowledge	12 min	2 min
False positive rate	62%	8%
On-call satisfaction (1-5)	1.8	4.2

The team stopped dreading on-call. Response times improved because every alert actually mattered.

Key Takeaway

Alert fatigue is a systems problem, not a people problem. You can't train your way out of bad signal-to-noise ratios. Audit ruthlessly, set data-driven thresholds, and make every page count.

If you're drowning in alerts and looking for a smarter way to manage incident noise, check out what we're building at Nova AI Ops.

DEV Community