The Night I Almost Quit
Three months into my SRE role, I was averaging 47 alerts per on-call shift. Most were noise. I was exhausted, making bad decisions at 3am, and seriously considering switching careers.
Then I decided to fix it instead of complaining about it.
Step 1: The Alert Audit
Before touching a single threshold, I spent two weeks categorizing every alert into four buckets:
- Actionable — Someone needs to do something right now
- Informational — Useful context but not urgent
- Redundant — Another alert already covers this
- Stale — The service changed but the alert didn't
The results were brutal:
Actionable: 12%
Informational: 28%
Redundant: 35%
Stale: 25%
Only 12% of our alerts actually required human intervention.
Step 2: Kill the Redundancies
We had monitoring on every layer — infrastructure, application, and synthetic. The problem was overlap. A single database slowdown would trigger:
- High CPU alert on the DB host
- Slow query alert from APM
- Elevated p99 latency from the application
- Failed synthetic check from the health endpoint
- SLO burn rate alert
That's five alerts for one problem. I created dependency maps and established a hierarchy:
# Before: 5 alerts fire
alerts:
- db_cpu_high
- slow_queries
- app_p99_high
- synthetic_health_fail
- slo_burn_rate
# After: 1 alert fires with context
alerts:
- slo_burn_rate:
context:
- db_cpu: "{{ db_cpu_percent }}%"
- slow_queries: "{{ slow_query_count }}"
- p99_latency: "{{ p99_ms }}ms"
Step 3: Raise the Thresholds
Most thresholds were set by gut feeling when the service launched. I replaced them with data-driven baselines:
import numpy as np
# Pull 30 days of metrics
values = get_metric_history('cpu_percent', days=30)
# Set threshold at p95 + 2 standard deviations
baseline = np.percentile(values, 95)
stddev = np.std(values)
new_threshold = baseline + (2 * stddev)
print(f"Old threshold: 80%")
print(f"New threshold: {new_threshold:.1f}%")
# Output: New threshold: 92.3%
Step 4: Add Time-Based Suppression
Some alerts are expected during deployments, batch jobs, or maintenance windows. Instead of people mentally filtering these, I automated it:
suppression_rules:
- name: "deploy_window"
match: "service=api AND severity<critical"
during: "deploy_in_progress"
- name: "batch_jobs"
match: "service=etl AND metric=cpu"
schedule: "0 2 * * * — 0 4 * * *" # 2am-4am daily
Step 5: The Routing Overhaul
Not every alert needs to wake someone up. I established clear escalation tiers:
- P1 (Page): Customer-facing outage, data loss risk
- P2 (Slack + ticket): Degraded but functional, fix within 4 hours
- P3 (Ticket only): Non-urgent, fix within sprint
- P4 (Dashboard): FYI, review in next planning
The Results
After eight weeks of implementation:
| Metric | Before | After |
|---|---|---|
| Alerts per shift | 47 | 5 |
| Mean time to acknowledge | 12 min | 2 min |
| False positive rate | 62% | 8% |
| On-call satisfaction (1-5) | 1.8 | 4.2 |
The team stopped dreading on-call. Response times improved because every alert actually mattered.
Key Takeaway
Alert fatigue is a systems problem, not a people problem. You can't train your way out of bad signal-to-noise ratios. Audit ruthlessly, set data-driven thresholds, and make every page count.
If you're drowning in alerts and looking for a smarter way to manage incident noise, check out what we're building at Nova AI Ops.
Top comments (0)