By Meena Nukala
Senior DevOps Engineer | 10+ years | Ex-on-call firefighter turned platform builder
Published: 12 December 2025
In January 2024 my team of 7 was getting ~400 actionable PagerDuty alerts every single night in our UK financial services environment.
By September 2024 we were down to 8 — and none of them woke anyone up unless a real customer was actually affected.
Here’s the exact, battle-tested playbook we used. No magic, no vendor sales pitch — just ruthless elimination of noise.
The Starting Nightmare (Jan 2024)
- 400+ alerts/night → 38 % false positive rate
- Average MTTR: 4 h 12 min (because we were exhausted)
- On-call burnout: 3 engineers left in 6 months
- Weekly “alert review” meetings that nobody wanted to attend
The 5-Phase Noise Massacre
Phase 1 – Alert Ruthlessness Day (Week 1)
We ran this single query across all Prometheus alerts for the previous 30 days:
SELECT alertname, COUNT(*) as fires
FROM prometheus_alerts
WHERE fired_at > now() - 30d
GROUP BY alertname
HAVING COUNT(*) > 100
ORDER BY fires DESC;
Result: 312 alerts fired more than 100 times — none had ever led to a customer-impacting incident.
Action: Deleted 287 of them on day one.
Rule: “If it has never caused a customer outage in 12 months, it is not an alert — it is a metric.”
Phase 2 – SLO-Based Alerting Only (Weeks 2–4)
We defined real SLOs for every critical service (99.95 % uptime, 400 ms P95 latency for payments API, etc.) and replaced everything with burn-rate alerts.
Example that replaced 47 old alerts:
# alerts.yaml — payments-api
- alert: PaymentsApiHighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.005
for: 5m
labels:
severity: page
annotations:
summary: "Payments API error budget burning fast — 5 % of monthly budget gone in last hour"
Phase 3 – Runbook Quality Gate (Weeks 5–8)
Every remaining alert now requires a runbook with:
- Is the customer impacted right now? (Yes/No)
- One-click link to exact dashboard
- Exact commands to verify/fix
- Escalation path
If the runbook was missing or older than 90 days → alert automatically muted.
Phase 4 – Alert Deduplication & Correlation (Weeks 9–12)
Deployed Cortex + Alertmanager with route grouping:
route:
group_by: ['service', 'environment']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
One disk-full on a 50-node cluster → one page instead of 50.
Phase 5 – Feedback Loop That Stuck (Ongoing)
Every single page now ends with a mandatory 2-question post-mortem:
- Was this page actionable? (Yes/No)
- How can we prevent this page forever?
If “No” → alert deleted within 24 h.
Final Numbers (Dec 2025)
| Metric | Jan 2024 | Sep 2024 | Improvement |
|---|---|---|---|
| Alerts per night | 400+ | 8 | 98 % reduction |
| Pages per week | 41 | 0.7 | 98 % reduction |
| MTTR (customer-impacting) | 4 h 12 min | 11 min | 97 % faster |
| False positive rate | 38 % | 0 % | |
| Engineer retention (same team) | -3 in 6 mo | +4 in 6 mo |
The One-Page Playbook You Can Run Next Week
- Delete every alert that never caused a customer outage in the last year
- Replace everything with SLO burn-rate alerts
- Make runbooks mandatory and auto-expire them
- Group + deduplicate aggressively
- Force post-mortem feedback on every page
Full open-source repo with all our final alert rules, runbooks, and the “alert-ruthlessness” SQL dashboard:
https://github.com/meenanukala/sre-alert-massacre-2025
Closing Thought
In 2025, alert fatigue is a choice.
400 alerts/night is not “normal” — it’s organisational debt with compound interest.
We paid ours off.
Your team can too.
— Meena Nukala
Senior DevOps Engineer | UK
GitHub: github.com/meena-nukala-devops
LinkedIn: linkedin.com/in/meena-nukala
(Published 12 December 2025 — clap 50 times if you’ve ever been woken up by a disk-space alert at 3 a.m.)
Top comments (0)