Meena Nukala

Posted on Dec 11, 2025

# From 400 Alerts/Night to 8: The SRE Playbook That Saved My Team’s Sanity

#sre #devops #cloud

By Meena Nukala

Senior DevOps Engineer | 10+ years | Ex-on-call firefighter turned platform builder

Published: 12 December 2025

In January 2024 my team of 7 was getting ~400 actionable PagerDuty alerts every single night in our UK financial services environment.

By September 2024 we were down to 8 — and none of them woke anyone up unless a real customer was actually affected.

Here’s the exact, battle-tested playbook we used. No magic, no vendor sales pitch — just ruthless elimination of noise.

The Starting Nightmare (Jan 2024)

400+ alerts/night → 38 % false positive rate
Average MTTR: 4 h 12 min (because we were exhausted)
On-call burnout: 3 engineers left in 6 months
Weekly “alert review” meetings that nobody wanted to attend

The 5-Phase Noise Massacre

Phase 1 – Alert Ruthlessness Day (Week 1)

We ran this single query across all Prometheus alerts for the previous 30 days:

SELECT alertname, COUNT(*) as fires
FROM prometheus_alerts
WHERE fired_at > now() - 30d
GROUP BY alertname
HAVING COUNT(*) > 100
ORDER BY fires DESC;

Result: 312 alerts fired more than 100 times — none had ever led to a customer-impacting incident.

Action: Deleted 287 of them on day one.

Rule: “If it has never caused a customer outage in 12 months, it is not an alert — it is a metric.”

Phase 2 – SLO-Based Alerting Only (Weeks 2–4)

We defined real SLOs for every critical service (99.95 % uptime, 400 ms P95 latency for payments API, etc.) and replaced everything with burn-rate alerts.

Example that replaced 47 old alerts:

# alerts.yaml — payments-api
- alert: PaymentsApiHighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.005
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "Payments API error budget burning fast — 5 % of monthly budget gone in last hour"

Phase 3 – Runbook Quality Gate (Weeks 5–8)

Every remaining alert now requires a runbook with:

Is the customer impacted right now? (Yes/No)
One-click link to exact dashboard
Exact commands to verify/fix
Escalation path

If the runbook was missing or older than 90 days → alert automatically muted.

Phase 4 – Alert Deduplication & Correlation (Weeks 9–12)

Deployed Cortex + Alertmanager with route grouping:

route:
  group_by: ['service', 'environment']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

One disk-full on a 50-node cluster → one page instead of 50.

Phase 5 – Feedback Loop That Stuck (Ongoing)

Every single page now ends with a mandatory 2-question post-mortem:

Was this page actionable? (Yes/No)
How can we prevent this page forever?

If “No” → alert deleted within 24 h.

Final Numbers (Dec 2025)

Metric	Jan 2024	Sep 2024	Improvement
Alerts per night	400+	8	98 % reduction
Pages per week	41	0.7	98 % reduction
MTTR (customer-impacting)	4 h 12 min	11 min	97 % faster
False positive rate	38 %	0 %
Engineer retention (same team)	-3 in 6 mo	+4 in 6 mo

The One-Page Playbook You Can Run Next Week

Delete every alert that never caused a customer outage in the last year
Replace everything with SLO burn-rate alerts
Make runbooks mandatory and auto-expire them
Group + deduplicate aggressively
Force post-mortem feedback on every page

Full open-source repo with all our final alert rules, runbooks, and the “alert-ruthlessness” SQL dashboard:

https://github.com/meenanukala/sre-alert-massacre-2025

Closing Thought

In 2025, alert fatigue is a choice.

400 alerts/night is not “normal” — it’s organisational debt with compound interest.

We paid ours off.

Your team can too.

— Meena Nukala

Senior DevOps Engineer | UK
GitHub: github.com/meena-nukala-devops
LinkedIn: linkedin.com/in/meena-nukala

(Published 12 December 2025 — clap 50 times if you’ve ever been woken up by a disk-space alert at 3 a.m.)

DEV Community