DEV Community

Meena Nukala
Meena Nukala

Posted on

# From 400 Alerts/Night to 8: The SRE Playbook That Saved My Team’s Sanity

By Meena Nukala

Senior DevOps Engineer | 10+ years | Ex-on-call firefighter turned platform builder

Published: 12 December 2025

In January 2024 my team of 7 was getting ~400 actionable PagerDuty alerts every single night in our UK financial services environment.

By September 2024 we were down to 8 — and none of them woke anyone up unless a real customer was actually affected.

Here’s the exact, battle-tested playbook we used. No magic, no vendor sales pitch — just ruthless elimination of noise.

The Starting Nightmare (Jan 2024)

  • 400+ alerts/night → 38 % false positive rate
  • Average MTTR: 4 h 12 min (because we were exhausted)
  • On-call burnout: 3 engineers left in 6 months
  • Weekly “alert review” meetings that nobody wanted to attend

The 5-Phase Noise Massacre

Phase 1 – Alert Ruthlessness Day (Week 1)

We ran this single query across all Prometheus alerts for the previous 30 days:

SELECT alertname, COUNT(*) as fires
FROM prometheus_alerts
WHERE fired_at > now() - 30d
GROUP BY alertname
HAVING COUNT(*) > 100
ORDER BY fires DESC;
Enter fullscreen mode Exit fullscreen mode

Result: 312 alerts fired more than 100 times — none had ever led to a customer-impacting incident.

Action: Deleted 287 of them on day one.

Rule: “If it has never caused a customer outage in 12 months, it is not an alert — it is a metric.”

Phase 2 – SLO-Based Alerting Only (Weeks 2–4)

We defined real SLOs for every critical service (99.95 % uptime, 400 ms P95 latency for payments API, etc.) and replaced everything with burn-rate alerts.

Example that replaced 47 old alerts:

# alerts.yaml — payments-api
- alert: PaymentsApiHighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.005
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "Payments API error budget burning fast  5 % of monthly budget gone in last hour"
Enter fullscreen mode Exit fullscreen mode

Phase 3 – Runbook Quality Gate (Weeks 5–8)

Every remaining alert now requires a runbook with:

  1. Is the customer impacted right now? (Yes/No)
  2. One-click link to exact dashboard
  3. Exact commands to verify/fix
  4. Escalation path

If the runbook was missing or older than 90 days → alert automatically muted.

Phase 4 – Alert Deduplication & Correlation (Weeks 9–12)

Deployed Cortex + Alertmanager with route grouping:

route:
  group_by: ['service', 'environment']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
Enter fullscreen mode Exit fullscreen mode

One disk-full on a 50-node cluster → one page instead of 50.

Phase 5 – Feedback Loop That Stuck (Ongoing)

Every single page now ends with a mandatory 2-question post-mortem:

  1. Was this page actionable? (Yes/No)
  2. How can we prevent this page forever?

If “No” → alert deleted within 24 h.

Final Numbers (Dec 2025)

Metric Jan 2024 Sep 2024 Improvement
Alerts per night 400+ 8 98 % reduction
Pages per week 41 0.7 98 % reduction
MTTR (customer-impacting) 4 h 12 min 11 min 97 % faster
False positive rate 38 % 0 %
Engineer retention (same team) -3 in 6 mo +4 in 6 mo

The One-Page Playbook You Can Run Next Week

  1. Delete every alert that never caused a customer outage in the last year
  2. Replace everything with SLO burn-rate alerts
  3. Make runbooks mandatory and auto-expire them
  4. Group + deduplicate aggressively
  5. Force post-mortem feedback on every page

Full open-source repo with all our final alert rules, runbooks, and the “alert-ruthlessness” SQL dashboard:

https://github.com/meenanukala/sre-alert-massacre-2025

Closing Thought

In 2025, alert fatigue is a choice.

400 alerts/night is not “normal” — it’s organisational debt with compound interest.

We paid ours off.

Your team can too.

— Meena Nukala

Senior DevOps Engineer | UK
GitHub: github.com/meena-nukala-devops
LinkedIn: linkedin.com/in/meena-nukala

(Published 12 December 2025 — clap 50 times if you’ve ever been woken up by a disk-space alert at 3 a.m.)

Top comments (0)