How We Killed Our Worst Alert (And What We Learned)

#sre #devops #alerts #postmortem

For two years, one alert dominated our on-call pages. It fired roughly 40% of all pages. Nobody had fixed it because 'it was important.' Here's how we finally killed it.

The alert

'Kafka consumer lag > 100k messages on topic X.'

Sounds reasonable. Actually fired 3-4 times a day, always between 8am-9am, almost never corresponding to a real problem. The 'fix' was to wait 10 minutes and let it catch up.

Why nobody killed it

Everyone was afraid to. 'What if the one time we ignore it, it's real?' Classic alert fatigue excuse.

What we did

We spent two days doing nothing but watching this alert. We checked every fire over 6 months of history.

Findings:

98% of fires were auto-resolved within 15 minutes
The 2% that weren't were correlated with a specific upstream service's cold start
The 'lag > 100k' threshold was arbitrary, set 18 months ago when traffic was 1/10 of current

The fix

Raised the threshold to 500k (still safely above real problems)
Added a 20-minute duration requirement (the alert only fires if lag stays elevated for 20 minutes)
Added a second alert for 'upstream service Y cold start detected' so we catch the real underlying cause directly

Result: alert volume on this one metric dropped from 1,200/month to 3/month. The 3 that remain are always real problems.

The lesson

Every noisy alert is a threshold problem, a duration problem, or a cause problem. Not 'we need to care more.'

Fix the system, not the attitude.

The bonus lesson

After killing this alert, three engineers independently told me they sleep better now. That alert had been waking one of them at 5 AM twice a week for a year.

Reliability work is human work. Never forget the human on the other end of the page.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com