For two years, one alert dominated our on-call pages. It fired roughly 40% of all pages. Nobody had fixed it because 'it was important.' Here's how we finally killed it.
The alert
'Kafka consumer lag > 100k messages on topic X.'
Sounds reasonable. Actually fired 3-4 times a day, always between 8am-9am, almost never corresponding to a real problem. The 'fix' was to wait 10 minutes and let it catch up.
Why nobody killed it
Everyone was afraid to. 'What if the one time we ignore it, it's real?' Classic alert fatigue excuse.
What we did
We spent two days doing nothing but watching this alert. We checked every fire over 6 months of history.
Findings:
- 98% of fires were auto-resolved within 15 minutes
- The 2% that weren't were correlated with a specific upstream service's cold start
- The 'lag > 100k' threshold was arbitrary, set 18 months ago when traffic was 1/10 of current
The fix
- Raised the threshold to 500k (still safely above real problems)
- Added a 20-minute duration requirement (the alert only fires if lag stays elevated for 20 minutes)
- Added a second alert for 'upstream service Y cold start detected' so we catch the real underlying cause directly
Result: alert volume on this one metric dropped from 1,200/month to 3/month. The 3 that remain are always real problems.
The lesson
Every noisy alert is a threshold problem, a duration problem, or a cause problem. Not 'we need to care more.'
Fix the system, not the attitude.
The bonus lesson
After killing this alert, three engineers independently told me they sleep better now. That alert had been waking one of them at 5 AM twice a week for a year.
Reliability work is human work. Never forget the human on the other end of the page.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)