ant

Posted on Mar 21

70% of Your Alerts Are Noise — And They're Burning Out Your On-Call Team

#ai #devops

An AI alert triage layer for teams drowning in false alarms, duplicate pages, and midnight noise.

Last Tuesday at 2:47 AM, I got paged. High CPU on the payments service. I dragged myself out of bed, opened my laptop, squinted at Grafana for 20 minutes, grepped through CloudWatch logs, checked the deploy history — and found... nothing. A brief spike. Already resolved. No customer impact. No action needed.

I went back to bed. Couldn't fall asleep. Got paged again at 4:15 AM. Same pattern. Another false alarm.

At 9 AM, I joined standup with bloodshot eyes. My manager asked if everything was okay. I said: "Yeah, just alert noise."

He nodded. Like it was normal. Because it is.

This is the life of a lot of modern on-call teams. Especially teams running Prometheus, Grafana, Datadog, ELK, PagerDuty, or Opsgenie — where the monitoring stack works, but the paging experience is still broken.

If you're an SRE, DevOps, platform, or backend lead dealing with noisy alerts, I'm building this now.

→ Join the waitlist

I Analyzed 90 Days of Alerts

I pulled alert data from three real teams I've worked with — a 12-person SaaS startup, a 30-person fintech, and a 60-person e-commerce platform. All running fairly standard stacks: Prometheus + Grafana, ELK for logs, PagerDuty or Opsgenie for routing.

This wasn't a polished vendor benchmark. It was a practical audit of real alert streams, incident timelines, and postmortem context to answer one question:

How many alerts actually required a human to wake up, investigate, and act?

Here's what I found across 18,749 alerts over 90 days:

The Numbers Don't Lie

Metric	Result
Total alerts (90 days)	18,749
Alerts that required human action	5,512 (29.4%)
Alerts that were non-actionable	13,237 (70.6%)
Off-hours alerts	3,847
Off-hours alerts that were actionable	891 (23.2%)
Average time to determine "it's nothing"	14.3 minutes
Total engineer-hours spent on non-actionable alerts (90 days)	~3,155 hours

Let that last number sink in. ~3,155 engineer-hours spent investigating alerts that didn't require action — in just 90 days.

For teams with formal on-call rotations, this isn't just inefficiency. It's burnout, slower incident response, and eventually lost trust in the paging system itself.

If that's your world, this product is being designed for you.

Breaking Down the Noise

I categorized every non-actionable alert to understand why it was noise:

Category breakdown of 13,237 noise alerts:

  Auto-resolved before human looked      ████████████████  38.2%
  Duplicate of another alert             ████████████      28.7%
  Threshold too sensitive                ████████          18.4%
  Known issue / expected behavior        ████              9.1%
  Misconfigured alert rule               ██                5.6%

The top two categories — auto-resolved and duplicates — account for 67% of all noise. These are the low-hanging fruit.

That means the first version of a useful solution probably doesn't need magic. It needs to do two things well:

Group related alerts into one incident
Avoid paging humans for alerts that disappear on their own

Do those two things well, and you can remove a huge chunk of pain without replacing the rest of the stack.

The 3 AM Problem

Let's zoom in on off-hours pages, because that's where alert fatigue does the most damage:

Off-hours alert distribution:

  12AM-2AM  ██████████████████ 412 alerts (78 actionable = 19%)
  2AM-4AM   ████████████████   387 alerts (71 actionable = 18%)
  4AM-6AM   ██████████████     341 alerts (82 actionable = 24%)
  6AM-8AM   ████████████████   398 alerts (103 actionable = 26%)
  8PM-10PM  ████████████████   389 alerts (94 actionable = 24%)
  10PM-12AM ██████████████████ 421 alerts (87 actionable = 21%)

  Average off-hours actionability rate: 23.2%

In this sample, roughly 77% of off-hours alerts turned out to be non-actionable.

I talked to the engineers on these teams. Here's what they told me:

"I've started putting my phone on Do Not Disturb. I know that's terrible, but I can't keep waking up for nothing."

"Last month I slept through a real P0 because I'd already been woken up 3 times that week for false alarms. My brain just stopped trusting the alert sound."

"My wife asked me to quit. Not because of the hours — because of the constant anxiety. The phone might ring at any moment."

This isn't just a tooling problem. It's a health problem, a trust problem, and eventually a reliability problem too — because engineers stop believing alerts deserve immediate attention.

What's Actually Causing the Noise?

After digging deeper, I found three root patterns:

Pattern 1: The "Alert on Everything" Culture

After every major incident, teams add more alerts. Nobody ever removes them. Over 2 years, you end up with 400+ alert rules, most of which are:

Default thresholds copy-pasted from a blog post
Alerts on symptoms instead of impact
Alerts for scenarios that were fixed months ago

Pattern 2: The Duplicate Cascade

One root cause triggers a chain reaction:

Database connection pool exhausted (root cause)
  → API latency > 2s (symptom)
    → Error rate > 5% (symptom)
      → Health check failed (symptom)
        → Pod restart (symptom)
          → New pod failing health check (symptom)

Result: 1 incident → 37 alerts → 37 pages → 1 very angry engineer

No one in the alert chain knows that these 37 alerts are the same incident. Prometheus doesn't know. PagerDuty groups some but misses most. The on-call engineer has to mentally correlate them at 3 AM.

Pattern 3: The Transient Spike

CPU hits 92% for 45 seconds during a batch job. Alert fires. By the time you look, it's back to 30%. This happens 3 times a week. Every time, you check. Every time, it's fine. But you can't not check, because the one time you don't...

Why Existing Tools Still Leave a Gap

I've used them all. Here's the honest truth:

Tool	What it does well	What it doesn't do
PagerDuty	Alert routing, escalation, scheduling	Doesn't understand why alerts fired or if they're related
Datadog	Beautiful dashboards, APM	Watchdog AI is a black box; can't use it standalone; $$$
Grafana OnCall	Free, open-source on-call	Zero intelligence — just routes alerts
incident.io / Rootly	Great incident management	Kicks in after a human declares an incident

The gap is clear: most tools help route alerts, visualize systems, or manage incidents after they start — but very few decide whether a human should be interrupted in the first place.

That's the gap I'm focused on.

What I'm Building First

Based on the data, here's the MVP I believe teams actually need first:

1. Intelligent Grouping

When 37 alerts fire within a 3-minute window and all relate to the same service dependency chain — that's one incident, not 37. Group them. Show the probable root cause at the top.

2. Smart Waiting

If an alert auto-resolves within 2-5 minutes, it probably wasn't a real incident. Hold the page. If it persists, then escalate. Most monitoring tools fire instantly — but a 2-minute buffer would eliminate the single largest category of noise.

3. Context-Rich Notifications

When you do get paged, a more useful alert could look like this:

🔴 payments-service: P99 latency > 2s (persisting 4 min)

📍 Probable cause: Deploy v2.3.1 by @alice (12 min ago)
   - DB query count increased 340% (N+1 query suspected)
   - Connection pool at 95% capacity
   - Similar to incident INC-847 on Jan 15 (resolved by rollback)

🛠️ Suggested action: Rollback to v2.3.0
📊 Confidence: 87%

That's not science fiction. With today's LLMs and the right context layer — metrics, logs, deploy history, topology, and past incidents — this is buildable.

4. Learning from History

Every time an alert fires and gets resolved, the system should learn: what was the root cause? What fixed it? Next time a similar pattern appears, match it instantly instead of making a human re-investigate from scratch.

But the first release won't try to do everything. The priority is simpler:

Reduce duplicate pages
Suppress short-lived noise
Deliver enough context that on-call can decide faster

If that sounds useful for your team, I'd love to show you what I'm building as it evolves.

I'm Building This

I've spent the last few weeks designing an AI-powered alert triage system that sits between your monitoring stack and your on-call team. It:

Ingests alerts from Prometheus, Grafana, Datadog, CloudWatch — via webhook
Groups and deduplicates related alerts into incidents
Analyzes root cause by pulling correlated metrics, logs, and recent changes
Delivers context-rich summaries to Slack/Discord/Teams
Learns from your incident history to get smarter over time

The goal is simple: cut duplicate and short-lived alert noise, reduce unnecessary off-hours pages, and help responders get to the probable cause faster.

Not by replacing your monitoring stack. Not by being another expensive enterprise platform. But by being the AI layer that makes your existing tools actually work together.

Who this is for

This is most relevant if your team:

Has a real on-call rotation
Already uses tools like PagerDuty, Opsgenie, Grafana, Datadog, Prometheus, or CloudWatch
Gets repeated, low-value alerts outside working hours
Wants fewer pages without ripping out existing observability tooling

What I'm looking for right now

I'm looking for early teams who want to:

Join the waitlist
Give feedback on the product direction
Share sample alert workflows or incident pain points
Potentially become early design partners

Want Early Access?

Before I go deeper on implementation, I want to work with teams who actually feel this pain.

If you run on-call and any part of this sounds painfully familiar, join the waitlist.

What you'll get by joining:

Early product updates
Priority access to the first beta
A chance to shape the workflow before it's locked in
Optional invite for a short research call if you're a strong fit

If you found this useful, consider sharing it with your on-call team. They'll thank you at 3 AM.

Discussion welcome. I know "AI for DevOps" triggers skepticism — and honestly, it should. I'm not claiming AI replaces SRE judgment. I'm claiming it can reduce the repetitive triage work that currently burns out humans and slows down response.

If you're curious, skeptical, or actively looking for a better way to handle alert noise, join the waitlist here:

→ Join the waitlist

Top comments (1)

ant • Mar 21

Author here 👋 A couple of things I'm genuinely curious about from this

community:

What's your team's noise ratio? My data landed at ~70% non-actionable. Is that higher or lower than what you're seeing?
Has anyone tried the "hold for 2-5 minutes before paging" approach? That was the single biggest lever in my analysis, but I'd love to hear if it's caused issues in practice (e.g., delayed response to a real P0).

Also — if there's something about the data or methodology you'd push back on,
I'd rather hear it now than after I've built the wrong thing. Constructive
skepticism welcome.