<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ant</title>
    <description>The latest articles on DEV Community by ant (@ant_ed4c32e7bb29a18).</description>
    <link>https://dev.to/ant_ed4c32e7bb29a18</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3837341%2F1c35286d-a983-4a84-935a-2b4aa40bb8d9.png</url>
      <title>DEV Community: ant</title>
      <link>https://dev.to/ant_ed4c32e7bb29a18</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ant_ed4c32e7bb29a18"/>
    <language>en</language>
    <item>
      <title>70% of Your Alerts Are Noise — And They're Burning Out Your On-Call Team</title>
      <dc:creator>ant</dc:creator>
      <pubDate>Sat, 21 Mar 2026 16:42:20 +0000</pubDate>
      <link>https://dev.to/ant_ed4c32e7bb29a18/70-of-your-alerts-are-noise-and-theyre-burning-out-your-on-call-team-570m</link>
      <guid>https://dev.to/ant_ed4c32e7bb29a18/70-of-your-alerts-are-noise-and-theyre-burning-out-your-on-call-team-570m</guid>
      <description>&lt;p&gt;&lt;em&gt;An AI alert triage layer for teams drowning in false alarms, duplicate pages, and midnight noise.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Last Tuesday at 2:47 AM, I got paged. High CPU on the payments service. I dragged myself out of bed, opened my laptop, squinted at Grafana for 20 minutes, grepped through CloudWatch logs, checked the deploy history — and found... nothing. A brief spike. Already resolved. No customer impact. No action needed.&lt;/p&gt;

&lt;p&gt;I went back to bed. Couldn't fall asleep. Got paged again at 4:15 AM. Same pattern. Another false alarm.&lt;/p&gt;

&lt;p&gt;At 9 AM, I joined standup with bloodshot eyes. My manager asked if everything was okay. I said: &lt;em&gt;"Yeah, just alert noise."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;He nodded. Like it was normal. Because it is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the life of a lot of modern on-call teams.&lt;/strong&gt; Especially teams running Prometheus, Grafana, Datadog, ELK, PagerDuty, or Opsgenie — where the monitoring stack works, but the paging experience is still broken.&lt;/p&gt;

&lt;p&gt;If you're an SRE, DevOps, platform, or backend lead dealing with noisy alerts, &lt;strong&gt;I'm building this now&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;→ &lt;a href="https://tally.so/r/Gx046k" rel="noopener noreferrer"&gt;Join the waitlist&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  I Analyzed 90 Days of Alerts
&lt;/h2&gt;

&lt;p&gt;I pulled alert data from three real teams I've worked with — a 12-person SaaS startup, a 30-person fintech, and a 60-person e-commerce platform. All running fairly standard stacks: Prometheus + Grafana, ELK for logs, PagerDuty or Opsgenie for routing.&lt;/p&gt;

&lt;p&gt;This wasn't a polished vendor benchmark. It was a practical audit of real alert streams, incident timelines, and postmortem context to answer one question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How many alerts actually required a human to wake up, investigate, and act?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's what I found across &lt;strong&gt;18,749 alerts&lt;/strong&gt; over 90 days:&lt;/p&gt;

&lt;h3&gt;
  
  
  The Numbers Don't Lie
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total alerts (90 days)&lt;/td&gt;
&lt;td&gt;18,749&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alerts that required human action&lt;/td&gt;
&lt;td&gt;5,512 (29.4%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alerts that were non-actionable&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;13,237 (70.6%)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Off-hours alerts&lt;/td&gt;
&lt;td&gt;3,847&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Off-hours alerts that were actionable&lt;/td&gt;
&lt;td&gt;891 (23.2%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average time to determine "it's nothing"&lt;/td&gt;
&lt;td&gt;14.3 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total engineer-hours spent on non-actionable alerts (90 days)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~3,155 hours&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Let that last number sink in. &lt;strong&gt;~3,155 engineer-hours&lt;/strong&gt; spent investigating alerts that didn't require action — in just 90 days.&lt;/p&gt;

&lt;p&gt;For teams with formal on-call rotations, this isn't just inefficiency. It's burnout, slower incident response, and eventually lost trust in the paging system itself.&lt;/p&gt;

&lt;p&gt;If that's your world, &lt;strong&gt;this product is being designed for you.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Breaking Down the Noise
&lt;/h2&gt;

&lt;p&gt;I categorized every non-actionable alert to understand &lt;em&gt;why&lt;/em&gt; it was noise:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Category breakdown of 13,237 noise alerts:

  Auto-resolved before human looked      ████████████████  38.2%
  Duplicate of another alert             ████████████      28.7%
  Threshold too sensitive                ████████          18.4%
  Known issue / expected behavior        ████              9.1%
  Misconfigured alert rule               ██                5.6%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The top two categories — auto-resolved and duplicates — account for 67% of all noise.&lt;/strong&gt; These are the low-hanging fruit.&lt;/p&gt;

&lt;p&gt;That means the first version of a useful solution probably doesn't need magic. It needs to do two things well:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Group related alerts into one incident&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Avoid paging humans for alerts that disappear on their own&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Do those two things well, and you can remove a huge chunk of pain without replacing the rest of the stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 3 AM Problem
&lt;/h2&gt;

&lt;p&gt;Let's zoom in on off-hours pages, because that's where alert fatigue does the most damage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Off-hours alert distribution:

  12AM-2AM  ██████████████████ 412 alerts (78 actionable = 19%)
  2AM-4AM   ████████████████   387 alerts (71 actionable = 18%)
  4AM-6AM   ██████████████     341 alerts (82 actionable = 24%)
  6AM-8AM   ████████████████   398 alerts (103 actionable = 26%)
  8PM-10PM  ████████████████   389 alerts (94 actionable = 24%)
  10PM-12AM ██████████████████ 421 alerts (87 actionable = 21%)

  Average off-hours actionability rate: 23.2%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;In this sample, roughly 77% of off-hours alerts turned out to be non-actionable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I talked to the engineers on these teams. Here's what they told me:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"I've started putting my phone on Do Not Disturb. I know that's terrible, but I can't keep waking up for nothing."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Last month I slept through a real P0 because I'd already been woken up 3 times that week for false alarms. My brain just stopped trusting the alert sound."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"My wife asked me to quit. Not because of the hours — because of the constant anxiety. The phone might ring at any moment."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This isn't just a tooling problem. It's a &lt;strong&gt;health problem&lt;/strong&gt;, a &lt;strong&gt;trust problem&lt;/strong&gt;, and eventually a &lt;strong&gt;reliability problem&lt;/strong&gt; too — because engineers stop believing alerts deserve immediate attention.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Actually Causing the Noise?
&lt;/h2&gt;

&lt;p&gt;After digging deeper, I found three root patterns:&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 1: The "Alert on Everything" Culture
&lt;/h3&gt;

&lt;p&gt;After every major incident, teams add more alerts. Nobody ever removes them. Over 2 years, you end up with 400+ alert rules, most of which are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Default thresholds copy-pasted from a blog post&lt;/li&gt;
&lt;li&gt;Alerts on &lt;em&gt;symptoms&lt;/em&gt; instead of &lt;em&gt;impact&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Alerts for scenarios that were fixed months ago&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pattern 2: The Duplicate Cascade
&lt;/h3&gt;

&lt;p&gt;One root cause triggers a chain reaction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Database connection pool exhausted (root cause)
  → API latency &amp;gt; 2s (symptom)
    → Error rate &amp;gt; 5% (symptom)
      → Health check failed (symptom)
        → Pod restart (symptom)
          → New pod failing health check (symptom)

Result: 1 incident → 37 alerts → 37 pages → 1 very angry engineer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No one in the alert chain knows that these 37 alerts are the same incident. Prometheus doesn't know. PagerDuty groups some but misses most. The on-call engineer has to mentally correlate them at 3 AM.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3: The Transient Spike
&lt;/h3&gt;

&lt;p&gt;CPU hits 92% for 45 seconds during a batch job. Alert fires. By the time you look, it's back to 30%. This happens 3 times a week. Every time, you check. Every time, it's fine. But you can't &lt;em&gt;not&lt;/em&gt; check, because the one time you don't...&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Existing Tools Still Leave a Gap
&lt;/h2&gt;

&lt;p&gt;I've used them all. Here's the honest truth:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it does well&lt;/th&gt;
&lt;th&gt;What it doesn't do&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PagerDuty&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Alert routing, escalation, scheduling&lt;/td&gt;
&lt;td&gt;Doesn't understand &lt;em&gt;why&lt;/em&gt; alerts fired or if they're related&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datadog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Beautiful dashboards, APM&lt;/td&gt;
&lt;td&gt;Watchdog AI is a black box; can't use it standalone; $$$&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Grafana OnCall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free, open-source on-call&lt;/td&gt;
&lt;td&gt;Zero intelligence — just routes alerts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;incident.io / Rootly&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Great incident &lt;em&gt;management&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;Kicks in after a human declares an incident&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The gap is clear: &lt;strong&gt;most tools help route alerts, visualize systems, or manage incidents after they start — but very few decide whether a human should be interrupted in the first place.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's the gap I'm focused on.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'm Building First
&lt;/h2&gt;

&lt;p&gt;Based on the data, here's the MVP I believe teams actually need first:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Intelligent Grouping
&lt;/h3&gt;

&lt;p&gt;When 37 alerts fire within a 3-minute window and all relate to the same service dependency chain — that's &lt;strong&gt;one incident&lt;/strong&gt;, not 37. Group them. Show the probable root cause at the top.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Smart Waiting
&lt;/h3&gt;

&lt;p&gt;If an alert auto-resolves within 2-5 minutes, it probably wasn't a real incident. Hold the page. If it persists, &lt;em&gt;then&lt;/em&gt; escalate. Most monitoring tools fire instantly — but a 2-minute buffer would eliminate the single largest category of noise.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Context-Rich Notifications
&lt;/h3&gt;

&lt;p&gt;When you &lt;em&gt;do&lt;/em&gt; get paged, a more useful alert could look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🔴 payments-service: P99 latency &amp;gt; 2s (persisting 4 min)

📍 Probable cause: Deploy v2.3.1 by @alice (12 min ago)
   - DB query count increased 340% (N+1 query suspected)
   - Connection pool at 95% capacity
   - Similar to incident INC-847 on Jan 15 (resolved by rollback)

🛠️ Suggested action: Rollback to v2.3.0
📊 Confidence: 87%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's not science fiction. With today's LLMs and the right context layer — metrics, logs, deploy history, topology, and past incidents — this is buildable.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Learning from History
&lt;/h3&gt;

&lt;p&gt;Every time an alert fires and gets resolved, the system should learn: what was the root cause? What fixed it? Next time a similar pattern appears, match it instantly instead of making a human re-investigate from scratch.&lt;/p&gt;

&lt;p&gt;But the first release won't try to do everything. The priority is simpler:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reduce duplicate pages&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Suppress short-lived noise&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deliver enough context that on-call can decide faster&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If that sounds useful for your team, I'd love to show you what I'm building as it evolves.&lt;/p&gt;




&lt;h2&gt;
  
  
  I'm Building This
&lt;/h2&gt;

&lt;p&gt;I've spent the last few weeks designing an AI-powered alert triage system that sits between your monitoring stack and your on-call team. It:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ingests alerts&lt;/strong&gt; from Prometheus, Grafana, Datadog, CloudWatch — via webhook&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Groups and deduplicates&lt;/strong&gt; related alerts into incidents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analyzes root cause&lt;/strong&gt; by pulling correlated metrics, logs, and recent changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delivers context-rich summaries&lt;/strong&gt; to Slack/Discord/Teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learns from your incident history&lt;/strong&gt; to get smarter over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is simple: &lt;strong&gt;cut duplicate and short-lived alert noise, reduce unnecessary off-hours pages, and help responders get to the probable cause faster.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not by replacing your monitoring stack. Not by being another expensive enterprise platform. But by being the &lt;strong&gt;AI layer&lt;/strong&gt; that makes your existing tools actually work together.&lt;/p&gt;

&lt;h3&gt;
  
  
  Who this is for
&lt;/h3&gt;

&lt;p&gt;This is most relevant if your team:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has a real on-call rotation&lt;/li&gt;
&lt;li&gt;Already uses tools like PagerDuty, Opsgenie, Grafana, Datadog, Prometheus, or CloudWatch&lt;/li&gt;
&lt;li&gt;Gets repeated, low-value alerts outside working hours&lt;/li&gt;
&lt;li&gt;Wants fewer pages &lt;strong&gt;without&lt;/strong&gt; ripping out existing observability tooling&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What I'm looking for right now
&lt;/h3&gt;

&lt;p&gt;I'm looking for early teams who want to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Join the waitlist&lt;/li&gt;
&lt;li&gt;Give feedback on the product direction&lt;/li&gt;
&lt;li&gt;Share sample alert workflows or incident pain points&lt;/li&gt;
&lt;li&gt;Potentially become early design partners&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Want Early Access?
&lt;/h2&gt;

&lt;p&gt;Before I go deeper on implementation, I want to work with teams who actually feel this pain.&lt;/p&gt;

&lt;p&gt;If you run on-call and any part of this sounds painfully familiar, join the waitlist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll get by joining:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Early product updates&lt;/li&gt;
&lt;li&gt;Priority access to the first beta&lt;/li&gt;
&lt;li&gt;A chance to shape the workflow before it's locked in&lt;/li&gt;
&lt;li&gt;Optional invite for a short research call if you're a strong fit&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;If you found this useful, consider sharing it with your on-call team. They'll thank you at 3 AM.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Discussion welcome.&lt;/strong&gt; I know "AI for DevOps" triggers skepticism — and honestly, it should. I'm not claiming AI replaces SRE judgment. I'm claiming it can reduce the repetitive triage work that currently burns out humans and slows down response.&lt;/p&gt;

&lt;p&gt;If you're curious, skeptical, or actively looking for a better way to handle alert noise, &lt;strong&gt;join the waitlist here:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;→ &lt;a href="https://tally.so/r/Gx046k" rel="noopener noreferrer"&gt;Join the waitlist&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
