LinChuang

Posted on Mar 9

Alert Fatigue Is Real — Here's What It's Actually Costing Your Team

#devops #monitoring #opensource #sre

VigilOps Team | February 2026

The Alert That Cried Wolf

You know the pattern. Your team sets up monitoring, writes alert rules, and connects them to Slack or PagerDuty. For the first week, every notification gets attention. By month three, the alert channel is muted. By month six, someone creates a "real-alerts" channel because the original one is useless.

This isn't a configuration problem. It's a structural problem with how monitoring systems work.

Most monitoring tools are designed to detect threshold violations and send notifications. They're very good at this. Too good, in fact — because the bar for "something worth alerting about" and "something that requires human intervention" are wildly different, and most systems make no distinction between the two.

The result is alert fatigue: the gradual erosion of trust in your monitoring system, leading to slower response times and, eventually, missed real incidents.

What the Data Says

Let's be careful with numbers here. The monitoring industry loves throwing around statistics like "teams receive 500+ alerts per day" or "80% of alerts are noise." These figures get repeated so often they've become urban legend.

Here's what we can say with more confidence:

PagerDuty's State of Digital Operations reports (published annually) consistently show that high-performing teams have fewer, more actionable alerts — not more alerts with better tools. Their data suggests that teams with lower alert volumes per on-call engineer have better MTTR (Mean Time to Resolution).

Gartner retired the term "AIOps" in 2024-2025, rebranding it as "Event Intelligence," partly because AIOps products over-promised and under-delivered on noise reduction. Their assessment: most so-called AI-based alert correlation is actually rule-based statistical analysis.

ServiceNow's 2025 report found that less than 1% of enterprises have achieved truly autonomous remediation. That means 99%+ of organizations are still relying on humans to respond to every alert that comes through.

The takeaway: alert fatigue is an industry-wide problem, and nobody has solved it cleanly yet.

Why Alerts Multiply

Understanding the mechanism helps. Alerts tend to grow for predictable reasons:

Fear-driven rules. After every incident where monitoring "missed" something, teams add more rules. The rules rarely get removed because nobody wants to be responsible for the next miss.

Microservice multiplication. When you go from a monolith to 20 microservices, your alert surface area doesn't just grow — it explodes. Each service has its own CPU, memory, error rate, and latency thresholds. Cross-service failures trigger cascading alerts.

Copy-paste thresholds. Most teams start with recommended alert thresholds from blog posts or Prometheus recording rules. These defaults rarely match the actual baseline of your specific infrastructure.

No alert lifecycle management. Unlike code, which gets reviewed and refactored, alert rules tend to accumulate forever. Most teams have never done an "alert rule audit" to ask: which of these rules actually led to useful action in the past 90 days?

What Existing Tools Do (and Don't Do)

AlertManager (Prometheus ecosystem)

Good at: Grouping related alerts, silencing during maintenance, inhibiting secondary alerts when a primary is firing.

Doesn't do: Context-aware analysis. It can group alerts by label, but it can't tell you "these 5 alerts are all caused by the same upstream failure."

PagerDuty Event Intelligence

Good at: ML-based alert aggregation, reducing notification volume. PagerDuty reports their customers see significant noise reduction.

Doesn't do: Root cause analysis or remediation. It reduces the number of notifications you receive, but you still need to investigate and fix things manually. Also, it's a separate paid product ($29+/user/month for Teams tier).

Grafana OnCall

Good at: Routing alerts to the right person based on schedules and escalation policies.

Doesn't do: Reduce alert volume. It ensures the right person gets paged, but it doesn't question whether the page was worth sending.

The Gap

No mainstream open-source tool today combines: (1) alert detection, (2) AI-powered root cause analysis, and (3) automated remediation in a single package. This is the gap VigilOps is trying to fill.

How VigilOps Approaches This

VigilOps takes a different philosophy: instead of just telling you about problems, try to fix them.

When an alert fires in VigilOps:

1. Alert triggers (standard threshold check)
       ↓
2. AI analysis engine (DeepSeek LLM):
   - Gathers recent metrics, logs, active alerts
   - Analyzes root cause and severity
       ↓
3. If a Runbook matches:
   - Safety checks (confirm the runbook is appropriate)
   - Execute auto-remediation
   - Log the result
       ↓
4. If no Runbook matches:
   - Attach AI analysis to the alert
   - Notify on-call via normal channels

The 6 built-in Runbooks handle common scenarios:

disk_cleanup — Clear temp files and old logs when disk is full
service_restart — Gracefully restart a failed service
memory_pressure — Kill memory-hogging processes
log_rotation — Rotate oversized logs
zombie_killer — Terminate zombie processes
connection_reset — Reset stuck connection pools

These aren't exotic scenarios. They're the bread-and-butter issues that wake people up at night and could be handled by a script — if someone had written and maintained that script.

What This Looks Like in Practice

Scenario: "Server web-03 disk usage at 93%."

Traditional flow: On-call gets paged → SSHs into server → Runs du -sh /var/* → Identifies /var/log growing → Manually cleans old logs → Verifies disk drops → Goes back to bed. Time: 15-30 minutes.

VigilOps flow: Alert fires → AI analyzes metrics and identifies /var/log growth → Matches disk_cleanup runbook → Automatically clears files older than 7 days in /tmp and rotated logs → Disk drops to 62% → Alert auto-resolves. On-call sees a "resolved automatically" record in the morning.

Try It Yourself

git clone https://github.com/LinChuang2008/vigilops.git
cd vigilops
cp .env.example .env   # Add your DeepSeek API key
docker compose up -d
# Open http://localhost:3001

Or try the live demo: http://139.196.210.68:3001 — Login: demo@vigilops.io / demo123 (read-only)

In the demo, check out:

The alert list — notice the AI analysis field
The Runbook page — see the logic of each built-in remediation
The audit log — see records of automated actions

Practical Advice (With or Without VigilOps)

Regardless of what tools you use, here are concrete steps to reduce alert fatigue:

1. Audit your alert rules. Export every rule. Sort by trigger frequency in the last 30 days. The top 10 most-triggered rules are your biggest noise sources. Review each: Is the threshold wrong? Is this even alertable?

2. Separate signals from noise with alert tiers.

P0: Wake someone up (service down, data loss risk)
P1: Slack notification (degraded but functional)
P2: Dashboard-only (informational)

If more than 10% of your alerts are P0, your tiers are wrong.

3. Track alert quality metrics.

Noise ratio: % of alerts that trigger but require no action
Miss rate: Incidents that happened without an alert
Target: noise ratio < 30%, miss rate → 0

4. Do monthly alert reviews. Like sprint retrospectives, but for alerts. What fired most? What was never acted on? What can be deleted?

Honest Caveats

VigilOps is an early-stage project. We don't claim to "eliminate alert fatigue" — that depends on your environment, your alert rules, and your team's practices.

What we do believe: monitoring systems should be able to handle simple, predictable issues without waking someone up. That's the direction we're building toward.

If you're experiencing alert fatigue and want to experiment with AI-assisted remediation, give VigilOps a try. And if it doesn't work for your use case, we'd genuinely like to know why — GitHub Discussions.

VigilOps is an Apache 2.0 open source project. GitHub

DEV Community