How we cut alert noise 80% with semantic correlation (and a little LLM RCA)

#devops #sre #observability #aiops

On-call rotations are quietly burning out engineers. The thing nobody talks about: most pages are noise — duplicates, flapping alerts, monitor rules firing for the same underlying outage from three different angles. The 5% that matter get drowned in the 95% that don't.

I built Saneops over 6 weeks to test a hypothesis: most alert fatigue isn't a tooling-volume problem, it's a correlation problem. Once a system can group, dedupe, and explain alerts the way a senior SRE does in their head, the L1 layer mostly disappears.

Here's what worked and what didn't.

The naive approach (and why it fails)

The first instinct when alert volume is high is to write more rules. Group by alertname, group by service tag, set up Alertmanager inhibit_rules. If you're on PagerDuty, you lean on its event-grouping settings (we wrote up the differences in Saneops vs PagerDuty if you're comparing). It works for known-shape alerts and breaks the moment your stack evolves. Three problems:

Rules require predicting failure modes you haven't seen yet. A new service rolls out, three monitors trigger, none of them group because the labels don't match the rules you wrote.
Inhibit rules are write-only. Six months in, nobody on the team can explain why inhibit_rules:[12] exists, so nobody touches it, so it rots.
Time-window grouping treats the symptom, not the cause. Two unrelated outages happening within 5 minutes get glued into one incident; later, somebody pages on the wrong service.

What we did instead — three signals, one decision

For each new alert, Saneops asks: does this belong to an existing open incident, or is it a new one?

The decision is a weighted score across:

Strong-label match. Service, namespace, cluster, deployment, job, app, pod — these are the dimensions SREs actually correlate on. A match here gets a heavy boost.
Label Jaccard similarity. For everything else (env, region, customer, etc.), Jaccard over key=value pairs.
Semantic similarity. Embedding the concatenation of alertname + summary + description and comparing cosine distance against the incident's centroid. Cheap (we use simple hashing embeddings — pgvector overkill at our scale) but surprisingly effective for catching alerts that share meaning but not labels.

Plus one veto: strong-label conflict. If service A and service B are both labelled but mismatched, no amount of text similarity overrides — it's a hard separation.

That's it. ~40 lines of correlation code. We tune the weights per tenant; default similarity threshold is 0.3 which catches most real clustering without false positives. Conceptually this overlaps with what Keep does on the open-source side; the differences are mostly in defaults and how the LLM piece is wired (more on that below).

The unexpected thing: time-based auto-resolve

The original design assumed sources would always send resolved webhooks. Production told us: Grafana Alertmanager defaults to send_resolved: false. So a third of "stuck open" incidents weren't bugs — they were never closed because nobody told us they ended.

Borrowed Keep's pattern: any open incident with no new alerts for auto_resolve_after_minutes (default 24h) gets auto-closed with resolved_by: "auto-idle". Fixed the zombie-incident problem in one commit. The Grafana side is documented in our Grafana integration setup — the right config is send_resolved: true on the receiver, but the idle-resolve sweep is the safety net for when customers haven't flipped that flag.

LLM RCA — keep it boring

The LLM bit is the part everyone wants to talk about, but it's the least interesting piece. We pass the alert payloads + clustered alert centroid + service topology to whatever LLM the tenant configured (Anthropic, OpenAI, Gemini, Ollama, whatever) and ask for a 3-bullet hypothesis. Conceptually similar to Datadog's Bits AI — but multi-source: it correlates Datadog alerts alongside Grafana, PagerDuty, and Prometheus signals in one incident, where Bits AI only sees Datadog. It's not the system of record — it's a starting point that the on-call engineer either confirms or rejects in 30 seconds.

The prompt structure that converged after 4 iterations:

You are an SRE. You have these N alerts that fired together within M minutes.
Common labels: {…}
Affected services: {…}
First-fire timestamp: {…}

In 3 bullets:
1. What is the most likely root cause?
2. What's the safest first thing to check?
3. What would make this incident worse?

Be specific. Cite alert fields. If you don't know, say "insufficient signal".

The "insufficient signal" line was the breakthrough — without it, the LLM hallucinates root causes for noisy clusters.

What we measured

In the first beta tenant (anonymised): 800 alerts/day → ~140 incidents → ~30 actually paged on-call after severity gating. Page reduction: ~80%. False-grouping rate (two unrelated outages clustered): ~3% — caught by the strong-label veto in 95% of cases.

The honest part

This is in closed beta as of May 2026. 10 design partners. Free for 60 days. If you've been on a real on-call rotation and want to try it, the signup is at app.saneops.in/signup — no card, 1,000 alerts/month free. Saneops plugs into Grafana, Datadog, and PagerDuty via webhook — full setup steps on each integration page.

If you're explicitly evaluating it as a PagerDuty alternative, that page covers the migration patterns (most teams keep PagerDuty for the rotation engine and use Saneops as the upstream noise filter; some replace entirely with Slack/email).

Or just steal the ideas. The strong-label veto + the "insufficient signal" prompt line are the two things I'd implement in any AIOps system, including ones I'm not building.

Built with FastAPI + Postgres + Render + Vercel. Total infra cost: $6/mo. Feedback wanted, especially edge cases the correlation breaks on.

— Om