Stella Lin

Posted on May 9 • Originally published at theculprit.ai

Two SQL primitives for when alert clustering gets it wrong

#aiops #postgres #sre #incidentmanagement

Originally published at theculprit.ai/blog/sql-primitives-for-incident-split-merge.

Every alert-correlation system gets things wrong. The interesting question is what the on-call engineer can do about it at 2 a.m.

The bad answer is: nothing. The system grouped seven events into one incident; six of them are the database connection-pool storm and the seventh is a totally unrelated TLS-handshake failure that happens to share some token overlap with the rest. The on-call sees one incident in the dashboard, ack's it, fixes the connection pool, and goes back to sleep. The TLS failure quietly stops alerting because it's already attached to a "resolved" incident, and customers find it for you in the morning.

The slightly less-bad answer is: file a ticket with the vendor. Wait two weeks. They tweak a threshold somewhere. The same shape of mis-clustering happens again on a different pair of unrelated events.

The right answer is: the on-call engineer can split the wrong-grouped event off into its own incident, in the UI, in five seconds, with a complete audit trail of who did it and why. And for the inverse case — two incidents that should be one — they can merge them with the same ergonomics.

This piece is about the two Postgres functions that implement that override mechanic, the two columns that record the lineage, and the one race condition we hit when shipping it.

01 — Why most AIOps tools punt on this

Vector-based clustering — embedding alert payloads, computing cosine similarity, attaching events to incidents above a threshold — is the dominant correlation approach right now. It works well in the median case. It fails in two specific ways that no threshold-tuning fixes:

False positives. Two unrelated alerts share enough surface tokens (service name, region, error class) that they cluster together even though they're describing different underlying problems. A connection-pool exhaustion alert and a TLS handshake failure can score 0.87 cosine similarity if they both happen to mention "production-west-2 db-primary timeout." Above the 0.85 attach threshold; into the same incident.

False negatives. The same root cause produces two alerts whose surface text diverges enough — different runbooks, different metric names, one from the application logs and one from the infra layer — that they cluster into separate incidents. The on-call sees two pages, fixes one, and the other auto-resolves later because the underlying issue went away. Now they have two incidents in the post-mortem record describing the same event, with no link between them.

Most AIOps platforms ship the clustering and stop there. The reasoning, if you ask, is some combination of "it's hard to expose this safely" and "it would imply our clustering isn't perfect" and "users don't actually understand incident lineage well enough to use it." The first is true and solvable. The second is product theater — every clustering system makes mistakes, and pretending otherwise is what makes engineers distrust the system. The third is condescending; the on-call engineer who's been awake for three hours understands incident lineage better than anyone.

The right principle is the inverse: AI assists humans, humans correct AI. The clustering system makes its best guess. The human reviews the guess. When the human disagrees, the system records what they did and learns from it. The clustering algorithm doesn't have to be perfect; it has to be correctable.

02 — The two RPCs

In Postgres, "correctable" looks like two SECURITY DEFINER functions:

incident_split(p_incident_id uuid, p_event_ids uuid[]) RETURNS uuid
incident_merge(p_source_id uuid, p_target_id uuid)     RETURNS void

incident_split takes an incident and a subset of its events. It creates a new incident on the same tenant + service, re-parents the chosen events to the new incident, and recomputes both incidents' representative embeddings so the next round of clustering doesn't re-cluster them back together. It returns the new incident's UUID.

incident_merge takes a source and target incident on the same service. It moves all of the source's events into the target, marks the source as merged with a merged_into pointer, and recomputes the target's representative embedding to incorporate the absorbed events.

Both functions live in the database, not the application layer. That choice is load-bearing for three reasons:

Atomicity. Re-parenting events, updating counts, recomputing embeddings, and writing the audit row need to happen in one transaction. If any step fails, the whole operation rolls back and the on-call doesn't end up looking at a half-split incident at the worst possible time.
Defense in depth. The route handler checks the caller's capability ("can this user split incidents?") before calling the RPC. The RPC also checks tenant membership directly via auth.uid() against the tenant_members table. Two layers, one for ergonomics, one for the case where the application layer has a bug.
Auditability. Both RPCs write a row to a tenant audit log inside the same transaction. The audit row is impossible to drop, because if the audit insert fails the whole split/merge fails.

The SECURITY DEFINER modifier means the function runs with the privileges of the function owner, not the caller — necessary because the caller (an authenticated user) doesn't have UPDATE privileges on incidents or sanitized_events directly. The function does the privileged work on the caller's behalf, gated by the in-function membership check. SET search_path = public, extensions, pg_temp is mandatory on any SECURITY DEFINER function that touches vector types — pgvector's schema location varies between local and managed Postgres deployments, and a missing search_path produces a "type vector does not exist" error that only fires in production.

03 — Validating a split

incident_split has five guards that fire in order:

-- Guard 1: caller must be authenticated.
IF v_caller_id IS NULL THEN RAISE EXCEPTION 'forbidden'; END IF;

-- Guard 2: event_ids must be a non-empty array.
IF p_event_ids IS NULL OR array_length(p_event_ids, 1) IS NULL THEN
  RAISE EXCEPTION 'invalid_event_ids';
END IF;

-- Guard 3: caller must be a member of the parent incident's tenant.
IF NOT EXISTS (
  SELECT 1 FROM tenant_members
  WHERE tenant_id = v_tenant_id AND user_id = v_caller_id
) THEN
  RAISE EXCEPTION 'forbidden';
END IF;

-- Guard 4: every requested event_id must actually belong to this incident.
SELECT count(*) INTO v_split_count
FROM sanitized_events
WHERE incident_id = p_incident_id AND id = ANY(p_event_ids);
IF v_split_count <> array_length(p_event_ids, 1) THEN
  RAISE EXCEPTION 'invalid_event_ids';
END IF;

-- Guard 5: cannot split all events (would leave the parent empty).
IF v_split_count = v_total_events THEN
  RAISE EXCEPTION 'cannot_split_all';
END IF;

Guard 4 is the interesting one. The naive implementation would just UPDATE sanitized_events SET incident_id = ... WHERE id = ANY(p_event_ids) and trust the caller. That works fine until a malicious or buggy client passes event IDs from a different incident — possibly belonging to a different tenant. The guard counts how many of the requested IDs actually live in this incident; if the count doesn't match the array length, at least one ID is wrong and the whole split fails.

Guard 5 protects against a UI bug where the user accidentally selects every event in the incident and clicks split. Without the guard, you'd end up with two incidents — one empty, one new with all the events — and the empty one would sit in the dashboard forever as a phantom. The simpler rule "you must keep at least one event in the parent" makes the post-condition obvious.

04 — The race condition we found

The first version of incident_split updated event counts the obvious way:

-- BEFORE (broken under concurrent attaches)
UPDATE incidents SET event_count = event_count - v_split_count
WHERE id = p_incident_id;

The split moved N events out, so subtract N from the parent's count. Pure arithmetic.

It's racy. Picture two transactions running concurrently:

Transaction A is the pipeline correlator. A new event for this incident has just arrived. The correlator computes the cosine similarity, finds it's above the threshold, and starts attaching the event: UPDATE incidents SET event_count = event_count + 1 WHERE id = p_incident_id.
Transaction B is the on-call engineer running a split. They've selected three events to move out. The split RPC fires UPDATE incidents SET event_count = event_count - 3 WHERE id = p_incident_id.

If A commits between B's SELECT count(*) (which gave us v_total_events) and B's UPDATE, B's subtraction-against-original-count clobbers A's increment. The new event is attached (the sanitized_events row is correct) but the event_count column is one too low. The next time anyone reads the incident, the displayed event count under-reports reality by one.

The fix is to re-derive the count from the actual rows, not from the cached starting count:

-- AFTER (correct under concurrent attaches)
UPDATE incidents
SET event_count = (
  SELECT count(*) FROM sanitized_events WHERE incident_id = p_incident_id
)
WHERE id = p_incident_id;

The subquery runs inside the same transaction as the update, with row-level locks on sanitized_events from B's earlier re-parenting, so it sees a consistent view of the world. If A's increment commits before B's subquery runs, B sees the new row and the count comes out right. If A's increment commits after B's transaction commits, A's event_count + 1 is now +1 against a re-derived (correct) count and the world is consistent again.

This same pattern applies to incident_merge, where a concurrent attach on the target incident during the merge window would cause event_count + v_moved_count to either double-count or clobber. Both mutations re-derive from sanitized_events for the same reason.

The general principle: when you have a denormalized count column maintained by transactions that can run concurrently, never compute the new value as cached_old + delta. Always re-derive from the source-of-truth. The cost is one extra count(*) per write; the benefit is correctness under any interleaving.

05 — The lineage columns

ALTER TABLE incidents
  ADD COLUMN merged_into  uuid REFERENCES incidents(id),
  ADD COLUMN derived_from uuid REFERENCES incidents(id);

merged_into lives on the source of a merge — the incident that was absorbed. After incident_merge(A, B), A's status flips to 'merged' and A's merged_into points at B. A is now read-only; the UI knows to show a banner ("This incident was merged into ↗ B") and the post-mortem record preserves the fact that A existed.

derived_from lives on the child of a split. After incident_split(A, [event_ids]) returns the new incident ID B, B's derived_from points at A. The UI shows a reciprocal banner ("Split from ↖ A") so the post-mortem record can reconstruct the full picture: A had nine events, the on-call moved three of them into B, A now has six events, B has three, and both are open and being investigated separately.

Notice what these columns are not: they're not ON DELETE CASCADE. The reference is informational, not a foreign-key-enforced hierarchy. If A is later deleted as part of a tenant-data-purge workflow, B's derived_from ends up pointing at a UUID with no row — and that's fine. The audit log row for the split still exists with both UUIDs in its metadata, and the post-mortem reconstruction relies on the audit log, not on the live derived_from pointer.

The reason matters: lineage columns are a denormalization for fast UI rendering ("show me the link without joining the audit log"). The audit log is the authoritative record. If the two ever disagree, trust the audit log.

06 — The same-service constraint

incident_merge rejects a cross-service merge:

-- Guard: same-service constraint.
IF v_src_service <> v_tgt_service THEN
  RAISE EXCEPTION 'cross_service_merge_denied';
END IF;

The reason isn't ergonomic; it's semantic. Each service has its own token namespace — the placeholder <TOKEN_a1b2c3> in service X and the same string in service Y refer to completely different underlying values, because the tokenizer scopes its dictionary by service. Merging an incident from service X into one from service Y would produce an incident whose displayed text is incoherent: half the tokens would resolve under X's dictionary, half under Y's, and a UI walking the events would either show wrong rehydrated values or have to special-case "rehydrate this event under its original service's dictionary, not the parent incident's."

The simpler rule "you can't merge across services" sidesteps the whole problem. If two incidents on different services are actually the same root cause — say a database service and an application service both alerting on the same connection-pool exhaustion — the right model isn't "merge them," it's "link them in the post-mortem record." That's a separate feature with different ergonomics.

07 — The capability gate

The route handler that calls incident_split does this:

const ctx = await getActiveTenantContext();
if (!ctx) return NextResponse.json({ error: 'no_tenant' }, { status: 403 });
const cap = { role: ctx.role, permissionMode: ctx.permissionMode };
if (!can(cap, 'split_incident')) {
  return NextResponse.json({ error: 'forbidden' }, { status: 403 });
}

can(cap, 'split_incident') is a typed capability check. Owners and admins can always split. Members can split only if the tenant is in "flat" permission mode (the default for small teams). Members in "strict" mode get a 403, with the UI hiding the Split button entirely so they don't see an action they can't perform.

The capability layer is where this decision lives, not the RPC. The RPC checks tenant membership but doesn't distinguish roles — that's an application-layer concern that can be tuned per-tenant without a migration. A team that wants to lock down splits to admins-only flips one config value; the RPC doesn't change.

This is the same pattern as rename_tenant and ack_incident: route handler enforces the capability, RPC enforces the membership, audit log records the actor. Three layers, each with one job.

08 — What we tell users

The Split button on the incident page opens a modal with one checkbox per event in the incident. Pick the events to split off; the gate is "at least one but not all." The modal text reads:

Pick at least one event but not all. The selected events become a new incident; the rest stay here.

That's the entire UX. No confirmation step, no "are you sure," no permission elevation flow. The audit log records the actor and the timestamp; if it was the wrong call, the on-call can merge the new incident back into the parent (incident_merge(new, parent)) and the round-trip is recorded too.

The Merge button opens a picker with all open incidents on the same service. Pick the target; the source incident becomes merged and its events show up in the target's timeline. The same "no confirmation" rule applies — the action is recorded, reversible (well, mergeable-in-the-other-direction), and surfaces in the audit feed for anyone reviewing the incident later.

The principle is: low-friction reversible action with a complete audit trail beats high-friction confirmation gates with no audit trail. The on-call engineer at 2 a.m. shouldn't have to click through a modal to do their job. The post-mortem author the next morning needs to be able to see exactly what happened, in order, with timestamps. Both of those are served by the same design.

09 — What this looks like in the post-mortem

When the post-mortem author opens the incident the next morning, the audit panel shows:

14:32:01  incident created (auto, from event-stream correlator)
14:32:04  event attached (cosine 0.87, threshold 0.85)
14:32:11  event attached (cosine 0.91)
14:32:18  event attached (cosine 0.86)
...
14:38:42  incident_split  →  a1b2c3...  (3 events moved by alice@team)
14:39:01  incident_merge  ←  d4e5f6...  (12 events absorbed by bob@team)
14:42:33  ack_incident          (acked by alice@team)

Every state transition has a timestamp, an actor, and a count. The author can reconstruct exactly what happened, who did what, and what state the incident was in at each step. The lineage badges in the header show the split-from and merged-into links so the author can navigate to the related incidents and read their audit trails too.

That's the deliverable: an incident whose state can be reconstructed entirely from its rows, with no implicit knowledge of what humans did between events, and no missing transitions. The clustering algorithm makes a guess; humans correct it; the system records both. The whole thing is one Postgres function call away from any client that holds an authenticated session.

10 — The takeaway

If your AIOps tool clusters alerts into incidents, ask the vendor what happens when it gets it wrong. There are exactly three answers:

"It doesn't get it wrong." Walk away. Every clustering system makes mistakes; the question is just whether you can correct them. A vendor who claims otherwise either hasn't run their system at scale or isn't being honest about what they've seen.
"You can file a support ticket and we'll re-tune the thresholds." This is the same as "no" with extra steps. The on-call engineer at 2 a.m. doesn't have a support engineer to call.
"You can split or merge the incident in the UI, with a full audit trail." This is the answer. If this answer comes with a 12-step approval workflow and a confirmation dialog with "type SPLIT to confirm," the answer is closer to (2) than (3).

The split/merge primitives aren't complicated to build. Two SQL functions, two lineage columns, one capability check, one audit row per action, one re-derive-don't-subtract trick to avoid races. The total surface area is under 300 lines of SQL plus a couple hundred lines of UI. The reason most tools don't ship it isn't technical complexity; it's that admitting "the clustering can be wrong" runs against the marketing pitch.

We took the opposite bet: the clustering is good but not infallible, the override is fast and well-instrumented, and the audit trail is complete. The on-call engineer is the ground truth. The system records what they decide, and the next round of clustering uses the corrected representative embedding so it doesn't re-cluster the same way next time.

That's a more honest pitch than "our AI never makes mistakes." It's also a better product.

Culprit is a push-based AIOps platform with edge tokenization, vector-based correlation, and LLM root-cause analysis. The split/merge mechanic shipped to production on 2026-05-08; full implementation is open in the public design spec.

DEV Community