Sid Probstein

Posted on Jul 1

Making RAG admit when it's guessing: source-grounded hallucination checks

#ai #rag #llm

The failure mode that scares me most in RAG isn't a wrong answer. It's a confident wrong answer with three citations that don't actually say what the answer claims.

So in SWIRL 5 I stopped trusting the model to police itself and added a check that runs after generation.

The flow:

Generate the answer with its citations, as usual.
Split the answer into atomic claims — roughly one assertion per sentence.
For each claim, pull the specific spans from the retrieved passages the model cited.
Run an entailment check: does the cited text actually support this claim, contradict it, or neither?
Any claim that isn't supported gets flagged in the UI, inline, before the user reads a word of it.

The interesting part wasn't the entailment model; it was everything around it.

Claim segmentation is harder than it sounds. Naive sentence splitting produces claims that are unverifiable on their own because the subject lives two sentences up.

Citations lie by omission. A model will cite a document that's topically relevant but doesn't contain the specific number it just quoted. The whole point of the check is to catch exactly that gap.

Latency budget. An honesty layer nobody waits for is an honesty layer nobody ships. SWIRL 5 batches and optionally caches passage embeddings and more.

The result isn't "SWIRL never hallucinates." Nothing can promise that. The result is: when it's on thin ice, it tells you, and it points at the exact sentence.

That's the version of trustworthy I can actually build.

Top comments (9)

Vasyl • Jul 2

The part nobody talks about with these honesty layers: how do you eval the checker itself? A false flag on a correct claim costs you differently than a miss. In my experience short numeric claims misfire the most, the entailment model can't tell "supported by the table" from "supported by that exact cell". And once users see a few wrong flags they start ignoring all of them, which quietly kills the whole feature. Do you track a false-flag rate against a labeled set, or tune the threshold by feel?

Sid Probstein • Jul 6

You are pointing at the real work!

The checker is a model too, so if you do not eval it you have just moved the trust problem down a layer. We treat the two errors asymmetrically on purpose: a miss costs you one bad claim, a false flag costs you the whole feature; as you said, after N (small) number of bogus flags people stop reading them and you are worse off than with no check at all. So you bias toward precision and measure a false-flag rate against a labeled set; tuning by feel is how you ship something that feels strict and is actually just noise.

And yes, short numeric claims are the worst case, exactly the table-versus-cell problem. IMO, seems like the fix there is narrowing what the claim is checked against, not reaching for a smarter NLI model.

Vasyl • Jul 6

The asymmetry framing settles it: one miss costs a claim, one false flag costs the feature. Measuring a false-flag rate against a labeled set instead of tuning by feel is the discipline part most write-ups skip. And agreed on narrowing the comparison window for numeric claims. A smarter NLI model just fails with more confidence.

Vinicius Pereira • Jul 1

This is the right instinct, moving the check after generation instead of trying to prompt the generator into honesty. A model grading its own groundedness in the same pass that produced the claim is just asking the fox about the henhouse. The verification has to be a separate step that's allowed to disagree with the generator.

One thing I'd push on from building similar checks: verify each claim against the whole retrieved set, not just the spans the model chose to cite. If you only check a claim against its own citation you're trusting the model's pointer, and the exact failure you describe (cites a topically relevant doc that doesn't contain the number) is really two different bugs wearing the same coat. Sometimes the number is supported by a passage it just didn't cite, sometimes nothing in the retrieval supports it at all, and those want different UI, one is a citation error and the other is a real hallucination. Same for claims with no citation at all, those should default to unsupported rather than skipped, otherwise the cleanest-looking answers are the ones that cited nothing.

And imo the whole thing lives or dies on the segmentation step you flagged, which is the honest hard part. The "subject lives two sentences up" problem is decontextualization, and it's the same wall FActScore-style atomic-fact decomposition runs into. If the claim isn't self-contained the entailment verdict is meaningless no matter how good the NLI model is. Curious whether you resolve coreference into the claims before checking or generate self-contained claims directly. Also curious how you treat NLI neutral, since most hallucinations land there rather than in contradiction, so neutral-as-fail is basically the whole policy. Good writeup, this is the version of trustworthy worth building.

Sid Probstein • Jul 6

You caught the two-bugs-in-one-coat problem better than the post did!

Checking a claim only against its own citation trusts the model's pointer, which is the one thing you are trying to verify.

Check against the whole retrieved set and split the outcomes: supported by an uncited passage is a citation error, supported by nothing is a hallucination, and they deserve different UI. And no citation should default to unsupported, never skipped, or you reward the answer that cited nothing, which is backwards.

Segmentation is the honest hard part. If the claim is not self-contained the entailment verdict is meaningless no matter how good the NLI model is; that is the FActScore wall. You have to make the claim self-contained first, resolve the coreference, so "it" and "that figure" become the real subject.

And neutral has to count as fail, since that is where most hallucinations actually live; neutral-as-pass ships a checker that never catches anything.

This is the version of the argument worth having, IMO. Thanks for commenting!!

Vinicius Pereira • Jul 6

totally with you that neutral can't pass, that part's non-negotiable. the wrinkle i'd add is neutral is doing double duty and it's worth splitting before you route it all to fail. some neutrals are genuinely unsupported, the number's just not in the retrieval. but some are neutral because segmentation left the claim un-self-contained, so the NLI had nothing real to bite on and shrugged. those two look identical in the verdict and they're opposite bugs: one is the model hallucinating, the other is your own decomposition failing upstream. collapse both into fail and you lose twice, you spam the user with flags on claims that were actually fine, and you bury your segmentation bugs under a hallucination label so you never go fix them.

so where i landed is neutral doesn't map to fail, it maps to a third state, abstain or couldn't-verify, that's explicitly not the same as contradicted. contradicted is we found evidence against, abstain is we couldn't stand behind this either way. the un-self-contained ones then get routed back to re-segment instead of counting as dishonesty, and what's left is the real unsupported set. binary pass/fail on a noisy NLI signal either cries wolf or ships blind, that third bucket is what keeps the flag list something people actually keep reading. good thread, you pushed it past where the post left it.

Sid Probstein • Jul 6

This is the right cut, and you said it better than the post did!

A neutral verdict is really diagnosing two different systems: unsupported is a statement about the model, un-self-contained is a statement about your own pipeline. Same verdict, opposite owners. Route both to fail and you don't just cry wolf at the user, you throw away the signal that your decomposition is leaking.

So yea, three states. Contradicted means evidence against. Abstain means we couldn't stand behind it either way. Collapsing those is exactly how a flag list turns into noise people stop reading.

This is the part the post pointed at and didn't finish.

Vinicius Pereira • Jul 6

same verdict, opposite owners is the cleanest phrasing this thread has produced, i'm keeping that one. the operational consequence is they belong on different dashboards too: unsupported-rate is a model quality metric, un-self-contained-rate is a pipeline regression metric, and the second one is the one worth alerting on, because it only moves when someone changes your own code. decomposition edits are supposed to be invisible, so a bump in couldn't-verify right after a segmenter change is the diff telling on itself. collapse the three states and that canary drowns in model noise, which is exactly the flag-list-nobody-reads failure you named.

liked this cut enough that i wired it into the last system i shipped, abstain as a first-class verdict next to confirmed and contradicted, with the abstention carrying which side of the boundary it came from. good thread, this is what comment sections are supposed to do.

Kartik N V J K • Jul 1

Running the entailment check after generation and flagging unsupported claims inline is the honesty layer I wish more RAG systems shipped, especially the catch on citations that are topically relevant but do not contain the actual number. Claim segmentation being the hard part tracks for me too, since a claim whose subject lives two sentences up is unverifiable on its own. Are you resolving those references before segmentation, or scoring the claim against the whole cited passage and accepting some noise to protect the latency budget?