DEV Community

Cover image for Making RAG admit when it's guessing: source-grounded hallucination checks
Sid Probstein
Sid Probstein

Posted on

Making RAG admit when it's guessing: source-grounded hallucination checks

The failure mode that scares me most in RAG isn't a wrong answer. It's a confident wrong answer with three citations that don't actually say what the answer claims.

So in SWIRL 5 I stopped trusting the model to police itself and added a check that runs after generation.

The flow:

  1. Generate the answer with its citations, as usual.
  2. Split the answer into atomic claims — roughly one assertion per sentence.
  3. For each claim, pull the specific spans from the retrieved passages the model cited.
  4. Run an entailment check: does the cited text actually support this claim, contradict it, or neither?
  5. Any claim that isn't supported gets flagged in the UI, inline, before the user reads a word of it.

The interesting part wasn't the entailment model; it was everything around it.

Claim segmentation is harder than it sounds. Naive sentence splitting produces claims that are unverifiable on their own because the subject lives two sentences up.

Citations lie by omission. A model will cite a document that's topically relevant but doesn't contain the specific number it just quoted. The whole point of the check is to catch exactly that gap.

Latency budget. An honesty layer nobody waits for is an honesty layer nobody ships. SWIRL 5 batches and optionally caches passage embeddings and more.

The result isn't "SWIRL never hallucinates." Nothing can promise that. The result is: when it's on thin ice, it tells you, and it points at the exact sentence.

That's the version of trustworthy I can actually build.

Top comments (2)

Collapse
 
kartik-nvjk profile image
Kartik N V J K

Running the entailment check after generation and flagging unsupported claims inline is the honesty layer I wish more RAG systems shipped, especially the catch on citations that are topically relevant but do not contain the actual number. Claim segmentation being the hard part tracks for me too, since a claim whose subject lives two sentences up is unverifiable on its own. Are you resolving those references before segmentation, or scoring the claim against the whole cited passage and accepting some noise to protect the latency budget?

Collapse
 
vinimabreu profile image
Vinicius Pereira

This is the right instinct, moving the check after generation instead of trying to prompt the generator into honesty. A model grading its own groundedness in the same pass that produced the claim is just asking the fox about the henhouse. The verification has to be a separate step that's allowed to disagree with the generator.

One thing I'd push on from building similar checks: verify each claim against the whole retrieved set, not just the spans the model chose to cite. If you only check a claim against its own citation you're trusting the model's pointer, and the exact failure you describe (cites a topically relevant doc that doesn't contain the number) is really two different bugs wearing the same coat. Sometimes the number is supported by a passage it just didn't cite, sometimes nothing in the retrieval supports it at all, and those want different UI, one is a citation error and the other is a real hallucination. Same for claims with no citation at all, those should default to unsupported rather than skipped, otherwise the cleanest-looking answers are the ones that cited nothing.

And imo the whole thing lives or dies on the segmentation step you flagged, which is the honest hard part. The "subject lives two sentences up" problem is decontextualization, and it's the same wall FActScore-style atomic-fact decomposition runs into. If the claim isn't self-contained the entailment verdict is meaningless no matter how good the NLI model is. Curious whether you resolve coreference into the claims before checking or generate self-contained claims directly. Also curious how you treat NLI neutral, since most hallucinations land there rather than in contradiction, so neutral-as-fail is basically the whole policy. Good writeup, this is the version of trustworthy worth building.