DEV Community

Brian Barbour
Brian Barbour

Posted on

How do you know an LLM answer is actually grounded — not just plausible? I measured it across 7 models and 4 regulated domains

I built a pipeline, solo, that audits LLM answers against the source text they're supposed to be grounded in — and ran it across 7 models and 4 regulated corpora. Sharing the method and the full results; I'd
like technical criticism. Veritrooper

## The problem

When an LLM answers over your documents, "sounds right" and "is right" are different things. A fluent answer can quietly fabricate a number, flip a direction, or confidently answer a question the source
doesn't actually address. In a regulated domain — tax, safety, drug labels, financial filings — that gap matters.

## What it does

For a body of written rules, the harness:

  • Generates ~1,000 questions from the source text, including deliberately unanswerable ones, to catch a model that bluffs instead of saying "not in the evidence."
  • Has the model answer with the source in context, then runs deterministic code over the answer — numbers must trace back to the source, direction/polarity must match, refusals on unanswerable questions are credited. The model is kept to prose; the checking is code.
  • Routes flagged answers to a diagnostic stage, and sends only the contested ones (~4%) to a frontier model from a different vendor — so no model gets the final word on its own output.
  • Detects malformed auto-generated questions and drops them from both arms' denominators, symmetrically.

## The experiment

Four regulated corpora: IRS tax code, OSHA 29 CFR, FDA drug labels, SEC 10-Ks. Seven models, from a 7B local model up through three frontier vendors.

The baseline isn't a straw-man — it's the same model with ordinary BM25 retrieval (top-5 chunks), a normal RAG setup. The audited column is that model handed the correct source passage and then
checked. The gap is roughly the accuracy ordinary retrieval leaves on the table.

A slice (IRS tax code):

| Model | Baseline (BM25 RAG) | Audited |
|---|---|---|
| Claude Opus 4.8 | 94.36% | 100.00% |
| GPT-5.5 | 93.04% | 99.70% |
| Qwen 2.5 72B | 86.76% | 98.19% |
| Qwen 2.5 7B | 86.58% | 96.67% |

Across all 7 models, audited accuracy lands in the ~95–100% band on most runs. The full 7×4 matrix — baseline and audited, nothing cherry-picked, including a weak 70B model that stays low — is here:

https://veritrooper.com/audit.html

Every number reconstructs from downloadable run records: every question, the model's answer, the expected answer, and the grading rationale. You can check the work, not take my word for it.

## What I'm curious about

How do others measure whether a RAG answer is genuinely grounded vs. merely plausible — a single judge-LLM, or something cross-model? Where would you poke holes in this?

Top comments (0)