I built a pipeline, solo, that audits LLM answers against the source text they're supposed to be grounded in — and ran it across 7 models and 4 regulated corpora. Sharing the method and the full results; I'd
like technical criticism. Veritrooper
## The problem
When an LLM answers over your documents, "sounds right" and "is right" are different things. A fluent answer can quietly fabricate a number, flip a direction, or confidently answer a question the source
doesn't actually address. In a regulated domain — tax, safety, drug labels, financial filings — that gap matters.
## What it does
For a body of written rules, the harness:
- Generates ~1,000 questions from the source text, including deliberately unanswerable ones, to catch a model that bluffs instead of saying "not in the evidence."
- Has the model answer with the source in context, then runs deterministic code over the answer — numbers must trace back to the source, direction/polarity must match, refusals on unanswerable questions are credited. The model is kept to prose; the checking is code.
- Routes flagged answers to a diagnostic stage, and sends only the contested ones (~4%) to a frontier model from a different vendor — so no model gets the final word on its own output.
- Detects malformed auto-generated questions and drops them from both arms' denominators, symmetrically.
## The experiment
Four regulated corpora: IRS tax code, OSHA 29 CFR, FDA drug labels, SEC 10-Ks. Seven models, from a 7B local model up through three frontier vendors.
The baseline isn't a straw-man — it's the same model with ordinary BM25 retrieval (top-5 chunks), a normal RAG setup. The audited column is that model handed the correct source passage and then
checked. The gap is roughly the accuracy ordinary retrieval leaves on the table.
A slice (IRS tax code):
| Model | Baseline (BM25 RAG) | Audited |
|---|---|---|
| Claude Opus 4.8 | 94.36% | 100.00% |
| GPT-5.5 | 93.04% | 99.70% |
| Qwen 2.5 72B | 86.76% | 98.19% |
| Qwen 2.5 7B | 86.58% | 96.67% |
Across all 7 models, audited accuracy lands in the ~95–100% band on most runs. The full 7×4 matrix — baseline and audited, nothing cherry-picked, including a weak 70B model that stays low — is here:
https://veritrooper.com/audit.html
Every number reconstructs from downloadable run records: every question, the model's answer, the expected answer, and the grading rationale. You can check the work, not take my word for it.
## What I'm curious about
How do others measure whether a RAG answer is genuinely grounded vs. merely plausible — a single judge-LLM, or something cross-model? Where would you poke holes in this?
Top comments (0)