I'm building a small, fully-local research assistant: a RAG over my own papers, running on Ollama, nothing leaving the box. The risk that actually worries me isn't speed or cost. A research tool that cites a wrong number while sounding sure of itself is worse than no tool, because you'll believe it.
Andrej Karpathy's llm-wiki note had a piece I kept thinking about. Instead of re-retrieving from scratch each query, you have the model build a persistent wiki, and during ingest a lint pass checks the pages against each other for contradictions. I wanted something adjacent at answer time: after the RAG drafts an answer, break it into claims and check each against the sources, then flag whatever a source doesn't actually support.
I should be precise about what that is, since the post is partly about citation accuracy. Karpathy's lint compares wiki pages to each other during ingest. What I built compares each answer-claim to its retrieved passage at answer time. That's groundedness (or faithfulness) checking, the same family as the RAGAS faithfulness metric and various self-check methods. The idea to bolt it on came from llm-wiki; the mechanism is standard groundedness checking, run locally on a small model.
I built it, measured it, and the honest version of the result is better than a clean win would have been. It includes the part where I was wrong twice about what was in my own corpus.
The verify layer
About 80 lines on top of my existing RAG. After the normal retrieve-and-answer step:
- Decompose the draft into atomic claims (one local LLM call).
- For each claim, an LLM-as-judge call returns
{supported, cite, why}. Supported only when a specific excerpt states it. - Flag the unsupported ones.
Here's one verdict verbatim, so it's concrete. The claim was a deliberately corrupted "AUROC is 0.92," checked against the passage that reports 0.804:
{ "supported": false,
"cite": null,
"why": "passage states AUROC 0.804; 0.92 does not appear" }
Cost, since this is a local-8B context and it matters: verify turns one answer call into roughly N+2 calls (one to decompose, one per claim to check). For a five-claim answer on my 1080 Ti that's about 15-20 extra seconds. Not free, not painful.
First eval, and almost shipping a false finding
To see if it catches fabrication I wrote some questions I assumed weren't answerable from my three-paper corpus. One asked for the AUROC of the synergy model "on the held-out test set." The baseline answered "0.804 [1]," and my verify layer passed it. I wrote it up as a miss: verify let a fabricated statistic through.
Then I grepped my own corpus for 0.804. It was there, seven times. So I rewrote it the other way: the number was real, the model was right, verify passed it correctly. A tidier story, and I almost shipped that one too.
It's also wrong. Look at what the passages actually say. Every 0.804 is reported as a GroupKFold cross-validation result, and one line states it outright: "No separate held-out test set was used due to the limited sample size." My question asked for a held-out test AUROC. There is no held-out test set. The model took a real cross-validation number and pinned it to an evaluation that doesn't exist, and verify passed it because the digits 0.804 were sitting right there in the context.
So I was wrong twice about my own corpus, in opposite directions, before landing on what happened: a right-number-wrong-context hallucination that claim-checking sailed straight past. The first lesson is awkward but worth stating plainly: you can't measure hallucination without ground truth, and "the number is real" is not the same as "the answer is right."
Doing it properly
I threw out the vibe-based questions and built a controlled benchmark: eight pairs of claims. Each pair has one true claim (a fact I confirmed by grep) and one false claim, the same statement with a number or entity corrupted to something I confirmed was absent from the corpus. AUROC 0.804 against AUROC 0.92. "pathway membership and gene essentiality scores" against "patient age, BMI, and smoking status." "implicates the hippocampus, amygdala, prefrontal cortex" against "leaves the hippocampus and amygdala unaffected."
That labeling step earned its keep immediately. Three of my first-draft "false" claims used terms (cerebellum, CRISPR, Loewe) that grep found were actually in the corpus, so they weren't false at all. The same mistake as the AUROC, caught before it counted.
Then I feed the verifier the context that supports the true claim and ask it to judge both claims against that same context.
The benchmark result
Both the lenient prompt and a stricter "numbers must match verbatim" prompt scored the same: 8 of 8 fabrications caught, 0 of 8 true claims wrongly flagged.
Two caveats keep that honest. With n=8 it's a point estimate, not a guarantee; the Wilson 95% interval on 8/8 runs from about 67% to 100%, so read it as "no failures in eight trials." And note the setup is the easiest possible one: the supporting passage is guaranteed present and the corrupted value is guaranteed absent. This measures the judge when retrieval is already perfect, not the pipeline. The strict prompt changing nothing isn't evidence it's useless either, because no pair in the set would separate the two prompts. Every corruption is far from its real value. 0.804 vs 0.81 would be the test, and I don't have one.
So: given perfect context and a blatant corruption, a local 8B model judges reliably. Good to know, and not the hard case.
The hard case the benchmark can't reach
The benchmark used false claims I wrote by hand. That sidesteps the scenario I actually care about for a co-scientist: a hallucination the model itself produces, judged by a verifier that is the same model and may share its blind spots.
So I ran a second eval. Six false-premise or absent-specific questions, ground-truth checked. The model's own draft is the unit under test, and I judge each draft twice: once with the same model (qwen3:8b) and once with a different one (gemma4 12B QAT). One caveat I owe you: that judge is larger than the 8B answerer, so any cross-model gain here is "different and bigger," not cleanly "different." A same-size cross-model check would isolate that, and I didn't run one.
On four of six the model abstained correctly. Asked the dose of meropenem (never in the corpus), the sample size of a meta-analysis (the stress paper is a scoping review with none), it said the context doesn't contain it. Grounding works when the gap is obvious.
On the other two it fabricated, and both were right-number-wrong-context:
- "AUROC on the held-out test set" → "0.804" again. The held-out AUROC the question presumes does not exist.
- "AUROC after removing the pathway-membership features" → "0.627." That 0.627 is real, but it's the number for removing ribosome-targeting combinations, a different ablation. The model grabbed a nearby real value and reattributed it.
The same-model judge caught zero of those two. It rubber-stamped both of its own fabrications. The cross-model judge caught one: gemma flagged the 0.627, with the reason "that AUROC resulted from removing ribosome-targeting combinations." A different model checked what the number was attached to; the original model, having produced the misattribution, did not re-examine it. Neither judge caught the held-out one. The digits 0.804 are real, so there's nothing in the number to contradict, and catching it means knowing that no held-out set exists. When I checked, the sentence that says so, "no separate held-out test set was used," wasn't even in the retrieved context. The model never saw the thing that made the premise false.
That's the measured core of this whole exercise, and the failures split three ways, not two. A value simply absent from the context (0.92) gets caught reliably, even when the same model judges. A value that's real but pinned to the wrong thing (0.627, from a different ablation) slips past the same-model judge and is only sometimes recovered by a second one. A false premise, like a held-out AUROC when no held-out set exists, gets through both, and that one bottomed out in retrieval since the sentence refuting it never reached the context. The layer I built reliably catches only the first kind. A model refereeing its own output inherits its own blind spots; a second model recovers some of the misattributions, not the premise-level error.
What the flags were actually pointing at
One more thread from the messy first eval. Back then the verifier had flagged a couple of true claims ("prefrontal cortex," "gene essentiality scores") as unsupported. After the clean benchmark, where it never did that, those looked inconsistent.
The resolution: verify checks claims against the retrieved context, not the whole corpus. Those claims were true and in the corpus, but the passage proving them wasn't among the chunks retrieved for that question. The verifier said "not in what you gave me," correctly.
A flag, then, can mean three things, and at answer time you can't tell them apart without ground truth: a retrieval miss, a real fabrication, or a true fact that lives outside the corpus. In the cases from my first eval that I could actually check, the flags were retrieval misses, which is why "re-retrieve and re-check" is a better default reaction than "delete the claim." I didn't measure the proportion, so I won't put a number on how often each happens.
Takeaways
What I actually measured, on one corpus with qwen3:8b:
- Given good context and a blatantly corrupted value, the verifier is reliable (8/8, with the n=8 caveat).
- The model abstains on clearly-absent information (4/6 here).
- Both fabrications were real-number-wrong-context: one a misattributed value (
0.627, the wrong ablation), one a false premise (a held-out AUROC where no held-out set exists). The same-model judge caught neither (0/2); it rubber-stamped its own output. - The cross-model judge caught the misattributed value but not the false premise (1/2). The premise error was the harder kind, and the sentence that would refute it wasn't even retrieved.
What I suspect but did not measure, so take it as an impression: in a grounded RAG, most of what looks like the model inventing facts is really retrieval not surfacing the passage, or the prompt not grounding hard enough. If that holds, the leverage is in retrieval and grounding, not a bigger model. I'd want a labeled run before saying it harder.
And the practical version: claim-checking reliably catches only values that are absent from the context. It misses a real number attached to the wrong question, and misses a false premise outright. Use a different model to judge than to answer, not on the strength of one recovered case out of two, but because a model that produced a misattribution won't re-examine it while an unrelated one cross-checks what the number is attached to. Expect it to recover some misattributed values, not false premises. Treat a flag as "go re-retrieve," not "delete." And you cannot evaluate any of this without ground-truth labels. I almost shipped two false findings about my own corpus, and grep fixed both.
Limitations are most of the honesty here: one small corpus, an 8B answerer, eight hand-built pairs plus six probes, no held-out scoring, no subtle-but-true corruptions like "significant" vs "trending." This is a first measurement, not a verdict.
Credit: the inspiration is Karpathy's llm-wiki pattern and the full desktop implementation at nashsu/llm_wiki. What I built is plain groundedness checking moved to answer time, run on a local box. The code and eval scripts are in my paper-rag repo if you want to poke holes in the setup. Please do.
Top comments (0)