The Problem with Pointwise Verification

Jason Volk — Thu, 30 Apr 2026 17:12:21 +0000

Every hallucination detector I’ve seen does the same thing: take a claim, take a source, compute some similarity score. Cosine similarity. NLI. LLM-as-judge. They check claims one at a time against sources one at a time.

This is a local check. And local checks have a structural problem that no amount of model scaling fixes.

Consider a RAG system checking five claims against a contract. Each claim individually matches a source passage with high similarity. A pointwise verifier returns “grounded” for all five. But claim 1 says the term is 12 months. Claim 3 references a 24-month renewal period. Claim 5 assumes quarterly payments over the 12-month term.

Locally, each claim is grounded. Globally, they cannot all be true simultaneously. The renewal period and the payment schedule are inconsistent with the stated term.

This is not a hypothetical failure mode. It is the structural reason why RAG hallucination rates remain high even with retrieval: the retrieval grounds individual claims but does not verify their joint consistency.

What Sheaf Cohomology Actually Does
A presheaf assigns data to each local region. A sheaf goes further: it requires that locally compatible data can be glued into a globally consistent section. When local data cannot be glued, the obstruction lives in the first cohomology group H1.

This is not a confidence score. It is binary. Either the sections glue (H1 = 0) or they do not (H1 > 0). When they don’t, you have a certified structural contradiction. Not “likely contradictory.” Structurally impossible to reconcile.

The second tool is the cokernel of the evidence morphism. If the source documents don’t cover a claim at all, the cokernel is non-zero at that stalk. That’s not “low confidence.” That’s algebraic proof that no section of the source sheaf maps to that claim.

H1 catches contradictions. The cokernel catches fabrications. Both are deterministic and exact over F2.

SATYA: The Implementation
I built this as a working system called SATYA. Claims and source passages are mapped to a cellular sheaf. H1 detects global contradictions. The cokernel of the evidence morphism identifies ungrounded claims. The computation is deterministic, exact over F2, and signed with Ed25519. Implementation details are covered by patent application 19/649,080.

The tech stack is aggressively minimal: Python, spaCy for sentence splitting, SciPy sparse eigensolver, Ed25519 from the standard library. No PyTorch. No TensorFlow. No HuggingFace. No GPU.

The Cokernel as a Retrieval Trigger
This is the part that surprised me.

Standard retrieval-augmented generation searches for the user’s entire query. That’s O(query). Every question triggers a full search regardless of what’s already grounded.

SATYA computes the cokernel first, identifying exactly which claims lack source grounding. Those specific claims become targeted retrieval queries against external knowledge bases (CrossRef, Semantic Scholar, Wikipedia, arXiv, PubMed).

If 85% of claims are grounded in the user’s documents and 15% fall in the cokernel, retrieval fires only for the 15%. That’s O(gap), not O(query). Retrieval cost drops 85%.

Learn about Medium’s values
The cokernel doesn’t just detect ungrounded claims. It identifies exactly which claims need external verification and turns them into search queries. The mathematical structure of the verification failure itself defines what needs to be fetched.

The Halvorsen Test
I tested the system with a fabricated academic citation: a paper titled “Adaptive Social Quantum Networks” by “E. R. Halvorsen,” supposedly published in 2024. This paper does not exist.

Standard LLMs will fabricate a plausible abstract for it. Standard RAG returns a “low confidence” or “likely false” score.

SATYA searched CrossRef, Semantic Scholar, Wikipedia, arXiv, and PubMed in parallel. Found zero corroboration. The cokernel was non-zero at every claim stalk. The system returned UNGROUNDED with a signed cryptographic certificate: input hash, source hashes, H1 dimension, cokernel dimension, timestamp, Ed25519 signature.

Not “we think it’s fake.” A mathematical certificate of ungroundedness in the queried academic record. 355 milliseconds. On a laptop.

Try it yourself: invariant.pro/receipts

Click “Fabricated paper” or type any claim. The system cannot be pre-computed.

Benchmarks
BenchmarkExamplesBalanced AccuracyNotesAggreFact (overall)30,42066.4%11 datasets, faithfulness domainAggreFact (bridge)13,20887.2%43% of traffic, high-confidence sheafContractNLI2,09171.4%Held-out legal NDA verification

State-of-the-art LLM-based detectors score higher on overall balanced accuracy. That is a fact. The tradeoff is explicit: they provide probabilistic assessments with higher coverage. SATYA provides deterministic, cryptographically signed verdicts on the traffic it handles.

For applications where the verification result must survive a courtroom, a regulatory audit, or a compliance review, determinism and provability matter more than coverage.

What Doesn’t Work
When the sheaf doesn’t fire (insufficient edges for cohomology), a flat subject-verb-object comparator handles the remaining traffic at 52% balanced accuracy. I’ve tried dependency tree parsing, value normalization, and negation detection to improve it. All regressed or flat-lined.

This suggests a boundary: structural verification via topology works well for claims with sufficient relational structure (contracts, policies, structured documents), but degrades on claims that lack extractable predicates. Whether this boundary is fundamental or an artifact of my extraction pipeline is an open question.

The syntactic claim extractor was built for structured document language. Short natural language sentences sometimes fail to produce the concept keys needed for sheaf construction. A raw text fallback using spaCy NER is in place but less precise than the template-based extraction.

The Tradeoff, Plainly
I am not claiming this replaces ML-based hallucination detectors. I am claiming that a deterministic, zero-parameter verification layer is structurally possible, practically useful, and provides guarantees that no learned model can: determinism, reproducibility, and cryptographic provability.

Different tools for different requirements. When you need coverage, use a classifier. When you need proof, use the topology.

Live Demo: invariant.pro/receipts

Type any claim. Real or fabricated. Every verdict is signed.

Verified Chat: invariant.pro/chat

DEV Community: Jason Volk

The Problem with Pointwise Verification