Alex

Posted on Jul 4

Catch LLM multi-hop hallucinations with zero extra tokens

#ai #llm #python #machinelearning

LLMs don't usually fail at facts. They fail at composing facts.

Ask a model "who is Alice's parent?" and it answers reliably. Ask "is Alice an
ancestor of Dave?" — a conclusion that requires chaining three parent facts —
and accuracy falls off a cliff. Here is DeepSeek on CLUTRR, a public
kinship-reasoning benchmark, measured by the number of composition steps:

hop:      2     3     4     5     6     7     8
DeepSeek: 83%   42%   25%   25%   42%   17%   8%

83% at two hops. 8% at eight hops. And the failures aren't shy "I don't
know" answers — the model confidently fabricates relatives that don't exist.

The usual fixes all have the same problem: they fight an LLM with more LLM.
Ask the model to double-check itself and you pay for a second call — I measured
+110% tokens — and it still hallucinates (34% precision on the check).
Sample five chains-of-thought and vote, and you multiply your bill by five for
a statistical improvement with no guarantee attached.

There's an older, cheaper tool for this specific job.

Relations are matrices

If your facts are triples — (alice, parent, bob) — then each relation is a
boolean matrix R where R[i][j] = 1 iff the fact holds. And then:

Composition is matrix multiplication. "Grandparent" is literally R_parent · R_parent.
Transitive closure is a sum of powers. "Ancestor" is Σ R_parentᵏ.
A claim is true iff a path exists — and you can return the path as a proof.

None of this is new math (it's reachability, the Katz index, the Neumann
series — decades old). The point is what it buys you when you bolt it onto an
LLM: a verifier that accepts a multi-hop claim if and only if a proof path
exists in the facts. It cannot be sweet-talked into accepting a fabricated
relation, it costs zero model tokens, and when it says yes, it shows its work.

Ten lines to a guaranteed check

pip install grounded-reasoning

from grounded_reasoning import GroundedReasoner

gr = GroundedReasoner()
gr.add_facts([
    ("alice", "parent", "bob"),
    ("bob",   "parent", "carol"),
    ("carol", "parent", "dave"),
])

gr.verify("alice", "dave", via="parent")
# Verdict(grounded=True, proof=['alice', 'bob', 'carol', 'dave'], ...)

gr.verify("alice", "zed", via="parent")
# Verdict(grounded=False, proof=None)   <- this is what a hallucination looks like

The typical integration is a post-filter: the LLM emits relational claims, and
only the ones with an evidence path reach the user.

llm_claims = [("alice", "carol", "parent"),   # true composition
              ("alice", "zed",   "parent")]   # fabricated

for claim, verdict in gr.filter_claims(llm_claims):
    print("KEPT" if verdict.grounded else "BLOCKED", claim)

Measured on real DeepSeek runs (48 multi-hop questions over supplied one-hop
facts, two seeds): raw multi-hop precision was 33–38%, with dozens of invented
names. After the filter: 100% precision, zero correct answers dropped —
the filter provably never rejects a claim that has a real path. On the same
CLUTRR chart from the top of this post, the grounded solver holds ~100% flat
from 2 to 10 hops.

The worst raw precision I've seen anywhere in this repo came from a denser,
deliberately anti-commonsense ontology — reversed and counter-intuitive
"is-a" relations layered into the same hierarchy, forcing the model past
whatever it thinks it already knows. Raw precision there fell to 31%, 106
fabricated edges. The filter caught 106 of 106, again with zero correct
claims dropped.

Detecting contradictions comes free, too. If a relation is supposed to be a
hierarchy (is-a, part-of, causes) and the asserted facts contain a cycle,
that's a certificate that something is wrong — and the library hands you the
cycle:

gr2 = GroundedReasoner()
gr2.add_facts([("cat", "is_a", "mammal"),
               ("mammal", "is_a", "animal"),
               ("animal", "is_a", "cat")])      # oops
gr2.contradictions("is_a")
# [['cat', 'mammal', 'animal']]

"But I don't have a knowledge base"

Fair — most people don't. Two answers, both shipped:

Use the LLM's own facts against it. Models are far more reliable on atomic
one-hop facts than on composition. So: ask the model for its one-hop facts,
build the closure locally, and reject any of the model's multi-hop claims
that contradict its own closure. Self-contradiction is the hallucination
signal. On a real taxonomy task with DeepSeek this took precision from 78% to
100% using no external knowledge whatsoever — with a proven two-sided bound on
what it costs you in recall, and a documented failure condition (it breaks in
domains where the model's atomic facts are themselves unreliable, like
"a whale is a fish" trap worlds).

Extracted, noisy graph? Trade the hard guarantee for a coverage guarantee.
If an LLM extracts the facts from raw text, edges go missing and the hard
proof-path guarantee no longer applies. A split-conformal threshold over the
operator's confidence score keeps a distribution-free guarantee of the form
"≥ 90% of true claims are kept." In the end-to-end test, DeepSeek's extraction
silently dropped 31% of the edges — and coverage still held at 93%. What
degrades under noise is efficiency (more false positives), never validity.

What this is not

Honesty section, because every tool post needs one:

Not a general reasoner. It verifies relational/transitive claims against a graph. It does nothing for free-standing factual errors ("the Eiffel Tower is in Rome") — that's a different problem.
Not new math. Reachability, Katz, conformal prediction, Horn logic — all classical. The contribution is the packaging: a unification theorem (the fuzzy-diffusion view, the operator view, and the spectral view are provably the same operator, to zero numerical error), measured guarantees with their token cost on a real LLM, and the no-KB / noisy-KB modes.
Only as good as the graph it's given. If the graph is wrong, the hard guarantee is about the graph, not about the world. The conformal mode softens exactly this, and nothing else does.

Don't take my word for any of this

The whole point of a verification tool is that you shouldn't have to trust its
author. Every number in this post has an offline lock:

git clone https://github.com/ALEXaquarius/grounded-reasoning
cd grounded-reasoning && pip install -e ".[dev]"
pytest tests/     # 116 tests, no API key — includes numerical verification
                  # of all seven theorems behind the guarantees

Or run the Colab quickstart
— it executes the theorem verifications live in about 30 seconds.

The repo (MIT) ships the library, an OpenAI/Anthropic function-calling tool,
and an MCP server, so an agent can call verify_relation like any other tool:
https://github.com/ALEXaquarius/grounded-reasoning

If you can construct a case where the guard accepts a claim with no path — or
rejects one that has a path — that would falsify the core theorem, and I very
much want to see it.

Top comments (1)

Dipankar Sarkar • Jul 5

The matrix reformulation is clean, and reachability-as-proof is the real win: you get an auditable path, not just a verdict. The place I would stress-test the 'zero extra tokens' claim is upstream of the matrix.

The triples have to come from somewhere. NL -> (alice, parent, bob) extraction is itself an LLM step, and that is exactly where a fabricated relative gets minted before your matrix ever sees it. Garbage triple in, valid path out, and now the hallucination has a proof attached.

Also R times R assumes relations compose transitively and cleanly. Real KBs have typed, probabilistic, non-transitive edges where the product is not meaningful. Within CLUTRR-style kinship it is a great fit. How far does it stretch once the relations stop being clean synthetic ones?