I’m genuinely obsessed with this phase of life where you can learn so much so fast — and suddenly you notice: oh wow, there’s real potential here. ✨
And yes, AI is controversial. A lot of critique is valid and not “debatable away”.
But if you zoom in on the technical side — the architecture, the retrieval mechanics, the verification problem — it’s an insanely interesting space.
So… last night I went full hyperfocus and wrote a research paper + reference implementation for something I call:
Open Hallucination Index (OHI)
OHI is a sovereign architectural framework that tries to move us from “Generative AI” to “Verifiable AI” by adding a deterministic Trust Layer after generation.
Motto: “LLMs Hallucinate – We Verify.”
The core problem: the epistemic deficit
LLMs are optimized for plausibility, not veracity.
They can sound true while being completely ungrounded — what my paper frames as stochastic fabulation.
On the human side, there’s also a psychological trap: people tend to grant fluent systems a kind of epistemic authority, which fuels automation bias (trusting the output even when it clashes with evidence).
So the goal here is not “make models perfect” — it’s:
make truth-checking systematic, auditable, and configurable.
Why “naive RAG” doesn’t solve it
Retrieval-Augmented Generation is a good direction, but the naive version still fails:
- vector similarity can fetch irrelevant chunks
- the generator can still override retrieved context with parametric memory
- “semantic proximity” is not “logical truth”
In the paper, I argue verification must be extrinsic and deterministic — not another stochastic step inside the generation loop.
OHI in one breath: a post-generation Trust Layer
Instead of trusting an answer as a blob of text, OHI verifies it at claim-level granularity.
1) Atomic Claim Decomposition (granularity is everything)
A paragraph can contain 9 correct facts and 1 subtle fabrication — binary “true/false” labels won’t cut it.
OHI decomposes a response into atomic claims, represented like a tuple:
- (S, P, O) → subject / predicate / object
This is aligned with ideas used in fine-grained factuality metrics like FActScore:
T (text) → A = {c1, c2, ... cn} (atomic claims)
FActScore = (1 / |A|) * Σ 𝟙(claim is supported)
2) Hybrid evidence retrieval (Graph + Vector + MCP)
For each atomic claim, OHI uses multiple verification oracles:
- Neo4j graph matching (deterministic structure: “does this relation/path exist?”)
- Qdrant vector retrieval (semantic candidates, “fuzzy” evidence)
- MCP evidence (live, standardized tool access — e.g., Wikipedia / Context7 docs)
3) Classify each claim + compute a Trust Score
Each claim becomes one of:
- Supported
- Contradicted
- Unverifiable (no decisive evidence)
Then the system aggregates per-claim signals into a final OHI Trust Score (0.0–1.0) and returns a visual overlay (green/red/gray) per sentence/claim.
A mental model I like from the paper:
like nutritional labels for epistemic quality — not vibes, but a visible score.
The system architecture (sovereign by design)
OHI is designed for local sovereignty:
- no external OpenAI/Anthropic calls needed for verification
- local control over Ground Truth
- avoids the “fox guarding the henhouse” scenario
High-level components:
- vLLM hosting Qwen2.5 (paper discusses 7B + 32B variants; reference setup uses a quantized 7B AWQ)
- Neo4j as ontology store (index-free adjacency for multi-hop traversal)
- Qdrant as semantic store (embeddings for initial retrieval)
- Redis as cache (reduce repeated lookup cost)
- MCP servers as standardized “truth adapters” (Wikipedia, Context7)
In the paper I also highlight vLLM’s PagedAttention as a throughput enabler for parallel verification workloads.
The algorithmic heart: Hybrid Verification Oracle
The paper formalizes the scoring logic as a hybrid oracle + weighted scorer.
Simplified version:
For each claim c:
S_graph = 1.0 if exact graph match else 0.0
S_vec = semantic similarity from vector retrieval
S_mcp = evidence signal from MCP tools
Trust(c) = α*S_graph + β*S_vec + γ*S_mcp
OHI Score = average Trust(c) across all claims
Default weights in the paper:
- α = 0.6 graph exact match (deterministic truth)
- β = 0.3 vector semantic match (plausibility)
- γ = 0.1 MCP evidence (contextual grounding)
Graph can effectively act as a “veto” against misleading semantic matches.
Two tiny code peeks (from the reference implementation)
1) MCP over SSE: a clean “truth sidecar” session
OHI uses MCP to standardize tool access via JSON-RPC 2.0, and the Wikipedia adapter keeps it lightweight with SSE transport.
Here’s the fallback session context manager:
from contextlib import asynccontextmanager
from mcp import ClientSession
from mcp.client.sse import sse_client
@asynccontextmanager
async def _session_fallback(self):
"""Create a new MCP session (non-pooled, for fallback)."""
async with sse_client(self._mcp_url) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
yield session
In the paper, I also discuss session pooling to reduce repeated SSE handshake overhead.
2) Streaming Wikipedia dump ingestion (memory-efficient)
Your “Ground Truth” is only as good as your ingestion pipeline.
To build the Neo4j graph from a massive Wikipedia XML dump, the importer uses streaming iterparse + aggressive element clearing:
from xml.etree.ElementTree import iterparse
context = iterparse(file_handle, events=("start", "end"))
for event, elem in context:
tag = strip_namespace(elem.tag)
if event == "start" and tag == "page":
in_page = True
current_page = {}
elif event == "end":
if not in_page:
elem.clear()
continue
# ... collect title, ids, text, etc. ...
if tag == "page":
in_page = False
# resume support: skip until checkpoint
page_id = current_page.get("page_id", 0)
if page_id <= start_after_page_id:
current_page = {}
elem.clear()
continue
if current_page.get("text"):
yield (
current_page.get("title", ""),
current_page.get("page_id", 0),
current_page.get("revision_id", 0),
current_page.get("text", ""),
)
current_page = {}
# Clear element to free memory
elem.clear()
That’s how you survive multi-GB XML without melting RAM.
Performance reality check (what the paper reports)
The system is fast on DB queries, but the bottlenecks are:
- claim decomposition (LLM inference): ~200–500ms per text segment
- orchestration overhead in Python (GIL + I/O overhead accumulates under concurrency)
The paper frames a clear direction: porting core orchestration to Rust for near-real-time constraints.
Minimum “sovereign mode” hardware (paper summary)
- ~16GB RAM + NVIDIA GPU
- ~8GB VRAM for Qwen2.5-7B-AWQ
- Neo4j: 4–16GB RAM
- Qdrant: ~4GB RAM per 1M vectors (depends on setup)
“LLMs as Gardeners of the Graph” 🌿
This analogy is my favorite part of the future-facing discussion.
Imagine the “Graph” as a floating monolith of structured knowledge:
nodes = entities/facts, edges = relations.
LLMs aren’t “truth” — they’re gardeners:
- they shape, prune, and recombine knowledge into fluent language
- but they can also cross-breed the wrong things and produce believable nonsense
The future vision in the paper is recursive:
- LLMs help grow the graph (extract triples from new text into Neo4j)
- OHI then audits the same models against the structure they helped build
A self-correcting loop… with one big philosophical question:
who owns the graph? 👀
Try it / read it / fork it
- Research Paper (PDF)
- Demo
- Open Source Docker Image + API
- Open Source Demo Website
Research Paper:
https://publuu.com/flip-book/1040591/2304964
Demo:
https://openhallucination.xyz
Open Source Docker-Image + API:
https://github.com/shiftbloom-studio/open-hallucination-index-api
Open Source Demo Website:
https://github.com/shiftbloom-studio/open-hallucination-index
If you’re into epistemology, verification, GraphRAG, or just building systems that refuse to hand-wave truth…
I’d love feedback, critiques, and “this will break when…” comments. That’s the good stuff. 😄🧠
Top comments (0)