DEV Community

Cover image for Open Hallucination Index (OHI): turning “Plausible AI” into “Verifiable AI” (LLMs as Gardeners of the Graph) 🌱🧠
Zimtzimt
Zimtzimt

Posted on

Open Hallucination Index (OHI): turning “Plausible AI” into “Verifiable AI” (LLMs as Gardeners of the Graph) 🌱🧠

I’m genuinely obsessed with this phase of life where you can learn so much so fast — and suddenly you notice: oh wow, there’s real potential here. ✨

And yes, AI is controversial. A lot of critique is valid and not “debatable away”.
But if you zoom in on the technical side — the architecture, the retrieval mechanics, the verification problem — it’s an insanely interesting space.

So… last night I went full hyperfocus and wrote a research paper + reference implementation for something I call:

Open Hallucination Index (OHI)

OHI is a sovereign architectural framework that tries to move us from “Generative AI” to “Verifiable AI” by adding a deterministic Trust Layer after generation.

Motto: “LLMs Hallucinate – We Verify.”


The core problem: the epistemic deficit

LLMs are optimized for plausibility, not veracity.

They can sound true while being completely ungrounded — what my paper frames as stochastic fabulation.

On the human side, there’s also a psychological trap: people tend to grant fluent systems a kind of epistemic authority, which fuels automation bias (trusting the output even when it clashes with evidence).

So the goal here is not “make models perfect” — it’s:

make truth-checking systematic, auditable, and configurable.


Why “naive RAG” doesn’t solve it

Retrieval-Augmented Generation is a good direction, but the naive version still fails:

  • vector similarity can fetch irrelevant chunks
  • the generator can still override retrieved context with parametric memory
  • “semantic proximity” is not “logical truth”

In the paper, I argue verification must be extrinsic and deterministic — not another stochastic step inside the generation loop.


OHI in one breath: a post-generation Trust Layer

Instead of trusting an answer as a blob of text, OHI verifies it at claim-level granularity.

1) Atomic Claim Decomposition (granularity is everything)

A paragraph can contain 9 correct facts and 1 subtle fabrication — binary “true/false” labels won’t cut it.

OHI decomposes a response into atomic claims, represented like a tuple:

  • (S, P, O) → subject / predicate / object

This is aligned with ideas used in fine-grained factuality metrics like FActScore:

T (text)  →  A = {c1, c2, ... cn} (atomic claims)

FActScore = (1 / |A|) * Σ 𝟙(claim is supported)
Enter fullscreen mode Exit fullscreen mode

2) Hybrid evidence retrieval (Graph + Vector + MCP)

For each atomic claim, OHI uses multiple verification oracles:

  • Neo4j graph matching (deterministic structure: “does this relation/path exist?”)
  • Qdrant vector retrieval (semantic candidates, “fuzzy” evidence)
  • MCP evidence (live, standardized tool access — e.g., Wikipedia / Context7 docs)

3) Classify each claim + compute a Trust Score

Each claim becomes one of:

  • Supported
  • Contradicted
  • Unverifiable (no decisive evidence)

Then the system aggregates per-claim signals into a final OHI Trust Score (0.0–1.0) and returns a visual overlay (green/red/gray) per sentence/claim.

A mental model I like from the paper:

like nutritional labels for epistemic quality — not vibes, but a visible score.


The system architecture (sovereign by design)

OHI is designed for local sovereignty:

  • no external OpenAI/Anthropic calls needed for verification
  • local control over Ground Truth
  • avoids the “fox guarding the henhouse” scenario

High-level components:

  • vLLM hosting Qwen2.5 (paper discusses 7B + 32B variants; reference setup uses a quantized 7B AWQ)
  • Neo4j as ontology store (index-free adjacency for multi-hop traversal)
  • Qdrant as semantic store (embeddings for initial retrieval)
  • Redis as cache (reduce repeated lookup cost)
  • MCP servers as standardized “truth adapters” (Wikipedia, Context7)

In the paper I also highlight vLLM’s PagedAttention as a throughput enabler for parallel verification workloads.


The algorithmic heart: Hybrid Verification Oracle

The paper formalizes the scoring logic as a hybrid oracle + weighted scorer.

Simplified version:

For each claim c:
  S_graph = 1.0 if exact graph match else 0.0
  S_vec   = semantic similarity from vector retrieval
  S_mcp   = evidence signal from MCP tools

Trust(c) = α*S_graph + β*S_vec + γ*S_mcp
OHI Score = average Trust(c) across all claims
Enter fullscreen mode Exit fullscreen mode

Default weights in the paper:

  • α = 0.6 graph exact match (deterministic truth)
  • β = 0.3 vector semantic match (plausibility)
  • γ = 0.1 MCP evidence (contextual grounding)

Graph can effectively act as a “veto” against misleading semantic matches.


Two tiny code peeks (from the reference implementation)

1) MCP over SSE: a clean “truth sidecar” session

OHI uses MCP to standardize tool access via JSON-RPC 2.0, and the Wikipedia adapter keeps it lightweight with SSE transport.

Here’s the fallback session context manager:

from contextlib import asynccontextmanager
from mcp import ClientSession
from mcp.client.sse import sse_client

@asynccontextmanager
async def _session_fallback(self):
    """Create a new MCP session (non-pooled, for fallback)."""
    async with sse_client(self._mcp_url) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            yield session
Enter fullscreen mode Exit fullscreen mode

In the paper, I also discuss session pooling to reduce repeated SSE handshake overhead.

2) Streaming Wikipedia dump ingestion (memory-efficient)

Your “Ground Truth” is only as good as your ingestion pipeline.

To build the Neo4j graph from a massive Wikipedia XML dump, the importer uses streaming iterparse + aggressive element clearing:

from xml.etree.ElementTree import iterparse

context = iterparse(file_handle, events=("start", "end"))

for event, elem in context:
    tag = strip_namespace(elem.tag)

    if event == "start" and tag == "page":
        in_page = True
        current_page = {}

    elif event == "end":
        if not in_page:
            elem.clear()
            continue

        # ... collect title, ids, text, etc. ...

        if tag == "page":
            in_page = False

            # resume support: skip until checkpoint
            page_id = current_page.get("page_id", 0)
            if page_id <= start_after_page_id:
                current_page = {}
                elem.clear()
                continue

            if current_page.get("text"):
                yield (
                    current_page.get("title", ""),
                    current_page.get("page_id", 0),
                    current_page.get("revision_id", 0),
                    current_page.get("text", ""),
                )

            current_page = {}

        # Clear element to free memory
        elem.clear()
Enter fullscreen mode Exit fullscreen mode

That’s how you survive multi-GB XML without melting RAM.


Performance reality check (what the paper reports)

The system is fast on DB queries, but the bottlenecks are:

  • claim decomposition (LLM inference): ~200–500ms per text segment
  • orchestration overhead in Python (GIL + I/O overhead accumulates under concurrency)

The paper frames a clear direction: porting core orchestration to Rust for near-real-time constraints.

Minimum “sovereign mode” hardware (paper summary)

  • ~16GB RAM + NVIDIA GPU
  • ~8GB VRAM for Qwen2.5-7B-AWQ
  • Neo4j: 4–16GB RAM
  • Qdrant: ~4GB RAM per 1M vectors (depends on setup)

“LLMs as Gardeners of the Graph” 🌿

This analogy is my favorite part of the future-facing discussion.

Imagine the “Graph” as a floating monolith of structured knowledge:
nodes = entities/facts, edges = relations.

LLMs aren’t “truth” — they’re gardeners:

  • they shape, prune, and recombine knowledge into fluent language
  • but they can also cross-breed the wrong things and produce believable nonsense

The future vision in the paper is recursive:

  • LLMs help grow the graph (extract triples from new text into Neo4j)
  • OHI then audits the same models against the structure they helped build

A self-correcting loop… with one big philosophical question:
who owns the graph? 👀


Try it / read it / fork it

  • Research Paper (PDF)
  • Demo
  • Open Source Docker Image + API
  • Open Source Demo Website
Research Paper:
https://publuu.com/flip-book/1040591/2304964

Demo:
https://openhallucination.xyz

Open Source Docker-Image + API:
https://github.com/shiftbloom-studio/open-hallucination-index-api

Open Source Demo Website:
https://github.com/shiftbloom-studio/open-hallucination-index
Enter fullscreen mode Exit fullscreen mode

If you’re into epistemology, verification, GraphRAG, or just building systems that refuse to hand-wave truth…

I’d love feedback, critiques, and “this will break when…” comments. That’s the good stuff. 😄🧠

Top comments (0)