Does Graph Beat Tokens? Engineering a GraphRAG Benchmark on TigerGraph

#ai #tigergraph #graphrag #rag

LLM token costs explode at scale. The TigerGraph GraphRAG Inference Hackathon
poses one question: can a knowledge graph make inference cheaper without
losing answer quality? I built three pipelines on 95 PubMed papers (~1M tokens)
about Type-2-Diabetes drug interactions and let the numbers decide.

Key design choice: all three pipelines use the same LLM (gpt-4o-mini).
So every difference is the retrieval architecture, not the model.

The Three Pipelines

LLM-Only — prompt in, answer out, no retrieval. The floor.
Basic RAG — FAISS vector search, top-5 chunks dumped into the prompt.
GraphRAG — TigerGraph knowledge graph (33,969 entities, 1.75M relationships, 2,459 community summaries). Path B: I customized the repo.

The Headline Result

On 3-hop reasoning — questions requiring connections across documents,
exactly what graphs are built for:

Pipeline	3-hop Accuracy	Tokens/Query
LLM-Only	90%	526
Basic RAG	60%	1,424
GraphRAG	90%	438

GraphRAG matches the best accuracy at the lowest token cost — 69% fewer
tokens than Basic RAG on the reasoning that matters. Across all 30 questions,
GraphRAG cut tokens ~95% vs Basic RAG.

Honest full picture: the architectures are complementary. GraphRAG
dominates multi-hop synthesis (90% vs 60%); Basic RAG leads precise
single-fact lookup (80% vs 50%). I'm not claiming a clean sweep — I'm
showing where graph structure wins, and why.

The Engineering (this is the real story)

Lever 1 — Chunking strategy

Reading the repo source, only semantic and characters chunkers are wired
into ingestion. I kept semantic as baseline for a specific reason:
entity-relationship extraction needs a complete fact inside one chunk.
Semantic splitting keeps "drug A increases drug B's AUC 2-fold" intact so
the extractor captures the relationship.

I tested fixed-size chunking in an isolated experiment — a separate
graph so the validated baseline was never at risk. CharacterChunker
(1000 chars / 200 overlap) produced 8,689 chunks vs the baseline's 4,083
(2.1× more), proving chunking materially reshapes the graph. But blind
character cuts fragment the precise facts I was trying to fix. The run was
interrupted by a resource limit before completion — reported honestly as a
documented finding and future work, not a finished claim.

Lever 2 — Retrieval (single-variable ablations)

Hop depth: num_hops=1 beat num_hops=2 — better BERTScore and fewer tokens. Two hops wandered into tangential context that diluted precision.
Method: tested hybrid / community / similarity. I hypothesized similarity would win fact-lookup — it didn't; hybrid was best or tied everywhere. Kept as an honest negative result.

Lever 3 — Prompt design

GraphRAG over-abstained ("no information available" when the answer was in
the graph). I traced it to a prompt clause forbidding synthesis, surgically
swapped only that clause, kept the load-bearing JSON-format line
byte-identical. Measured: BERTScore 0.8648 → 0.8623 — no gain, still
abstained. Reverted. Conclusion: the abstention is a graph-retrieval
limitation, not prompt wording.

Discipline throughout: change one variable, measure, keep only what the
data supports. Every bundled change broke something.