DEV Community

Sudharsan@7621
Sudharsan@7621

Posted on

Does Graph Beat Tokens? Engineering a GraphRAG Benchmark on TigerGraph

LLM token costs explode at scale. The TigerGraph GraphRAG Inference Hackathon
poses one question: can a knowledge graph make inference cheaper without
losing answer quality? I built three pipelines on 95 PubMed papers (~1M tokens)
about Type-2-Diabetes drug interactions and let the numbers decide.

Key design choice: all three pipelines use the same LLM (gpt-4o-mini).
So every difference is the retrieval architecture, not the model.

The Three Pipelines

  • LLM-Only — prompt in, answer out, no retrieval. The floor.
  • Basic RAG — FAISS vector search, top-5 chunks dumped into the prompt.
  • GraphRAG — TigerGraph knowledge graph (33,969 entities, 1.75M relationships, 2,459 community summaries). Path B: I customized the repo.

The Headline Result

On 3-hop reasoning — questions requiring connections across documents,
exactly what graphs are built for:

Pipeline 3-hop Accuracy Tokens/Query
LLM-Only 90% 526
Basic RAG 60% 1,424
GraphRAG 90% 438

GraphRAG matches the best accuracy at the lowest token cost — 69% fewer
tokens than Basic RAG on the reasoning that matters. Across all 30 questions,
GraphRAG cut tokens ~95% vs Basic RAG.

Honest full picture: the architectures are complementary. GraphRAG
dominates multi-hop synthesis (90% vs 60%); Basic RAG leads precise
single-fact lookup (80% vs 50%). I'm not claiming a clean sweep — I'm
showing where graph structure wins, and why.

The Engineering (this is the real story)

Lever 1 — Chunking strategy

Reading the repo source, only semantic and characters chunkers are wired
into ingestion. I kept semantic as baseline for a specific reason:
entity-relationship extraction needs a complete fact inside one chunk.
Semantic splitting keeps "drug A increases drug B's AUC 2-fold" intact so
the extractor captures the relationship.

I tested fixed-size chunking in an isolated experiment — a separate
graph so the validated baseline was never at risk. CharacterChunker
(1000 chars / 200 overlap) produced 8,689 chunks vs the baseline's 4,083
(2.1× more), proving chunking materially reshapes the graph. But blind
character cuts fragment the precise facts I was trying to fix. The run was
interrupted by a resource limit before completion — reported honestly as a
documented finding and future work, not a finished claim.

Lever 2 — Retrieval (single-variable ablations)

  • Hop depth: num_hops=1 beat num_hops=2 — better BERTScore and fewer tokens. Two hops wandered into tangential context that diluted precision.
  • Method: tested hybrid / community / similarity. I hypothesized similarity would win fact-lookup — it didn't; hybrid was best or tied everywhere. Kept as an honest negative result.

Lever 3 — Prompt design

GraphRAG over-abstained ("no information available" when the answer was in
the graph). I traced it to a prompt clause forbidding synthesis, surgically
swapped only that clause, kept the load-bearing JSON-format line
byte-identical. Measured: BERTScore 0.8648 → 0.8623 — no gain, still
abstained. Reverted. Conclusion: the abstention is a graph-retrieval
limitation, not prompt wording.

Discipline throughout: change one variable, measure, keep only what the
data supports. Every bundled change broke something.

Reproducible

Public repo, live dashboard, all 30 questions and scores visible. Nothing
hidden.

Built on the TigerGraph GraphRAG repo
for #GraphRAGInferenceHackathon.

Top comments (0)