LLM token costs explode at scale. The TigerGraph GraphRAG Inference Hackathon
poses one question: can a knowledge graph make inference cheaper without
losing answer quality? I built three pipelines on 95 PubMed papers (~1M tokens)
about Type-2-Diabetes drug interactions and let the numbers decide.
Key design choice: all three pipelines use the same LLM (gpt-4o-mini).
So every difference is the retrieval architecture, not the model.
The Three Pipelines
- LLM-Only — prompt in, answer out, no retrieval. The floor.
- Basic RAG — FAISS vector search, top-5 chunks dumped into the prompt.
- GraphRAG — TigerGraph knowledge graph (33,969 entities, 1.75M relationships, 2,459 community summaries). Path B: I customized the repo.
The Headline Result
On 3-hop reasoning — questions requiring connections across documents,
exactly what graphs are built for:
| Pipeline | 3-hop Accuracy | Tokens/Query |
|---|---|---|
| LLM-Only | 90% | 526 |
| Basic RAG | 60% | 1,424 |
| GraphRAG | 90% | 438 |
GraphRAG matches the best accuracy at the lowest token cost — 69% fewer
tokens than Basic RAG on the reasoning that matters. Across all 30 questions,
GraphRAG cut tokens ~95% vs Basic RAG.
Honest full picture: the architectures are complementary. GraphRAG
dominates multi-hop synthesis (90% vs 60%); Basic RAG leads precise
single-fact lookup (80% vs 50%). I'm not claiming a clean sweep — I'm
showing where graph structure wins, and why.
The Engineering (this is the real story)
Lever 1 — Chunking strategy
Reading the repo source, only semantic and characters chunkers are wired
into ingestion. I kept semantic as baseline for a specific reason:
entity-relationship extraction needs a complete fact inside one chunk.
Semantic splitting keeps "drug A increases drug B's AUC 2-fold" intact so
the extractor captures the relationship.
I tested fixed-size chunking in an isolated experiment — a separate
graph so the validated baseline was never at risk. CharacterChunker
(1000 chars / 200 overlap) produced 8,689 chunks vs the baseline's 4,083
(2.1× more), proving chunking materially reshapes the graph. But blind
character cuts fragment the precise facts I was trying to fix. The run was
interrupted by a resource limit before completion — reported honestly as a
documented finding and future work, not a finished claim.
Lever 2 — Retrieval (single-variable ablations)
- Hop depth: num_hops=1 beat num_hops=2 — better BERTScore and fewer tokens. Two hops wandered into tangential context that diluted precision.
- Method: tested hybrid / community / similarity. I hypothesized similarity would win fact-lookup — it didn't; hybrid was best or tied everywhere. Kept as an honest negative result.
Lever 3 — Prompt design
GraphRAG over-abstained ("no information available" when the answer was in
the graph). I traced it to a prompt clause forbidding synthesis, surgically
swapped only that clause, kept the load-bearing JSON-format line
byte-identical. Measured: BERTScore 0.8648 → 0.8623 — no gain, still
abstained. Reverted. Conclusion: the abstention is a graph-retrieval
limitation, not prompt wording.
Discipline throughout: change one variable, measure, keep only what the
data supports. Every bundled change broke something.
Reproducible
Public repo, live dashboard, all 30 questions and scores visible. Nothing
hidden.
- GitHub: https://github.com/kamisettysba2027-source/graphrag-inference-hackathon
- Live dashboard: https://graphrag-inference-hackathon-pdtqvncbcctvdlqsacqlkr.streamlit.app/
Built on the TigerGraph GraphRAG repo
for #GraphRAGInferenceHackathon.
Top comments (0)