I just built a system that beats Basic RAG on every single metric simultaneously. Higher accuracy. Better semantic similarity. 94.6% fewer tokens.
Here's exactly how I did it.
The Problem
Every time an LLM answers a question, it burns tokens. And tokens cost money.
The industry's current answer is RAG (Retrieval Augmented Generation). Instead of sending the LLM your entire knowledge base, you retrieve the most relevant chunks and send only those.
But here's what nobody tells you: Basic RAG doesn't actually solve the problem. It just moves it.
When I built my submission for the TigerGraph GraphRAG Inference Hackathon, I set out to prove something: a knowledge graph can do what vector search cannot. It retrieves precisely what an LLM needs, not just what looks similar. The result was a system that achieves 94.6% fewer tokens than Basic RAG while being more accurate.
Here's the full story.
The Setup: Three Pipelines, One Dataset
The hackathon challenge was straightforward: build three pipelines that answer the same questions on the same dataset, then let the benchmarks tell the story.
The dataset: I chose IPL (Indian Premier League) cricket — 665 Wikipedia articles, over 2 million tokens of text. IPL is perfect for this experiment because it's deeply interconnected. Players move between teams. Coaches lead multiple franchises. The same venue hosts finals across different seasons. These cross-document relationships are exactly where vector search struggles and graphs shine.
The three pipelines:
Pipeline 1: LLM Only (Baseline)
Send the question directly to the LLM. No retrieval, no context.
Results: 174 avg tokens, 62% accuracy
Pipeline 2: Basic RAG (Industry Standard)
Chunk documents, embed with sentence-transformers, store in FAISS, retrieve top 5 similar chunks, send to LLM.
Results: 2,541 avg tokens, 50% accuracy
Pipeline 3: GraphRAG (The Experiment)
Extract entities and relationships, build a knowledge graph in TigerGraph, traverse 2-3 hops to find precise context, send structured prompt to LLM.
Results: 137 avg tokens, 64% accuracy
The Final Numbers
MetricP1 LLM-OnlyP2 Basic RAGP3 GraphRAGJudge Pass Rate62%50%64%BERTScore F10.87240.79740.8826 ✅BERTScore Rescaled0.61780.39300.6484 ✅Avg Tokens1742,541137Token reduction vs RAGn/abaseline-94.6%
GraphRAG is the best pipeline on every single metric. Both BERTScore bonus thresholds hit (F1 >= 0.88 and Rescaled >= 0.55).
The comparison that matters is P3 vs P2. GraphRAG beats Basic RAG by 14 percentage points on accuracy while using 94.6% fewer tokens. That is not a tradeoff. That is a win on every dimension.
Why Basic RAG Loses
When you ask "Which player won the Orange Cap in the same season their team won the IPL trophy?", a vector search retrieves chunks that are semantically similar to that question. You get paragraphs about the Orange Cap award, paragraphs about IPL trophies, paragraphs about famous players — 2,500 tokens of text that is about the right topic but does not directly answer the question.
Basic RAG retrieves similar text. It cannot reason across relationships.
How GraphRAG Works
Instead of asking "what text looks like this question?", GraphRAG asks "what entities and relationships are relevant to this question?"
Step 1: Build the graph (one-time)
I ingested all 665 articles into TigerGraph Savanna:
8,373 unique entities: players, teams, venues, awards, seasons
5,349 relationships: "played for", "won at", "coached by", "hosted"
7 communities detected via Louvain algorithm
Step 2: Query time (per question)
Entity extraction (~50 tokens): identify entities in the question
Graph traversal (0 LLM tokens): find entities in TigerGraph, traverse 2-3 hops
Community lookup (0 LLM tokens): retrieve community summary
Focused prompt (~90 tokens): structured list of relevant entities and relationships
Final answer: LLM answers from precise knowledge, not a document dump
The concise prompt format was the key unlock. Forcing 1-2 sentence answers matched the ground truth distribution, boosting BERTScore, and cutting tokens from 1,154 to 137 per query.
What Surprised Me
The prompt format matters more than the retrieval.
The biggest single improvement came from changing the answer prompt to require 1-2 sentence responses. This simultaneously improved accuracy, BERTScore, and token count. One prompt change moved BERTScore Rescaled from 0.5485 to 0.6484 and tokens from 1,154 to 137.
Graph communities are genuinely meaningful.
The Louvain algorithm produced 7 coherent clusters that a cricket fan would recognise — the IPL core cluster, the RCB/Hyderabad era cluster, the T20 World Cup cluster. The algorithm found structure I did not explicitly encode.
Basic RAG is worse than you think.
P2 scored 50% accuracy despite having the most context (2,541 tokens per query). More context is not better context. The noise in retrieved chunks actively hurt answer quality — reflected in the lowest BERTScore of all three pipelines.
The Stack
TigerGraph Savanna: cloud graph database, free tier
pyTigerGraph: Python SDK for TigerGraph REST API
Groq: LLM inference (llama-3.3-70b-versatile for answers, llama-3.1-8b-instant for entity extraction)
FAISS + sentence-transformers: vector store for Pipeline 2
NetworkX: Louvain community detection
Streamlit + Plotly: comparison dashboard
Wikipedia API: dataset collection
Try It Live
Live dashboard: graphrag-ipl.streamlit.app
Full source code: github.com/AbdullahMustafa7/GraphRAG-IPL
To run locally:
Sign up at tgcloud.io (free, $60 credits)
Get a Groq API key at console.groq.com (free tier)
Clone the repo, add credentials to .env
Run python pipeline3_graphrag/setup_tigergraph.py
Run python pipeline3_graphrag/ingest.py
Run streamlit run dashboard/app.py
Built for the TigerGraph GraphRAG Inference Hackathon 2026.
Top comments (0)