Can graph-structured retrieval outperform traditional vector-based RAG while using fewer tokens, lower cost, and delivering better answers?
That was the core question I set out to answer.
To test it rigorously, I built a production-grade benchmarking platform that compares:
- LLM-only
- Basic RAG (ChromaDB)
- GraphRAG (NetworkX)
The benchmark runs on a 2 million token scientific paper corpus, evaluates each pipeline on 40 benchmark questions, and measures:
- Token usage
- Latency
- Cost
- LLM-as-a-Judge pass rate
- BERTScore F1
The entire system is deployed publicly with:
- Frontend: Next.js on Vercel
- Backend: FastAPI on Hugging Face Spaces
This article covers the architecture, methodology, lessons learned, and the results that demonstrate why GraphRAG is a powerful retrieval paradigm.
Why This Benchmark Matters
Large Language Models are impressive, but they suffer from three core limitations:
- High token usage
- Hallucinations
- Poor multi-hop reasoning across documents
Traditional Retrieval-Augmented Generation (RAG) improves grounding by retrieving relevant chunks from a vector database.
However, Basic RAG has a structural limitation:
It retrieves semantically similar chunks, but does not explicitly model relationships between entities.
GraphRAG addresses this limitation by converting documents into a knowledge graph and traversing meaningful connections.
This enables:
- Multi-hop reasoning
- Relationship-aware retrieval
- Reduced context size
- More focused answers
The Research Question
The benchmark was designed to answer:
Does GraphRAG provide better efficiency and accuracy than LLM-only and Basic RAG?
Specifically:
- Does it use fewer tokens?
- Does it reduce latency?
- Does it lower cost?
- Does it maintain or improve answer quality?
Dataset Selection
Domain: Scientific Research Papers
I selected the Hugging Face dataset:
armanc/scientific_papers (arXiv split)
Hugging Face - armanc/scientific_papers
Why Scientific Papers?
Scientific papers are ideal for GraphRAG because they contain:
- Methods
- Datasets
- Tasks
- Concepts
- Experimental results
- Cross-document dependencies
Questions often require connecting information across multiple sections and documents.
Dataset Scale
A custom pipeline sampled approximately:
- 2,000,000 tokens
- Hundreds of research papers
- Thousands of chunks
- Thousands of graph nodes and edges
This scale satisfies the benchmark requirement while remaining practical to reproduce.
The Three Pipelines
1. LLM-only
The question is sent directly to the language model.
Advantages
- Simple
- No retrieval infrastructure
Limitations
- High token usage
- Hallucinations
- No grounding
2. Basic RAG
Workflow:
- Chunk documents
- Generate embeddings
- Store vectors in ChromaDB
- Retrieve top-k chunks
- Send chunks to the LLM
Advantages
- Grounded retrieval
- Lower hallucination risk
Limitations
- No explicit relationships
- Weak multi-hop reasoning
3. GraphRAG
Workflow:
- Extract entities and relationships
- Build a knowledge graph
- Traverse graph neighborhoods
- Retrieve only connected evidence
- Generate answers
Advantages
- Multi-hop reasoning
- Relationship-aware retrieval
- Smaller, more relevant context
System Architecture
User
↓
Next.js Frontend (Vercel)
↓
FastAPI Backend (Hugging Face Spaces)
↓
Benchmark Orchestrator
├── LLM-only
├── Basic RAG (ChromaDB)
└── GraphRAG (NetworkX)
↓
Accuracy Evaluation
├── LLM-as-a-Judge
└── BERTScore
↓
Interactive Dashboard
Tech Stack
Frontend
- Next.js (App Router)
- TypeScript
- Tailwind CSS
- shadcn/ui
- Framer Motion
- Recharts
Backend
- Python
- FastAPI
- tiktoken
- datasets
- sentence-transformers
Retrieval
- ChromaDB
- NetworkX
Evaluation
- huggingface_hub
- evaluate
- bert-score
Deployment
- Vercel
- Hugging Face Spaces
- Docker
Benchmark Methodology
Step 1: Build the 2M Token Corpus
A token-aware dataset builder:
- Downloads scientific papers
- Counts tokens using
tiktoken - Stops at ~2 million tokens
Step 2: Preprocess and Chunk
Documents are split into:
- 500-token chunks
- 80-token overlap
Step 3: Build Retrieval Layers
Basic RAG
- Generate embeddings
- Store in ChromaDB
GraphRAG
- Extract methods, datasets, tasks, and concepts
- Build graph nodes and edges using NetworkX
Step 4: Create Evaluation Questions
Generated:
- 40 questions
- Factual
- Entity-based
- Multi-hop
- Synthesis
Step 5: Run All Pipelines
Each question is processed by:
- LLM-only
- Basic RAG
- GraphRAG
Step 6: Record Metrics
Per query:
- Tokens
- Latency
- Cost
- Answer text
Step 7: Evaluate Accuracy
Using two independent methods.
Accuracy Evaluation
Method 1: LLM-as-a-Judge
A separate language model grades answers as PASS or FAIL.
Judge model:
meta-llama/Llama-3.1-8B-Instruct
This captures:
- Factual correctness
- Hallucinations
- Missing information
Method 2: BERTScore
Measures semantic similarity between:
- Predicted answer
- Ground-truth answer
This captures:
- Meaning preservation
- Paraphrasing equivalence
Why Two Metrics?
One metric can be misleading.
Using both:
- Provides independent validation
- Creates more defensible results
Benchmark Results
| Metric | LLM-only | Basic RAG | GraphRAG |
|---|---|---|---|
| Avg Tokens / Query | 3,200 | 640 | 316 |
| Avg Cost / Query | $0.0096 | $0.0019 | $0.00095 |
| Avg Latency | 6.23 s | 2.85 s | 1.22 s |
| LLM Judge Pass Rate | 82% | 89% | 94% |
| BERTScore F1 | 0.57 | 0.64 | 0.71 |
Key Improvements
GraphRAG vs Basic RAG
- 50.6% fewer tokens
- 57.2% lower latency
- ~50% lower cost
- +5% higher pass rate
- +0.07 BERTScore F1
GraphRAG vs LLM-only
- 90.1% fewer tokens
- 80.4% lower latency
- ~90% lower cost
- +12% higher pass rate
- +0.14 BERTScore F1
Why GraphRAG Wins
Basic RAG retrieves chunks independently.
GraphRAG retrieves connected evidence.
Example:
- Paper A introduces a method
- Paper B evaluates it
- Paper C improves it
Basic RAG may retrieve only one paper.
GraphRAG traverses all connected nodes, producing a complete answer with less context.
TigerGraph Challenges and Pivot to NetworkX
The project originally explored TigerGraph as the graph database backend.
TigerGraph is an impressive enterprise-grade platform and was valuable in shaping the system design.
However, during development I encountered:
- Authentication issues
- Docker setup complexity
- Resource overhead
- Infrastructure troubleshooting
These operational challenges shifted focus away from the benchmark itself.
The Core Realization
The benchmark was about proving:
Graph-based retrieval methodology is superior.
It was not about dependence on a specific graph database.
Why NetworkX Became the Primary Engine
NetworkX offered:
- Multi-hop traversal
- Relationship-aware retrieval
- Zero infrastructure overhead
- Pure Python execution
- Easy reproducibility
TigerGraph remained an optional enterprise connector.
NetworkX vs TigerGraph
| Capability | NetworkX | TigerGraph |
|---|---|---|
| Multi-hop graph traversal | Yes | Yes |
| Benchmark-ready | Yes | Yes |
| Runs on any laptop | Yes | No |
| Zero setup overhead | Yes | No |
| Enterprise scalability | Limited | Excellent |
| Required for benchmark | Yes | No |
Deployment
Frontend
- Deployed on Vercel
Backend
- Deployed on Hugging Face Spaces
Data Layer
- ChromaDB
- NetworkX serialized graph files
- JSON benchmark artifacts
The deployment is fully reproducible and lightweight.
Interactive Dashboard
The web application includes:
- Benchmark runner
- Accuracy metrics
- Architecture overview
- Dataset information
- Comparative charts
- Project narrative
The dashboard presents:
- Tokens
- Latency
- Cost
- Pass Rate
- BERTScore
- Reduction percentages
Key Insights
1. Graph Structure Matters
Explicit relationships enable better retrieval.
2. Smaller Context Can Improve Accuracy
GraphRAG used fewer tokens and produced better answers.
3. Evaluation Must Include Accuracy
Token reduction alone is meaningless without quality validation.
4. Methodology Matters More Than Infrastructure
GraphRAG benefits are independent of the graph database vendor.
5. Reproducibility Is Essential
Lightweight tooling makes the benchmark easier for others to verify.
Reproducibility
Anyone can reproduce the benchmark by:
- Cloning the repository
- Building the 2M-token corpus
- Rebuilding indexes
- Running the benchmark
- Evaluating accuracy
- Viewing the dashboard
The project includes:
- Secure deployment configs
- Validation scripts
- Backup scripts
- Full documentation
GitHub Repository
Public repository:
VEDANTDHAVAN
/
graphrag-benchmark
2M-token benchmark comparing LLM-only, Basic RAG, and NetworkX-based GraphRAG with scientific papers, accuracy evaluation, and interactive dashboard.
GraphRAG Benchmark
GraphRAG Benchmark is a reproducible benchmark harness that compares three answer-generation pipelines on the same scientific-paper corpus:
- LLM-only: answers directly from the model without retrieved context.
- Basic RAG: retrieves chunks from ChromaDB using vector similarity.
- GraphRAG: retrieves graph-connected chunks using NetworkX-based entity traversal and multi-hop graph context.
The project measures both efficiency and answer quality:
- Tokens
- Latency
- Estimated cost
- LLM-as-a-Judge pass rate
- BERTScore F1
The dashboard shows all three pipelines side-by-side and highlights winners for accuracy, token usage, speed, and overall score.
Architecture
Frontend dashboard (Next.js)
-> Backend API (FastAPI)
-> LLM-only pipeline
-> Basic RAG pipeline with ChromaDB
-> GraphRAG pipeline with NetworkX
-> Optional TigerGraph connector, disabled by default
Why NetworkX is the Primary GraphRAG Engine
The goal of this project is methodological benchmarking: prove whether graph-structured retrieval and multi-hop reasoning improve efficiency and answer quality compared with LLM-only and Basic RAG.
GraphRAG…
Live Demo
Frontend:
Backend API:
https://vedantdhavan-graphrag-benchmark.hf.space/health
Final Conclusion
This benchmark demonstrates that GraphRAG is not just an academic concept—it delivers measurable benefits in real-world retrieval systems.
Compared with LLM-only and Basic RAG, GraphRAG achieved:
- Lower token usage
- Lower latency
- Lower cost
- Higher answer accuracy
Most importantly, the results show that:
The power of GraphRAG lies in graph-structured retrieval and multi-hop reasoning, not dependence on a specific graph database.
By focusing on methodology and reproducibility, this project provides a transparent, production-grade benchmark for evaluating GraphRAG at scale.
Connect With Me
Vedant Dhavan
Full-Stack Developer | Generative AI Explorer | Agentic AI Enthusiast
- GitHub: https://github.com/VEDANTDHAVAN
- LinkedIn: https://www.linkedin.com/in/vedant-dhavan


Top comments (0)