Vedant Atul Dhavan

Posted on May 16

GraphRAG Benchmark: A 2 Million Token Comparison of LLM-only, Basic RAG, and GraphRAG

#python #machinelearning #computerscience #programming

Can graph-structured retrieval outperform traditional vector-based RAG while using fewer tokens, lower cost, and delivering better answers?

That was the core question I set out to answer.

To test it rigorously, I built a production-grade benchmarking platform that compares:

LLM-only
Basic RAG (ChromaDB)
GraphRAG (NetworkX)

The benchmark runs on a 2 million token scientific paper corpus, evaluates each pipeline on 40 benchmark questions, and measures:

Token usage
Latency
Cost
LLM-as-a-Judge pass rate
BERTScore F1

The entire system is deployed publicly with:

Frontend: Next.js on Vercel
Backend: FastAPI on Hugging Face Spaces

This article covers the architecture, methodology, lessons learned, and the results that demonstrate why GraphRAG is a powerful retrieval paradigm.

Why This Benchmark Matters

Large Language Models are impressive, but they suffer from three core limitations:

High token usage
Hallucinations
Poor multi-hop reasoning across documents

Traditional Retrieval-Augmented Generation (RAG) improves grounding by retrieving relevant chunks from a vector database.

However, Basic RAG has a structural limitation:

It retrieves semantically similar chunks, but does not explicitly model relationships between entities.

GraphRAG addresses this limitation by converting documents into a knowledge graph and traversing meaningful connections.

This enables:

Multi-hop reasoning
Relationship-aware retrieval
Reduced context size
More focused answers

The Research Question

The benchmark was designed to answer:

Does GraphRAG provide better efficiency and accuracy than LLM-only and Basic RAG?

Specifically:

Does it use fewer tokens?
Does it reduce latency?
Does it lower cost?
Does it maintain or improve answer quality?

Dataset Selection

Domain: Scientific Research Papers

I selected the Hugging Face dataset:

armanc/scientific_papers (arXiv split)

Hugging Face - armanc/scientific_papers

Why Scientific Papers?

Scientific papers are ideal for GraphRAG because they contain:

Methods
Datasets
Tasks
Concepts
Experimental results
Cross-document dependencies

Questions often require connecting information across multiple sections and documents.

Dataset Scale

A custom pipeline sampled approximately:

2,000,000 tokens
Hundreds of research papers
Thousands of chunks
Thousands of graph nodes and edges

This scale satisfies the benchmark requirement while remaining practical to reproduce.

The Three Pipelines

1. LLM-only

The question is sent directly to the language model.

Advantages

Simple
No retrieval infrastructure

Limitations

High token usage
Hallucinations
No grounding

2. Basic RAG

Workflow:

Chunk documents
Generate embeddings
Store vectors in ChromaDB
Retrieve top-k chunks
Send chunks to the LLM

Advantages

Grounded retrieval
Lower hallucination risk

Limitations

No explicit relationships
Weak multi-hop reasoning

3. GraphRAG

Workflow:

Extract entities and relationships
Build a knowledge graph
Traverse graph neighborhoods
Retrieve only connected evidence
Generate answers

Advantages

Multi-hop reasoning
Relationship-aware retrieval
Smaller, more relevant context

System Architecture

User
  ↓
Next.js Frontend (Vercel)
  ↓
FastAPI Backend (Hugging Face Spaces)
  ↓
Benchmark Orchestrator
  ├── LLM-only
  ├── Basic RAG (ChromaDB)
  └── GraphRAG (NetworkX)
  ↓
Accuracy Evaluation
  ├── LLM-as-a-Judge
  └── BERTScore
  ↓
Interactive Dashboard

Tech Stack

Frontend

Next.js (App Router)
TypeScript
Tailwind CSS
shadcn/ui
Framer Motion
Recharts

Backend

Python
FastAPI
tiktoken
datasets
sentence-transformers

Retrieval

ChromaDB
NetworkX

Evaluation

huggingface_hub
evaluate
bert-score

Deployment

Vercel
Hugging Face Spaces
Docker

Benchmark Methodology

Step 1: Build the 2M Token Corpus

A token-aware dataset builder:

Downloads scientific papers
Counts tokens using tiktoken
Stops at ~2 million tokens

Step 2: Preprocess and Chunk

Documents are split into:

500-token chunks
80-token overlap

Step 3: Build Retrieval Layers

Basic RAG

Generate embeddings
Store in ChromaDB

GraphRAG

Extract methods, datasets, tasks, and concepts
Build graph nodes and edges using NetworkX

Step 4: Create Evaluation Questions

Generated:

40 questions
Factual
Entity-based
Multi-hop
Synthesis

Step 5: Run All Pipelines

Each question is processed by:

LLM-only
Basic RAG
GraphRAG

Step 6: Record Metrics

Per query:

Tokens
Latency
Cost
Answer text

Step 7: Evaluate Accuracy

Using two independent methods.

Accuracy Evaluation

Method 1: LLM-as-a-Judge

A separate language model grades answers as PASS or FAIL.

Judge model:

meta-llama/Llama-3.1-8B-Instruct

This captures:

Factual correctness
Hallucinations
Missing information

Method 2: BERTScore

Measures semantic similarity between:

Predicted answer
Ground-truth answer

This captures:

Meaning preservation
Paraphrasing equivalence

Why Two Metrics?

One metric can be misleading.

Using both:

Provides independent validation
Creates more defensible results

Benchmark Results

Metric	LLM-only	Basic RAG	GraphRAG
Avg Tokens / Query	3,200	640	316
Avg Cost / Query	$0.0096	$0.0019	$0.00095
Avg Latency	6.23 s	2.85 s	1.22 s
LLM Judge Pass Rate	82%	89%	94%
BERTScore F1	0.57	0.64	0.71

Key Improvements

GraphRAG vs Basic RAG

50.6% fewer tokens
57.2% lower latency
~50% lower cost
+5% higher pass rate
+0.07 BERTScore F1

GraphRAG vs LLM-only

90.1% fewer tokens
80.4% lower latency
~90% lower cost
+12% higher pass rate
+0.14 BERTScore F1

Why GraphRAG Wins

Basic RAG retrieves chunks independently.

GraphRAG retrieves connected evidence.

Example:

Paper A introduces a method
Paper B evaluates it
Paper C improves it

Basic RAG may retrieve only one paper.

GraphRAG traverses all connected nodes, producing a complete answer with less context.

TigerGraph Challenges and Pivot to NetworkX

The project originally explored TigerGraph as the graph database backend.

TigerGraph is an impressive enterprise-grade platform and was valuable in shaping the system design.

However, during development I encountered:

Authentication issues
Docker setup complexity
Resource overhead
Infrastructure troubleshooting

These operational challenges shifted focus away from the benchmark itself.

The Core Realization

The benchmark was about proving:

Graph-based retrieval methodology is superior.

It was not about dependence on a specific graph database.

Why NetworkX Became the Primary Engine

NetworkX offered:

Multi-hop traversal
Relationship-aware retrieval
Zero infrastructure overhead
Pure Python execution
Easy reproducibility

TigerGraph remained an optional enterprise connector.

NetworkX vs TigerGraph

Capability	NetworkX	TigerGraph
Multi-hop graph traversal	Yes	Yes
Benchmark-ready	Yes	Yes
Runs on any laptop	Yes	No
Zero setup overhead	Yes	No
Enterprise scalability	Limited	Excellent
Required for benchmark	Yes	No

Deployment

Frontend

Deployed on Vercel

Backend

Deployed on Hugging Face Spaces

Data Layer

ChromaDB
NetworkX serialized graph files
JSON benchmark artifacts

The deployment is fully reproducible and lightweight.

Interactive Dashboard

The web application includes:

Benchmark runner
Accuracy metrics
Architecture overview
Dataset information
Comparative charts
Project narrative

The dashboard presents:

Tokens
Latency
Cost
Pass Rate
BERTScore
Reduction percentages

Key Insights

1. Graph Structure Matters

Explicit relationships enable better retrieval.

2. Smaller Context Can Improve Accuracy

GraphRAG used fewer tokens and produced better answers.

3. Evaluation Must Include Accuracy

Token reduction alone is meaningless without quality validation.

4. Methodology Matters More Than Infrastructure

GraphRAG benefits are independent of the graph database vendor.

5. Reproducibility Is Essential

Lightweight tooling makes the benchmark easier for others to verify.

Reproducibility

Anyone can reproduce the benchmark by:

Cloning the repository
Building the 2M-token corpus
Rebuilding indexes
Running the benchmark
Evaluating accuracy
Viewing the dashboard

The project includes:

Secure deployment configs
Validation scripts
Backup scripts
Full documentation

GitHub Repository

Public repository:

VEDANTDHAVAN / graphrag-benchmark

2M-token benchmark comparing LLM-only, Basic RAG, and NetworkX-based GraphRAG with scientific papers, accuracy evaluation, and interactive dashboard.

GraphRAG Benchmark

GraphRAG Benchmark is a reproducible benchmark harness that compares three answer-generation pipelines on the same scientific-paper corpus:

LLM-only: answers directly from the model without retrieved context.
Basic RAG: retrieves chunks from ChromaDB using vector similarity.
GraphRAG: retrieves graph-connected chunks using NetworkX-based entity traversal and multi-hop graph context.

The project measures both efficiency and answer quality:

Tokens
Latency
Estimated cost
LLM-as-a-Judge pass rate
BERTScore F1

The dashboard shows all three pipelines side-by-side and highlights winners for accuracy, token usage, speed, and overall score.

Architecture

Frontend dashboard (Next.js)
  -> Backend API (FastAPI)
  -> LLM-only pipeline
  -> Basic RAG pipeline with ChromaDB
  -> GraphRAG pipeline with NetworkX
  -> Optional TigerGraph connector, disabled by default

Why NetworkX is the Primary GraphRAG Engine

The goal of this project is methodological benchmarking: prove whether graph-structured retrieval and multi-hop reasoning improve efficiency and answer quality compared with LLM-only and Basic RAG.

GraphRAG…

View on GitHub

Live Demo

Frontend:

GraphRAG Benchmark | 2M Token Evaluation of LLM-only vs RAG vs GraphRAG

A reproducible benchmark proving graph-based retrieval at scientific-paper scale.

graphrag-benchmark.vercel.app

Backend API:
https://vedantdhavan-graphrag-benchmark.hf.space/health

Final Conclusion

This benchmark demonstrates that GraphRAG is not just an academic concept—it delivers measurable benefits in real-world retrieval systems.

Compared with LLM-only and Basic RAG, GraphRAG achieved:

Lower token usage
Lower latency
Lower cost
Higher answer accuracy

Most importantly, the results show that:

The power of GraphRAG lies in graph-structured retrieval and multi-hop reasoning, not dependence on a specific graph database.

By focusing on methodology and reproducibility, this project provides a transparent, production-grade benchmark for evaluating GraphRAG at scale.

Connect With Me

Vedant Dhavan
Full-Stack Developer | Generative AI Explorer | Agentic AI Enthusiast

GitHub: https://github.com/VEDANTDHAVAN
LinkedIn: https://www.linkedin.com/in/vedant-dhavan

GraphRAG #RAG #LLM #FastAPI #NextJS #NetworkX #ChromaDB #GenerativeAI #Benchmarking #HuggingFace #TigerGraph #Python #MachineLearning

Top comments (1)

Harjot Singh • Jun 1

i found your exploration of the efficiencies in graph-structured retrieval really insightful. it’s fascinating how you tackled the limitations of traditional RAG. speaking of deployment, at moonshift, you can get a full next.js + postgres + auth app up and running in about 7 minutes, with full code ownership on github. if you're interested, i can set you up for a free run.

Why This Benchmark Matters

The Research Question

Dataset Selection

Domain: Scientific Research Papers

Why Scientific Papers?

Dataset Scale

The Three Pipelines

1. LLM-only

Advantages

Limitations

2. Basic RAG

Advantages

Limitations

3. GraphRAG

Advantages

System Architecture

Tech Stack

Frontend

Backend

Retrieval

Evaluation

Deployment

Benchmark Methodology

Step 1: Build the 2M Token Corpus

Step 2: Preprocess and Chunk

Step 3: Build Retrieval Layers

Basic RAG

GraphRAG

Step 4: Create Evaluation Questions

Step 5: Run All Pipelines

Step 6: Record Metrics

Step 7: Evaluate Accuracy

Accuracy Evaluation

Method 1: LLM-as-a-Judge

Method 2: BERTScore

Why Two Metrics?

Benchmark Results

Key Improvements

GraphRAG vs Basic RAG

GraphRAG vs LLM-only

Why GraphRAG Wins

TigerGraph Challenges and Pivot to NetworkX

The Core Realization

Why NetworkX Became the Primary Engine

NetworkX vs TigerGraph

Deployment

Frontend

Backend

Data Layer

Interactive Dashboard

Key Insights

1. Graph Structure Matters

2. Smaller Context Can Improve Accuracy

3. Evaluation Must Include Accuracy

4. Methodology Matters More Than Infrastructure

5. Reproducibility Is Essential

Reproducibility

GitHub Repository

VEDANTDHAVAN / graphrag-benchmark

2M-token benchmark comparing LLM-only, Basic RAG, and NetworkX-based GraphRAG with scientific papers, accuracy evaluation, and interactive dashboard.

GraphRAG Benchmark

Architecture

Why NetworkX is the Primary GraphRAG Engine

Live Demo

GraphRAG Benchmark | 2M Token Evaluation of LLM-only vs RAG vs GraphRAG

Final Conclusion

Connect With Me

Tags

GraphRAG #RAG #LLM #FastAPI #NextJS #NetworkX #ChromaDB #GenerativeAI #Benchmarking #HuggingFace #TigerGraph #Python #MachineLearning