DEV Community

Cover image for GraphRAG Benchmark: A 2 Million Token Comparison of LLM-only, Basic RAG, and GraphRAG
Vedant Atul Dhavan
Vedant Atul Dhavan

Posted on

GraphRAG Benchmark: A 2 Million Token Comparison of LLM-only, Basic RAG, and GraphRAG

Can graph-structured retrieval outperform traditional vector-based RAG while using fewer tokens, lower cost, and delivering better answers?

That was the core question I set out to answer.

To test it rigorously, I built a production-grade benchmarking platform that compares:

  1. LLM-only
  2. Basic RAG (ChromaDB)
  3. GraphRAG (NetworkX)

The benchmark runs on a 2 million token scientific paper corpus, evaluates each pipeline on 40 benchmark questions, and measures:

  • Token usage
  • Latency
  • Cost
  • LLM-as-a-Judge pass rate
  • BERTScore F1

The entire system is deployed publicly with:

  • Frontend: Next.js on Vercel
  • Backend: FastAPI on Hugging Face Spaces

This article covers the architecture, methodology, lessons learned, and the results that demonstrate why GraphRAG is a powerful retrieval paradigm.


Why This Benchmark Matters

Large Language Models are impressive, but they suffer from three core limitations:

  1. High token usage
  2. Hallucinations
  3. Poor multi-hop reasoning across documents

Traditional Retrieval-Augmented Generation (RAG) improves grounding by retrieving relevant chunks from a vector database.

However, Basic RAG has a structural limitation:

It retrieves semantically similar chunks, but does not explicitly model relationships between entities.

GraphRAG addresses this limitation by converting documents into a knowledge graph and traversing meaningful connections.

This enables:

  • Multi-hop reasoning
  • Relationship-aware retrieval
  • Reduced context size
  • More focused answers

The Research Question

The benchmark was designed to answer:

Does GraphRAG provide better efficiency and accuracy than LLM-only and Basic RAG?

Specifically:

  • Does it use fewer tokens?
  • Does it reduce latency?
  • Does it lower cost?
  • Does it maintain or improve answer quality?

Dataset Selection

Domain: Scientific Research Papers

I selected the Hugging Face dataset:

armanc/scientific_papers (arXiv split)

Hugging Face - armanc/scientific_papers

Why Scientific Papers?

Scientific papers are ideal for GraphRAG because they contain:

  • Methods
  • Datasets
  • Tasks
  • Concepts
  • Experimental results
  • Cross-document dependencies

Questions often require connecting information across multiple sections and documents.

Dataset Scale

A custom pipeline sampled approximately:

  • 2,000,000 tokens
  • Hundreds of research papers
  • Thousands of chunks
  • Thousands of graph nodes and edges

This scale satisfies the benchmark requirement while remaining practical to reproduce.


The Three Pipelines

1. LLM-only

The question is sent directly to the language model.

Advantages

  • Simple
  • No retrieval infrastructure

Limitations

  • High token usage
  • Hallucinations
  • No grounding

2. Basic RAG

Workflow:

  1. Chunk documents
  2. Generate embeddings
  3. Store vectors in ChromaDB
  4. Retrieve top-k chunks
  5. Send chunks to the LLM

Advantages

  • Grounded retrieval
  • Lower hallucination risk

Limitations

  • No explicit relationships
  • Weak multi-hop reasoning

3. GraphRAG

Workflow:

  1. Extract entities and relationships
  2. Build a knowledge graph
  3. Traverse graph neighborhoods
  4. Retrieve only connected evidence
  5. Generate answers

Advantages

  • Multi-hop reasoning
  • Relationship-aware retrieval
  • Smaller, more relevant context

System Architecture


User
  ↓
Next.js Frontend (Vercel)
  ↓
FastAPI Backend (Hugging Face Spaces)
  ↓
Benchmark Orchestrator
  ├── LLM-only
  ├── Basic RAG (ChromaDB)
  └── GraphRAG (NetworkX)
  ↓
Accuracy Evaluation
  ├── LLM-as-a-Judge
  └── BERTScore
  ↓
Interactive Dashboard
Enter fullscreen mode Exit fullscreen mode

Tech Stack

Frontend

  • Next.js (App Router)
  • TypeScript
  • Tailwind CSS
  • shadcn/ui
  • Framer Motion
  • Recharts

Backend

  • Python
  • FastAPI
  • tiktoken
  • datasets
  • sentence-transformers

Retrieval

  • ChromaDB
  • NetworkX

Evaluation

  • huggingface_hub
  • evaluate
  • bert-score

Deployment

  • Vercel
  • Hugging Face Spaces
  • Docker

Benchmark Methodology

Step 1: Build the 2M Token Corpus

A token-aware dataset builder:

  • Downloads scientific papers
  • Counts tokens using tiktoken
  • Stops at ~2 million tokens

Step 2: Preprocess and Chunk

Documents are split into:

  • 500-token chunks
  • 80-token overlap

Step 3: Build Retrieval Layers

Basic RAG

  • Generate embeddings
  • Store in ChromaDB

GraphRAG

  • Extract methods, datasets, tasks, and concepts
  • Build graph nodes and edges using NetworkX

Step 4: Create Evaluation Questions

Generated:

  • 40 questions
  • Factual
  • Entity-based
  • Multi-hop
  • Synthesis

Step 5: Run All Pipelines

Each question is processed by:

  • LLM-only
  • Basic RAG
  • GraphRAG

Step 6: Record Metrics

Per query:

  • Tokens
  • Latency
  • Cost
  • Answer text

Step 7: Evaluate Accuracy

Using two independent methods.


Accuracy Evaluation

Method 1: LLM-as-a-Judge

A separate language model grades answers as PASS or FAIL.

Judge model:

  • meta-llama/Llama-3.1-8B-Instruct

This captures:

  • Factual correctness
  • Hallucinations
  • Missing information

Method 2: BERTScore

Measures semantic similarity between:

  • Predicted answer
  • Ground-truth answer

This captures:

  • Meaning preservation
  • Paraphrasing equivalence

Why Two Metrics?

One metric can be misleading.

Using both:

  • Provides independent validation
  • Creates more defensible results

Benchmark Results

Metric LLM-only Basic RAG GraphRAG
Avg Tokens / Query 3,200 640 316
Avg Cost / Query $0.0096 $0.0019 $0.00095
Avg Latency 6.23 s 2.85 s 1.22 s
LLM Judge Pass Rate 82% 89% 94%
BERTScore F1 0.57 0.64 0.71

Key Improvements

GraphRAG vs Basic RAG

  • 50.6% fewer tokens
  • 57.2% lower latency
  • ~50% lower cost
  • +5% higher pass rate
  • +0.07 BERTScore F1

GraphRAG vs LLM-only

  • 90.1% fewer tokens
  • 80.4% lower latency
  • ~90% lower cost
  • +12% higher pass rate
  • +0.14 BERTScore F1

Why GraphRAG Wins

Basic RAG retrieves chunks independently.

GraphRAG retrieves connected evidence.

Example:

  • Paper A introduces a method
  • Paper B evaluates it
  • Paper C improves it

Basic RAG may retrieve only one paper.

GraphRAG traverses all connected nodes, producing a complete answer with less context.


TigerGraph Challenges and Pivot to NetworkX

The project originally explored TigerGraph as the graph database backend.

TigerGraph is an impressive enterprise-grade platform and was valuable in shaping the system design.

However, during development I encountered:

  • Authentication issues
  • Docker setup complexity
  • Resource overhead
  • Infrastructure troubleshooting

These operational challenges shifted focus away from the benchmark itself.

The Core Realization

The benchmark was about proving:

Graph-based retrieval methodology is superior.

It was not about dependence on a specific graph database.

Why NetworkX Became the Primary Engine

NetworkX offered:

  • Multi-hop traversal
  • Relationship-aware retrieval
  • Zero infrastructure overhead
  • Pure Python execution
  • Easy reproducibility

TigerGraph remained an optional enterprise connector.


NetworkX vs TigerGraph

Capability NetworkX TigerGraph
Multi-hop graph traversal Yes Yes
Benchmark-ready Yes Yes
Runs on any laptop Yes No
Zero setup overhead Yes No
Enterprise scalability Limited Excellent
Required for benchmark Yes No

Deployment

Frontend

  • Deployed on Vercel

Backend

  • Deployed on Hugging Face Spaces

Data Layer

  • ChromaDB
  • NetworkX serialized graph files
  • JSON benchmark artifacts

The deployment is fully reproducible and lightweight.


Interactive Dashboard

The web application includes:

  • Benchmark runner
  • Accuracy metrics
  • Architecture overview
  • Dataset information
  • Comparative charts
  • Project narrative

The dashboard presents:

  • Tokens
  • Latency
  • Cost
  • Pass Rate
  • BERTScore
  • Reduction percentages

Key Insights

1. Graph Structure Matters

Explicit relationships enable better retrieval.

2. Smaller Context Can Improve Accuracy

GraphRAG used fewer tokens and produced better answers.

3. Evaluation Must Include Accuracy

Token reduction alone is meaningless without quality validation.

4. Methodology Matters More Than Infrastructure

GraphRAG benefits are independent of the graph database vendor.

5. Reproducibility Is Essential

Lightweight tooling makes the benchmark easier for others to verify.


Reproducibility

Anyone can reproduce the benchmark by:

  1. Cloning the repository
  2. Building the 2M-token corpus
  3. Rebuilding indexes
  4. Running the benchmark
  5. Evaluating accuracy
  6. Viewing the dashboard

The project includes:

  • Secure deployment configs
  • Validation scripts
  • Backup scripts
  • Full documentation

GitHub Repository

Public repository:

GitHub logo VEDANTDHAVAN / graphrag-benchmark

2M-token benchmark comparing LLM-only, Basic RAG, and NetworkX-based GraphRAG with scientific papers, accuracy evaluation, and interactive dashboard.

GraphRAG Benchmark

GraphRAG Benchmark is a reproducible benchmark harness that compares three answer-generation pipelines on the same scientific-paper corpus:

  • LLM-only: answers directly from the model without retrieved context.
  • Basic RAG: retrieves chunks from ChromaDB using vector similarity.
  • GraphRAG: retrieves graph-connected chunks using NetworkX-based entity traversal and multi-hop graph context.

The project measures both efficiency and answer quality:

  • Tokens
  • Latency
  • Estimated cost
  • LLM-as-a-Judge pass rate
  • BERTScore F1

The dashboard shows all three pipelines side-by-side and highlights winners for accuracy, token usage, speed, and overall score.

Architecture

GraphRAG Benchmark Architecture

Frontend dashboard (Next.js)
  -> Backend API (FastAPI)
  -> LLM-only pipeline
  -> Basic RAG pipeline with ChromaDB
  -> GraphRAG pipeline with NetworkX
  -> Optional TigerGraph connector, disabled by default

Why NetworkX is the Primary GraphRAG Engine

The goal of this project is methodological benchmarking: prove whether graph-structured retrieval and multi-hop reasoning improve efficiency and answer quality compared with LLM-only and Basic RAG.

GraphRAG…





Live Demo

Frontend:

GraphRAG Benchmark | 2M Token Evaluation of LLM-only vs RAG vs GraphRAG

A reproducible benchmark proving graph-based retrieval at scientific-paper scale.

favicon graphrag-benchmark.vercel.app

Backend API:
https://vedantdhavan-graphrag-benchmark.hf.space/health


Final Conclusion

This benchmark demonstrates that GraphRAG is not just an academic concept—it delivers measurable benefits in real-world retrieval systems.

Compared with LLM-only and Basic RAG, GraphRAG achieved:

  • Lower token usage
  • Lower latency
  • Lower cost
  • Higher answer accuracy

Most importantly, the results show that:

The power of GraphRAG lies in graph-structured retrieval and multi-hop reasoning, not dependence on a specific graph database.

By focusing on methodology and reproducibility, this project provides a transparent, production-grade benchmark for evaluating GraphRAG at scale.


Connect With Me

Vedant Dhavan
Full-Stack Developer | Generative AI Explorer | Agentic AI Enthusiast


Tags

GraphRAG #RAG #LLM #FastAPI #NextJS #NetworkX #ChromaDB #GenerativeAI #Benchmarking #HuggingFace #TigerGraph #Python #MachineLearning

Top comments (0)