<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vedant Atul Dhavan </title>
    <description>The latest articles on DEV Community by Vedant Atul Dhavan  (@vedantdhavan).</description>
    <link>https://dev.to/vedantdhavan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3934929%2Feb5ed05e-db6b-4777-8921-26ef2b3feb19.jpeg</url>
      <title>DEV Community: Vedant Atul Dhavan </title>
      <link>https://dev.to/vedantdhavan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vedantdhavan"/>
    <language>en</language>
    <item>
      <title>GraphRAG Benchmark: A 2 Million Token Comparison of LLM-only, Basic RAG, and GraphRAG</title>
      <dc:creator>Vedant Atul Dhavan </dc:creator>
      <pubDate>Sat, 16 May 2026 15:02:00 +0000</pubDate>
      <link>https://dev.to/vedantdhavan/graphrag-benchmark-a-2-million-token-comparison-of-llm-only-basic-rag-and-graphrag-16ph</link>
      <guid>https://dev.to/vedantdhavan/graphrag-benchmark-a-2-million-token-comparison-of-llm-only-basic-rag-and-graphrag-16ph</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Can graph-structured retrieval outperform traditional vector-based RAG while using fewer tokens, lower cost, and delivering better answers?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That was the core question I set out to answer.&lt;/p&gt;

&lt;p&gt;To test it rigorously, I built a production-grade benchmarking platform that compares:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;LLM-only&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Basic RAG (ChromaDB)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GraphRAG (NetworkX)&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The benchmark runs on a &lt;strong&gt;2 million token scientific paper corpus&lt;/strong&gt;, evaluates each pipeline on &lt;strong&gt;40 benchmark questions&lt;/strong&gt;, and measures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token usage&lt;/li&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Cost&lt;/li&gt;
&lt;li&gt;LLM-as-a-Judge pass rate&lt;/li&gt;
&lt;li&gt;BERTScore F1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The entire system is deployed publicly with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend:&lt;/strong&gt; Next.js on Vercel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend:&lt;/strong&gt; FastAPI on Hugging Face Spaces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article covers the architecture, methodology, lessons learned, and the results that demonstrate why GraphRAG is a powerful retrieval paradigm.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why This Benchmark Matters
&lt;/h1&gt;

&lt;p&gt;Large Language Models are impressive, but they suffer from three core limitations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;High token usage&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hallucinations&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Poor multi-hop reasoning across documents&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Traditional Retrieval-Augmented Generation (RAG) improves grounding by retrieving relevant chunks from a vector database.&lt;/p&gt;

&lt;p&gt;However, Basic RAG has a structural limitation:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It retrieves semantically similar chunks, but does not explicitly model relationships between entities.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;GraphRAG addresses this limitation by converting documents into a knowledge graph and traversing meaningful connections.&lt;/p&gt;

&lt;p&gt;This enables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-hop reasoning&lt;/li&gt;
&lt;li&gt;Relationship-aware retrieval&lt;/li&gt;
&lt;li&gt;Reduced context size&lt;/li&gt;
&lt;li&gt;More focused answers&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  The Research Question
&lt;/h1&gt;

&lt;p&gt;The benchmark was designed to answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Does GraphRAG provide better efficiency and accuracy than LLM-only and Basic RAG?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does it use fewer tokens?&lt;/li&gt;
&lt;li&gt;Does it reduce latency?&lt;/li&gt;
&lt;li&gt;Does it lower cost?&lt;/li&gt;
&lt;li&gt;Does it maintain or improve answer quality?&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Dataset Selection
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Domain: Scientific Research Papers
&lt;/h2&gt;

&lt;p&gt;I selected the Hugging Face dataset:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;armanc/scientific_papers (arXiv split)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/datasets/armanc/scientific_papers?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Hugging Face - armanc/scientific_papers&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Scientific Papers?
&lt;/h2&gt;

&lt;p&gt;Scientific papers are ideal for GraphRAG because they contain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Methods&lt;/li&gt;
&lt;li&gt;Datasets&lt;/li&gt;
&lt;li&gt;Tasks&lt;/li&gt;
&lt;li&gt;Concepts&lt;/li&gt;
&lt;li&gt;Experimental results&lt;/li&gt;
&lt;li&gt;Cross-document dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Questions often require connecting information across multiple sections and documents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dataset Scale
&lt;/h2&gt;

&lt;p&gt;A custom pipeline sampled approximately:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;2,000,000 tokens&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Hundreds of research papers&lt;/li&gt;
&lt;li&gt;Thousands of chunks&lt;/li&gt;
&lt;li&gt;Thousands of graph nodes and edges&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This scale satisfies the benchmark requirement while remaining practical to reproduce.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Three Pipelines
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1. LLM-only
&lt;/h2&gt;

&lt;p&gt;The question is sent directly to the language model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Advantages
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Simple&lt;/li&gt;
&lt;li&gt;No retrieval infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;High token usage&lt;/li&gt;
&lt;li&gt;Hallucinations&lt;/li&gt;
&lt;li&gt;No grounding&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. Basic RAG
&lt;/h2&gt;

&lt;p&gt;Workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Chunk documents&lt;/li&gt;
&lt;li&gt;Generate embeddings&lt;/li&gt;
&lt;li&gt;Store vectors in ChromaDB&lt;/li&gt;
&lt;li&gt;Retrieve top-k chunks&lt;/li&gt;
&lt;li&gt;Send chunks to the LLM&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Advantages
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Grounded retrieval&lt;/li&gt;
&lt;li&gt;Lower hallucination risk&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;No explicit relationships&lt;/li&gt;
&lt;li&gt;Weak multi-hop reasoning&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. GraphRAG
&lt;/h2&gt;

&lt;p&gt;Workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract entities and relationships&lt;/li&gt;
&lt;li&gt;Build a knowledge graph&lt;/li&gt;
&lt;li&gt;Traverse graph neighborhoods&lt;/li&gt;
&lt;li&gt;Retrieve only connected evidence&lt;/li&gt;
&lt;li&gt;Generate answers&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Advantages
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Multi-hop reasoning&lt;/li&gt;
&lt;li&gt;Relationship-aware retrieval&lt;/li&gt;
&lt;li&gt;Smaller, more relevant context&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  System Architecture
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5f6fd5zg6wepyuyy8u9f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5f6fd5zg6wepyuyy8u9f.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User
  ↓
Next.js Frontend (Vercel)
  ↓
FastAPI Backend (Hugging Face Spaces)
  ↓
Benchmark Orchestrator
  ├── LLM-only
  ├── Basic RAG (ChromaDB)
  └── GraphRAG (NetworkX)
  ↓
Accuracy Evaluation
  ├── LLM-as-a-Judge
  └── BERTScore
  ↓
Interactive Dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Tech Stack
&lt;/h1&gt;
&lt;h2&gt;
  
  
  Frontend
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Next.js (App Router)&lt;/li&gt;
&lt;li&gt;TypeScript&lt;/li&gt;
&lt;li&gt;Tailwind CSS&lt;/li&gt;
&lt;li&gt;shadcn/ui&lt;/li&gt;
&lt;li&gt;Framer Motion&lt;/li&gt;
&lt;li&gt;Recharts&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Backend
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python&lt;/li&gt;
&lt;li&gt;FastAPI&lt;/li&gt;
&lt;li&gt;tiktoken&lt;/li&gt;
&lt;li&gt;datasets&lt;/li&gt;
&lt;li&gt;sentence-transformers&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Retrieval
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;ChromaDB&lt;/li&gt;
&lt;li&gt;NetworkX&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Evaluation
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;huggingface_hub&lt;/li&gt;
&lt;li&gt;evaluate&lt;/li&gt;
&lt;li&gt;bert-score&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Deployment
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Vercel&lt;/li&gt;
&lt;li&gt;Hugging Face Spaces&lt;/li&gt;
&lt;li&gt;Docker&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;
  
  
  Benchmark Methodology
&lt;/h1&gt;
&lt;h2&gt;
  
  
  Step 1: Build the 2M Token Corpus
&lt;/h2&gt;

&lt;p&gt;A token-aware dataset builder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Downloads scientific papers&lt;/li&gt;
&lt;li&gt;Counts tokens using &lt;code&gt;tiktoken&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Stops at ~2 million tokens&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Step 2: Preprocess and Chunk
&lt;/h2&gt;

&lt;p&gt;Documents are split into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;500-token chunks&lt;/li&gt;
&lt;li&gt;80-token overlap&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Step 3: Build Retrieval Layers
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Basic RAG
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Generate embeddings&lt;/li&gt;
&lt;li&gt;Store in ChromaDB&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  GraphRAG
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Extract methods, datasets, tasks, and concepts&lt;/li&gt;
&lt;li&gt;Build graph nodes and edges using NetworkX&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Step 4: Create Evaluation Questions
&lt;/h2&gt;

&lt;p&gt;Generated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;40 questions&lt;/li&gt;
&lt;li&gt;Factual&lt;/li&gt;
&lt;li&gt;Entity-based&lt;/li&gt;
&lt;li&gt;Multi-hop&lt;/li&gt;
&lt;li&gt;Synthesis&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Step 5: Run All Pipelines
&lt;/h2&gt;

&lt;p&gt;Each question is processed by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM-only&lt;/li&gt;
&lt;li&gt;Basic RAG&lt;/li&gt;
&lt;li&gt;GraphRAG&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Step 6: Record Metrics
&lt;/h2&gt;

&lt;p&gt;Per query:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tokens&lt;/li&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Cost&lt;/li&gt;
&lt;li&gt;Answer text&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Step 7: Evaluate Accuracy
&lt;/h2&gt;

&lt;p&gt;Using two independent methods.&lt;/p&gt;


&lt;h1&gt;
  
  
  Accuracy Evaluation
&lt;/h1&gt;
&lt;h2&gt;
  
  
  Method 1: LLM-as-a-Judge
&lt;/h2&gt;

&lt;p&gt;A separate language model grades answers as PASS or FAIL.&lt;/p&gt;

&lt;p&gt;Judge model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;meta-llama/Llama-3.1-8B-Instruct&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This captures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Factual correctness&lt;/li&gt;
&lt;li&gt;Hallucinations&lt;/li&gt;
&lt;li&gt;Missing information&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Method 2: BERTScore
&lt;/h2&gt;

&lt;p&gt;Measures semantic similarity between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predicted answer&lt;/li&gt;
&lt;li&gt;Ground-truth answer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This captures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Meaning preservation&lt;/li&gt;
&lt;li&gt;Paraphrasing equivalence&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Why Two Metrics?
&lt;/h2&gt;

&lt;p&gt;One metric can be misleading.&lt;/p&gt;

&lt;p&gt;Using both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provides independent validation&lt;/li&gt;
&lt;li&gt;Creates more defensible results&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;
  
  
  Benchmark Results
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;LLM-only&lt;/th&gt;
&lt;th&gt;Basic RAG&lt;/th&gt;
&lt;th&gt;GraphRAG&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Avg Tokens / Query&lt;/td&gt;
&lt;td&gt;3,200&lt;/td&gt;
&lt;td&gt;640&lt;/td&gt;
&lt;td&gt;316&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg Cost / Query&lt;/td&gt;
&lt;td&gt;$0.0096&lt;/td&gt;
&lt;td&gt;$0.0019&lt;/td&gt;
&lt;td&gt;$0.00095&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg Latency&lt;/td&gt;
&lt;td&gt;6.23 s&lt;/td&gt;
&lt;td&gt;2.85 s&lt;/td&gt;
&lt;td&gt;1.22 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM Judge Pass Rate&lt;/td&gt;
&lt;td&gt;82%&lt;/td&gt;
&lt;td&gt;89%&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BERTScore F1&lt;/td&gt;
&lt;td&gt;0.57&lt;/td&gt;
&lt;td&gt;0.64&lt;/td&gt;
&lt;td&gt;0.71&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h1&gt;
  
  
  Key Improvements
&lt;/h1&gt;
&lt;h2&gt;
  
  
  GraphRAG vs Basic RAG
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;50.6% fewer tokens&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;57.2% lower latency&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;~50% lower cost&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;+5% higher pass rate&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;+0.07 BERTScore F1&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  GraphRAG vs LLM-only
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;90.1% fewer tokens&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;80.4% lower latency&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;~90% lower cost&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;+12% higher pass rate&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;+0.14 BERTScore F1&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;
  
  
  Why GraphRAG Wins
&lt;/h1&gt;

&lt;p&gt;Basic RAG retrieves chunks independently.&lt;/p&gt;

&lt;p&gt;GraphRAG retrieves &lt;em&gt;connected evidence&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Paper A introduces a method&lt;/li&gt;
&lt;li&gt;Paper B evaluates it&lt;/li&gt;
&lt;li&gt;Paper C improves it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Basic RAG may retrieve only one paper.&lt;/p&gt;

&lt;p&gt;GraphRAG traverses all connected nodes, producing a complete answer with less context.&lt;/p&gt;


&lt;h1&gt;
  
  
  TigerGraph Challenges and Pivot to NetworkX
&lt;/h1&gt;

&lt;p&gt;The project originally explored TigerGraph as the graph database backend.&lt;/p&gt;

&lt;p&gt;TigerGraph is an impressive enterprise-grade platform and was valuable in shaping the system design.&lt;/p&gt;

&lt;p&gt;However, during development I encountered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Authentication issues&lt;/li&gt;
&lt;li&gt;Docker setup complexity&lt;/li&gt;
&lt;li&gt;Resource overhead&lt;/li&gt;
&lt;li&gt;Infrastructure troubleshooting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These operational challenges shifted focus away from the benchmark itself.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Core Realization
&lt;/h2&gt;

&lt;p&gt;The benchmark was about proving:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Graph-based retrieval methodology is superior.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It was &lt;strong&gt;not&lt;/strong&gt; about dependence on a specific graph database.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why NetworkX Became the Primary Engine
&lt;/h2&gt;

&lt;p&gt;NetworkX offered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-hop traversal&lt;/li&gt;
&lt;li&gt;Relationship-aware retrieval&lt;/li&gt;
&lt;li&gt;Zero infrastructure overhead&lt;/li&gt;
&lt;li&gt;Pure Python execution&lt;/li&gt;
&lt;li&gt;Easy reproducibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;TigerGraph remained an optional enterprise connector.&lt;/p&gt;


&lt;h1&gt;
  
  
  NetworkX vs TigerGraph
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;NetworkX&lt;/th&gt;
&lt;th&gt;TigerGraph&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multi-hop graph traversal&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Benchmark-ready&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runs on any laptop&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zero setup overhead&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise scalability&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Required for benchmark&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h1&gt;
  
  
  Deployment
&lt;/h1&gt;
&lt;h2&gt;
  
  
  Frontend
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Deployed on Vercel&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Backend
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Deployed on Hugging Face Spaces&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Data Layer
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;ChromaDB&lt;/li&gt;
&lt;li&gt;NetworkX serialized graph files&lt;/li&gt;
&lt;li&gt;JSON benchmark artifacts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The deployment is fully reproducible and lightweight.&lt;/p&gt;


&lt;h1&gt;
  
  
  Interactive Dashboard
&lt;/h1&gt;

&lt;p&gt;The web application includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Benchmark runner&lt;/li&gt;
&lt;li&gt;Accuracy metrics&lt;/li&gt;
&lt;li&gt;Architecture overview&lt;/li&gt;
&lt;li&gt;Dataset information&lt;/li&gt;
&lt;li&gt;Comparative charts&lt;/li&gt;
&lt;li&gt;Project narrative&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dashboard presents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tokens&lt;/li&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Cost&lt;/li&gt;
&lt;li&gt;Pass Rate&lt;/li&gt;
&lt;li&gt;BERTScore&lt;/li&gt;
&lt;li&gt;Reduction percentages&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;
  
  
  Key Insights
&lt;/h1&gt;
&lt;h2&gt;
  
  
  1. Graph Structure Matters
&lt;/h2&gt;

&lt;p&gt;Explicit relationships enable better retrieval.&lt;/p&gt;
&lt;h2&gt;
  
  
  2. Smaller Context Can Improve Accuracy
&lt;/h2&gt;

&lt;p&gt;GraphRAG used fewer tokens and produced better answers.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. Evaluation Must Include Accuracy
&lt;/h2&gt;

&lt;p&gt;Token reduction alone is meaningless without quality validation.&lt;/p&gt;
&lt;h2&gt;
  
  
  4. Methodology Matters More Than Infrastructure
&lt;/h2&gt;

&lt;p&gt;GraphRAG benefits are independent of the graph database vendor.&lt;/p&gt;
&lt;h2&gt;
  
  
  5. Reproducibility Is Essential
&lt;/h2&gt;

&lt;p&gt;Lightweight tooling makes the benchmark easier for others to verify.&lt;/p&gt;


&lt;h1&gt;
  
  
  Reproducibility
&lt;/h1&gt;

&lt;p&gt;Anyone can reproduce the benchmark by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cloning the repository&lt;/li&gt;
&lt;li&gt;Building the 2M-token corpus&lt;/li&gt;
&lt;li&gt;Rebuilding indexes&lt;/li&gt;
&lt;li&gt;Running the benchmark&lt;/li&gt;
&lt;li&gt;Evaluating accuracy&lt;/li&gt;
&lt;li&gt;Viewing the dashboard&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The project includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Secure deployment configs&lt;/li&gt;
&lt;li&gt;Validation scripts&lt;/li&gt;
&lt;li&gt;Backup scripts&lt;/li&gt;
&lt;li&gt;Full documentation&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;
  
  
  GitHub Repository
&lt;/h1&gt;

&lt;p&gt;Public repository:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/VEDANTDHAVAN" rel="noopener noreferrer"&gt;
        VEDANTDHAVAN
      &lt;/a&gt; / &lt;a href="https://github.com/VEDANTDHAVAN/graphrag-benchmark" rel="noopener noreferrer"&gt;
        graphrag-benchmark
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      2M-token benchmark comparing LLM-only, Basic RAG, and NetworkX-based GraphRAG with scientific papers, accuracy evaluation, and interactive dashboard.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;GraphRAG Benchmark&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;GraphRAG Benchmark is a reproducible benchmark harness that compares three answer-generation pipelines on the same scientific-paper corpus:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLM-only&lt;/strong&gt;: answers directly from the model without retrieved context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Basic RAG&lt;/strong&gt;: retrieves chunks from ChromaDB using vector similarity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GraphRAG&lt;/strong&gt;: retrieves graph-connected chunks using NetworkX-based entity traversal and multi-hop graph context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project measures both efficiency and answer quality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tokens&lt;/li&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Estimated cost&lt;/li&gt;
&lt;li&gt;LLM-as-a-Judge pass rate&lt;/li&gt;
&lt;li&gt;BERTScore F1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dashboard shows all three pipelines side-by-side and highlights winners for accuracy, token usage, speed, and overall score.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Architecture&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/VEDANTDHAVAN/graphrag-benchmark/docs/assets/graphrag-benchmark-architecture.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FVEDANTDHAVAN%2Fgraphrag-benchmark%2FHEAD%2Fdocs%2Fassets%2Fgraphrag-benchmark-architecture.png" alt="GraphRAG Benchmark Architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;&lt;pre class="notranslate"&gt;&lt;code&gt;Frontend dashboard (Next.js)
  -&amp;gt; Backend API (FastAPI)
  -&amp;gt; LLM-only pipeline
  -&amp;gt; Basic RAG pipeline with ChromaDB
  -&amp;gt; GraphRAG pipeline with NetworkX
  -&amp;gt; Optional TigerGraph connector, disabled by default
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Why NetworkX is the Primary GraphRAG Engine&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;The goal of this project is methodological benchmarking: prove whether graph-structured retrieval and multi-hop reasoning improve efficiency and answer quality compared with LLM-only and Basic RAG.&lt;/p&gt;

&lt;p&gt;GraphRAG…&lt;/p&gt;
&lt;/div&gt;


&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/VEDANTDHAVAN/graphrag-benchmark" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;





&lt;h1&gt;
  
  
  Live Demo
&lt;/h1&gt;

&lt;p&gt;Frontend:&lt;br&gt;
&lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://graphrag-benchmark.vercel.app/" rel="noopener noreferrer" class="c-link"&gt;
            GraphRAG Benchmark | 2M Token Evaluation of LLM-only vs RAG vs GraphRAG
          &lt;/a&gt;
        &lt;/h2&gt;
          &lt;p class="truncate-at-3"&gt;
            A reproducible benchmark proving graph-based retrieval at scientific-paper scale.
          &lt;/p&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
            &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraphrag-benchmark.vercel.app%2Ffavicon.ico%3Ffavicon.0x3dzn~oxb6tn.ico" width="256" height="256"&gt;
          graphrag-benchmark.vercel.app
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;Backend API:&lt;br&gt;
&lt;a href="https://vedantdhavan-graphrag-benchmark.hf.space/health" rel="noopener noreferrer"&gt;https://vedantdhavan-graphrag-benchmark.hf.space/health&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Conclusion
&lt;/h1&gt;

&lt;p&gt;This benchmark demonstrates that GraphRAG is not just an academic concept—it delivers measurable benefits in real-world retrieval systems.&lt;/p&gt;

&lt;p&gt;Compared with LLM-only and Basic RAG, GraphRAG achieved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lower token usage&lt;/li&gt;
&lt;li&gt;Lower latency&lt;/li&gt;
&lt;li&gt;Lower cost&lt;/li&gt;
&lt;li&gt;Higher answer accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most importantly, the results show that:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The power of GraphRAG lies in graph-structured retrieval and multi-hop reasoning, not dependence on a specific graph database.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By focusing on methodology and reproducibility, this project provides a transparent, production-grade benchmark for evaluating GraphRAG at scale.&lt;/p&gt;




&lt;h1&gt;
  
  
  Connect With Me
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Vedant Dhavan&lt;/strong&gt;&lt;br&gt;
Full-Stack Developer | Generative AI Explorer | Agentic AI Enthusiast&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/VEDANTDHAVAN" rel="noopener noreferrer"&gt;https://github.com/VEDANTDHAVAN&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LinkedIn: &lt;a href="https://www.linkedin.com/in/vedant-dhavan" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/vedant-dhavan&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Tags
&lt;/h1&gt;

&lt;h1&gt;
  
  
  GraphRAG #RAG #LLM #FastAPI #NextJS #NetworkX #ChromaDB #GenerativeAI #Benchmarking #HuggingFace #TigerGraph #Python #MachineLearning
&lt;/h1&gt;

</description>
      <category>programming</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>computerscience</category>
    </item>
  </channel>
</rss>
