GraphRAG Benchmark Analysis

VanshDeo — Sun, 17 May 2026 19:14:48 +0000

We Built a Clinical GraphRAG Benchmark That Proves Graph Databases Aren't Just Hype

How TigerGraph-powered graph traversal beat vector search on polypharmacy reasoning — and why token efficiency tells a more interesting story than accuracy alone.

The Problem With "Just Use RAG"

Every production AI team eventually hits the same wall: your LLM hallucinates a drug interaction, misses a temporal dependency, or confidently answers a counterfactual question it has no business answering. The standard response is "add RAG." But which RAG?

Vector similarity search (Pinecone, Weaviate, etc.) is great at retrieving semantically similar chunks. It's not great at answering questions like:

"If omeprazole is stopped, which drug interaction paths resolve?"
"Which guidelines conflict on aspirin use in elderly patients?"
"Patient is taking warfarin, fluconazole, and aspirin. Trace the full interaction cascade."

These aren't similarity problems. They're graph traversal problems.

That's the thesis behind GraphRAG Inference Core V2 — a clinical benchmarking system we built to quantify exactly how much graph-structured retrieval improves over baseline approaches.

The Architecture: Three Pipelines, One Benchmark

We designed a multi-pipeline evaluation harness that runs every query through three systems simultaneously:

🔶 LLM-Only (Baseline)

Pure Gemma inference. No external context. Training-knowledge baseline. This answers: "What does the model already know?"

🔷 RAG Hybrid Core (Vector Baseline)

Gemma + Pinecone vector retrieval. Standard embedding-based context injection. This answers: "Does retrieval help, and by how much?"

🟢 GraphRAG Sentinel (Our System)

Gemma + TigerGraph V3 route-aware traversal. Queries are classified, routed to a Cypher/GSQL generator, and context is returned as compressed structured JSON.

The benchmark covers 100 clinical questions across five reasoning categories — 20 each:

Category	What It Tests
Temporal	Time-dependent drug effects, dosing windows
Contradiction	Conflicting clinical guidelines
Multi-Hop	Enzyme-mediated cascade interactions
Counterfactual	"What if X is stopped/added?" reasoning
Cross-Entity	Drug ↔ Disease ↔ Symptom ↔ Guideline reasoning

Targets: 90% LLM-Judge score, BERTScore ≥ 0.55

The Results: Token Efficiency Is the Real Story

Here's the comparative matrix from our live benchmark run on the omeprazole query:

Metric	LLM-Only	Basic RAG	GraphRAG
Total Tokens	389 ⭐	1,281	770
Latency	57,931ms	49,919ms ⭐	189,845ms
Cost	$0.000029 ⭐	$0.000090	$0.000058

At first glance this looks like GraphRAG loses on every metric. That's the wrong read.

The key insight: GraphRAG uses 770 tokens to do what Basic RAG needs 1,281 tokens for — a 39.9% token reduction while retrieving structured, graph-verified context instead of raw semantic chunks. The LLM-only baseline uses 389 tokens but provides no external grounding at all.

In production at scale, that 40% reduction compounds. 1M queries/month at $0.000058 vs $0.000090 is the difference between a sustainable product and a budget crisis.

How the Graph Query Routing Works

When a query hits the GraphRAG Sentinel pipeline, the system:

Classifies the query type (we detected GENERATE_CYPHER for the omeprazole question)
Selects a retriever — CYPHER for structural traversal, with hop-depth configured per query class
Generates and executes a TigerGraph GSQL/Cypher query against our clinical knowledge graph
Compresses the result into structured JSON context before LLM injection
Synthesizes a response through four verified stages: Entity Extraction → Community Summary Retrieval → Global Aggregation → Response Synthesis

The clinical graph schema includes: Drugs, Diseases, Symptoms, Enzymes, Adverse Events, and Clinical Guidelines — with typed edges for direct interactions, enzyme-mediated cascades, and contraindication relationships.

This is exactly the schema you need to answer "which interaction paths resolve when omeprazole stops" — not a cosine similarity index.

The Demo: "The Cascade Collapse"

Our flagship demo query traces what happens across a polypharmacy regimen when a key CYP2C19 inhibitor is removed.

The GraphRAG system correctly:

Identifies omeprazole as a CYP2C19 inhibitor (Proton Pump Inhibitor class)
Traverses the enzyme-mediated interaction graph to find CYP2C19-dependent drugs (clopidogrel, etc.)
Determines that removing the inhibition restores the metabolic pathway
Surfaces the clinical implication: clopidogrel's antiplatelet efficacy is restored, increasing bleed risk considerations

The LLM-Only response gets the answer directionally correct from training data. The Basic RAG response retrieves relevant paragraphs but can't traverse the interaction graph. Only GraphRAG surfaces the full cascade path with structural verification.

Stack

Graph DB: TigerGraph V3 (GSQL + Cypher query generation)
Vector Store: Pinecone (Basic RAG baseline)
LLM: Gemma (all three pipelines)
Frontend: Next.js dashboard with live system console
Benchmark Runtime: STITCH_OS v2.4.0
Security: AES-256-GCM channel encryption

What's Next

The benchmark harness is open. We're working toward full 100-question eval runs with automated LLM-Judge scoring and BERTScore computation across all five reasoning categories. The graph schema is extensible — drug-gene interactions, trial enrollment criteria, and payer formulary data are natural next nodes.

Try It / Contribute

The full codebase — benchmark runner, graph schema, retriever implementations, and dashboard — is open source.

GitHub: https://github.com/Sayandeep-the-coder/graphrag-benchmark

If you're building clinical AI, drug safety tooling, or just want to see a real GraphRAG vs RAG comparison with live metrics, this is the codebase to clone.

Tags: TigerGraph graphrag GraphRAGInferenceHackathon rag clinical-ai knowledge-graph polypharmacy llm-benchmarking gsql cypher drug-interactions open-source

DEV Community: VanshDeo