🚀 Beating the Token Explosion: How GraphRAG Outperforms Vector Search in Medical AI
As Large Language Models (LLMs) scale across industries, developers are hitting a massive wall: the token explosion. Shoving massive document dumps into an LLM's context window isn't just slow—it is incredibly expensive. In domains like healthcare, where precision is everything, hallucinating because the model lost track of complex relationships isn't just an error; it's a critical failure.
For the 🏆 TigerGraph GraphRAG Inference Hackathon, my teammate and I wanted to prove that graphs make LLM inference faster, cheaper, and fundamentally smarter. The goal wasn't just to build a pipeline, but to benchmark token reduction while rigorously maintaining answer accuracy through a custom-built, interactive UI.
Here is a technical breakdown of how we built an interactive GraphRAG benchmark using Python and tkinter, the three pipelines we compared, and the data that proves why graphs are the future of retrieval.
🏥 The Problem and The Medical Dataset
Building robust architecture for healthcare applications requires absolute precision. That is why we chose a dense Medical Dataset—mapping specific diseases and their overlapping symptoms—for this hackathon.
Medical data is inherently a graph. A symptom links to multiple possible diseases, and a disease links to specific treatments. Standard vector search struggles to connect these dots safely because it treats text as isolated, independent chunks. We needed a knowledge graph to preserve the true multi-hop relationships.
🖥️ System Architecture & The Tkinter Dashboard
To make the comparison tangible and user-friendly, we bypassed standard web frameworks and built a lightweight, interactive GUI using Python's native tkinter library.
This dashboard acts as our command center. When you enter a single patient symptom query, the GUI simultaneously routes it through all three pipelines, displaying the responses side-by-side along with real-time performance metrics.
🧠 The Three Inference Pipelines: A Medical Case Study
To truly measure the impact, our app evaluates three side-by-side pipelines answering the exact same symptom-based queries. Let's look at how they handle a complex, multi-hop query like: "What diseases share the symptoms of chronic migraines, acute nausea, and severe vertigo?"
❌ Pipeline 1: LLM-Only (The Baseline)
- How it works: A prompt goes in, and an output comes out. Zero external retrieval.
- The Medical Result: The LLM relies entirely on its pre-trained weights. It frequently generates generic advice or hallucinates a completely unrelated disease. In healthcare AI, this worst-case baseline is highly risky.
⚠️ Pipeline 2: Basic RAG (Vector + LLM)
- How it works: The current industry standard. Vector embeddings retrieve mathematically similar text chunks regarding the listed symptoms, dumping them into the context window.
- The Medical Result: It pulls a massive, noisy context dump of raw symptom descriptions. The LLM has to sift through thousands of tokens to try and find the overlap. It frequently misses crucial relationship data because vector search lacks explicit entity linking.
✅ Pipeline 3: GraphRAG (Graph + LLM)
-
How it works: Built using the TigerGraph GraphRAG framework. We modeled the data as interconnected entities (
DiseaseandSymptom) and relationships (HAS_SYMPTOM). - The Medical Result: TigerGraph performs true multi-hop reasoning. It identifies the symptoms, traces the graph edges to find the exact diseases where they intersect, and filters out the noise. The LLM receives a clean, highly structured, and tightly filtered prompt.
📊 Evaluating the Metrics: The Comparison Dashboard
The heart of this project is the interactive comparison dashboard. Here is how the performance metrics stacked up during testing:
| Metric | Pipeline 1 (LLM-Only) | Pipeline 2 (Basic RAG) | Pipeline 3 (GraphRAG) |
|---|---|---|---|
| Tokens Used | Baseline (Prompt only) | Extremely High (Massive context dump) | Significantly Reduced |
| Cost Per Query | Lowest | Highest | Optimized & Predictable |
| Response Latency | Ultra-Fast | Slowest (Processing massive chunks) | Balanced & Efficient |
| Multi-Hop Reasoning | Fails completely | Fails frequently | Excels (Traverses explicit edges) |

Caption: Live side-by-side comparison inside our Python GUI dashboard.
🔬 Deep-Diving Into the Knowledge Graph Data
To give a better visual of how our application structures facts, we can see how the schema explicitly connects diseases directly to symptoms instead of scattering them inside raw text paragraphs.

Caption: A view of our interconnected nodes inside TigerGraph.
🎯 Ensuring Answer Accuracy (The Hugging Face Evaluation)
Cutting tokens is useless if the answer quality drops—especially with medical data. To prove Pipeline 3's superiority, every response was rigorously evaluated using two complementary Hugging Face approaches:
- LLM-as-a-Judge: A hosted Hugging Face model graded each answer on a PASS/FAIL basis, targeting a pass rate of ≥ 90%.
- BERTScore: We measured the semantic similarity of the generated response against the ground-truth correct answer, aiming for an F1 rescaled score of ≥ 0.55.
Through iterative parameter tuning, the GraphRAG pipeline consistently hit these high-accuracy thresholds while utilizing a fraction of the tokens required by Basic RAG.
💡 The Verdict
The numbers tell the story. By feeding the LLM structured graph relationships instead of raw vector chunks, TigerGraph extracts exactly what is needed for complex, symptom-to-disease routing. This means optimized API costs, faster generation times, and highly accurate, hallucination-resistant answers.
🛠️ Open Source & Code Walkthrough
We have open-sourced our benchmarking suite! Check out how we wired up the graph traversal and built the GUI:
- 💻 GitHub Repository: https://github.com/Lochan-Visnu/GraphRAG-Hackathon-Lcube*
🎥 Demo Video: https://drive.google.com/file/d/1eF7Ahm6laaiajQhQDBlMzY-m1IrczWZJ/view?usp=drivesdk
A huge thank you to the TigerGraph team for providing the open-source repo, the Savanna environment, and the opportunity to tackle this industry-wide problem during the hackathon!




Top comments (0)