Building a GraphRAG Benchmark Platform During the TigerGraph Hackathon

#devchallenge #llm #performance #rag

Large Language Models are becoming increasingly powerful, but their growing context windows also increase token usage, latency, and inference cost. Traditional Retrieval-Augmented Generation (RAG) systems improve grounding by retrieving similar text chunks, yet they still struggle to reason across relationships between entities and concepts. This challenge became the core idea behind the TigerGraph GraphRAG Inference Hackathon, which focuses on proving whether graph-based retrieval can make AI systems faster, cheaper, and more context-aware.
To explore this, I built an interactive benchmarking dashboard that compares three pipelines side-by-side: LLM-Only, Basic RAG, and GraphRAG. Using an arXiv AI research dataset containing nearly 10,000 papers, the system evaluates token usage, latency, cost, and response quality to analyze how different retrieval architectures impact modern AI inference systems.
**The Problem
Modern AI systems rely heavily on large amounts of context to generate accurate responses. While this improves answer quality, it also increases token consumption and inference cost significantly. In production environments, even small inefficiencies in retrieval pipelines can lead to slower responses, higher API expenses, and unnecessary context processing.
Basic RAG systems attempt to solve this by retrieving semantically similar chunks using vector search. However, similarity-based retrieval often lacks structural understanding. It may retrieve related text, but it cannot naturally capture deeper relationships such as entity connections, topic dependencies, or multi-hop reasoning across documents.
For domains like research papers, where concepts are highly interconnected, this becomes a major limitation. The challenge was to explore whether GraphRAG could provide more focused and relationship-aware retrieval while maintaining answer accuracy with lower token usage.
**What I Built
I developed a GraphRAG benchmarking dashboard that executes and compares three different AI inference pipelines using the same user query and dataset. The platform was designed to visually demonstrate how retrieval strategy affects system performance, cost, and reasoning quality in real-world AI workflows.
The dashboard includes:
an LLM-Only pipeline for direct generation,
a Basic RAG pipeline using TF-IDF-based retrieval,
and a GraphRAG pipeline that uses relationship-aware retrieval with graph-style reasoning.
Each pipeline generates responses independently and is evaluated using metrics such as total tokens consumed, response latency, estimated inference cost, and answer quality. The system also includes automated evaluation methods like LLM-as-a-Judge and BERTScore to benchmark semantic accuracy across pipelines.
The entire application was built as an interactive Streamlit dashboard for real-time experimentation and comparison.
**System Architecture
The project was designed as a modular benchmarking system where all three pipelines operate independently but share the same input query and evaluation workflow. This structure made it easier to compare retrieval strategies under identical conditions and measure their impact on inference efficiency.

Pipeline 1 — LLM-Only
The first pipeline acts as the baseline system. The user query is directly sent to the language model without any external retrieval or contextual grounding. While this approach is simple, it often leads to higher token usage and less controlled responses.
Query → LLM → Response

Pipeline 2 — Basic RAG
The second pipeline introduces retrieval before generation. Relevant chunks are fetched from the arXiv dataset using TF-IDF similarity search and passed as context to the language model. This improves grounding by supplying supporting information related to the query.
Query → TF-IDF Retrieval → Context → LLM → Response

Pipeline 3 — GraphRAG
The GraphRAG pipeline extends retrieval beyond similarity matching by incorporating entity relationships and multi-hop reasoning. Instead of retrieving isolated chunks, the system builds structured contextual connections between related concepts before generating a response.
Query → Graph Retrieval → Relationship Context → LLM → Response

All three pipelines are evaluated through a unified benchmarking layer that measures token consumption, latency, cost, and answer quality side-by-side.
**Tech Stack
**The project was built using a combination of AI, retrieval, visualization, and evaluation tools to create a complete end-to-end benchmarking workflow.

Frontend Dashboard: Streamlit

Programming Language: Python

LLM Provider: Groq API (Llama 3 70B)

Retrieval System: TF-IDF Retrieval with Cosine Similarity

Graph Retrieval: TigerGraph GraphRAG with mock graph retrieval fallback

Dataset Processing: Pandas, NumPy

Evaluation Metrics: BERTScore and LLM-as-a-Judge

Visualization: Plotly

Dataset: FinanceBench + Finance Alpaca financial datasets based on SEC 10-K filings and financial Q&A data

The modular structure allowed different components such as retrieval, evaluation, and inference pipelines to operate independently, making the system easier to debug, benchmark, and extend during development.
**Dataset
To benchmark the three retrieval pipelines on a realistic knowledge-heavy domain, I used an arXiv AI research papers dataset containing approximately 10,000 research records and over 1 million tokens of text. The dataset included paper titles, summaries, categories, and research descriptions, making it well-suited for testing retrieval and reasoning performance across interconnected topics.
Research datasets are particularly effective for GraphRAG systems because concepts such as neural networks, transformers, reinforcement learning, and graph neural networks frequently reference and relate to one another across documents. This creates a naturally connected knowledge structure that graph-based retrieval can leverage more effectively than isolated similarity matching.
Before benchmarking, the dataset was cleaned, merged into structured text fields, and prepared for retrieval through preprocessing and chunking pipelines.
**Challenges Faced During Development
One of the biggest challenges during development was managing LLM API compatibility and quota limitations. The project initially used Gemini APIs, but repeated model access errors, SDK inconsistencies, and rate-limit issues disrupted the inference pipeline. To stabilize the system, the backend was later migrated to Groq, which required restructuring the LLM utility layer and updating imports across multiple modules.
Another challenge involved maintaining consistent retrieval quality across pipelines. Selecting appropriate chunk sizes, retrieval depth, and context limits required several iterations to balance accuracy with token efficiency. Smaller contexts reduced cost but occasionally removed important information, while larger contexts increased latency and token consumption.
Integrating GraphRAG behavior without a fully active TigerGraph server was also difficult. To prevent failures during demos and testing, a mock graph retrieval fallback system was implemented to simulate relationship-aware retrieval when the graph backend was unavailable.
**Benchmark Results
The benchmarking process showed clear differences in how each retrieval pipeline handled context generation and inference efficiency. The LLM-Only pipeline produced responses quickly but consumed the highest number of tokens because the model relied entirely on internal reasoning without external retrieval support.
The Basic RAG pipeline improved contextual grounding by retrieving relevant chunks from the dataset before generation. This reduced hallucinations and improved response relevance, but token usage still increased when multiple large chunks were added to the prompt.
The GraphRAG pipeline demonstrated better token efficiency by constructing more focused contextual inputs using entity relationships and multi-hop retrieval patterns. Instead of supplying large volumes of loosely related text, the system generated structured context centered around connected concepts. This helped reduce unnecessary prompt expansion while maintaining comparable response quality across benchmark queries.
The dashboard also provided side-by-side comparisons for latency, estimated inference cost, and semantic evaluation scores, making the performance trade-offs between all three pipelines easier to analyze visually.
**Key Learnings
Building this project highlighted how strongly retrieval architecture influences the overall behavior of AI systems. The experiments showed that improving retrieval quality can often be more impactful than simply increasing model size or context length. Structured retrieval helped generate more focused prompts, which directly affected token efficiency and response consistency.
Another major learning was the importance of modular system design. Separating the project into independent components for retrieval, evaluation, inference, and visualization made debugging significantly easier during rapid development and API migrations.
The project also demonstrated that benchmarking AI systems requires more than observing generated answers. Metrics such as token usage, latency, semantic similarity, and evaluation consistency provide a much clearer understanding of real-world system performance. Comparing pipelines side-by-side made these trade-offs far more visible than testing individual models in isolation.
**Conclusion
This project explored how different retrieval strategies affect the efficiency and reasoning capability of modern AI systems. By benchmarking LLM-Only, Basic RAG, and GraphRAG pipelines on the same dataset and queries, the system provided a practical comparison of how contextual retrieval impacts token consumption, response quality, and inference performance.
The results demonstrated that graph-oriented retrieval can generate more focused contextual information compared to traditional similarity-based approaches, especially in highly connected domains such as research literature. Beyond the technical implementation, the project emphasized the growing importance of evaluation-driven AI engineering, where system architecture, retrieval quality, and benchmarking metrics are treated as equally important as the language model itself.
The dashboard ultimately served as both a research experiment and a scalable prototype for analyzing retrieval efficiency in production-style AI workflows.

Top comments (1)

Harjot Singh • May 31

Building a benchmark platform for GraphRAG rather than just shipping a GraphRAG is the move that earns trust, because graph retrieval is exactly the kind of thing everyone assumes is better without measuring whether it actually beats plain vector search on their data. Sometimes the graph structure genuinely helps (multi-hop questions, relationship-heavy queries where the answer is in how entities connect, not in any single chunk), and sometimes it's expensive machinery that loses to a good embedding retriever on straightforward lookups. The only way to know which case you're in is a benchmark, so building the measuring stick first is the right order of operations. The thing I'd watch for in the eval design is that the test set has to contain the queries where graph should win, multi-hop, relational, the questions a flat retriever structurally can't answer, otherwise you're benchmarking GraphRAG on vector-friendly questions and concluding the graph adds nothing when really your test just never exercised it. Match the benchmark to the capability you're trying to measure. Measure whether the graph earns its complexity, don't assume it. That benchmark-before-you-believe instinct is core to how I think about retrieval in Moonshift. In your results so far, is GraphRAG winning mostly on the multi-hop queries, or are you seeing it help on single-fact lookups too?