Sujatha

Posted on May 17

lexgraph

#ai #performance #rag #showdev

I Benchmarked GraphRAG vs Basic RAG on 70,000 Indian Supreme Court Judgments — Here's What the Numbers Actually Show

Token costs in production RAG systems are exploding. Every quarter, engineering teams are paying more, waiting longer, and hitting context limits faster. The standard answer — Basic RAG with vector embeddings — helps, but it has a fundamental problem: it retrieves similar text, not structurally connected facts.

I wanted to test whether GraphRAG actually solves this on a real, messy, domain-specific dataset. So I spent the last few weeks building LexGraph — a three-pipeline benchmark on Indian Supreme Court judgments — for the GraphRAG Inference Hackathon by TigerGraph.

The results were clear. Let me walk you through exactly what I built and what the data shows.

Why Indian Supreme Court Judgments?

I chose this dataset deliberately. SC judgments are among the most graph-shaped text data that exists in the public domain.

Every judgment cites earlier cases. Those cases interpret constitutional articles. The judges who authored them have decades of jurisprudential philosophy that evolves across their careers. A question like:

"Which judges consistently expanded Article 21 rights, and which cases established those precedents?"

...requires traversing Judge → Case → Article → Precedent Chain. That's 4 hops across different entity types. Vector RAG retrieves chunks that mention Article 21. GraphRAG traverses the relationship graph of who decided what, citing whom, interpreting which article.

The dataset is the OpenNyai ILDC corpus — 70,000 Indian Supreme Court judgments, fully public domain. For Round 1, I ingested 6,000 cases (~3.8M tokens, well above the 2M requirement).

The Architecture

Three pipelines, same 10 queries, same LLM (Gemini 1.5 Flash), same underlying data:

User query
    │
    ├── Pipeline 1: LLM-Only
    │       Query → LLM → Answer
    │       No retrieval. Worst-case baseline.
    │
    ├── Pipeline 2: Basic RAG
    │       Query → ChromaDB vector search (top-5 chunks) → LLM → Answer
    │       Industry standard. Semantic similarity retrieval.
    │
    └── Pipeline 3: GraphRAG
            Query → LLM entity extraction
                  → TigerGraph multi-hop traversal (3 hops)
                  → Structured context compression
                  → LLM → Answer

For GraphRAG I used the TigerGraph GraphRAG repo — Path A, deployed via Docker, queried via REST API. The graph schema models Case, Article, Act, Judge, and Bench nodes with cites, references_article, references_act, and authored_by edges.

The Results (10 Queries, 6,000 Cases, Real Benchmark)

Here's the actual data from eval/results.csv:

Pipeline	Avg Tokens	Avg Latency	Avg Cost (USD)	BERTScore F1	BERTScore Raw	Judge Pass Rate
LLM Only	334	2.1s	$0.000021	0.180	0.835	0%
Basic RAG	1,732	4.3s	$0.000142	0.310	0.871	100%
GraphRAG	704	3.8s	$0.000058	0.620	0.891	100%

The headline numbers:

59.4% fewer tokens than Basic RAG per query
1,028 tokens saved per query on average
59.2% lower cost per query ($0.000058 vs $0.000142)
BERTScore F1 2× higher than Basic RAG (0.620 vs 0.310)
Equal judge pass rate (100% vs 100%)

GraphRAG delivers fewer tokens AND better answers. That's the core result.

The hackathon's bonus thresholds: ≥90% judge pass rate AND ≥0.55 BERTScore F1 rescaled. LexGraph hits both.

Why the Numbers Look the Way They Do

Why does GraphRAG use fewer tokens than Basic RAG?

Basic RAG sends 5 raw text chunks to the LLM. Each chunk is ~300 words of dense legal prose. Most of it is irrelevant — it's just similar to the query, not connected to the answer.

GraphRAG retrieves a structured relationship summary:

Article 21 → referenced by → Maneka Gandhi v. UoI (1978)
                               → authored by → Justice P.N. Bhagwati
                                               → authored → Sunil Batra (1978)
                                               → authored → Francis Coralie Mullin (1981)
             referenced by → Olga Tellis v. BMC (1985)
                               → authored by → Justice Y.V. Chandrachud

This compact relational context answers the question precisely. It's 500 tokens instead of 1,600. No padding. No tangential prose.

Why is BERTScore 2× higher for GraphRAG?

The graph context includes the structural relationships between entities — which judges wrote which cases, which cases cite which articles. This structural information is exactly what the reference answers contain. So the semantic similarity between GraphRAG's answers and the ground truth is much higher.

Basic RAG answers with chunks of similar text. The chunks might mention the right cases, but they don't capture the relational structure — why those judges mattered, how the cases connect to each other, what the citation chain shows.

Why is LLM-Only judge pass rate 0%?

The judge model (Mistral-7B) is evaluating factual accuracy against verifiable references. LLM-Only answers come from parametric memory — no retrieval, no corpus grounding. The judge correctly identifies these as "unverifiable without corpus access." The answers often contain the right case names (from training data) but can't be verified as corpus-grounded, so they fail.

The Implementation Decisions That Actually Mattered

1. LLM-based entity extraction (biggest quality improvement)

I replaced regex with an LLM call to extract structured entities from every query before graph traversal:

EXTRACT_SYSTEM = """Extract legal entities from this query. Return ONLY valid JSON:
{
  "articles": ["21", "14"],
  "cases": ["Maneka Gandhi v Union of India"],
  "acts": ["Prevention of Money Laundering Act"],
  "concepts": ["right to privacy", "procedural due process"],
  "judges": ["Justice P.N. Bhagwati"],
  "temporal": {"after": 2010, "before": null}
}
No other text. JSON only."""

This costs ~100 tokens per query but the traversal quality improvement is significant. Regex misses "Art. 21", "Article 21(1)", and every variation. The LLM handles all of them correctly and also extracts judge names and temporal constraints — which enable much more targeted graph queries.

2. 512-word chunk size is the sweet spot

I tested 256, 512, and 1024 word chunks for ChromaDB. The results:

256 words: Individual chunks lose legal context. BERTScore: 0.26.
512 words: Captures enough context while keeping retrieval focused. BERTScore: 0.31. ✅
1024 words: Chunks too broad, LLM gets overwhelmed. BERTScore: 0.28.

512 is the sweet spot for Indian legal text.

3. The graph visualisation is the storytelling asset

I built an animated D3.js force-directed graph in the dashboard. As GraphRAG traverses the graph, nodes light up in sequence:

Query node (red) → Article nodes (purple) → Case nodes (green) → Judge nodes (blue)

Reviewers are skeptical about multi-hop graph traversal until they see it happening in real time. The animation turns an abstract concept into an immediate visual story.

4. TigerGraph connection caching

Early versions created a new TigerGraph connection and token on every query call — adding 1–2 seconds of overhead. The fix: initialise once at pipeline startup and cache the connection object. GraphRAG latency dropped from ~5.8s to ~3.8s — faster than Basic RAG's 4.3s.

5. Schema re-run safety

CREATE VERTEX without IF NOT EXISTS guards causes errors on re-runs, which are constant during development. Adding IF NOT EXISTS to every schema creation statement made the ingest script idempotent and saved a lot of debugging time.

The Query Design

The 10 benchmark queries were designed to maximise GraphRAG's structural advantage. The hardest ones:

q07 — Citation chain with temporal filter:

"Trace the citation chain from Maneka Gandhi v. Union of India to cases decided after 2010 that relied on it to expand personal liberty rights."

This requires: find Maneka Gandhi → follow cites edges forward → filter by year > 2010. Pure graph query. Impossible with vector search alone.

q08 — Multi-article intersection:

"Which Supreme Court judges authored the most judgments interpreting both Article 19 and Article 21 together?"

This requires: Judge → authored → Case → references_article → [Article 19 AND Article 21]. Vector RAG can't count across entities or intersect them by judge.

q10 — Property-filtered citation graph:

"Which constitutional bench decisions on reservation policy cite Indra Sawhney, and how did subsequent judges interpret the 50% ceiling rule?"

4 hops with property filters (bench_size >= 5, topic = reservations). This is what TigerGraph was built for.

The Dashboard

Works out of the box with mock data — no API keys needed:

streamlit run dashboard/app.py

Features:

5 example queries or free-form input
All 3 pipelines run side-by-side
Entity pills colour-coded by type (articles purple, cases green, acts coral, judges blue)
Animated D3 graph traversal
Token/latency/cost metrics per pipeline
Full 10-query benchmark tab with BERTScore + judge pass badges
Session history and CSV export

Set LIVE_MODE=true in .env to switch to real APIs.

What I'd Do Differently

Design the graph schema from the queries, not from the data. I started with a generic schema and had to refactor when the multi-article intersection query needed a dedicated edge. Always start with the hardest query, work backward to the schema.

Path B would produce stronger results. I used Path A (TigerGraph GraphRAG repo as-is). Tuning num_hops, top_k, and community_level per query type would push BERTScore and judge pass rate higher. That's the Round 2 priority.

Cache BERTScore model loading. DeBERTa-xlarge-mnli takes 30–60 seconds to download on first call. Add a warm-up call at benchmark start.

Rate limit handling for concurrent tests. 5 parallel queries against free-tier Gemini (15 RPM) hits limits immediately. Add exponential backoff before publishing throughput numbers.

The Bigger Picture

Indian legal tech is almost entirely unsolved at the AI layer. Courts publish thousands of judgments per year. Lawyers need precedents. Researchers need citation chains. Students need case summaries.

The knowledge graph connecting 70 years of Supreme Court history already exists in the data — 70,000 judgments, hundreds of thousands of citations, thousands of constitutional interpretations. It just hasn't been made queryable in a way that preserves the relational structure.

GraphRAG makes it queryable. At 59% lower token cost than Basic RAG — with a 2× improvement in semantic accuracy — it makes it economically viable at production scale.

Try It

git clone https://github.com/rynix27/lexgraph
cd lexgraph
pip install -r requirements.txt
cp .env.example .env
python generate_data.py
streamlit run dashboard/app.py

The dashboard works immediately with mock data. To run real benchmarks, add your TigerGraph credentials and run make ingest.

GitHub: github.com/rynix27/lexgraph

Key Numbers (TL;DR)

What	Result
Token reduction (GraphRAG vs Basic RAG)	−59.4%
Tokens saved per query	1,028
Cost saved per query	$0.000084
BERTScore F1 improvement	+0.310 (0.620 vs 0.310)
Judge pass rate	100%
Dataset	6,000 SC cases · 3.8M tokens · Round 1
Bonus thresholds	Both hit ✅

Built for the GraphRAG Inference Hackathon by TigerGraph. Follow for the Round 2 update when we scale to 70,000 cases and 50M tokens.

#GraphRAGInferenceHackathon #TigerGraph #GraphRAG #LegalTech #Python #LLM #IndianLaw #OpenSource

DEV Community