saurabh naik

Posted on May 18

GraphRAG vs vector RAG: when the knowledge graph pays for itself

#ai #llm #rag #python

Ask your vector RAG pipeline "what are the main themes in this corpus?" and watch it return three random chunks that share a keyword. Flat vector retrieval is built for "find me the chunk that matches this query." It is not built for holistic, sense-making questions over a whole corpus.

GraphRAG, from Microsoft Research, was the headline fix for that gap. It builds an LLM-extracted knowledge graph plus hierarchical community summaries, then answers global queries by map-reducing over those summaries. The catch — which Microsoft itself published in their LazyGraphRAG benchmark — is that the indexing pipeline costs roughly 1000x more than a naive vector index. This post walks through what GraphRAG actually does, when it earns that cost, and what to reach for when it doesn't.

The failure mode flat vector RAG hides

Say you have 500 internal incident reports. A new hire asks: "What categories of incidents have we hit most often this year?"

Vector RAG embeds the question, retrieves the top-k chunks by cosine similarity, and stuffs them into the prompt. You get an answer based on whichever 5 chunks happened to score highest — usually the ones with the highest keyword overlap, not a representative sample of the corpus. The model can only summarize what it sees, and it never sees the whole picture.

This is the failure mode GraphRAG was built for: queries where the right answer requires reasoning over the whole corpus, not just retrieving the closest passage.

How GraphRAG fixes it

The indexing pipeline does four things:

Chunk and extract. An LLM reads each chunk and extracts entities, relationships, and claims — with weighted edges and source provenance back to the original text.
Build a typed graph. Entities become nodes, relationships become edges. Storage is usually Neo4j or LanceDB.
Run Leiden community detection. This hierarchical clustering algorithm partitions the graph into nested communities — small tight clusters inside larger thematic ones.
Generate community reports. For every community at every level, the LLM writes a natural-language summary. These summaries are what global queries actually answer against.

That last step is where the token bill explodes. You are paying an LLM to summarize every community at every hierarchy level, and you do it once at index time.

Local Search vs Global Search

GraphRAG ships two query modes, and the difference matters:

Local Search is for specific entity-centric questions ("what did we ship in the Q3 release?"). It matches the query to entities, expands to their neighborhoods (linked entities, relationships, source text), and feeds that subgraph as context.

Global Search is for thematic, aggregative questions ("what are the recurring failure modes across these incidents?"). It map-reduces over the precomputed community reports — each report contributes a partial answer, then a reducer combines them.

If you only need Local Search, you arguably do not need GraphRAG — entity-anchored hybrid retrieval gets you most of the way there. Global Search is the unique capability, and it is also the one that justifies the indexing cost.

A minimal run

pip install graphrag

Initialize a workspace:

python -m graphrag.index --init --root ./ragtest

That scaffolds a settings.yaml. The fields you will edit first:

llm:
  type: openai_chat
  model: gpt-4o-mini
  api_key: ${GRAPHRAG_API_KEY}

embeddings:
  llm:
    type: openai_embedding
    model: text-embedding-3-small

chunks:
  size: 1200
  overlap: 100

community_reports:
  max_length: 2000

Drop your text files in ./ragtest/input/, then run the index:

python -m graphrag.index --root ./ragtest

Issue a global query:

python -m graphrag.query \
  --root ./ragtest \
  --method global \
  "What are the main themes across these documents?"

The first run on a small corpus is illuminating — you can watch the token meter while the LLM summarizes communities.

Warning: Run this on a 5MB corpus before you point it at a 5GB one. The indexing cost scales with the LLM work, not with disk size, and that work is not cheap.

The cost wall

This is the part most blog posts skip. Microsoft Research's own LazyGraphRAG benchmark on AP News measured the original GraphRAG indexing cost at ~$1,544 per million tokens versus ~$1.45 per million tokens for vector RAG. That is roughly 1000x.

The same paper introduced LazyGraphRAG, which defers graph construction to query time and uses cheaper NLP for entity extraction plus on-the-fly LLM ranking. On the same benchmark, LazyGraphRAG matched or beat GraphRAG's answer quality at ~0.1% of the indexing cost — and at its highest query budget, it outperformed GraphRAG Global Search by 16.96% on comprehensiveness and 25.7% on diversity win rates for local queries.

The authors of that LazyGraphRAG paper, Darren Edge and Ha Trinh, are also the authors of the original GraphRAG paper. Microsoft is telling you the upfront graph is overkill for most workloads.

The cheaper default: hybrid + rerank

When the corpus is not heavily reused, the practical pattern is:

Hybrid retrieval. BM25 for lexical recall + dense embeddings for semantic recall, union the candidates.
LLM reranker. Pass the top ~50 candidates to a small cheap LLM with a relevance prompt, keep the top 5–10.
Generate. Feed those into the answer LLM.

This recovers most of the gain GraphRAG offers for entity-anchored queries, with no indexing-time graph build. The tradeoff is that you do pay more per query — every question runs the rerank step. For corpora that are queried rarely, that economics is correct. For corpora that are queried thousands of times a day on the same content, GraphRAG amortizes better.

When the graph still wins

Three signals say "build the graph upfront":

Reuse. The same corpus is queried heavily — knowledge bases, support docs, contract repositories — so the indexing cost amortizes over thousands of queries.
Provenance. Regulated domains where every answer needs a citation trail back to source documents. The graph's edge-level source tracking is the cleanest way to deliver that.
Repeatable thematic queries. Same kinds of "what are the patterns across X" questions, over and over. Community reports are precisely the precomputation that makes those cheap at query time.

If your workload misses all three, LazyGraphRAG or hybrid+rerank is almost certainly the right default.

A three-question decision

Before you pip install graphrag in production:

Will you reissue similar queries against the same corpus more than ~1000 times? If no, defer the graph.
Do answers need an auditable citation trail? If no, defer the graph.
Are your hardest queries thematic ("what are the main X across the whole corpus")? If no, hybrid retrieval is likely enough.

Three nos means hybrid retrieval plus an LLM reranker. Three yeses means GraphRAG earns its index cost. Mixed answers mean LazyGraphRAG is probably the right middle.

Wrapping up

GraphRAG is real engineering, not hype. It solves a problem flat vector RAG genuinely cannot solve. But the cost profile is severe enough that Microsoft itself shipped a 1000x-cheaper variant a year later. Treat the choice as an economics question, not a capability question: does your query-to-index ratio amortize a $1,500 indexing job, and do your answers need the provenance the graph gives you?

If you want to go deeper, the LazyGraphRAG announcement on the Microsoft Research blog has the full benchmark numbers, and the microsoft/graphrag repo has reference settings for several backends. Both are worth reading before you commit to a path.

What query-to-index ratio made GraphRAG worth it in your stack? Or did you end up landing on hybrid retrieval instead?

Top comments (1)

Varsha Ojha • May 18

Good breakdown. Vector RAG is great for quick retrieval, but GraphRAG starts making more sense when relationships between entities matter. For complex domains, the value is not just finding similar chunks. It’s understanding how things connect.