DEV Community

Nitheesh gaddam
Nitheesh gaddam

Posted on

How GraphRAG Works

GraphRAG has two main phases: Indexing (preprocessing the dataset) and Querying (answering questions).

  1. Indexing Phase (Offline, Expensive but Done Once)

Text Chunking — Split the input text into manageable chunks.
Entity Extraction — Use an LLM to identify entities (people, places, organizations, concepts) and relationships from each chunk.
Build Knowledge Graph — Create a graph where nodes are entities and edges are relationships (with descriptions).
Community Detection — Apply graph algorithms (e.g., Leiden algorithm) to identify clusters of closely related entities (communities).
Hierarchical Summarization — Generate summaries for each community at multiple levels (bottom-up hierarchy: detailed low-level communities → higher-level aggregated summaries).
The result is a structured index: the graph + pre-generated community summaries.

This captures implicit connections across the entire dataset that vector embeddings alone miss.

  1. Querying Phase

Local Queries (specific details): Retrieve relevant subgraphs or text chunks near mentioned entities.
Global Queries (broad understanding):
Select relevant community summaries (based on similarity to the query).
Use the LLM to generate partial answers from each summary.
Aggregate and summarize the partial answers into a final coherent response.

This "map-reduce" style over communities enables holistic reasoning.
Why It's Better Than Standard RAG

Comprehensiveness: Captures broader themes and connections → answers are more complete.
Diversity: Reduces repetition and surfaces varied perspectives.
Empowerment: Provides grounded, evidence-based insights for complex datasets (e.g., conflicting news sources).
Experiments in the paper (on datasets ~1 million tokens) show GraphRAG outperforming baseline RAG by 70-80% on metrics like comprehensiveness and diversity for global questions.

Practical Details

Open-source implementation: Available on GitHub (microsoft/graphrag).
Costs: Indexing is LLM-intensive (many calls for extraction and summarization), but querying is efficient.
Later improvements (post-paper): Things like LazyGraphRAG (more cost-efficient), DRIFT search, dynamic community selection, and auto-tuning for new domains.

In summary, GraphRAG represents a major advancement in making LLMs reason over large, private, narrative-rich datasets by leveraging graph structures for "global sensemaking." It's particularly useful when standard RAG gives incomplete or superficial answers.

Top comments (0)