GraphRAG has two main phases: Indexing (preprocessing the dataset) and Querying (answering questions).
- Indexing Phase (Offline, Expensive but Done Once)
Text Chunking — Split the input text into manageable chunks.
Entity Extraction — Use an LLM to identify entities (people, places, organizations, concepts) and relationships from each chunk.
Build Knowledge Graph — Create a graph where nodes are entities and edges are relationships (with descriptions).
Community Detection — Apply graph algorithms (e.g., Leiden algorithm) to identify clusters of closely related entities (communities).
Hierarchical Summarization — Generate summaries for each community at multiple levels (bottom-up hierarchy: detailed low-level communities → higher-level aggregated summaries).
The result is a structured index: the graph + pre-generated community summaries.
This captures implicit connections across the entire dataset that vector embeddings alone miss.
- Querying Phase
Local Queries (specific details): Retrieve relevant subgraphs or text chunks near mentioned entities.
Global Queries (broad understanding):
Select relevant community summaries (based on similarity to the query).
Use the LLM to generate partial answers from each summary.
Aggregate and summarize the partial answers into a final coherent response.
This "map-reduce" style over communities enables holistic reasoning.
Why It's Better Than Standard RAG
Comprehensiveness: Captures broader themes and connections → answers are more complete.
Diversity: Reduces repetition and surfaces varied perspectives.
Empowerment: Provides grounded, evidence-based insights for complex datasets (e.g., conflicting news sources).
Experiments in the paper (on datasets ~1 million tokens) show GraphRAG outperforming baseline RAG by 70-80% on metrics like comprehensiveness and diversity for global questions.
Practical Details
Open-source implementation: Available on GitHub (microsoft/graphrag).
Costs: Indexing is LLM-intensive (many calls for extraction and summarization), but querying is efficient.
Later improvements (post-paper): Things like LazyGraphRAG (more cost-efficient), DRIFT search, dynamic community selection, and auto-tuning for new domains.
In summary, GraphRAG represents a major advancement in making LLMs reason over large, private, narrative-rich datasets by leveraging graph structures for "global sensemaking." It's particularly useful when standard RAG gives incomplete or superficial answers.
Top comments (0)