Rost

Posted on Nov 19 • Originally published at glukhov.org

Advanced RAG: LongRAG, Self-RAG and GraphRAG Explained

#ai #rag #llm #python

Retrieval-Augmented Generation (RAG)
has evolved far beyond simple vector similarity search.
LongRAG, Self-RAG, and GraphRAG represent the cutting edge of these capabilities.

Modern RAG systems need to handle massive documents, understand complex entity relationships and much more.

The Evolution Beyond Basic RAG

Traditional RAG systems follow a simple pattern: chunk documents, embed them into vectors, retrieve similar chunks via cosine similarity, and feed them to an LLM. While effective for many use cases, this approach struggles with three critical scenarios:

Long-range dependencies: Important context might span thousands of tokens across multiple chunks
Retrieval confidence: The system has no way to assess whether retrieved content is actually relevant
Relationship complexity: Vector similarity cannot capture intricate connections between entities

Advanced RAG variants address these limitations with specialized architectures tailored to specific challenges.

LongRAG: Conquering Extended Context

Architecture Overview

LongRAG fundamentally rethinks the chunking strategy by leveraging LLMs with extended context windows (32K, 100K, or even 1M tokens). Instead of breaking documents into small 512-token chunks, LongRAG uses a hierarchical approach:

Document-Level Embedding: The entire document (or very large sections) is processed as a single unit. A document-level embedding captures the overall semantic meaning, while maintaining the full text for downstream processing.

Minimal Fragmentation: When chunking is necessary, LongRAG uses much larger chunks (4K-8K tokens) with significant overlap (20-30%). This preserves narrative flow and reduces context fragmentation.

Context Assembly: At retrieval time, LongRAG returns complete documents or large coherent sections rather than scattered fragments. The LLM receives continuous context that preserves structural and semantic relationships.

Implementation Strategy

Here's a conceptual implementation using Python and modern embedding models:

from typing import List, Dict
import numpy as np

class LongRAGRetriever:
    def __init__(self, model, chunk_size=8000, overlap=1600):
        self.model = model
        self.chunk_size = chunk_size
        self.overlap = overlap
        self.doc_embeddings = []
        self.documents = []

    def create_long_chunks(self, text: str) -> List[str]:
        """Create overlapping large chunks"""
        chunks = []
        start = 0
        while start < len(text):
            end = start + self.chunk_size
            chunk = text[start:end]
            chunks.append(chunk)
            start += (self.chunk_size - self.overlap)
        return chunks

    def index_document(self, doc: str, metadata: Dict):
        """Index document with hierarchical embedding"""
        # Embed entire document
        doc_embedding = self.model.embed(doc)

        # Create large chunks with overlap
        chunks = self.create_long_chunks(doc)
        chunk_embeddings = [self.model.embed(c) for c in chunks]

        self.doc_embeddings.append({
            'doc_id': len(self.documents),
            'doc_embedding': doc_embedding,
            'chunk_embeddings': chunk_embeddings,
            'chunks': chunks,
            'full_text': doc,
            'metadata': metadata
        })
        self.documents.append(doc)

    def retrieve(self, query: str, top_k: int = 3) -> List[Dict]:
        """Retrieve relevant long-form content"""
        query_embedding = self.model.embed(query)

        # Score at document level first
        doc_scores = [
            np.dot(query_embedding, doc['doc_embedding'])
            for doc in self.doc_embeddings
        ]

        # Get top documents
        top_doc_indices = np.argsort(doc_scores)[-top_k:][::-1]

        results = []
        for idx in top_doc_indices:
            doc_data = self.doc_embeddings[idx]

            # For each document, find best chunks
            chunk_scores = [
                np.dot(query_embedding, emb)
                for emb in doc_data['chunk_embeddings']
            ]
            best_chunk_idx = np.argmax(chunk_scores)

            # Return extended context around best chunk
            context_chunks = self._get_extended_context(
                doc_data['chunks'], 
                best_chunk_idx
            )

            results.append({
                'text': ''.join(context_chunks),
                'score': doc_scores[idx],
                'metadata': doc_data['metadata']
            })

        return results

    def _get_extended_context(self, chunks: List[str], 
                             center_idx: int) -> List[str]:
        """Get extended context around relevant chunk"""
        start = max(0, center_idx - 1)
        end = min(len(chunks), center_idx + 2)
        return chunks[start:end]

Use Cases and Performance

LongRAG excels in scenarios where context matters:

Legal document analysis: Contracts and legal briefs often have dependencies spanning dozens of pages
Research paper retrieval: Understanding methodology requires coherent sections, not isolated paragraphs
Code repositories: Functions and classes must be understood within their module context

Performance characteristics:

Latency: Higher due to processing large chunks (2-5x slower than standard RAG)
Accuracy: 15-25% improvement on long-form QA benchmarks
Memory: Requires 3-4x more memory for context windows

Self-RAG: Reflective Retrieval

Core Principles

Self-RAG introduces a metacognitive layer to RAG systems. Instead of blindly retrieving and generating, the system actively reflects on its own processes through special reflection tokens:

Retrieve Token: Decides whether retrieval is necessary for a given query
Relevance Token: Evaluates if retrieved documents are actually relevant
Support Token: Checks if the generated answer is supported by the retrieved content
Critique Token: Assesses the overall quality of the generated response

Architecture Components

The Self-RAG architecture consists of three interleaved phases:

class SelfRAGSystem:
    def __init__(self, retriever, generator, critic):
        self.retriever = retriever
        self.generator = generator
        self.critic = critic

    def generate_with_reflection(self, query: str, 
                                 max_iterations: int = 3):
        """Generate answer with self-reflection"""

        # Phase 1: Decide if retrieval needed
        retrieve_decision = self.critic.should_retrieve(query)

        if not retrieve_decision:
            # Direct generation without retrieval
            return self.generator.generate(query)

        # Phase 2: Retrieve and evaluate relevance
        retrieved_docs = self.retriever.retrieve(query)
        relevant_docs = []

        for doc in retrieved_docs:
            relevance_score = self.critic.assess_relevance(
                query, doc
            )
            if relevance_score > 0.7:  # Threshold
                relevant_docs.append(doc)

        if not relevant_docs:
            # Fallback to generation without retrieval
            return self.generator.generate(query)

        # Phase 3: Generate and verify support
        best_answer = None
        best_score = -1

        for _ in range(max_iterations):
            # Generate candidate answer
            answer = self.generator.generate(
                query, context=relevant_docs
            )

            # Evaluate support and quality
            support_score = self.critic.check_support(
                answer, relevant_docs
            )
            quality_score = self.critic.assess_quality(answer)

            total_score = 0.6 * support_score + 0.4 * quality_score

            if total_score > best_score:
                best_score = total_score
                best_answer = answer

            # Early stopping if high quality achieved
            if total_score > 0.9:
                break

        return {
            'answer': best_answer,
            'confidence': best_score,
            'sources': relevant_docs,
            'reflections': {
                'retrieved': retrieve_decision,
                'relevance': len(relevant_docs),
                'support': support_score
            }
        }

Training the Reflection Mechanisms

Self-RAG requires training the critic component to make reliable assessments. This typically involves:

Supervised fine-tuning on datasets annotated with relevance judgments
Reinforcement learning with rewards for accurate predictions
Contrastive learning to distinguish supported vs. unsupported claims

The reflection tokens can be implemented as:

Special tokens in the vocabulary (like [RETRIEVE], [RELEVANT])
Separate classifier heads on the model
External critic models (ensemble approach)

Production Considerations

When deploying Self-RAG in production systems:

Latency Trade-offs: Each reflection step adds 20-40% inference overhead. Balance thoroughness with response time requirements.

Confidence Thresholds: Tune reflection thresholds based on your use case. Legal or medical applications need higher confidence than general chatbots.

Monitoring: Track reflection decisions to identify patterns. If retrieval is rarely needed, you might benefit from a simpler architecture.

GraphRAG: Knowledge Graph-Enhanced Retrieval

Conceptual Foundation

GraphRAG transforms the retrieval problem from vector similarity to graph traversal. Instead of finding semantically similar text chunks, GraphRAG identifies relevant subgraphs of connected entities and relationships.

Entity Extraction: Identify named entities, concepts, and their types
Relationship Mapping: Extract relationships between entities (temporal, causal, hierarchical)
Graph Construction: Build a knowledge graph with entities as nodes and relationships as edges
Subgraph Retrieval: Given a query, find relevant connected subgraphs

Graph Construction Pipeline

Building a knowledge graph from unstructured text involves several stages:

class GraphRAGBuilder:
    def __init__(self, entity_extractor, relation_extractor):
        self.entity_extractor = entity_extractor
        self.relation_extractor = relation_extractor
        self.graph = NetworkGraph()

    def build_graph(self, documents: List[str]):
        """Build knowledge graph from documents"""
        for doc in documents:
            # Extract entities
            entities = self.entity_extractor.extract(doc)

            # Add entities as nodes
            for entity in entities:
                self.graph.add_node(
                    entity['text'],
                    entity_type=entity['type'],
                    context=entity['surrounding_text']
                )

            # Extract relationships
            relations = self.relation_extractor.extract(
                doc, entities
            )

            # Add relationships as edges
            for rel in relations:
                self.graph.add_edge(
                    rel['source'],
                    rel['target'],
                    relation_type=rel['type'],
                    confidence=rel['score'],
                    evidence=rel['text_span']
                )

    def enrich_graph(self):
        """Add derived relationships and metadata"""
        # Compute node importance (PageRank, etc.)
        self.graph.compute_centrality()

        # Identify communities/clusters
        self.graph.detect_communities()

        # Add temporal ordering if timestamps available
        self.graph.add_temporal_edges()

Query Processing with Graphs

GraphRAG queries involve multi-hop reasoning across the knowledge graph:

class GraphRAGRetriever:
    def __init__(self, graph, embedder):
        self.graph = graph
        self.embedder = embedder

    def retrieve_subgraph(self, query: str, 
                         max_hops: int = 2,
                         max_nodes: int = 50):
        """Retrieve relevant subgraph for query"""

        # Identify seed entities in query
        query_entities = self.entity_extractor.extract(query)

        # Find matching nodes in graph
        seed_nodes = []
        for entity in query_entities:
            matches = self.graph.find_similar_nodes(
                entity['text'],
                similarity_threshold=0.85
            )
            seed_nodes.extend(matches)

        # Expand subgraph via traversal
        subgraph = self.graph.create_subgraph()
        visited = set()

        for seed in seed_nodes:
            self._expand_from_node(
                seed, 
                subgraph, 
                visited,
                current_hop=0,
                max_hops=max_hops
            )

        # Rank nodes by relevance
        ranked_nodes = self._rank_subgraph_nodes(
            subgraph, query
        )

        # Extract and format context
        context = self._format_graph_context(
            ranked_nodes[:max_nodes],
            subgraph
        )

        return context

    def _expand_from_node(self, node, subgraph, visited,
                         current_hop, max_hops):
        """Recursively expand subgraph"""
        if current_hop >= max_hops or node in visited:
            return

        visited.add(node)
        subgraph.add_node(node)

        # Get neighbors
        neighbors = self.graph.get_neighbors(node)

        for neighbor, edge_data in neighbors:
            # Add edge to subgraph
            subgraph.add_edge(node, neighbor, edge_data)

            # Recursively expand
            self._expand_from_node(
                neighbor,
                subgraph,
                visited,
                current_hop + 1,
                max_hops
            )

    def _format_graph_context(self, nodes, subgraph):
        """Convert subgraph to textual context"""
        context_parts = []

        for node in nodes:
            # Add node context
            context_parts.append(f"Entity: {node.text}")
            context_parts.append(f"Type: {node.entity_type}")

            # Add relationship information
            edges = subgraph.get_edges(node)
            for edge in edges:
                context_parts.append(
                    f"- {edge.relation_type} -> {edge.target.text}"
                )

        return "\n".join(context_parts)

Microsoft's GraphRAG Implementation

Microsoft's GraphRAG takes a unique approach by generating community summaries:

Build initial graph from documents using LLM-based entity/relation extraction
Detect communities using Leiden algorithm or similar
Generate summaries for each community using LLMs
Hierarchical structure: Build multiple levels of community abstractions
Query time: Retrieve relevant communities and traverse to specific entities

This approach is particularly effective for:

Exploratory queries ("What are the main themes in this corpus?")
Multi-hop reasoning ("How is A connected to C through B?")
Temporal analysis ("How did this entity's relationships evolve?")

Comparative Analysis

When to Use Each Variant

Use LongRAG when:

Documents have strong internal coherence
Context windows of your LLM support large inputs (32K+)
Query answers require understanding long-range dependencies
You're working with structured documents (reports, papers, books)

Use Self-RAG when:

Accuracy and trustworthiness are critical
You need explainable retrieval decisions
False positives from irrelevant retrieval are costly
Query complexity varies widely (some need retrieval, others don't)

Use GraphRAG when:

Your domain has rich entity relationships
Queries involve multi-hop reasoning
Temporal or hierarchical relationships matter
You need to understand connections between entities

Performance Metrics Comparison

Metric	Standard RAG	LongRAG	Self-RAG	GraphRAG
Indexing Time	1x	0.8x	1.1x	3-5x
Query Latency	1x	2-3x	1.4x	1.5-2x
Memory Usage	1x	3-4x	1.2x	2-3x
Accuracy (QA)	baseline	+15-25%	+20-30%	+25-40%*
Interpretability	Low	Medium	High	High

*GraphRAG improvements highly domain-dependent

Hybrid Approaches

The most powerful production systems often combine multiple techniques:

LongRAG + GraphRAG: Use graph structure to identify relevant document clusters, then retrieve full documents rather than fragments

Self-RAG + GraphRAG: Apply reflection mechanisms to graph traversal decisions (which paths to follow, when to stop expansion)

Three-stage pipeline: Use GraphRAG for initial entity-based retrieval → Self-RAG for relevance filtering → LongRAG for context assembly

Implementation Considerations

Embedding Models

Different RAG variants have different embedding requirements:

LongRAG: Needs embeddings that work well on both document-level and chunk-level. Consider models trained with contrastive learning on long sequences.

Self-RAG: Benefits from embeddings that capture semantic nuances for fine-grained relevance assessment.

GraphRAG: Requires entity-aware embeddings. Models fine-tuned on entity linking tasks perform better.

The choice of embedding model significantly impacts performance. When working with local models, tools like Ollama provide a straightforward way to experiment with different embedding models before committing to a production deployment.

Chunking Strategies Revisited

Traditional fixed-size chunking is insufficient for advanced RAG:

Semantic chunking: Break at natural boundaries (paragraphs, sections, topic shifts)
Recursive chunking: Create hierarchical chunks with parent-child relationships
Sliding window: Use overlapping chunks to preserve context at boundaries
Structure-aware: Respect document structure (markdown headers, XML tags, code blocks)

For Python-based implementations, libraries like LangChain and LlamaIndex provide built-in support for these chunking strategies.

Reranking Integration

Reranking dramatically improves retrieval quality across all RAG variants. After initial retrieval, a specialized reranking model re-scores results based on query-document interaction features. This provides a significant accuracy boost (10-20%) with minimal latency impact when integrated thoughtfully.

Scaling to Production

Indexing pipeline:

Use distributed processing (Ray, Dask) for large document corpora
Implement incremental indexing for real-time updates
Store embeddings in optimized vector databases (Pinecone, Weaviate, Qdrant)

Query optimization:

Cache frequent queries and their results
Implement query routing (different RAG variants for different query types)
Use approximate nearest neighbor search for sub-linear scaling

Monitoring:

Track retrieval relevance scores
Monitor reflection decisions in Self-RAG
Measure graph traversal paths and depths
Log confidence scores and user feedback

Real-World Applications

Technical Documentation Search

A major cloud provider implemented GraphRAG for their documentation:

Entities: API endpoints, parameters, error codes, service names
Relationships: Dependencies, version compatibilities, migration paths
Result: 35% reduction in support tickets, 45% faster resolution time

Legal Discovery

A legal tech company combined Self-RAG with LongRAG:

Self-RAG filters irrelevant documents early
LongRAG preserves context in retained documents
Lawyers review 60% fewer false positives
Critical context preservation improved from 71% to 94%

Research Literature Review

Academic search engine using hybrid approach:

GraphRAG identifies citation networks and research communities
LongRAG retrieves full sections maintaining methodology context
40% improvement in relevant paper discovery
Reduced time to literature review from weeks to days

Advanced Topics

Multi-Modal RAG

Extending these variants to handle images, tables, and code:

Visual grounding: Link text entities to images in documents
Table understanding: Parse structured data into graph format
Code analysis: Build dependency graphs from codebases

Adaptive RAG

Dynamic selection of RAG strategy based on query characteristics:

Query complexity classifier
Document type detector
Cost-benefit optimizer for strategy selection

Privacy-Preserving RAG

Implementing these variants with privacy constraints:

Federated retrieval across data silos
Differential privacy in embeddings
Encrypted similarity search

Getting Started

Quick Start with Python

For those looking to implement these techniques, starting with a solid Python foundation is essential. Python's rich ecosystem for machine learning makes it the natural choice for RAG development.

Here's a simple starting point for experimentation:

# Install dependencies
# pip install sentence-transformers faiss-cpu langchain

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Basic setup for experimenting with long chunks
model = SentenceTransformer('all-MiniLM-L6-v2')

documents = [
    # Your long-form documents here
]

# Create embeddings
embeddings = model.encode(documents)

# Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings.astype('float32'))

# Query
query = "Your question here"
query_embedding = model.encode([query])
distances, indices = index.search(
    query_embedding.astype('float32'), k=3
)

Framework Selection

LangChain: Best for rapid prototyping, extensive integrations
LlamaIndex: Optimized for document indexing and retrieval
Haystack: Production-ready, strong pipeline abstractions
Custom: When you need full control and optimization

Evaluation Framework

Implement rigorous evaluation before production deployment:

Retrieval metrics:

Precision@K, Recall@K, MRR (Mean Reciprocal Rank)
NDCG (Normalized Discounted Cumulative Gain)

Generation metrics:

ROUGE, BLEU for text similarity
BERTScore for semantic similarity
Human evaluation for quality assessment

End-to-end metrics:

Task success rate
User satisfaction scores
Latency percentiles (p50, p95, p99)

Conclusion

The landscape of RAG systems has matured significantly beyond basic vector similarity search. LongRAG, Self-RAG, and GraphRAG each address specific limitations of traditional approaches:

LongRAG solves the context fragmentation problem by embracing extended context windows and minimal chunking. It's the go-to choice when document coherence matters and you have the computational resources to handle large contexts.

Self-RAG adds critical self-awareness to retrieval systems. By reflecting on its own decisions, it reduces false positives and improves trustworthiness—essential for high-stakes applications where accuracy matters more than speed.

GraphRAG unlocks the power of structured knowledge representation. When your domain involves complex relationships between entities, graph-based retrieval can surface connections that vector similarity completely misses.

The future of RAG likely involves hybrid approaches that combine the strengths of these variants. A production system might use GraphRAG to identify relevant entity clusters, Self-RAG to filter and validate retrievals, and LongRAG to assemble coherent context for the LLM.

As LLMs continue to improve and context windows expand, we'll see even more sophisticated RAG variants emerge. The key is understanding your specific use case requirements—document structure, query patterns, accuracy demands, and computational constraints—and selecting the appropriate technique or combination thereof.

The tooling ecosystem is maturing rapidly, with frameworks like LangChain, LlamaIndex, and Haystack providing increasingly sophisticated support for these advanced patterns. Combined with powerful local LLM runtimes and embedding models, it's never been easier to experiment with and deploy production-grade RAG systems.

Start with the basics, measure performance rigorously, and evolve your architecture as requirements dictate. The advanced RAG variants covered here provide a roadmap for that evolution.