RAG SOTA: I Built SEQUOIA and Tested 7 Pipelines — Full Results

#rag #llm #ai #opensource

RAG SOTA: I Built SEQUOIA and Tested 7 Pipelines — Full Results

After 20+ hours of compute time on local hardware, I benchmarked 7 RAG configurations against real-world tasks. SEQUOIA (RAPTOR tree + step-back prompting) consistently outperformed alternatives.

The Full Pipeline List

Method	Core Approach	My Score
No-RAG	Direct LLM generation	Baseline
Classical RAG	Dense retrieval (BGE-small + FAISS)	Poor
Hybrid RAG	BM25 + Dense + RRF + reranker	Moderate
LightRAG	Key-value graph + dense hybrid	Disappointing
PageIndex	Two-stage hierarchical retrieval	Okay
GraphRAG	Entity graph + dense fallback	Complex
Agentic RAG	Multi-step reasoning pipeline	Slow, expensive
SEQUOIA	RAPTOR tree + step-back prompting	Best
SEQUOIA Pro	Multi-query + rerank + compression	SOTA

Why LightRAG Underperformed

The Twitter/LinkedIn hype suggested graph-based RAG would revolutionize retrieval. On real banking documents and technical manuals:

Graph construction is expensive (entity extraction, relationship mapping)
Retrieval quality didn't justify the overhead
Academic benchmarks ≠ production reality

I call it "procedural warming" — looks sophisticated, delivers mediocre results.

Why RAPTOR Works

Recursive Abstractive Processing for Tree-Organized Retrieval:

Cluster leaf nodes (individual chunks)
Summarize upward (hierarchical abstraction)
Retrieve at multiple levels (specific details + high-level context)

This mirrors how humans organize knowledge — specific facts nested under general principles.

Step-Back Prompting: Free Performance

Before retrieving, generalize the query:

User asks: "What's the error rate for Q3?"
Step-back: "What metrics are tracked quarterly?"
Retrieve broader context first, then narrow

Result: ~15% improvement in recall across all tested configurations. Costs nothing in latency.

SEQUOIA Architecture

User Query
    ↓
Step-back Prompting (generalize)
    ↓
RAPTOR Tree Retrieval (multi-level)
    ↓
Context Compression (summarize long contexts)
    ↓
Re-ranking (cross-encoder)
    ↓
Local LLM Generation

Local LLM Evaluation

I used a local model weaker than GPT-4 for judging and summarization. Key finding: relative rankings between methods stayed consistent even with a weaker evaluator.

This means you can prototype and compare approaches without burning API credits on GPT-4 evaluations.