RAG SOTA: I Built SEQUOIA and Tested 7 Pipelines — Full Results
After 20+ hours of compute time on local hardware, I benchmarked 7 RAG configurations against real-world tasks. SEQUOIA (RAPTOR tree + step-back prompting) consistently outperformed alternatives.
The Full Pipeline List
| Method | Core Approach | My Score |
|---|---|---|
| No-RAG | Direct LLM generation | Baseline |
| Classical RAG | Dense retrieval (BGE-small + FAISS) | Poor |
| Hybrid RAG | BM25 + Dense + RRF + reranker | Moderate |
| LightRAG | Key-value graph + dense hybrid | Disappointing |
| PageIndex | Two-stage hierarchical retrieval | Okay |
| GraphRAG | Entity graph + dense fallback | Complex |
| Agentic RAG | Multi-step reasoning pipeline | Slow, expensive |
| SEQUOIA | RAPTOR tree + step-back prompting | Best |
| SEQUOIA Pro | Multi-query + rerank + compression | SOTA |
Why LightRAG Underperformed
The Twitter/LinkedIn hype suggested graph-based RAG would revolutionize retrieval. On real banking documents and technical manuals:
- Graph construction is expensive (entity extraction, relationship mapping)
- Retrieval quality didn't justify the overhead
- Academic benchmarks ≠ production reality
I call it "procedural warming" — looks sophisticated, delivers mediocre results.
Why RAPTOR Works
Recursive Abstractive Processing for Tree-Organized Retrieval:
- Cluster leaf nodes (individual chunks)
- Summarize upward (hierarchical abstraction)
- Retrieve at multiple levels (specific details + high-level context)
This mirrors how humans organize knowledge — specific facts nested under general principles.
Step-Back Prompting: Free Performance
Before retrieving, generalize the query:
- User asks: "What's the error rate for Q3?"
- Step-back: "What metrics are tracked quarterly?"
- Retrieve broader context first, then narrow
Result: ~15% improvement in recall across all tested configurations. Costs nothing in latency.
SEQUOIA Architecture
User Query
↓
Step-back Prompting (generalize)
↓
RAPTOR Tree Retrieval (multi-level)
↓
Context Compression (summarize long contexts)
↓
Re-ranking (cross-encoder)
↓
Local LLM Generation
Local LLM Evaluation
I used a local model weaker than GPT-4 for judging and summarization. Key finding: relative rankings between methods stayed consistent even with a weaker evaluator.
This means you can prototype and compare approaches without burning API credits on GPT-4 evaluations.
Production Recommendations
- Start with Classical RAG — establish baseline, prove value
- Add step-back prompting — free performance gain
- Move to hierarchical retrieval — when context complexity justifies it
- Avoid graph approaches — unless you have specific graph-structured data
- Measure on YOUR data — academic benchmarks are misleading
Open Source
Everything is available:
🔗 https://github.com/Diyago/rag-benchmark/tree/main
Includes all implementations, evaluation dataset (anonymized), and analysis notebooks.
More RAG benchmarks, agent architectures, and production AI notes from inside a bank — follow my Telegram channel:
🚀 https://t.me/ai_tablet (Russian, technical)
Top comments (0)