RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)
After 20+ hours of compute time on local hardware, I benchmarked 7 RAG configurations against real-world tasks. SEQUOIA (RAPTOR tree + step-back prompting) consistently outperformed alternatives.
The Full Pipeline List
| Method | Core Approach |
|---|---|
| No-RAG | Direct LLM generation |
| Classical RAG | Dense retrieval (BGE-small + FAISS) |
| Hybrid RAG | BM25 + Dense + RRF + reranker |
| LightRAG | Key-value graph + dense hybrid |
| PageIndex | Two-stage hierarchical retrieval |
| GraphRAG | Entity graph + dense fallback |
| Agentic RAG | Multi-step reasoning pipeline |
| SEQUOIA | RAPTOR tree + step-back prompting |
| SEQUOIA Pro | Multi-query + rerank + compression |
Why LightRAG Underperformed
The hype suggested graph-based RAG would revolutionize retrieval. On real banking documents and technical manuals:
- Graph construction is expensive (entity extraction, relationship mapping)
- Retrieval quality did not justify the overhead
- Academic benchmarks do not equal production reality
Why RAPTOR Works
Recursive Abstractive Processing for Tree-Organized Retrieval:
- Cluster leaf nodes (individual chunks)
- Summarize upward (hierarchical abstraction)
- Retrieve at multiple levels (specific details + high-level context)
This mirrors how humans organize knowledge.
Step-Back Prompting: Free Performance
Before retrieving, generalize the query:
- User asks: "What's the error rate for Q3?"
- Step-back: "What metrics are tracked quarterly?"
- Retrieve broader context first, then narrow
Result: ~15% improvement in recall. Zero latency cost.
SEQUOIA Architecture
User Query
Step-back Prompting (generalize)
RAPTOR Tree Retrieval (multi-level)
Context Compression (summarize long contexts)
Re-ranking (cross-encoder)
Local LLM Generation
Local LLM Evaluation
I used a local model weaker than GPT-4 for judging. Key finding: relative rankings between methods stayed consistent even with a weaker evaluator.
You can prototype and compare approaches without burning API credits on GPT-4 evaluations.
Production Recommendations
- Start with Classical RAG — establish baseline, prove value
- Add step-back prompting — free performance gain
- Move to hierarchical retrieval when context complexity justifies it
- Avoid graph approaches unless you have specific graph-structured data
- Measure on YOUR data — academic benchmarks are misleading
Open Source
Everything is available:
https://github.com/Diyago/rag-benchmark/tree/main
Includes all implementations, evaluation dataset (anonymized), and analysis notebooks.
More AI Engineering Notes
I write about practical AI/ML from inside a bank — RAG systems, LLM deployment, team management, and what actually works versus what is just hype.
Telegram channel (Russian, technical): https://t.me/ai_tablet
Have you benchmarked RAG on real data? What surprised you?
More AI engineering notes, RAG benchmarks, and production insights from inside a bank — follow my Telegram channel:
🚀 https://t.me/ai_tablet (Russian, technical)
Top comments (2)
Great work benchmarking these RAG configurations! I'm curious how SEQUOIA performs with large document sets—did you notice any scalability issues?
It bases on raptor it will work!!