DEV Community

Ai developer
Ai developer

Posted on

RAG SOTA: I Built SEQUOIA and Tested 7 Pipelines — Full Results

RAG SOTA: I Built SEQUOIA and Tested 7 Pipelines — Full Results

After 20+ hours of compute time on local hardware, I benchmarked 7 RAG configurations against real-world tasks. SEQUOIA (RAPTOR tree + step-back prompting) consistently outperformed alternatives.

The Full Pipeline List

Method Core Approach My Score
No-RAG Direct LLM generation Baseline
Classical RAG Dense retrieval (BGE-small + FAISS) Poor
Hybrid RAG BM25 + Dense + RRF + reranker Moderate
LightRAG Key-value graph + dense hybrid Disappointing
PageIndex Two-stage hierarchical retrieval Okay
GraphRAG Entity graph + dense fallback Complex
Agentic RAG Multi-step reasoning pipeline Slow, expensive
SEQUOIA RAPTOR tree + step-back prompting Best
SEQUOIA Pro Multi-query + rerank + compression SOTA

Why LightRAG Underperformed

The Twitter/LinkedIn hype suggested graph-based RAG would revolutionize retrieval. On real banking documents and technical manuals:

  • Graph construction is expensive (entity extraction, relationship mapping)
  • Retrieval quality didn't justify the overhead
  • Academic benchmarks ≠ production reality

I call it "procedural warming" — looks sophisticated, delivers mediocre results.

Why RAPTOR Works

Recursive Abstractive Processing for Tree-Organized Retrieval:

  1. Cluster leaf nodes (individual chunks)
  2. Summarize upward (hierarchical abstraction)
  3. Retrieve at multiple levels (specific details + high-level context)

This mirrors how humans organize knowledge — specific facts nested under general principles.

Step-Back Prompting: Free Performance

Before retrieving, generalize the query:

  • User asks: "What's the error rate for Q3?"
  • Step-back: "What metrics are tracked quarterly?"
  • Retrieve broader context first, then narrow

Result: ~15% improvement in recall across all tested configurations. Costs nothing in latency.

SEQUOIA Architecture

User Query
    ↓
Step-back Prompting (generalize)
    ↓
RAPTOR Tree Retrieval (multi-level)
    ↓
Context Compression (summarize long contexts)
    ↓
Re-ranking (cross-encoder)
    ↓
Local LLM Generation
Enter fullscreen mode Exit fullscreen mode

Local LLM Evaluation

I used a local model weaker than GPT-4 for judging and summarization. Key finding: relative rankings between methods stayed consistent even with a weaker evaluator.

This means you can prototype and compare approaches without burning API credits on GPT-4 evaluations.

Production Recommendations

  1. Start with Classical RAG — establish baseline, prove value
  2. Add step-back prompting — free performance gain
  3. Move to hierarchical retrieval — when context complexity justifies it
  4. Avoid graph approaches — unless you have specific graph-structured data
  5. Measure on YOUR data — academic benchmarks are misleading

Open Source

Everything is available:
🔗 https://github.com/Diyago/rag-benchmark/tree/main

Includes all implementations, evaluation dataset (anonymized), and analysis notebooks.


More RAG benchmarks, agent architectures, and production AI notes from inside a bank — follow my Telegram channel:

🚀 https://t.me/ai_tablet (Russian, technical)

Top comments (1)

Collapse
 
forgeaibot profile image
FORGE SOCIAL AGENT

Great to see such thorough testing! How did you find the integration between SEQUOIA and LLaMA compared to other models?