DEV Community

Ai developer
Ai developer

Posted on

RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)

RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)

After 20+ hours of compute time on local hardware, I benchmarked 7 RAG configurations against real-world tasks. SEQUOIA (RAPTOR tree + step-back prompting) consistently outperformed alternatives.

The Full Pipeline List

Method Core Approach
No-RAG Direct LLM generation
Classical RAG Dense retrieval (BGE-small + FAISS)
Hybrid RAG BM25 + Dense + RRF + reranker
LightRAG Key-value graph + dense hybrid
PageIndex Two-stage hierarchical retrieval
GraphRAG Entity graph + dense fallback
Agentic RAG Multi-step reasoning pipeline
SEQUOIA RAPTOR tree + step-back prompting
SEQUOIA Pro Multi-query + rerank + compression

Why LightRAG Underperformed

The hype suggested graph-based RAG would revolutionize retrieval. On real banking documents and technical manuals:

  • Graph construction is expensive (entity extraction, relationship mapping)
  • Retrieval quality did not justify the overhead
  • Academic benchmarks do not equal production reality

Why RAPTOR Works

Recursive Abstractive Processing for Tree-Organized Retrieval:

  1. Cluster leaf nodes (individual chunks)
  2. Summarize upward (hierarchical abstraction)
  3. Retrieve at multiple levels (specific details + high-level context)

This mirrors how humans organize knowledge.

Step-Back Prompting: Free Performance

Before retrieving, generalize the query:

  • User asks: "What's the error rate for Q3?"
  • Step-back: "What metrics are tracked quarterly?"
  • Retrieve broader context first, then narrow

Result: ~15% improvement in recall. Zero latency cost.

SEQUOIA Architecture

User Query
    Step-back Prompting (generalize)
    RAPTOR Tree Retrieval (multi-level)
    Context Compression (summarize long contexts)
    Re-ranking (cross-encoder)
    Local LLM Generation
Enter fullscreen mode Exit fullscreen mode

Local LLM Evaluation

I used a local model weaker than GPT-4 for judging. Key finding: relative rankings between methods stayed consistent even with a weaker evaluator.

You can prototype and compare approaches without burning API credits on GPT-4 evaluations.

Production Recommendations

  1. Start with Classical RAG — establish baseline, prove value
  2. Add step-back prompting — free performance gain
  3. Move to hierarchical retrieval when context complexity justifies it
  4. Avoid graph approaches unless you have specific graph-structured data
  5. Measure on YOUR data — academic benchmarks are misleading

Open Source

Everything is available:
https://github.com/Diyago/rag-benchmark/tree/main

Includes all implementations, evaluation dataset (anonymized), and analysis notebooks.

More AI Engineering Notes

I write about practical AI/ML from inside a bank — RAG systems, LLM deployment, team management, and what actually works versus what is just hype.

Telegram channel (Russian, technical): https://t.me/ai_tablet


Have you benchmarked RAG on real data? What surprised you?


More AI engineering notes, RAG benchmarks, and production insights from inside a bank — follow my Telegram channel:

🚀 https://t.me/ai_tablet (Russian, technical)

Top comments (2)

Collapse
 
forgeaibot profile image
FORGE SOCIAL AGENT

Great work benchmarking these RAG configurations! I'm curious how SEQUOIA performs with large document sets—did you notice any scalability issues?

Collapse
 
__2ddbae6bb7d profile image
Ai developer

It bases on raptor it will work!!