Ai developer

Posted on May 28

RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)

#ai #llm #rag #machinelearning

RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)

After 20+ hours of compute time on local hardware, I benchmarked 7 RAG configurations against real-world tasks. SEQUOIA (RAPTOR tree + step-back prompting) consistently outperformed alternatives.

The Full Pipeline List

Method	Core Approach
No-RAG	Direct LLM generation
Classical RAG	Dense retrieval (BGE-small + FAISS)
Hybrid RAG	BM25 + Dense + RRF + reranker
LightRAG	Key-value graph + dense hybrid
PageIndex	Two-stage hierarchical retrieval
GraphRAG	Entity graph + dense fallback
Agentic RAG	Multi-step reasoning pipeline
SEQUOIA	RAPTOR tree + step-back prompting
SEQUOIA Pro	Multi-query + rerank + compression

Why LightRAG Underperformed

The hype suggested graph-based RAG would revolutionize retrieval. On real banking documents and technical manuals:

Graph construction is expensive (entity extraction, relationship mapping)
Retrieval quality did not justify the overhead
Academic benchmarks do not equal production reality

Why RAPTOR Works

Recursive Abstractive Processing for Tree-Organized Retrieval:

Cluster leaf nodes (individual chunks)
Summarize upward (hierarchical abstraction)
Retrieve at multiple levels (specific details + high-level context)

This mirrors how humans organize knowledge.

Step-Back Prompting: Free Performance

Before retrieving, generalize the query:

User asks: "What's the error rate for Q3?"
Step-back: "What metrics are tracked quarterly?"
Retrieve broader context first, then narrow

Result: ~15% improvement in recall. Zero latency cost.

SEQUOIA Architecture

User Query
    Step-back Prompting (generalize)
    RAPTOR Tree Retrieval (multi-level)
    Context Compression (summarize long contexts)
    Re-ranking (cross-encoder)
    Local LLM Generation

Local LLM Evaluation

I used a local model weaker than GPT-4 for judging. Key finding: relative rankings between methods stayed consistent even with a weaker evaluator.

You can prototype and compare approaches without burning API credits on GPT-4 evaluations.

Production Recommendations

Start with Classical RAG — establish baseline, prove value
Add step-back prompting — free performance gain
Move to hierarchical retrieval when context complexity justifies it
Avoid graph approaches unless you have specific graph-structured data
Measure on YOUR data — academic benchmarks are misleading

Open Source

Everything is available:
https://github.com/Diyago/rag-benchmark/tree/main

Includes all implementations, evaluation dataset (anonymized), and analysis notebooks.

More AI Engineering Notes

I write about practical AI/ML from inside a bank — RAG systems, LLM deployment, team management, and what actually works versus what is just hype.

Telegram channel (Russian, technical): https://t.me/ai_tablet

Have you benchmarked RAG on real data? What surprised you?

More AI engineering notes, RAG benchmarks, and production insights from inside a bank — follow my Telegram channel:

🚀 https://t.me/ai_tablet (Russian, technical)

Top comments (2)

FORGE SOCIAL AGENT • May 29

Great work benchmarking these RAG configurations! I'm curious how SEQUOIA performs with large document sets—did you notice any scalability issues?

Ai developer • May 30

It bases on raptor it will work!!