DEV Community

Ai developer
Ai developer

Posted on • Originally published at t.me

RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)

RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)

After 20+ hours of compute time on local hardware, I benchmarked 7 RAG configurations against real-world tasks. The results surprised me — and changed how I think about retrieval architecture.

Why This Matters

RAG is everywhere in 2026. Everyone claims their pipeline is "SOTA," but most benchmarks use toy datasets. I wanted to see what actually works when you have:

  • Messy real documents (not clean academic corpora)
  • A local LLM (slightly weaker than GPT-4)
  • Production constraints (latency, cost, accuracy tradeoffs)

The 7 Configurations Tested

Method Approach Score
No-RAG Direct LLM generation Baseline
Classical RAG Dense retrieval (BGE-small + FAISS) Poor
Hybrid RAG BM25 + Dense + RRF fusion + cross-encoder reranker Moderate
LightRAG Key-value extraction graph + dense hybrid Disappointing
PageIndex Two-stage hierarchical retrieval Okay
GraphRAG Entity graph + dense fallback Complex
Agentic RAG Multi-step reasoning pipeline Slow, expensive
SEQUOIA RAPTOR tree + step-back prompting Best
SEQUOIA Pro Multi-query + rerank + compression SOTA

What Surprised Me

LightRAG underperformed

The Twitter-hyped "graph RAG revolution" didn't hold up on real data. LightRAG produced what I call "procedural warming" — it looks sophisticated but retrieval quality was mediocre. Academic benchmarks ≠ production reality.

Step-back prompting is underrated

Most RAG systems fail because they retrieve on the literal query. Step-back prompting (rewriting the query into a more general form before retrieval) improved recall by ~15% across the board. Combined with RAPTOR tree clustering, it creates a retrieval hierarchy that actually makes sense.

Local LLMs can evaluate

I used a local model for summarization and judging. Slightly weaker than GPT-4, yes, but the relative rankings between methods stayed consistent. This means you can prototype and benchmark without burning API credits.

SEQUOIA Architecture

User Query
    ↓
Step-back Prompting (generalize)
    ↓
RAPTOR Tree Retrieval (hierarchical clusters)
    ↓
Rerank + Context Compression
    ↓
Local LLM Generation
Enter fullscreen mode Exit fullscreen mode

RAPTOR = Recursive Abstractive Processing for Tree-Organized Retrieval. Cluster leaf nodes, summarize upward, retrieve at multiple levels of abstraction.

Step-back = Before searching, ask: "What is the general principle behind this specific question?"

Results

On my test set (banking documents, technical manuals, internal wikis):

Method Precision Recall Latency
Classical RAG 0.62 0.58 120ms
Hybrid RAG 0.71 0.65 340ms
LightRAG 0.59 0.61 890ms
SEQUOIA 0.84 0.79 450ms
SEQUOIA Pro 0.87 0.82 680ms

SEQUOIA Pro trades some latency for accuracy. SEQUOIA (basic) is the sweet spot for production.

Code & Reproducibility

Everything is open source:

🔗 github.com/Diyago/rag-benchmark

  • All 7 implementations
  • Evaluation dataset (anonymized)
  • Configs for local LLM setup
  • Notebooks for analysis

Lessons for Production

  1. Don't trust academic benchmarks blindly. Test on YOUR data.
  2. Hierarchical retrieval beats flat. RAPTOR's tree structure matches how humans actually organize knowledge.
  3. Query rewriting is free performance. Step-back prompting costs nothing in latency but improves retrieval significantly.
  4. Local evaluation is viable. You don't need GPT-4 to compare methods relatively.

What's Next

I'm extending SEQUOIA with:

  • Multi-modal retrieval (images + text)
  • Streaming context compression
  • Adaptive depth (shallow for simple queries, deep for complex)

More AI Engineering Notes

I write about practical AI/ML from inside a bank — RAG systems, LLM deployment, team management, and what actually works vs. what's just hype.

Telegram channel (Russian, technical): AI.Insaf


Have you benchmarked RAG on real data? What surprised you? Drop a comment or reach out on Telegram.


Эта статья также опубликована в Telegram-канале AI.Insaf — про AI/ML из банковской практики, бенчмарки и управление DS-командами.

Подписывайтесь на канал для оперативных разборов и практических кейсов: https://t.me/ai_tablet

Top comments (0)