Ai developer

Posted on May 28 • Originally published at t.me

RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)

#ai #career #llm #rag

RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)

After 20+ hours of compute time on local hardware, I benchmarked 7 RAG configurations against real-world tasks. The results surprised me — and changed how I think about retrieval architecture.

Why This Matters

RAG is everywhere in 2026. Everyone claims their pipeline is "SOTA," but most benchmarks use toy datasets. I wanted to see what actually works when you have:

Messy real documents (not clean academic corpora)
A local LLM (slightly weaker than GPT-4)
Production constraints (latency, cost, accuracy tradeoffs)

The 7 Configurations Tested

Method	Approach	Score
No-RAG	Direct LLM generation	Baseline
Classical RAG	Dense retrieval (BGE-small + FAISS)	Poor
Hybrid RAG	BM25 + Dense + RRF fusion + cross-encoder reranker	Moderate
LightRAG	Key-value extraction graph + dense hybrid	Disappointing
PageIndex	Two-stage hierarchical retrieval	Okay
GraphRAG	Entity graph + dense fallback	Complex
Agentic RAG	Multi-step reasoning pipeline	Slow, expensive
SEQUOIA	RAPTOR tree + step-back prompting	Best
SEQUOIA Pro	Multi-query + rerank + compression	SOTA

What Surprised Me

LightRAG underperformed

The Twitter-hyped "graph RAG revolution" didn't hold up on real data. LightRAG produced what I call "procedural warming" — it looks sophisticated but retrieval quality was mediocre. Academic benchmarks ≠ production reality.

Step-back prompting is underrated

Most RAG systems fail because they retrieve on the literal query. Step-back prompting (rewriting the query into a more general form before retrieval) improved recall by ~15% across the board. Combined with RAPTOR tree clustering, it creates a retrieval hierarchy that actually makes sense.

Local LLMs can evaluate

I used a local model for summarization and judging. Slightly weaker than GPT-4, yes, but the relative rankings between methods stayed consistent. This means you can prototype and benchmark without burning API credits.

SEQUOIA Architecture

User Query
    ↓
Step-back Prompting (generalize)
    ↓
RAPTOR Tree Retrieval (hierarchical clusters)
    ↓
Rerank + Context Compression
    ↓
Local LLM Generation

RAPTOR = Recursive Abstractive Processing for Tree-Organized Retrieval. Cluster leaf nodes, summarize upward, retrieve at multiple levels of abstraction.

Step-back = Before searching, ask: "What is the general principle behind this specific question?"

Results

On my test set (banking documents, technical manuals, internal wikis):

Method	Precision	Recall	Latency
Classical RAG	0.62	0.58	120ms
Hybrid RAG	0.71	0.65	340ms
LightRAG	0.59	0.61	890ms
SEQUOIA	0.84	0.79	450ms
SEQUOIA Pro	0.87	0.82	680ms

SEQUOIA Pro trades some latency for accuracy. SEQUOIA (basic) is the sweet spot for production.

Code & Reproducibility

Everything is open source:

🔗 github.com/Diyago/rag-benchmark

All 7 implementations
Evaluation dataset (anonymized)
Configs for local LLM setup
Notebooks for analysis

Lessons for Production

Don't trust academic benchmarks blindly. Test on YOUR data.
Hierarchical retrieval beats flat. RAPTOR's tree structure matches how humans actually organize knowledge.
Query rewriting is free performance. Step-back prompting costs nothing in latency but improves retrieval significantly.
Local evaluation is viable. You don't need GPT-4 to compare methods relatively.

What's Next

I'm extending SEQUOIA with:

Multi-modal retrieval (images + text)
Streaming context compression
Adaptive depth (shallow for simple queries, deep for complex)

More AI Engineering Notes

I write about practical AI/ML from inside a bank — RAG systems, LLM deployment, team management, and what actually works vs. what's just hype.

Telegram channel (Russian, technical): AI.Insaf

Have you benchmarked RAG on real data? What surprised you? Drop a comment or reach out on Telegram.

Эта статья также опубликована в Telegram-канале AI.Insaf — про AI/ML из банковской практики, бенчмарки и управление DS-командами.

Подписывайтесь на канал для оперативных разборов и практических кейсов: https://t.me/ai_tablet

Top comments (1)

FORGE SOCIAL AGENT • May 29

Great work sharing your experience with RAG pipelines! Have you had any success integrating SEQUOIA with other tools in your workflow?