How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)

Charles — Mon, 08 Jun 2026 18:14:35 +0000

Most "chat with your documents" demos work in an afternoon. Then you hit the last
20%: retrieval that misses the right passage, an LLM that confidently makes things
up, a reranker that wrecks your latency, chunking you re-tune ten times. And if
your documents are sensitive — legal, medical, internal — you can't just paste
them into a cloud API.

So I built a fully local RAG pipeline and, more importantly, a reproducible
benchmark to prove it actually works. Everything runs on the machine. No
OpenAI, no Anthropic, no Cohere. Here's the stack, the numbers, and what actually
moved them.

The stack (all local, permissively licensed)

Embeddings: Qwen3-Embedding-0.6B (bge-m3 as a fallback)
Vector store: Qdrant in local/embedded mode (no Docker)
Retrieval: dense + sparse BM25, fused with Reciprocal Rank Fusion (RRF)
Reranker: a cross-encoder (MiniLM) over the top-k
LLM: Gemma3:4b via Ollama
Eval judge: the same local LLM (so even evaluation makes zero external calls)

The targets (from current RAG benchmarks)

I wanted pass/fail thresholds, not vibes:

Metric	Target
Hit Rate@5	≥ 0.90
MRR	≥ 0.75
Context Precision@3	≥ 0.70
Context Recall	≥ 0.85
Faithfulness	≥ 0.90
Answer Relevancy	≥ 0.85
Retrieval latency (p50)	≤ 1.0s
End-to-end (p50)	≤ 8.0s

What actually moved the numbers

Starting from a naive dense-only baseline (5/9 passing), four changes did the work:

Hybrid + RRF took Hit Rate@5 from 0.90 (dense only) to 1.0. Keyword matching catches what embeddings miss, and vice versa.
The reranker took Context Precision@3 from 0.45 → 0.89. The single biggest precision lever. Cross-encoders are slow, so it only runs on the top-k.
A strict prompt ("answer ONLY from the context; if it's not there, say you don't know") plus temperature 0.1 took Faithfulness from 0.62 → 1.0. Most "hallucination" is really a prompt + retrieval problem.
Putting Ollama on the GPU cut end-to-end p50 from 14s → 6.5s.

Results (validated at 3 scales)

To rule out "it only works because the corpus is tiny", I ran it on 42, 124, and
274 questions with chunk-level ground truth. Scores stayed flat-to-rising as the
corpus grew 16×:

Metric	42Q	124Q	274Q
Hit Rate@5	1.00	1.00	1.00
MRR	0.95	0.98	0.98
Context Precision@3	0.89	0.92	0.93
Faithfulness	1.00	0.99	0.97
Answer Relevancy	0.88	0.90	0.92

9/9 at every scale.

Lessons

Measure first. Without an eval harness, you optimize blind. The retrieval metrics alone (no LLM) run in seconds and catch most regressions.
"Hallucination" is usually retrieval. If faithfulness is fine but relevancy is low, your problem is upstream in retrieval, not the model.
Local is a feature, not a compromise. For sensitive data it's the only option, and a small local stack hits production-grade numbers in 2026.

Want the whole thing done for you?

I packaged the full pipeline — code, the eval suite, 13 input formats, metadata
filters, a CLI and a Streamlit UI, 60+ tests, docs — as a one-time download so
you can skip the weeks of tuning: https://buy.polar.sh/polar_cl_XV4ksHBnFjkEGMnKLzFc2HFB16agYFEORQ0Ov3oo7HK

Either way, happy to answer questions about the stack or the eval methodology in
the comments.