- Book: RAG Pocket Guide
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Picture this: a user types KB-2024-7831 into your support bot. The dense retriever returns three articles about ticket-tracking systems, none of which contain the string KB-2024-7831. The BM25 retriever next to it returns the exact article in position one. You have been A/B testing the dense retriever for weeks and were about to make it default. A screenshot of that one query, dropped into team Slack, kills the rollout.
Hybrid search is the phrase you will hear at every RAG talk in 2026, and the reason is boring: pure dense retrieval is bad at proper nouns and exact identifiers, pure BM25 is bad at paraphrase, and stitching them with Reciprocal Rank Fusion recovers both signals at the cost of about 6ms per query. There is no version of this trade where pure retrieval wins.
Why dense alone misses
Embeddings collapse meaning. That is the feature. The same embedding bucket holds "VPN keeps disconnecting" and "I can't stay connected to the corporate network" and "tunnel drops every 30s." It is also the failure mode. The bucket holding "KB-2024-7831" looks identical to the bucket holding "KB-2024-7832" and "ticket TX-447811." The model has no incentive to keep proper-noun strings separable in vector space; nothing in its training rewarded that.
This shows up as a precision problem on long-tail queries:
- Product SKUs, error codes, version numbers, ticket IDs.
- Company names, person names, place names that did not appear often in the embedder's training corpus.
- API method names, function names, command-line flags.
- Anything that is meaningful as a string, not as a concept.
On a typical support corpus you will see something like this: roughly a fifth of queries contain at least one identifier the embedder cannot separate from a hundred other identifiers. Recall on the identifier-bearing queries can collapse into the 30% range while recall on the rest sits comfortably in the 70s. The aggregate looks fine. That identifier-bearing fifth is where every escalated ticket lives. (Illustrative numbers, but the shape is the one teams keep reporting.)
Why BM25 alone misses
Lexical retrieval is the opposite failure. BM25 scores documents by exact term overlap with the query, weighted by inverse document frequency. It is unbeatable at finding KB-2024-7831 because the rare token has a huge IDF and pins the matching document to position one.
It also has zero idea that "VPN keeps disconnecting" and "tunnel drops every 30s" are the same question. If your knowledge base says "tunnel drops" and the user typed "VPN keeps disconnecting," BM25 returns nothing useful. The classic BM25 paraphrase failure: high precision when the words match, zero recall when they do not.
A team running pure BM25 on developer documentation will see this every day. Users phrase questions in their own vocabulary; the docs use the team's vocabulary; the overlap is partial. Recall on paraphrased queries is the failure mode that drove the entire dense-retrieval wave in 2022-2023.
The fusion pattern
The fix is to run both retrievers and merge their result lists by rank, not by raw score. BM25 scores are unbounded positive numbers tied to corpus statistics. Cosine similarities live in [-1, 1] but cluster narrowly in practice. Score normalization to combine them is fragile and locale-specific.
Reciprocal Rank Fusion (Cormack, Clarke, and Buettcher, SIGIR 2009) sidesteps the problem. For each document d, compute a fused score that depends only on its rank in each list:
RRF_score(d) = sum over rankers r of 1 / (k + rank_r(d))
The constant k is conventionally 60. It softens the contribution of top-1 hits relative to top-10, which empirically helps when one retriever is overconfident. Sort by fused score, take top-N, you have your context.
The headline shape, as reported in one practitioner writeup, is a recall@10 jump from the high-60s-to-high-70s range you see with either retriever alone into the 90s on hybrid + RRF. Treat the exact percentages as corpus-dependent, but the direction is what teams keep reporting. The fusion step itself is essentially free.
The 50-line implementation
Postgres with pgvector for dense, plus tsvector for BM25-like full-text. Same database. No second service to operate.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
id BIGSERIAL PRIMARY KEY,
doc_id TEXT NOT NULL,
text TEXT NOT NULL,
embedding VECTOR(1536),
text_tsv TSVECTOR
);
CREATE INDEX chunks_emb_idx
ON chunks USING hnsw (embedding vector_cosine_ops);
CREATE INDEX chunks_tsv_idx
ON chunks USING gin (text_tsv);
Populate text_tsv on insert with to_tsvector('english', text). The HNSW index handles dense ANN. The GIN index handles full-text. Postgres's full-text search is not BM25 by default; it uses ts_rank_cd, a normalized term-frequency score. For hybrid retrieval the rank ordering is what RRF reads, so close enough. If you need true BM25, the pg_search extension from ParadeDB adds it; the code below works either way.
The retrieval function:
import asyncpg
from openai import AsyncOpenAI
oai = AsyncOpenAI()
RRF_K = 60
async def embed(text: str) -> list[float]:
r = await oai.embeddings.create(
model="text-embedding-3-large",
input=text,
)
return r.data[0].embedding
async def hybrid_search(pool, query, top_n=10):
qvec = await embed(query)
async with pool.acquire() as conn:
dense = await conn.fetch(
"SELECT id, text "
"FROM chunks "
"ORDER BY embedding <=> $1 "
"LIMIT 50",
qvec,
)
sparse = await conn.fetch(
"SELECT id, text "
"FROM chunks "
"WHERE text_tsv @@ plainto_tsquery('english', $1) "
"ORDER BY ts_rank_cd("
" text_tsv, plainto_tsquery('english', $1)"
") DESC "
"LIMIT 50",
query,
)
fused: dict[int, float] = {}
texts: dict[int, str] = {}
for rank, row in enumerate(dense):
fused[row["id"]] = (
fused.get(row["id"], 0.0)
+ 1.0 / (RRF_K + rank + 1)
)
texts[row["id"]] = row["text"]
for rank, row in enumerate(sparse):
fused[row["id"]] = (
fused.get(row["id"], 0.0)
+ 1.0 / (RRF_K + rank + 1)
)
texts[row["id"]] = row["text"]
ranked = sorted(
fused.items(), key=lambda kv: -kv[1]
)[:top_n]
return [(cid, texts[cid], score) for cid, score in ranked]
That is the full hybrid retrieval. The dense and sparse queries run in parallel inside the same connection. The fusion is a dict and a sort. The top-N comes back ranked.
A few production notes that bite teams the first time they ship this:
- Pull 50 from each retriever, fuse, take top 10. Pulling fewer from each side starves RRF — the value of fusion is that a document weakly matching both retrievers can outrank a document strongly matching one. Need both lists deep enough to capture those.
-
<=>is cosine distance, so smaller is better andORDER BY embedding <=> $1already gives you ranked-best-first. Make sure your embedder produces normalized vectors or you get garbage. -
plainto_tsqueryis forgiving but lossy. It strips operators. For developer-doc corpora with code snippets,websearch_to_tsqueryhandles quoted phrases better. - Index size matters. HNSW indexes are RAM-hungry. On a 5M-chunk corpus, expect 8-15GB of HNSW index. Plan accordingly. The GIN index for tsvector is comparatively cheap.
Adding a reranker
The recall@10 lift above is on raw fused results. If your generator can only handle top-3 in context, you want a reranker between fusion and the LLM. A cross-encoder reranker (Cohere Rerank 3, Voyage rerank-2, or a local bge-reranker) takes the fused top-50, scores (query, chunk) pairs jointly, and reorders.
async def rerank(query, candidates, top_k=3):
pairs = [(query, c[1]) for c in candidates]
scores = await reranker.score(pairs)
ranked = sorted(
zip(candidates, scores),
key=lambda x: -x[1],
)[:top_k]
return [c for c, _ in ranked]
The reranker adds 50-150ms depending on model and batch size. On a 500ms-2s LLM call this is invisible. The precision lift is large: in my experience a 10-15 point precision@3 improvement on top of hybrid is typical.
When RRF is the wrong fusion
RRF is the default for a reason: rank-only, no normalization, almost no parameters. It is also a mediocre fit for two situations.
First, when you have score-calibrated retrievers. If both your dense and sparse scores have known meaning (say, a cross-encoder rerank score calibrated to relevance probability), throwing away the score in favor of rank is information loss. Use a weighted score sum and tune the weight on a held-out set.
Second, when you have more than two retrievers. RRF with three retrievers (dense, sparse, knowledge-graph) gets noisy because the rank denominator dominates. Use a learned weighting (logistic regression on rank features) or a small reranker over the merged list.
In practice, two-retriever RRF covers >90% of production RAG workloads. Reach for the alternatives only when you have evidence the two-retriever shape is leaving recall on the table.
What to measure
Three numbers tell you whether hybrid is working.
- Recall@10 by query class. Slice into "contains identifier" and "no identifier" buckets. Hybrid should improve the first bucket dramatically with no regression on the second.
- Per-retriever contribution rate. Of the top-10 fused results, what fraction came from dense-only, sparse-only, or both? Healthy hybrid sits around 30/30/40. If it is 80/10/10 you are paying for two retrievers and using one.
- Tail latency. p95 retrieval latency should not exceed your dense-only baseline by more than 30-50ms. If it does, your sparse query is missing an index or pulling too many candidates.
You can ship a hybrid retriever that silently degrades to dense-only and never notice on the aggregate metric. Picture the failure mode: the sparse index is missing, so the sparse query returns empty for every request, and the fusion quietly collapses to whatever the dense side already had. Aggregate recall looks unchanged. The contribution-rate slice is the only place this shows up.
The teams that have been shipping RAG quietly since 2024 figured this out the first time their support bot whiffed on a ticket ID. The teams that learned RAG from a 2023 vector-database tutorial are about to figure it out in 2026, on stage, when someone in the audience asks how their retriever handles KB-2024-7831.
If this was useful
The RAG Pocket Guide covers the full retrieval stack: chunking, hybrid retrieval, reranking, evaluation, with the production-tested defaults that keep showing up across teams. If you are shipping RAG and the eval numbers feel like they are bouncing around for no reason, it is the book.
Top comments (0)