- Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
You watched a team ship a RAG service for an internal code-search tool. Two weeks of work, a $180 embedding bill against a 14k-document corpus. Then someone typed ERR_TLS_CERT_ALTNAME_INVALID into the search box and got back five chunks about TLS in general, none of them the runbook with that exact string in the title. The runbook ranked 31st. The cosine search smeared the error code into a neighborhood of TLS prose.
You know what fixes that. A to_tsvector column, a GIN index, ten lines of SQL, zero embedding cost. The same query lands the runbook at rank one. So the question is not whether BM25 still works. The question is when it is the only retriever you actually need.
What the benchmarks still say in 2026
Dense retrieval has pulled ahead of BM25 on the average BEIR score. The original BEIR paper (Thakur et al., 2021) showed BM25 as a strong zero-shot baseline that beat several dense models on out-of-domain corpora. Five years on, recent leaderboards show retrieval-tuned models like e5-large-instruct and the current MTEB leaders outperforming BM25 by 15-25% on the BEIR average, with hybrid approaches adding another 2-5% on top (BEIR leaderboard summary).
Look at the per-corpus breakdown and the picture is messier. BM25 still wins or ties on the corpora where queries are short and lexical, where domain shift is large, or where exact strings carry the meaning. Touché-2020 (argument retrieval), BioASQ (biomedical), and Signal-1M (news with rare entities) are the classic cases where dense models trained on MS MARCO never close the gap. The original BEIR analysis flagged this in 2021 and the qualitative shape has held.
The practitioner takeaway: averages hide which corpus you actually have. If your corpus has the shape BM25 is good at, paying $0.20 per 1k tokens to embed it buys you nothing.
The decision rule, before you write any code
Tokenize a representative batch of 200-500 queries from your logs. Tokenize your documents the same way. Compute the average overlap rate: for each query, what fraction of its content tokens appear verbatim in the documents that should match it.
- Overlap above ~70%: BM25 alone is likely your best retriever. Dense will not add enough to pay for the embedding pipeline.
- Overlap 40-70%: hybrid (BM25 + dense, fused with Reciprocal Rank Fusion) is the safe call. Each retriever covers the other's blind spots.
- Overlap below 40%: dense retrieval probably wins on its own; BM25 will mostly add noise. Common in conversational support corpora where users describe symptoms in their own vocabulary.
These thresholds are heuristics, not guarantees. Run an eval. But the cheap tokenize-and-overlap measurement is the right thing to look at before you reach for an embedding API key.
The Postgres-only setup
Here is the schema. One table, one tsvector column, no vector index in sight.
CREATE TABLE chunks (
id BIGSERIAL PRIMARY KEY,
doc_id TEXT NOT NULL,
content TEXT NOT NULL,
ts TSVECTOR
GENERATED ALWAYS AS
(to_tsvector('english', content)) STORED
);
CREATE INDEX chunks_ts_idx
ON chunks USING GIN (ts);
The ts column is a stored generated column, so Postgres recomputes it whenever content changes. No triggers to maintain. The GIN index on ts gives you sub-millisecond lookup against tsqueries on a 1k-document corpus, and tens of milliseconds at the 1M-chunk scale.
Retrieval is one query.
SELECT id, content,
ts_rank_cd(ts, plainto_tsquery('english', $1)) AS s
FROM chunks
WHERE ts @@ plainto_tsquery('english', $1)
ORDER BY s DESC
LIMIT $2;
ts_rank_cd is Postgres's cover-density score, not strict BM25, but the rank ordering tracks closely enough for most corpora. If you need true BM25, the pg_search extension from ParadeDB swaps it in without changing the application code.
The Python wrapper is unremarkable.
import os
import psycopg
DSN = os.environ["PG_DSN"]
SQL = """
SELECT id, content,
ts_rank_cd(
ts, plainto_tsquery('english', %s)
) AS s
FROM chunks
WHERE ts @@ plainto_tsquery('english', %s)
ORDER BY s DESC
LIMIT %s
"""
def search(query: str, k: int = 10):
with psycopg.connect(DSN) as conn:
with conn.cursor() as cur:
cur.execute(SQL, (query, query, k))
return cur.fetchall()
That is the whole retriever. No embedding API call, no HNSW index, no vector column. On a 1k-document corpus this typically returns top-10 in under 5ms warm. The same setup with a 1,536-dim HNSW index adds on the order of 30ms per query for the embedding API roundtrip plus another 5-10ms for the ANN search. Ingest costs land around $0.13 per 1k chunks at text-embedding-3-small rates as of April 2026 (OpenAI pricing). For an exact-match-heavy corpus, you are paying for retrieval you would discard.
When BM25 alone is enough
Three corpus shapes where the dense retriever is overhead:
-
Code and config search. Function names, error codes, environment variable keys, command-line flags, version strings. The token either appears in the document or it does not. Embeddings smear
ERR_TLS_CERT_ALTNAME_INVALIDinto a region of "TLS-ish errors" and rank the wrong document at one. BM25's IDF weight pins the rare token to the right document. -
Legal and regulatory text. Statute numbers, case citations, paragraph references like
GDPR Art. 6(1)(f). Legal-search teams that move from dense-only to BM25 (or BM25 + tsvector) commonly report large recall improvements on citation queries. Statutes are read by exact reference; embeddings have no incentive to preserve those references separably. - Internal ID lookup. Ticket IDs, SKU numbers, JIRA keys, change-request numbers. Same reason as code. Anything that is meaningful as a string, not as a concept.
If 60% or more of your queries fall into one of those buckets, ship BM25 first. Add dense later if your eval rig says it is leaving recall on the table.
When BM25 plus RRF wins, when dense alone wins
The lazy default is hybrid. RRF over BM25 and a dense retriever does not require you to tune a weight, the fusion step costs sub-millisecond, and on most production corpora it pulls 5-10 points of recall@10 over either side alone (Cormack, Clarke, Buettcher, 2009). If you do not know your corpus shape yet, this is the right call.
Dense alone wins on a narrow but real set of corpora. Conversational support content where users describe a symptom they can't name and the docs use the team's vocabulary. Cross-language retrieval where a multilingual embedder bridges the language gap and BM25 has nothing to match on. Short-query semantic search like "find me articles similar to this paragraph" where there are no rare tokens to anchor.
The mistake is to assume your corpus is the dense-wins case because dense is the trendy default. Look at the corpora most teams actually ship: code, error logs, structured docs, regulatory text. They have enough lexical overlap that BM25 carries the bulk of the load, and dense earns its keep only on the long-tail paraphrase queries.
What to measure before you commit
Two numbers, both cheap.
- Token-overlap rate on a labelled query/document pair set. Five lines of Python over your tsvector tokens. Tells you which retrieval shape your corpus has before you build anything.
- Recall@10 on BM25-only versus hybrid versus dense-only, run on the same labelled set. If BM25-only sits within 2-3 points of hybrid, you do not need dense. If hybrid pulls ahead by 8+ points, dense is earning its keep. If dense alone matches hybrid, your corpus is mostly paraphrase and BM25 is dead weight.
Picture the result on an internal-docs corpus: BM25 alone lands within a point or two of hybrid, and dense alone trails both. That is the shape that lets you ship BM25 only, delete the embedding column, and cut the retrieval bill to zero. The eval numbers say the same thing the token-overlap rate said: the corpus did not need dense.
The shape that is easy to miss
The vector-database vendor pitch is that embeddings are the new search. The benchmark reality is that embeddings are one retriever in a toolbox. On the corpora most teams actually ship (code, docs, errors, internal IDs), embeddings are the retriever you reach for second, after BM25 has already covered the head of the distribution.
Tokenize your queries. Tokenize your documents. Look at the overlap. Then pick the retriever that matches the shape your corpus actually has, not the one your last conference talk told you to.
If this was useful
The RAG Pocket Guide walks through retrieval selection end-to-end: when BM25 is enough, when hybrid pays for itself, the eval rig that tells you which one your corpus needs, and the production patterns that survive the second month of running it. If you are about to spin up a vector database because the tutorial said so, read it first.

Top comments (0)