BM25 vs. Vector Search: Choosing the Right Retrieval Strategy for Production Systems

#ai #search #architecture #machinelearning

Search is deceptively complex. You can stand up Elasticsearch in an afternoon and have something that works. Whether it surfaces the right document when a user asks "how do I reset my subscription?" instead of typing "subscription reset steps" is an entirely different problem.

The two dominant retrieval paradigms — BM25 and Vector Search — are both mature and production-proven. The real question is why one fails where the other succeeds, and how to combine them.

BM25: The Probabilistic Workhorse

BM25 scores documents using term frequency, inverse document frequency, and document length normalization. It is still the default ranking algorithm in Elasticsearch and OpenSearch.

Excels at: exact keyword matching (SKUs, error codes, CLI flags), transparent debuggable ranking, sub-millisecond latency at scale, zero GPU cost.

Breaks down at: vocabulary mismatch ("cancel membership" vs "terminate subscription"), semantic intent, cross-language queries, conceptual similarity.

BM25 is fundamentally a bag-of-words model. It has no understanding of meaning.

Vector Search: Semantic Retrieval via Embeddings

Vector search transforms text into dense numerical vectors where semantically similar content is geometrically close. Retrieval becomes nearest-neighbor search in that space.

Excels at: semantic equivalence, natural language questions, cross-lingual retrieval, RAG pipelines where LLMs generate natural language queries.

Breaks down at: exact term matching (NullPointerException, order IDs), infrastructure cost (GPU for embedding, RAM for HNSW indexes), staleness when documents change frequently.

The quality of vector search is entirely determined by your embedding model — domain fit, dimensionality, and max token length all matter.

Hybrid Search: The Production Reality

In most real-world systems, neither alone is sufficient. The answer is hybrid retrieval — running both in parallel and combining scores.

Reciprocal Rank Fusion (RRF) merges ranked lists without score normalization. Documents that rank well in both lists get a substantial boost. It is rank-based, not score-based — no need to normalize cosine similarity against BM25 scores.

The full pipeline:

BM25 + Vector Search in parallel → top-K candidates each
RRF merge → combined ranked list
Cross-encoder re-ranker → precision pass on top candidates
Final top-N → LLM context or search results

Skipping the re-ranker is a common mistake. Initial retrieval optimizes for recall. The re-ranker optimizes for precision — especially critical when context window limits mean you can only pass top-5 to an LLM.

Common Architecture Mistakes

Vector search alone for RAG — misses exact-match cases that BM25 catches, creating systematic blind spots
Ignoring chunk boundaries — embedding a 5000-token document as a single vector destroys retrieval specificity
General-purpose embedding model on specialized corpus — domain-specific models significantly outperform on legal, medical, or code retrieval
Not evaluating retrieval independently — teams blame the LLM when retrieval is the actual failure point. Measure Recall@K and MRR before debugging generation.

Read the Full Article

This is a summary of my deep dive into retrieval architecture. The full article covers BM25 mechanics, vector search internals, the complete tooling landscape, chunking strategies, and a decision framework:

👉 BM25 vs. Vector Search — Full Article

The full article includes:

BM25 scoring formula breakdown and tuning parameters
Embedding model evaluation criteria and provider comparison
Head-to-head scenario table showing when each wins
Hybrid architecture pattern with RRF and re-ranking
Complete tooling landscape (Pinecone, Qdrant, Weaviate, pgvector, Elasticsearch 8.x)
Chunking strategy deep dive (fixed, semantic, hierarchical)
Decision framework for choosing your retrieval architecture