A Blind Spot in Vector Search
Suppose your knowledge base contains a document with this sentence:
"For Chinese scenarios, we recommend BAAI/bge-large-zh-v1.5, with a vector dimension of 1024."
A user asks: "What is the vector dimension of BAAI/bge-large-zh-v1.5?"
You might think this is a gimme — identical words, vector search should nail it easily.
Not necessarily. Vector search relies on semantic similarity. When the query and document share the same exact vocabulary, vector search has no particular advantage over BM25 — and sometimes performs worse. BM25 is specifically designed for exact term frequency matching. This is its home turf.
The real issue: your RAG system will inevitably face both types of queries:
- Keyword queries: contain exact model names, parameters, formulas, names — "BAAI/bge-large-zh-v1.5 dimension"
- Semantic queries: conceptual questions phrased differently — "My AI assistant keeps giving outdated answers, how do I fix this?"
Pure vector search handles the second well, but struggles with the first. Pure BM25 is the opposite.
Hybrid Search is conceptually simple: run both, then merge the results.
BM25 in Plain Terms
BM25 (Best Match 25) is the classic ranking algorithm behind Elasticsearch, Lucene, and most search engines.
Core formula:
score(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))
Human-readable version:
- IDF (Inverse Document Frequency): Rare words are worth more. "the" is worthless; "BAAI/bge-large-zh-v1.5" is gold.
- TF (Term Frequency): More occurrences → higher score, but with diminishing returns.
- Document length normalization: Long documents don't automatically win just because they have more words.
BM25 strengths: Purely vocabulary-based. If the query word appears in the document, it hits — precisely and reliably. Exact product names, function names, parameter values — this is its home court.
BM25 weaknesses: No semantic understanding. "knowledge cutoff" and "AI that doesn't know recent events" are completely unrelated to BM25, even though they mean the same thing.
The RRF Fusion Algorithm
Given results from both BM25 and vector search, how do you combine them?
The naive approach is to take a weighted average of scores — but the two algorithms use completely different scoring scales, so direct addition is meaningless.
RRF (Reciprocal Rank Fusion) takes a more elegant approach: compare ranks, not scores.
Formula:
RRF_score(d) = Σ 1 / (k + rank(d))
-
rank(d): where document d ranked in a given retriever (1st, 2nd, ...) -
k: a constant, usually 60, to prevent the top-ranked document from dominating - Sum across all retrievers
Example:
| Document | BM25 Rank | Vector Rank | RRF Score (k=60) |
|---|---|---|---|
| doc-006 | 1 | 3 | 1/(60+1) + 1/(60+3) = 0.0164 + 0.0159 = 0.0323 |
| doc-003 | 3 | 1 | 1/(60+3) + 1/(60+1) = 0.0323 |
| doc-002 | 2 | 4 | 1/(60+2) + 1/(60+4) = 0.0161 + 0.0156 = 0.0317 |
The key benefit of RRF: no matter how different two retrievers' score ranges are, results are fused fairly based on rank alone. No manual score normalization needed.
Experiment Design
6 test queries covering both scenarios:
| Type | Query | Expected Doc | What It Tests |
|---|---|---|---|
| Keyword | BAAI/bge-large-zh-v1.5 dimension |
doc-003 | Exact model name |
| Keyword | RRF score sum 1/(k+rank) formula |
doc-006 | Exact formula string |
| Keyword | chunk_size 256 1024 overlap recommended |
doc-004 | Exact parameter values |
| Semantic | My AI assistant gives outdated answers, how do I keep it current? |
doc-001 | No mention of "RAG" |
| Semantic | Multiple teams share one Q&A system — how to keep their data separate? |
doc-008 | No mention of "multi-tenancy" |
| Semantic | Rephrasing the same question returns completely different results — how to fix this? |
doc-007 | No mention of "Multi-Query" |
Evaluation metric: MRR (Mean Reciprocal Rank)
RR = 1 / rank (where did the correct document land?)
MRR = average RR across all queries
- Always ranks first → MRR = 1.0
- Averages second place → MRR = 0.5
- Never found → MRR = 0.0
Implementing the Three Retrievers
BM25 Retriever
Chinese text needs word segmentation first. We use jieba:
import jieba
from langchain_community.retrievers import BM25Retriever
def chinese_tokenizer(text: str) -> list[str]:
return list(jieba.cut(text))
bm25_retriever = BM25Retriever.from_documents(
docs,
k=3,
preprocess_func=chinese_tokenizer,
)
Vector Retriever
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(
model="BAAI/bge-large-zh-v1.5",
api_key=os.getenv("EMBEDDING_API_KEY"),
base_url="https://api.siliconflow.cn/v1",
)
vectorstore = Chroma.from_documents(docs, embedding=embeddings)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
Hybrid Retriever (EnsembleRetriever + RRF)
from langchain_classic.retrievers import EnsembleRetriever
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.5, 0.5], # Equal weight — fused internally via RRF
)
The weights parameter in EnsembleRetriever controls each retriever's contribution to RRF scoring, not a direct score average. The implementation performs weighted RRF fusion over each retriever's ranked results.
Experimental Results
======================================================================
Per-Query Results (RR = Reciprocal Rank; Hit@1 = correct doc ranked first?)
======================================================================
[KEYWORD ] BAAI/bge-large-zh-v1.5 dimension
Expected: doc-003
BM25 [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-003', 'doc-006', 'doc-004']
Vector [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-003', 'doc-005', 'doc-002']
Hybrid [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-003', 'doc-006', 'doc-004']
[KEYWORD ] RRF score sum 1/(k+rank) formula
Expected: doc-006
BM25 [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-006', 'doc-002', 'doc-004']
Vector [H@1=✗] RR=0.50 | rank=2 | retrieved: ['doc-004', 'doc-006', 'doc-003']
Hybrid [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-006', 'doc-004', 'doc-003']
[KEYWORD ] chunk_size 256 1024 overlap recommended
Expected: doc-004
BM25 [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-004', 'doc-003', 'doc-006']
Vector [H@1=✗] RR=0.50 | rank=2 | retrieved: ['doc-006', 'doc-004', 'doc-003']
Hybrid [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-004', 'doc-006', 'doc-003']
[SEMANTIC] My AI gives outdated answers — how do I keep it current?
Expected: doc-001
BM25 [H@1=✗] RR=0.33 | rank=3 | retrieved: ['doc-007', 'doc-005', 'doc-001']
Vector [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-001', 'doc-005', 'doc-007']
Hybrid [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-001', 'doc-007', 'doc-005']
[SEMANTIC] Multiple teams share a Q&A system — how to keep their data separate?
Expected: doc-008
BM25 [H@1=✗] RR=0.33 | rank=3 | retrieved: ['doc-002', 'doc-007', 'doc-008']
Vector [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-008', 'doc-001', 'doc-002']
Hybrid [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-008', 'doc-002', 'doc-007']
[SEMANTIC] Rephrasing a question gives completely different results — how to fix?
Expected: doc-007
BM25 [H@1=✗] RR=0.00 | rank=miss | retrieved: ['doc-005', 'doc-001', 'doc-003']
Vector [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-007', 'doc-001', 'doc-005']
Hybrid [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-007', 'doc-001', 'doc-005']
MRR summary:
======================================================================
MRR Summary
MRR=1.0 → always ranked first | MRR=0.0 → never found
======================================================================
Query Type BM25 Vector Hybrid Winner
────────────────────────────────────────────────────────
Keyword queries 1.000 0.667 1.000 BM25
Semantic queries 0.222 1.000 1.000 Vector
Overall 0.611 0.833 1.000 Hybrid
======================================================================
✓ Keyword queries: BM25 MRR is higher (exact term matching advantage)
✓ Semantic queries: Vector MRR is higher (semantic understanding advantage)
✓ Hybrid search: highest overall MRR — handles both query types
Reading the numbers:
- BM25 achieves a perfect 1.000 on keyword queries, but collapses to 0.222 on semantic ones — the third semantic query ("rephrasing") completely fails with no hit in the top 3.
- Vector search is perfect on semantic queries (1.000), but only 0.667 on keyword ones — two queries (the RRF formula and chunk_size) rank second instead of first.
- Hybrid search scores 1.000 across the board — it inherits BM25's keyword precision and matches vector's semantic performance.
When to Use What
| Dimension | BM25 | Vector Search |
|---|---|---|
| Strengths | Exact term matching (model names, formulas, parameters) | Semantic understanding (synonyms, paraphrases) |
| Fails when | Query and document use different words | Exact technical terms don't have semantically distinct embeddings |
| Typical query | "BERT-base-uncased number of layers" | "Why do pre-trained models need fine-tuning?" |
| Language | Better for English; Chinese needs tokenization | Works well for both |
| Compute cost | Low (no GPU, no API calls) | Higher (requires embedding calls) |
When you should definitely use hybrid search:
- Your knowledge base contains product names, API names, parameter names, acronyms
- Users query in diverse ways (power users ask exact terms; general users ask conceptually)
- You need high recall and can't afford to miss relevant documents
When vector-only is fine:
- Knowledge base is all natural language prose — no exact technical terms
- All queries are conceptual and semantic in nature
- Resource-constrained and want to minimize dependencies
Full Code
Complete code is open-sourced at:
https://github.com/chendongqi/llm-in-action/tree/main/10-hybrid-search
Core file:
-
hybrid_search.py— Full comparison experiment across three retrieval strategies
How to run:
git clone https://github.com/chendongqi/llm-in-action
cd 10-hybrid-search
cp .env.example .env # Fill in your Embedding API key
pip install -r requirements.txt
python hybrid_search.py
Summary
This article ran a controlled experiment comparing three retrieval strategies:
- Pure BM25 — The keyword matching specialist. Perfect on exact terms, blind to semantics.
- Pure Vector Search — The semantic specialist. Handles paraphrasing beautifully, misses exact terms.
- Hybrid Search (RRF) — Fuses both, achieves the highest MRR across all query types.
The core idea behind RRF is worth keeping in mind: compare ranks, not scores. This lets it fairly fuse any two retrievers regardless of how different their scoring scales are.
In production, hybrid search has become the default recommendation for RAG systems. Elasticsearch, Qdrant, and Weaviate all support it natively. It's no longer an optional enhancement — it's the baseline.
Top comments (0)