WonderLab

Posted on May 8

RAG Series (10): Hybrid Search — Retrieving More, Missing Less

#rag #bm25 #langchain #ai

A Blind Spot in Vector Search

Suppose your knowledge base contains a document with this sentence:

"For Chinese scenarios, we recommend BAAI/bge-large-zh-v1.5, with a vector dimension of 1024."

A user asks: "What is the vector dimension of BAAI/bge-large-zh-v1.5?"

You might think this is a gimme — identical words, vector search should nail it easily.

Not necessarily. Vector search relies on semantic similarity. When the query and document share the same exact vocabulary, vector search has no particular advantage over BM25 — and sometimes performs worse. BM25 is specifically designed for exact term frequency matching. This is its home turf.

The real issue: your RAG system will inevitably face both types of queries:

Keyword queries: contain exact model names, parameters, formulas, names — "BAAI/bge-large-zh-v1.5 dimension"
Semantic queries: conceptual questions phrased differently — "My AI assistant keeps giving outdated answers, how do I fix this?"

Pure vector search handles the second well, but struggles with the first. Pure BM25 is the opposite.

Hybrid Search is conceptually simple: run both, then merge the results.

BM25 in Plain Terms

BM25 (Best Match 25) is the classic ranking algorithm behind Elasticsearch, Lucene, and most search engines.

Core formula:

score(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))

Human-readable version:

IDF (Inverse Document Frequency): Rare words are worth more. "the" is worthless; "BAAI/bge-large-zh-v1.5" is gold.
TF (Term Frequency): More occurrences → higher score, but with diminishing returns.
Document length normalization: Long documents don't automatically win just because they have more words.

BM25 strengths: Purely vocabulary-based. If the query word appears in the document, it hits — precisely and reliably. Exact product names, function names, parameter values — this is its home court.

BM25 weaknesses: No semantic understanding. "knowledge cutoff" and "AI that doesn't know recent events" are completely unrelated to BM25, even though they mean the same thing.

The RRF Fusion Algorithm

Given results from both BM25 and vector search, how do you combine them?

The naive approach is to take a weighted average of scores — but the two algorithms use completely different scoring scales, so direct addition is meaningless.

RRF (Reciprocal Rank Fusion) takes a more elegant approach: compare ranks, not scores.

Formula:

RRF_score(d) = Σ 1 / (k + rank(d))

rank(d): where document d ranked in a given retriever (1st, 2nd, ...)
k: a constant, usually 60, to prevent the top-ranked document from dominating
Sum across all retrievers

Example:

Document	BM25 Rank	Vector Rank	RRF Score (k=60)
doc-006	1	3	1/(60+1) + 1/(60+3) = 0.0164 + 0.0159 = 0.0323
doc-003	3	1	1/(60+3) + 1/(60+1) = 0.0323
doc-002	2	4	1/(60+2) + 1/(60+4) = 0.0161 + 0.0156 = 0.0317

The key benefit of RRF: no matter how different two retrievers' score ranges are, results are fused fairly based on rank alone. No manual score normalization needed.

Experiment Design

6 test queries covering both scenarios:

Type	Query	Expected Doc	What It Tests
Keyword	`BAAI/bge-large-zh-v1.5 dimension`	doc-003	Exact model name
Keyword	`RRF score sum 1/(k+rank) formula`	doc-006	Exact formula string
Keyword	`chunk_size 256 1024 overlap recommended`	doc-004	Exact parameter values
Semantic	`My AI assistant gives outdated answers, how do I keep it current?`	doc-001	No mention of "RAG"
Semantic	`Multiple teams share one Q&A system — how to keep their data separate?`	doc-008	No mention of "multi-tenancy"
Semantic	`Rephrasing the same question returns completely different results — how to fix this?`	doc-007	No mention of "Multi-Query"

Evaluation metric: MRR (Mean Reciprocal Rank)

RR = 1 / rank  (where did the correct document land?)
MRR = average RR across all queries

Always ranks first → MRR = 1.0
Averages second place → MRR = 0.5
Never found → MRR = 0.0

Implementing the Three Retrievers

BM25 Retriever

Chinese text needs word segmentation first. We use jieba:

import jieba
from langchain_community.retrievers import BM25Retriever

def chinese_tokenizer(text: str) -> list[str]:
    return list(jieba.cut(text))

bm25_retriever = BM25Retriever.from_documents(
    docs,
    k=3,
    preprocess_func=chinese_tokenizer,
)

Vector Retriever

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="BAAI/bge-large-zh-v1.5",
    api_key=os.getenv("EMBEDDING_API_KEY"),
    base_url="https://api.siliconflow.cn/v1",
)
vectorstore = Chroma.from_documents(docs, embedding=embeddings)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

Hybrid Retriever (EnsembleRetriever + RRF)

from langchain_classic.retrievers import EnsembleRetriever

hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5],   # Equal weight — fused internally via RRF
)

The weights parameter in EnsembleRetriever controls each retriever's contribution to RRF scoring, not a direct score average. The implementation performs weighted RRF fusion over each retriever's ranked results.

Experimental Results

======================================================================
  Per-Query Results  (RR = Reciprocal Rank; Hit@1 = correct doc ranked first?)
======================================================================

  [KEYWORD ] BAAI/bge-large-zh-v1.5 dimension
    Expected: doc-003
    BM25   [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-003', 'doc-006', 'doc-004']
    Vector [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-003', 'doc-005', 'doc-002']
    Hybrid [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-003', 'doc-006', 'doc-004']

  [KEYWORD ] RRF score sum 1/(k+rank) formula
    Expected: doc-006
    BM25   [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-006', 'doc-002', 'doc-004']
    Vector [H@1=✗] RR=0.50 | rank=2 | retrieved: ['doc-004', 'doc-006', 'doc-003']
    Hybrid [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-006', 'doc-004', 'doc-003']

  [KEYWORD ] chunk_size 256 1024 overlap recommended
    Expected: doc-004
    BM25   [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-004', 'doc-003', 'doc-006']
    Vector [H@1=✗] RR=0.50 | rank=2 | retrieved: ['doc-006', 'doc-004', 'doc-003']
    Hybrid [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-004', 'doc-006', 'doc-003']

  [SEMANTIC] My AI gives outdated answers — how do I keep it current?
    Expected: doc-001
    BM25   [H@1=✗] RR=0.33 | rank=3 | retrieved: ['doc-007', 'doc-005', 'doc-001']
    Vector [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-001', 'doc-005', 'doc-007']
    Hybrid [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-001', 'doc-007', 'doc-005']

  [SEMANTIC] Multiple teams share a Q&A system — how to keep their data separate?
    Expected: doc-008
    BM25   [H@1=✗] RR=0.33 | rank=3 | retrieved: ['doc-002', 'doc-007', 'doc-008']
    Vector [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-008', 'doc-001', 'doc-002']
    Hybrid [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-008', 'doc-002', 'doc-007']

  [SEMANTIC] Rephrasing a question gives completely different results — how to fix?
    Expected: doc-007
    BM25   [H@1=✗] RR=0.00 | rank=miss | retrieved: ['doc-005', 'doc-001', 'doc-003']
    Vector [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-007', 'doc-001', 'doc-005']
    Hybrid [H@1=✓] RR=1.00 | rank=1 | retrieved: ['doc-007', 'doc-001', 'doc-005']

MRR summary:

======================================================================
  MRR Summary
  MRR=1.0 → always ranked first  |  MRR=0.0 → never found
======================================================================

  Query Type         BM25     Vector     Hybrid  Winner
  ────────────────────────────────────────────────────────
  Keyword queries    1.000      0.667      1.000  BM25
  Semantic queries   0.222      1.000      1.000  Vector
  Overall            0.611      0.833      1.000  Hybrid
======================================================================

  ✓ Keyword queries: BM25 MRR is higher (exact term matching advantage)
  ✓ Semantic queries: Vector MRR is higher (semantic understanding advantage)
  ✓ Hybrid search: highest overall MRR — handles both query types

Reading the numbers:

BM25 achieves a perfect 1.000 on keyword queries, but collapses to 0.222 on semantic ones — the third semantic query ("rephrasing") completely fails with no hit in the top 3.
Vector search is perfect on semantic queries (1.000), but only 0.667 on keyword ones — two queries (the RRF formula and chunk_size) rank second instead of first.
Hybrid search scores 1.000 across the board — it inherits BM25's keyword precision and matches vector's semantic performance.

When to Use What

Dimension	BM25	Vector Search
Strengths	Exact term matching (model names, formulas, parameters)	Semantic understanding (synonyms, paraphrases)
Fails when	Query and document use different words	Exact technical terms don't have semantically distinct embeddings
Typical query	"BERT-base-uncased number of layers"	"Why do pre-trained models need fine-tuning?"
Language	Better for English; Chinese needs tokenization	Works well for both
Compute cost	Low (no GPU, no API calls)	Higher (requires embedding calls)

When you should definitely use hybrid search:

Your knowledge base contains product names, API names, parameter names, acronyms
Users query in diverse ways (power users ask exact terms; general users ask conceptually)
You need high recall and can't afford to miss relevant documents

When vector-only is fine:

Knowledge base is all natural language prose — no exact technical terms
All queries are conceptual and semantic in nature
Resource-constrained and want to minimize dependencies

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/10-hybrid-search

Core file:

hybrid_search.py — Full comparison experiment across three retrieval strategies

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 10-hybrid-search
cp .env.example .env   # Fill in your Embedding API key
pip install -r requirements.txt
python hybrid_search.py

Summary

This article ran a controlled experiment comparing three retrieval strategies:

Pure BM25 — The keyword matching specialist. Perfect on exact terms, blind to semantics.
Pure Vector Search — The semantic specialist. Handles paraphrasing beautifully, misses exact terms.
Hybrid Search (RRF) — Fuses both, achieves the highest MRR across all query types.

The core idea behind RRF is worth keeping in mind: compare ranks, not scores. This lets it fairly fuse any two retrievers regardless of how different their scoring scales are.

In production, hybrid search has become the default recommendation for RAG systems. Elasticsearch, Qdrant, and Weaviate all support it natively. It's no longer an optional enhancement — it's the baseline.

DEV Community