⚠️Need of Hybrid Search
We have documents containing error codes in python with their respective definitions and use-cases. User writes a query to know about "What is ERR_404_AUTH?"
Classic Rag: Will retrieve all the authentication and error related context it can find from vector db (document embeddings).
Lexical search: Will search for terms
["What", "is", "ERR_404_AUTH"]Hybrid search: Will search for keyword
"ERR_404_AUTH"and retrieve semantically similar documents using similarity search.
🛠️Using BM25
Take BM25 as extended version of TF-IDF for key-word based search.
The step-wise implementation of BM25 in LangChain is straightforward because LangChain provides a built-in BM25Retriever.
Here is the step-wise implementation alongside the intuition.
# pip install rank_bm25
from langchain_community.retrievers import BM25Retriever
from langchain_core.documents import Document
# Chunks from your text splitter
chunks = [
Document(page_content="The AX-705 engine uses a 4-stroke cycle."),
Document(page_content="Maintenance for AX-705 requires synthetic oil."),
Document(page_content="Four-stroke engines are common in modern cars.")
]
# Step: Build the BM25 index (The Inverted Index)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 2 # Retrieve top 2
✨Creating the Hybrid "Ensemble"
To get the best of both worlds (exact keywords + semantic meaning), you merge your vector retriever with the BM25 retriever.
from langchain.retrievers import EnsembleRetriever
# Assume 'chroma_retriever' is already created from your syllabus
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, chroma_retriever],
weights=[0.3, 0.7] # 30% importance to Keywords, 70% to Meaning
)
🔍The 4 Steps of BM25 (Under the Hood)
When you call hybrid_retriever.invoke("AX-705 engine"), the BM25 part follows these steps:
1. Tokenization: The query "AX-705 engine" is split into ["ax-705", "engine"].
2. Lookup: The retriever looks into its "Inverted Index" (a dictionary) to see which document chunks contain these exact strings.
3. Scoring (f(q, d)): It calculates a score for each match using the BM25 formula:
- Rareness: Since "AX-705" is rare in the database, it's worth more points than "engine".
- Saturation: Even if a doc mentions "engine" 100 times, its score won't skyrocket (preventing "keyword stuffing" from winning).
- Length Penalty: If a tiny 10-word chunk matches both words, it ranks higher than a massive 1000-word chunk matching both.
4. Ranking: It returns a list of chunks sorted by this score.
👉Next steps:
Reciprocal Rank Fusion (RRF): The Glue:
When the Ensemble Retriever gets the BM25 list and the Vector list, it needs to combine them. Since their scores are on different scales (one is 0-1, the other could be 0-25), it uses RRF.
Logic: It looks at the rank (position) of a document in both lists.
Intuition: If a document is #1 in BM25 but #50 in Vector Search, it still gets a high total score because it's a "perfect" keyword match.
I hope this was useful..
Top comments (0)