DEV Community

Rushank Savant
Rushank Savant

Posted on

What is Hybrid search in RAGs?

⚠️Need of Hybrid Search

We have documents containing error codes in python with their respective definitions and use-cases. User writes a query to know about "What is ERR_404_AUTH?"

  • Classic Rag: Will retrieve all the authentication and error related context it can find from vector db (document embeddings).

  • Lexical search: Will search for terms ["What", "is", "ERR_404_AUTH"]

  • Hybrid search: Will search for keyword "ERR_404_AUTH" and retrieve semantically similar documents using similarity search.


🛠️Using BM25

Take BM25 as extended version of TF-IDF for key-word based search.

The step-wise implementation of BM25 in LangChain is straightforward because LangChain provides a built-in BM25Retriever.

Here is the step-wise implementation alongside the intuition.

# pip install rank_bm25
from langchain_community.retrievers import BM25Retriever
from langchain_core.documents import Document

# Chunks from your text splitter
chunks = [
    Document(page_content="The AX-705 engine uses a 4-stroke cycle."),
    Document(page_content="Maintenance for AX-705 requires synthetic oil."),
    Document(page_content="Four-stroke engines are common in modern cars.")
]

# Step: Build the BM25 index (The Inverted Index)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 2  # Retrieve top 2
Enter fullscreen mode Exit fullscreen mode

✨Creating the Hybrid "Ensemble"

To get the best of both worlds (exact keywords + semantic meaning), you merge your vector retriever with the BM25 retriever.

from langchain.retrievers import EnsembleRetriever

# Assume 'chroma_retriever' is already created from your syllabus
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, chroma_retriever],
    weights=[0.3, 0.7]  # 30% importance to Keywords, 70% to Meaning
)
Enter fullscreen mode Exit fullscreen mode

🔍The 4 Steps of BM25 (Under the Hood)

When you call hybrid_retriever.invoke("AX-705 engine"), the BM25 part follows these steps:
1. Tokenization: The query "AX-705 engine" is split into ["ax-705", "engine"].

2. Lookup: The retriever looks into its "Inverted Index" (a dictionary) to see which document chunks contain these exact strings.

3. Scoring (f(q, d)): It calculates a score for each match using the BM25 formula:
- Rareness: Since "AX-705" is rare in the database, it's worth more points than "engine".
- Saturation: Even if a doc mentions "engine" 100 times, its score won't skyrocket (preventing "keyword stuffing" from winning).
- Length Penalty: If a tiny 10-word chunk matches both words, it ranks higher than a massive 1000-word chunk matching both.

4. Ranking: It returns a list of chunks sorted by this score.


👉Next steps:

Reciprocal Rank Fusion (RRF): The Glue:

When the Ensemble Retriever gets the BM25 list and the Vector list, it needs to combine them. Since their scores are on different scales (one is 0-1, the other could be 0-25), it uses RRF.

Logic: It looks at the rank (position) of a document in both lists.
Intuition: If a document is #1 in BM25 but #50 in Vector Search, it still gets a high total score because it's a "perfect" keyword match.

I hope this was useful..

Top comments (0)