Why Your RAG System Needs Hybrid Search (And How to Actually Implement It)

#ai #llm #machinelearning #rag

Vector similarity search is powerful but it has a well-known weakness: exact term matching. If a user searches for "SOC 2 Type II report" and your documents contain that exact phrase, a well-tuned vector search will find them. But if the query is "security certification audit document" and the document says "SOC 2 Type II," the semantic match might miss it depending on how the embedding model handles that specific terminology.

The solution is hybrid search: combining vector similarity search with traditional keyword search and merging the results. Most production RAG systems I have reviewed that are performing below expectations are doing vector-only search. Adding hybrid search is one of the highest-leverage improvements available.

Here is how to implement it properly.

The two search types and what each catches

Dense retrieval (vector search) is good at: semantic similarity, paraphrase matching, concept-level queries, finding relevant content even when exact terms differ. It struggles with: rare terms, product names, codes, identifiers, and precise technical terminology where exact matching matters.

Sparse retrieval (keyword search) is good at: exact term matching, rare words, codes, identifiers, and queries where the user knows the specific terminology used in the document. It struggles with: synonyms, paraphrases, and concept-level queries where the words differ from the document.

Hybrid search combines both. You retrieve candidates from each system separately and then merge and re-rank.

Implementation with Reciprocal Rank Fusion

The simplest and most effective merging strategy is Reciprocal Rank Fusion. It does not require knowing the score scale of either system, just the rank positions.

from typing import List, Dict, Tuple

def reciprocal_rank_fusion(
    dense_results: List[Tuple[str, float]],
    sparse_results: List[Tuple[str, float]],
    k: int = 60,
    dense_weight: float = 0.5,
    sparse_weight: float = 0.5
) -> List[str]:
    """
    dense_results: list of (doc_id, score) from vector search
    sparse_results: list of (doc_id, score) from keyword search
    k: RRF constant (60 is standard default)
    Returns: list of doc_ids ranked by fused score
    """
    scores: Dict[str, float] = {}

    for rank, (doc_id, _) in enumerate(dense_results):
        rrf_score = dense_weight * (1 / (k + rank + 1))
        scores[doc_id] = scores.get(doc_id, 0) + rrf_score

    for rank, (doc_id, _) in enumerate(sparse_results):
        rrf_score = sparse_weight * (1 / (k + rank + 1))
        scores[doc_id] = scores.get(doc_id, 0) + rrf_score

    return sorted(scores.keys(), key=lambda x: scores[x], reverse=True)

Wiring it up with Elasticsearch for the sparse side

Most enterprise environments already have Elasticsearch or OpenSearch running. Use it for your sparse retrieval.

from elasticsearch import Elasticsearch

es = Elasticsearch(["http://localhost:9200"])

def sparse_search(query: str, index: str, top_k: int = 20) -> List[Tuple[str, float]]:
    response = es.search(
        index=index,
        body={
            "query": {
                "multi_match": {
                    "query": query,
                    "fields": ["content^2", "title^3", "metadata.section"],
                    "type": "best_fields"
                }
            },
            "size": top_k
        }
    )
    return [
        (hit["_id"], hit["_score"])
        for hit in response["hits"]["hits"]
    ]

def dense_search(query: str, vectorstore, top_k: int = 20) -> List[Tuple[str, float]]:
    results = vectorstore.similarity_search_with_score(query, k=top_k)
    return [(doc.metadata["doc_id"], score) for doc, score in results]

def hybrid_search(query: str, vectorstore, es_index: str, top_k: int = 10) -> List[str]:
    dense = dense_search(query, vectorstore, top_k=20)
    sparse = sparse_search(query, es_index, top_k=20)
    fused = reciprocal_rank_fusion(dense, sparse)
    return fused[:top_k]

Tuning the weights

The default 50/50 weight split is a reasonable starting point. For query types where exact terminology matters heavily (compliance documents, technical specifications, product names), skew toward sparse. For conceptual queries where paraphrasing is common, skew toward dense.

You can measure this empirically with your evaluation set. Run 50/50, 70/30 dense-heavy, and 30/70 sparse-heavy on the same query set and compare recall at k. The results will tell you where to set the production weights.

In my experience, most enterprise knowledge base deployments benefit from a slight sparse-heavy weighting around 40/60 dense/sparse because enterprise documents tend to use precise technical terminology that benefits from exact matching. Tune to your actual content.

One gotcha

Document IDs need to be consistent between your vector store and your Elasticsearch index. If you use different identifiers in the two systems, the RRF merge will not find overlapping results correctly. Use the source document path or a stable UUID as the canonical identifier and store it in both systems at ingestion time.

Hybrid search adds meaningful complexity to your retrieval pipeline. In most enterprise deployments where I have added it to a previously vector-only system, recall at k=5 improved by 15 to 25 percentage points on the evaluation set. For a knowledge base that employees rely on for accurate answers, that improvement is worth the implementation effort.

Top comments (1)

Ahmet Özel • Jun 30

Agree hybrid is usually the right default. One thing I would add from production: hybrid only helps when the BM25 and dense arms fail on different queries, otherwise you pay double for the same recall. It is worth logging which arm actually contributed to the final top-k, because often one is carrying almost everything and the fusion weight is doing nothing. The fusion method also matters more than people expect: RRF is a safe start, but per-query weighting or a small reranker over the union tends to beat a fixed alpha once query types vary. How are you tuning the blend, fixed weights or learned?