klement Gunndu

Posted on Apr 21

15 Engineering Decisions Behind RAG Hybrid Search

#ai #machinelearning #python #rag

Most people think hybrid search in RAG is just "run BM25 and vector search, combine the results."

There are actually 15 distinct engineering decisions happening between a user's question and the 6 chunks that reach the LLM. I traced through production source code line by line. Here's every single one, with the math and code.

The Pipeline at a Glance

Before diving in, here's the full funnel:

100,000 chunks → BM25 + Vector Search → Score Fusion → Cross-Encoder Reranker → 6 chunks → LLM → 1 answer

Each stage trades speed for accuracy. The broadest, fastest stage comes first. The most accurate, slowest stage comes last and only sees a handful of candidates.

Part 1: Keyword Search (BM25) — 5 Engineering Decisions

Decision 1: IDF — Score Words by Rarity

BM25 starts with a simple question: how rare is this word across all chunks?

The formula is called IDF (Inverse Document Frequency):

import math

def idf(doc_count: int, doc_freq: int) -> float:
    """Score a word by how rare it is across all chunks."""
    return math.log(
        (doc_count - doc_freq + 0.5) / (doc_freq + 0.5) + 1
    )

# Example: 10,000 chunks in database
print(idf(10000, 9800))  # "the"        → 0.020 (useless)
print(idf(10000, 500))   # "learning"   → 2.996 (useful)
print(idf(10000, 5))     # "kubernetes" → 7.506 (highly discriminating)

"the" appears in 98% of chunks — it tells you nothing about relevance. "kubernetes" appears in 0.05% — it's extremely discriminating. IDF gives rare words high scores and common words near-zero scores.

Without IDF: The word "the" contributes as much as "kubernetes." Every query is dominated by stop words.

Decision 2: Term Frequency Saturation (k1 Parameter)

Raw word counting is broken. A chunk containing "machine" 100 times shouldn't score 100x higher than one containing it once — it's probably spam.

BM25 adds a saturation curve — each additional occurrence contributes less:

def tf_saturated(freq: int, k1: float = 1.2) -> float:
    """Diminishing returns on word repetition."""
    return (freq * (k1 + 1)) / (freq + k1)

# Watch the diminishing returns
for f in [1, 2, 5, 10, 100]:
    score = tf_saturated(f)
    max_possible = k1 + 1  # 2.2
    print(f"freq={f:3d} → score={score:.2f} ({score/2.2*100:.0f}% of max)")

freq=  1 → score=1.00 (45% of max)
freq=  2 → score=1.38 (63% of max)
freq=  5 → score=1.77 (81% of max)
freq= 10 → score=1.96 (89% of max)
freq=100 → score=2.17 (99% of max)

The first occurrence does 45% of all possible work. The next 99 together add only 54% more. The ceiling is always k1 + 1 — no matter how many times a word appears.

k1 controls saturation speed: Low k1 (0.5) = saturates fast, good for short text. High k1 (3.0) = saturates slowly, good for long documents.

Decision 3: Document Length Normalization (b Parameter)

A 800-token chunk naturally contains more words than a 50-token chunk. Without correction, longer chunks always win unfairly.

The b parameter penalizes chunks longer than average and boosts shorter ones:

def length_factor(doc_length: int, avg_length: float, b: float = 0.75) -> float:
    """How much to adjust for document length."""
    return 1 - b + b * (doc_length / avg_length)

# Average chunk length = 200 tokens
print(length_factor(50, 200))   # 0.44 → short chunk gets boosted
print(length_factor(200, 200))  # 1.00 → average chunk, no adjustment
print(length_factor(800, 200))  # 3.25 → long chunk gets penalized

This factor goes in the denominator. Bigger denominator = smaller score. A word appearing twice in 50 tokens is a stronger signal than twice in 800 tokens.

Decision 4: Binary Presence for Small Chunks

Here's where production systems diverge from textbook BM25.

Standard BM25 uses the full saturation curve. But for small chunks (128-512 tokens), the difference between 1 and 2 occurrences is noise, not signal. Some production RAG systems simplify radically:

def production_similarity(doc_count, doc_freq, term_freq, boost):
    """Simplified scoring: binary presence × normalized IDF × field boost."""
    # IDF with corpus-size normalization
    idf_num = math.log(1 + (doc_count - doc_freq + 0.5) / (doc_freq + 0.5))
    idf_den = math.log(1 + (doc_count - 0.5) / 1.5)
    normalized_idf = idf_num / idf_den

    # Binary: word exists (1) or doesn't (0) — no saturation curve
    presence = min(term_freq, 1)

    return boost * normalized_idf * presence

Why? In a 200-token chunk, "machine" appearing 1 time vs 2 times is noise. Binary presence with IDF is more stable than full BM25 for small chunks.

The IDF is also divided by a corpus-size normalizer — this makes scores comparable when searching across multiple knowledge bases simultaneously.

Decision 5: Field Boosts — WHERE a Match Happens

Not all text positions are equal. A word in the title is a stronger signal than a word buried in the body:

field_weights = {
    "important_keywords": 30,  # Extracted key terms
    "important_tokens": 20,    # Key topic tokens
    "question_tokens": 20,     # Q&A headings
    "title": 10,               # Document title
    "title_small": 5,          # Lowercase title
    "content": 2,              # Body text
    "content_small": 1,        # Lowercase body (baseline)
}

# Same word, same chunk, different field:
idf_score = 0.85  # normalized IDF for "kubernetes"

title_match = 10 * idf_score    # = 8.5
body_match = 2 * idf_score      # = 1.7
keyword_match = 30 * idf_score  # = 25.5

# A keyword match is 15x more valuable than a body match

This hierarchy replaces term frequency as the primary ranking signal. Instead of "how many times does the word appear," the question becomes "where does it appear?"

Part 2: Semantic Search (Cosine Similarity) — 4 Engineering Decisions

Decision 6: Embedding Text as Vectors

An embedding model converts text into a list of numbers (a vector) that represents meaning:

# Conceptual (real embeddings have 1024 dimensions)
query_vector  = embed("machine learning algorithms")  # [0.8, 0.6, 0.1, 0.3]
chunk_a_vector = embed("neural network training")     # [0.7, 0.5, 0.2, 0.4]
chunk_b_vector = embed("history of ancient Rome")     # [0.1, 0.0, 0.9, 0.2]

"Machine learning" and "neural network training" share zero words but get similar vectors because the meaning is similar. This is what BM25 fundamentally cannot do.

Decision 7: Cosine Similarity — Angle, Not Magnitude

Cosine similarity measures the angle between two vectors, ignoring their length:

import numpy as np

def cosine_similarity(a: list, b: list) -> float:
    """Measure directional similarity between two vectors."""
    a, b = np.array(a), np.array(b)
    dot_product = np.dot(a, b)            # How much they overlap
    magnitude_a = np.linalg.norm(a)       # Length of arrow A
    magnitude_b = np.linalg.norm(b)       # Length of arrow B
    return dot_product / (magnitude_a * magnitude_b)

query = [0.8, 0.6, 0.1, 0.3]

# Related topic — similar direction
print(cosine_similarity(query, [0.9, 0.7, 0.0, 0.2]))  # 0.988

# Unrelated topic — different direction
print(cosine_similarity(query, [0.1, 0.0, 0.9, 0.2]))  # 0.237

Why magnitude doesn't matter: "I like cats" (short) and "I really really like cats a lot" (long) produce vectors pointing in the same direction but with different lengths. Cosine correctly sees them as identical meaning. Raw dot product would rank the longer text higher — cosine fixes this.

Decision 8: Pre-Normalized Vectors = Faster Math

When vectors have magnitude = 1 (pre-normalized), cosine simplifies to just a dot product:

# If ||A|| = 1 and ||B|| = 1:
# cosine(A, B) = dot(A, B) / (1 * 1) = dot(A, B)

# Skip the expensive square root calculation entirely
# Most embedding models (OpenAI, BGE, Cohere) output normalized vectors by default

This is why vector databases use "dot product" as the distance metric — it gives identical results to cosine when vectors are pre-normalized, with less computation.

Decision 9: Approximate Nearest Neighbors (HNSW)

Checking cosine similarity against all 100,000 vectors is slow. HNSW (Hierarchical Navigable Small World) builds a graph structure for approximate search:

# Elasticsearch kNN search
search.knn(
    field="embedding_1024",
    k=100,                    # Return 100 nearest vectors
    num_candidates=200,       # Examine 200 candidates (2x for accuracy)
    query_vector=query_vec,
    similarity=0.1,           # Reject anything below cosine 0.1
)

Think of HNSW like a map with highways and local roads. Instead of visiting every address, you take a highway to the right neighborhood, then search locally. 100x faster, might miss a slightly better result.

num_candidates = 2 × k means: "examine twice as many candidates as I need, then return the best k." More candidates = more accurate but slower.

Part 3: Score Fusion — 3 Engineering Decisions

Decision 10: The Scale Mismatch Problem

BM25 produces scores like 1.521, 15.2, 0.149 (range: 0 to ~20). Cosine produces scores like 0.988, 0.237 (range: -1 to 1). Adding them directly is like adding kilograms and meters.

# Naive addition — BM25 dominates
naive = 15.2 + 0.95  # = 16.15
# BM25 contributes 94%, cosine only 6%
# The semantic signal is drowned out

Decision 11: Two-Stage Weighted Fusion

Production systems use different weights at different stages:

# Stage 1: Elasticsearch retrieval (broad net, maximize recall)
es_score = 0.05 * bm25_score + 0.95 * cosine_score

# Stage 2: Python reranking (precise, maximize precision)
# Recompute BOTH scores in Python — can't unmix ES's combined score
from sklearn.metrics.pairwise import cosine_similarity as cos_sim

vector_scores = cos_sim([query_vec], chunk_vectors)[0]   # range: [0, 1]
token_scores = token_overlap(query_keywords, chunk_keywords)  # range: [0, 1]

# Both in 0-1 range now — fair to add
final_scores = 0.70 * vector_scores + 0.30 * token_scores

Why recompute in Python? Elasticsearch returns one combined score — like mixed paint, you can't unmix it back into BM25 and cosine components. Python needs the separate scores to re-weight them at 30/70 instead of 5/95.

Token overlap is simple word counting: how many query keywords appear in the chunk?

def token_overlap(query_kw: list, chunk_kw: list) -> float:
    """What fraction of query words appear in the chunk?"""
    matches = sum(1 for w in query_kw if w in chunk_kw)
    return matches / len(query_kw) if query_kw else 0.0

# Query: ["nginx", "ERR_CONN_REFUSED", "error"]
# Chunk: ["nginx", "ERR_CONN_REFUSED", "error", "proxy_pass"]
print(token_overlap(
    ["nginx", "ERR_CONN_REFUSED", "error"],
    ["nginx", "ERR_CONN_REFUSED", "error", "proxy_pass"]
))  # 1.0 — perfect keyword match

Decision 12: RRF — The Score-Free Alternative

Reciprocal Rank Fusion ignores scores entirely and uses only rank positions:

def rrf_score(ranks: dict, k: int = 60) -> float:
    """Merge rankings from multiple systems using only positions."""
    return sum(1.0 / (k + rank) for rank in ranks.values())

# D3: BM25 ranked it #1, Vector ranked it #2
print(rrf_score({"bm25": 1, "vector": 2}))   # 0.03252 — consensus winner

# D4: Vector ranked it #1, BM25 never found it (rank=1000)
print(rrf_score({"bm25": 1000, "vector": 1})) # 0.01734 — penalized

# Consensus beats individual confidence

With k=60, the difference between rank #1 and #2 is only 0.00026. No single ranker can dominate. A chunk ranked top 5 by BOTH systems beats a chunk ranked #1 by one system but #50 by the other.

k=60 favors consensus (safe for RAG). k=1 lets one ranker override the other (risky).

Part 4: Cross-Encoder Reranking — 3 Engineering Decisions

Decision 13: Bi-Encoder vs Cross-Encoder

Bi-encoders (embedding models) encode query and document separately — they never see each other:

Query: "diabetes causes"  ──→ Encoder ──→ Vector_Q
                                            │
                                     cosine similarity
                                            │
Chunk: "pancreatic cell destruction" ──→ Encoder ──→ Vector_D

Cross-encoders concatenate query + document and process them together:

Input: "[CLS] diabetes causes [SEP] pancreatic cell destruction [SEP]"
                          │
                  Full Transformer
                  (every query word attends to every chunk word)
                          │
                  Relevance score: 0.95

The cross-encoder sees "diabetes" and "pancreatic" in the same context and recognizes the connection. The bi-encoder compressed each text independently and might miss it.

The trade-off: Cross-encoders are far more accurate but cannot pre-compute anything. Every (query, chunk) pair must be processed from scratch.

Decision 14: Two-Stage Pipeline

Cross-encoders are too slow for full-corpus search:

Bi-encoder:    encode query (10ms) + compare 100K vectors (50ms) = 60ms
Cross-encoder: process 100K pairs × 0.5ms each = 50,000ms = 50 SECONDS

The solution — use both in stages:

100,000 chunks
    │
    ▼ Stage 1: BM25 + Vector (fast, ~50ms)
  200 candidates
    │
    ▼ Stage 2: Cross-Encoder (precise, ~80ms)
    6 chunks
    │
    ▼ Stage 3: LLM generates answer

Stage 1 maximizes recall — cast a wide net. Stage 2 maximizes precision — pick the best from what was found.

Decision 15: Precision Over Recall

The final and most important engineering decision: for RAG, precision matters more than recall.

Before reranking: 6 out of 10 top chunks are relevant  → Precision = 62%
After reranking:  8 out of 10 top chunks are relevant  → Precision = 84%

You can survive missing one relevant chunk. But one irrelevant chunk in the LLM context can poison the entire answer — the LLM might generate a response based on wrong information, and the user has no way to know.

The full reranking formula:

# With cross-encoder model
final = 0.30 * token_overlap + 0.70 * cross_encoder_score + rank_features

# Without cross-encoder (fallback)
final = 0.30 * token_overlap + 0.70 * cosine_similarity + rank_features

The cross-encoder replaces cosine in the 70% slot. Same weights, upgraded engine. Adding a reranker is like swapping regular flour for premium flour in a recipe — the recipe stays the same, the result gets better.

The Complete Pipeline

User: "What are the tax implications of remote work?"
                              │
                              ▼
                    ┌─────────────────┐
                    │  Query Analysis  │
                    │                  │
                    │  Keywords: ["tax", "implications", "remote", "work"]
                    │  Vector: embed(query) → [1024 numbers]
                    └────────┬────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │  Hybrid Search   │
                    │                  │
                    │  BM25 (5%) + Vector (95%)
                    │  → 1,024 candidates
                    └────────┬────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │  Reranking       │
                    │                  │
                    │  30% token + 70% cross-encoder
                    │  + tag bonus + pagerank
                    └────────┬────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │  Threshold       │
                    │                  │
                    │  score ≥ 0.2 → keep
                    │  0 results? → retry at 0.17
                    │  Return top 6
                    └────────┬────────┘
                              │
                              ▼
                    Top 6 chunks → LLM → Answer with citations

Every stage trades speed for accuracy. 100,000 chunks become 1,024 become 6 become 1 answer.

Key Takeaways

BM25 is not TF-IDF. BM25 has saturation and length normalization. For small chunks, even BM25 is overkill — binary presence + IDF works better.
Cosine similarity is not a percentage. A cosine of 0.9 means an angle of ~26 degrees. What counts as "similar" depends entirely on the embedding model.
Score fusion is harder than it looks. BM25 and cosine scores are on different scales. You must normalize first, or use RRF which ignores scores entirely.
Cross-encoders can't fix bad retrieval. If the relevant chunk isn't in the top 200, no reranker will ever find it. Fix retrieval first.
For RAG, precision beats recall. One bad chunk in the LLM context can poison the entire answer. Better to send 5 great chunks than 6 mediocre ones.

Follow @klement_gunndu for more RAG and AI engineering content. We're building in public.

DEV Community