vapmail16

Posted on Feb 26

I Switched From Pure Vector Search to Hybrid Retrieval in My RAG System — Here's What Changed

#rag #ai #machinelearning #tutorial

I've been building RAG (Retrieval-Augmented Generation) systems for a while now, and I recently made one change that boosted my retrieval accuracy from ~60% to ~85%.

The change? Adding BM25 keyword matching alongside my existing vector search.

That's it. No fancy model swaps. No expensive rerankers. Just combining two search strategies that complement each other's blind spots.

Let me walk you through exactly what happened, why it works, and what I learned from other engineers running RAG in production.

The Problem With Pure Vector Search

Vector search (using embeddings + cosine similarity) is incredible at understanding meaning. Ask it for "employee vacation policy" and it'll find documents about "time off benefits," "annual leave," and "PTO guidelines."

But here's the catch — it sometimes misses exact terminology.

In my test set of 50 questions against internal documentation, I kept running into this pattern:

User query: "What's the PTO policy?"

Vector search found: Chunks about "vacation time," "time off benefits," "leave of absence"

Vector search missed: The exact chunk that contained the acronym "PTO"

The embeddings understood the concept of paid time off perfectly. But when a document used a specific acronym, abbreviation, or domain-specific term, vector search would sometimes grab semantically similar but wrong chunks.

This matters a lot in production. Your users don't speak in perfect semantic paragraphs — they use acronyms, product names, jargon, and exact phrases they remember from documents.

Enter Hybrid Retrieval: Vector + BM25

BM25 is an old-school keyword matching algorithm. It doesn't understand meaning at all — it just finds documents that contain the exact words in your query. Think of it as a very sophisticated Ctrl+F.

Hybrid retrieval combines both:

Vector search finds chunks that are semantically similar to your query
BM25 finds chunks that contain exact keyword matches
Reciprocal Rank Fusion (RRF) merges both result sets into a single ranked list

Here's a simplified view of how RRF works:

def reciprocal_rank_fusion(vector_results, bm25_results, k=60):
    """
    Merge two ranked lists using RRF.
    k=60 is the standard constant that controls 
    how much weight lower-ranked results get.
    """
    fused_scores = {}

    for rank, doc in enumerate(vector_results):
        fused_scores[doc.id] = fused_scores.get(doc.id, 0) + 1 / (k + rank + 1)

    for rank, doc in enumerate(bm25_results):
        fused_scores[doc.id] = fused_scores.get(doc.id, 0) + 1 / (k + rank + 1)

    # Sort by combined score — docs appearing in BOTH lists rank highest
    return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)

The beauty of RRF is that documents appearing in both result sets naturally bubble to the top. If a chunk is semantically relevant AND contains exact keywords, it's almost certainly the right one.

My Results

Metric	Pure Vector	Hybrid (Vector + BM25)
Accuracy (50 questions)	~60%	~85%
Acronym/jargon queries	Poor	Strong
Natural language queries	Strong	Strong
Implementation complexity	Simple	Moderate

The 25-percentage-point jump came almost entirely from queries involving exact terminology — acronyms, product names, specific phrases, and domain jargon.

For natural language queries like "how do I request time off?", both approaches performed similarly. The hybrid approach essentially kept vector search's strengths while patching its biggest weakness.

What I Learned From Other Engineers

After sharing these results, I got some fascinating insights from engineers running RAG in production. Here are the key takeaways:

1. "It depends on your data" is annoyingly true

For technical/QA documentation (code, APIs, specs), BM25 alone can outperform vector search because queries tend to use exact terms. For enterprise/business documents with natural language, vector search pulls more weight. Hybrid gives you the best of both worlds, but the weighting between vector and BM25 scores should be tuned for your specific corpus.

2. Rerankers aren't always worth the latency

Cross-encoder rerankers (a second-pass model that re-scores your top results) are often recommended as the next step after hybrid retrieval. But several production teams reported minimal improvement when their initial retrieval was already solid. One engineer measured an NDCG of 0.74 on their hybrid setup and saw almost no gain from adding a reranker — so they dropped it to reduce latency.

Takeaway: If your hybrid retrieval is already good, a reranker might just add 100-200ms of latency for marginal improvement. Measure before committing.

3. Chunk size is NOT a solved problem

I started with ~500 characters and 100-character overlap, which works OK for my use case. But the consensus from production teams is clear: there is no universal best chunk size.

The right size depends on:

Document type: Legal docs need larger chunks (full clauses); FAQs need smaller ones
Query patterns: How specific are your users' questions?
Content structure: Is your data prose, tables, code, or mixed?

A good starting heuristic: think of chunks as "one complete thought." For prose, that's roughly a paragraph. For code, it's a function. For FAQs, it's a question-answer pair.

4. Dimensionality reduction is an underexplored lever

One interesting approach I came across: reducing embedding vectors from their native dimensions (e.g., 1536 for OpenAI) down to 25-100 dimensions using PCA, UMAP, or similar methods. The goal is to strip away noisy dimensions and keep only the ones that carry meaningful signal.

This could potentially improve accuracy AND speed — smaller vectors mean faster similarity search and less noise in the matching. Worth experimenting with if you're at scale.

Getting Started With Hybrid Retrieval

If you're running pure vector search and want to try hybrid, here's a practical starting point:

Option A: PostgreSQL + pgvector

If you're already using Postgres, you can run both searches in the same database:

-- Vector search (pgvector)
SELECT id, content, embedding <=> query_embedding AS vector_distance
FROM documents
ORDER BY embedding <=> query_embedding
LIMIT 20;

-- BM25 search (pg_trgm or ParadeDB for proper BM25)
SELECT id, content, ts_rank(to_tsvector(content), plainto_tsquery('PTO policy')) AS bm25_score
FROM documents
WHERE to_tsvector(content) @@ plainto_tsquery('PTO policy')
ORDER BY bm25_score DESC
LIMIT 20;

-- Merge results in application code using RRF

Option B: Dedicated tools

Weaviate, Qdrant, and Milvus all support hybrid search natively
LangChain and LlamaIndex have hybrid retriever implementations
NornicDB (MIT licensed) handles the full pipeline — embedding, BM25, reranking — in-process with impressive latency (~7ms on 1M documents)

Option C: Keep it simple

If you want minimal infrastructure, just run:

Your existing vector DB for semantic search
Elasticsearch or even SQLite FTS5 for BM25
Merge results with the RRF function above

It's not elegant, but it works. You can optimize later.

The 80/20 of RAG Optimization

Here's what I've learned about where to spend your time:

High impact, do first:

Switch from pure vector to hybrid retrieval
Build a proper evaluation set (50+ questions with expected answers)
Tune your chunk boundaries to match document structure

Moderate impact, do second:

Experiment with chunk sizes for your specific data
Try different embedding models
Add metadata filtering (date, source, category)

Low impact unless at scale:

Cross-encoder reranking (measure before committing)
Dimensionality reduction on embeddings
HyDE (Hypothetical Document Embeddings)

The single best investment? Build an eval set. Without one, you're just guessing whether your changes help or hurt.

Wrapping Up

Hybrid retrieval isn't new or cutting-edge — BM25 has been around since 1994. But combining it with modern vector search is one of those "boring but effective" improvements that should probably be the default for any production RAG system.

If you're running RAG with pure vector search and your accuracy isn't where you want it to be, try adding BM25 before you reach for more complex solutions. It took me about half a day to implement and delivered a bigger improvement than anything else I've tried.

What's your RAG setup look like? Are you running pure vector, hybrid, or something else entirely? I'd love to hear what's working (or not working) for you in production.

Tags: rag, ai, machinelearning, tutorial

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.