I've been building RAG (Retrieval-Augmented Generation) systems for a while now, and I recently made one change that boosted my retrieval accuracy from ~60% to ~85%.
The change? Adding BM25 keyword matching alongside my existing vector search.
That's it. No fancy model swaps. No expensive rerankers. Just combining two search strategies that complement each other's blind spots.
Let me walk you through exactly what happened, why it works, and what I learned from other engineers running RAG in production.
The Problem With Pure Vector Search
Vector search (using embeddings + cosine similarity) is incredible at understanding meaning. Ask it for "employee vacation policy" and it'll find documents about "time off benefits," "annual leave," and "PTO guidelines."
But here's the catch — it sometimes misses exact terminology.
In my test set of 50 questions against internal documentation, I kept running into this pattern:
User query: "What's the PTO policy?"
Vector search found: Chunks about "vacation time," "time off benefits," "leave of absence"
Vector search missed: The exact chunk that contained the acronym "PTO"
The embeddings understood the concept of paid time off perfectly. But when a document used a specific acronym, abbreviation, or domain-specific term, vector search would sometimes grab semantically similar but wrong chunks.
This matters a lot in production. Your users don't speak in perfect semantic paragraphs — they use acronyms, product names, jargon, and exact phrases they remember from documents.
Enter Hybrid Retrieval: Vector + BM25
BM25 is an old-school keyword matching algorithm. It doesn't understand meaning at all — it just finds documents that contain the exact words in your query. Think of it as a very sophisticated Ctrl+F.
Hybrid retrieval combines both:
- Vector search finds chunks that are semantically similar to your query
- BM25 finds chunks that contain exact keyword matches
- Reciprocal Rank Fusion (RRF) merges both result sets into a single ranked list
Here's a simplified view of how RRF works:
def reciprocal_rank_fusion(vector_results, bm25_results, k=60):
"""
Merge two ranked lists using RRF.
k=60 is the standard constant that controls
how much weight lower-ranked results get.
"""
fused_scores = {}
for rank, doc in enumerate(vector_results):
fused_scores[doc.id] = fused_scores.get(doc.id, 0) + 1 / (k + rank + 1)
for rank, doc in enumerate(bm25_results):
fused_scores[doc.id] = fused_scores.get(doc.id, 0) + 1 / (k + rank + 1)
# Sort by combined score — docs appearing in BOTH lists rank highest
return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
The beauty of RRF is that documents appearing in both result sets naturally bubble to the top. If a chunk is semantically relevant AND contains exact keywords, it's almost certainly the right one.
My Results
| Metric | Pure Vector | Hybrid (Vector + BM25) |
|---|---|---|
| Accuracy (50 questions) | ~60% | ~85% |
| Acronym/jargon queries | Poor | Strong |
| Natural language queries | Strong | Strong |
| Implementation complexity | Simple | Moderate |
The 25-percentage-point jump came almost entirely from queries involving exact terminology — acronyms, product names, specific phrases, and domain jargon.
For natural language queries like "how do I request time off?", both approaches performed similarly. The hybrid approach essentially kept vector search's strengths while patching its biggest weakness.
What I Learned From Other Engineers
After sharing these results, I got some fascinating insights from engineers running RAG in production. Here are the key takeaways:
1. "It depends on your data" is annoyingly true
For technical/QA documentation (code, APIs, specs), BM25 alone can outperform vector search because queries tend to use exact terms. For enterprise/business documents with natural language, vector search pulls more weight. Hybrid gives you the best of both worlds, but the weighting between vector and BM25 scores should be tuned for your specific corpus.
2. Rerankers aren't always worth the latency
Cross-encoder rerankers (a second-pass model that re-scores your top results) are often recommended as the next step after hybrid retrieval. But several production teams reported minimal improvement when their initial retrieval was already solid. One engineer measured an NDCG of 0.74 on their hybrid setup and saw almost no gain from adding a reranker — so they dropped it to reduce latency.
Takeaway: If your hybrid retrieval is already good, a reranker might just add 100-200ms of latency for marginal improvement. Measure before committing.
3. Chunk size is NOT a solved problem
I started with ~500 characters and 100-character overlap, which works OK for my use case. But the consensus from production teams is clear: there is no universal best chunk size.
The right size depends on:
- Document type: Legal docs need larger chunks (full clauses); FAQs need smaller ones
- Query patterns: How specific are your users' questions?
- Content structure: Is your data prose, tables, code, or mixed?
A good starting heuristic: think of chunks as "one complete thought." For prose, that's roughly a paragraph. For code, it's a function. For FAQs, it's a question-answer pair.
4. Dimensionality reduction is an underexplored lever
One interesting approach I came across: reducing embedding vectors from their native dimensions (e.g., 1536 for OpenAI) down to 25-100 dimensions using PCA, UMAP, or similar methods. The goal is to strip away noisy dimensions and keep only the ones that carry meaningful signal.
This could potentially improve accuracy AND speed — smaller vectors mean faster similarity search and less noise in the matching. Worth experimenting with if you're at scale.
Getting Started With Hybrid Retrieval
If you're running pure vector search and want to try hybrid, here's a practical starting point:
Option A: PostgreSQL + pgvector
If you're already using Postgres, you can run both searches in the same database:
-- Vector search (pgvector)
SELECT id, content, embedding <=> query_embedding AS vector_distance
FROM documents
ORDER BY embedding <=> query_embedding
LIMIT 20;
-- BM25 search (pg_trgm or ParadeDB for proper BM25)
SELECT id, content, ts_rank(to_tsvector(content), plainto_tsquery('PTO policy')) AS bm25_score
FROM documents
WHERE to_tsvector(content) @@ plainto_tsquery('PTO policy')
ORDER BY bm25_score DESC
LIMIT 20;
-- Merge results in application code using RRF
Option B: Dedicated tools
- Weaviate, Qdrant, and Milvus all support hybrid search natively
- LangChain and LlamaIndex have hybrid retriever implementations
- NornicDB (MIT licensed) handles the full pipeline — embedding, BM25, reranking — in-process with impressive latency (~7ms on 1M documents)
Option C: Keep it simple
If you want minimal infrastructure, just run:
- Your existing vector DB for semantic search
- Elasticsearch or even SQLite FTS5 for BM25
- Merge results with the RRF function above
It's not elegant, but it works. You can optimize later.
The 80/20 of RAG Optimization
Here's what I've learned about where to spend your time:
High impact, do first:
- Switch from pure vector to hybrid retrieval
- Build a proper evaluation set (50+ questions with expected answers)
- Tune your chunk boundaries to match document structure
Moderate impact, do second:
- Experiment with chunk sizes for your specific data
- Try different embedding models
- Add metadata filtering (date, source, category)
Low impact unless at scale:
- Cross-encoder reranking (measure before committing)
- Dimensionality reduction on embeddings
- HyDE (Hypothetical Document Embeddings)
The single best investment? Build an eval set. Without one, you're just guessing whether your changes help or hurt.
Wrapping Up
Hybrid retrieval isn't new or cutting-edge — BM25 has been around since 1994. But combining it with modern vector search is one of those "boring but effective" improvements that should probably be the default for any production RAG system.
If you're running RAG with pure vector search and your accuracy isn't where you want it to be, try adding BM25 before you reach for more complex solutions. It took me about half a day to implement and delivered a bigger improvement than anything else I've tried.
What's your RAG setup look like? Are you running pure vector, hybrid, or something else entirely? I'd love to hear what's working (or not working) for you in production.
Tags: rag, ai, machinelearning, tutorial
Top comments (0)