I wrote about abandoning vector-only search after SKU lookups returned random products. That post covered the problem. This one covers the solution: running BM25 and vector search in parallel, merging results, and reranking with a cross-encoder.
BM25 handles:
- Exact product names and SKUs
- Brand names ("Nike", "Bosch")
- Specific attributes ("size 38", "500ml", "red")
Vector search handles:
- Natural language descriptions ("something warm for winter")
- Intent-based queries ("gift for a coffee lover")
- Cross-language queries (customer asks in German, catalog is in English)
How the Merge Works
Both searches return scored results. The trick is normalizing scores so they're comparable, then combining them with configurable weights.
def hybrid_search(query: str, store_id: str, limit: int = 20):
# Run both searches in parallel
bm25_results = bm25_search(query, store_id, limit=limit * 2)
vector_results = vector_search(query, store_id, limit=limit * 2)
# Normalize scores to 0-1 range
bm25_scores = normalize(bm25_results)
vector_scores = normalize(vector_results)
# Merge with weights (tuned per use case)
merged = {}
for product_id, score in bm25_scores.items():
merged[product_id] = score * BM25_WEIGHT
for product_id, score in vector_scores.items():
if product_id in merged:
merged[product_id] += score * VECTOR_WEIGHT
else:
merged[product_id] = score * VECTOR_WEIGHT
return sorted(merged.items(), key=lambda x: x[1], reverse=True)[:limit]
The weights need tuning per store. Stores with lots of SKU-based lookups benefit from higher BM25 weight. Stores where customers describe what they want (fashion, home goods) benefit from higher vector weight. I don't have a universal formula. Start at 50/50 and adjust based on your query logs.
Cross-Encoder Reranking
After the merge, a cross-encoder reranker compares each candidate directly against the query.
Unlike bi-encoders (which encode query and product separately), cross-encoders take the pair as input and output a relevance score. More expensive, but more accurate.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, candidates: list[dict]) -> list[dict]:
pairs = [(query, c["text"]) for c in candidates]
scores = reranker.predict(pairs)
for i, candidate in enumerate(candidates):
candidate["rerank_score"] = float(scores[i])
return sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
I run this on the top 20-30 candidates from hybrid search, not the full catalog. This keeps response times reasonable since cross-encoders are slow on large sets.
Note: the code examples above are simplified for clarity. Production code needs error handling, async execution, and score caching.
Cross-Language Search
One side effect of using intfloat/multilingual-e5-large: it maps 100+ languages into the same vector space. A query in French against an English catalog returns correct results because the embedding model treats meaning, not language, as the proximity metric. No translation API needed. If you sell across borders, the multilingual embedding model does the work for free.
What This Doesn't Do
Limitations:
- Image search. Customers can't upload a photo and find matching products. This is a different problem requiring CLIP or similar models.
- Personalization. The search doesn't learn from individual user behavior. It treats every query independently.
- Typo correction. Heavy typos can throw off both BM25 and vector search. I handle this with query preprocessing, but it's not perfect.
- Real-time inventory. Search returns products that exist in the catalog. Stock availability is a separate check.
What I've Seen in Practice
I don't have clean A/B test data to share yet. What I can say from manually testing across several store catalogs:
- Keyword-only search fails on most natural language queries. If the customer doesn't use the exact product name, they get nothing.
- Vector-only search handles descriptions well but returns wrong results for SKU lookups and specific attributes (color, size).
- Hybrid search with reranking handles both query types. SKU searches still work. Descriptive queries return relevant products.
I won't put a percentage on it until I have proper metrics. If you're building this, set up evaluation before you ship.
The data sync pipeline that feeds this search engine handles 60,000+ product catalogs with batch embeddings and hash-based skip logic. I wrote about that in Syncing 60,000 Products Without Breaking Everything.
I built this as part of Emporiqa. You can test hybrid search on your own catalog: the sandbox syncs up to 100 products in about 2 minutes.
Top comments (0)