DEV Community

Rosen Hristov
Rosen Hristov

Posted on

Hybrid Search for E-commerce: When Keywords Alone Fail

I wrote about abandoning vector-only search after SKU lookups returned random products. That post covered the problem. This one covers the solution: running BM25 and vector search in parallel, merging results, and reranking with a cross-encoder.

BM25 handles:

  • Exact product names and SKUs
  • Brand names ("Nike", "Bosch")
  • Specific attributes ("size 38", "500ml", "red")

Vector search handles:

  • Natural language descriptions ("something warm for winter")
  • Intent-based queries ("gift for a coffee lover")
  • Cross-language queries (customer asks in German, catalog is in English)

How the Merge Works

Both searches return scored results. The trick is normalizing scores so they're comparable, then combining them with configurable weights.

def hybrid_search(query: str, store_id: str, limit: int = 20):
    # Run both searches in parallel
    bm25_results = bm25_search(query, store_id, limit=limit * 2)
    vector_results = vector_search(query, store_id, limit=limit * 2)

    # Normalize scores to 0-1 range
    bm25_scores = normalize(bm25_results)
    vector_scores = normalize(vector_results)

    # Merge with weights (tuned per use case)
    merged = {}
    for product_id, score in bm25_scores.items():
        merged[product_id] = score * BM25_WEIGHT

    for product_id, score in vector_scores.items():
        if product_id in merged:
            merged[product_id] += score * VECTOR_WEIGHT
        else:
            merged[product_id] = score * VECTOR_WEIGHT

    return sorted(merged.items(), key=lambda x: x[1], reverse=True)[:limit]
Enter fullscreen mode Exit fullscreen mode

The weights need tuning per store. Stores with lots of SKU-based lookups benefit from higher BM25 weight. Stores where customers describe what they want (fashion, home goods) benefit from higher vector weight. I don't have a universal formula. Start at 50/50 and adjust based on your query logs.

Cross-Encoder Reranking

After the merge, a cross-encoder reranker compares each candidate directly against the query.

Unlike bi-encoders (which encode query and product separately), cross-encoders take the pair as input and output a relevance score. More expensive, but more accurate.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[dict]) -> list[dict]:
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)

    for i, candidate in enumerate(candidates):
        candidate["rerank_score"] = float(scores[i])

    return sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
Enter fullscreen mode Exit fullscreen mode

I run this on the top 20-30 candidates from hybrid search, not the full catalog. This keeps response times reasonable since cross-encoders are slow on large sets.

Note: the code examples above are simplified for clarity. Production code needs error handling, async execution, and score caching.

Cross-Language Search

One side effect of using intfloat/multilingual-e5-large: it maps 100+ languages into the same vector space. A query in French against an English catalog returns correct results because the embedding model treats meaning, not language, as the proximity metric. No translation API needed. If you sell across borders, the multilingual embedding model does the work for free.

What This Doesn't Do

Limitations:

  • Image search. Customers can't upload a photo and find matching products. This is a different problem requiring CLIP or similar models.
  • Personalization. The search doesn't learn from individual user behavior. It treats every query independently.
  • Typo correction. Heavy typos can throw off both BM25 and vector search. I handle this with query preprocessing, but it's not perfect.
  • Real-time inventory. Search returns products that exist in the catalog. Stock availability is a separate check.

What I've Seen in Practice

I don't have clean A/B test data to share yet. What I can say from manually testing across several store catalogs:

  • Keyword-only search fails on most natural language queries. If the customer doesn't use the exact product name, they get nothing.
  • Vector-only search handles descriptions well but returns wrong results for SKU lookups and specific attributes (color, size).
  • Hybrid search with reranking handles both query types. SKU searches still work. Descriptive queries return relevant products.

I won't put a percentage on it until I have proper metrics. If you're building this, set up evaluation before you ship.

The data sync pipeline that feeds this search engine handles 60,000+ product catalogs with batch embeddings and hash-based skip logic. I wrote about that in Syncing 60,000 Products Without Breaking Everything.

I built this as part of Emporiqa. You can test hybrid search on your own catalog: the sandbox syncs up to 100 products in about 2 minutes.

Top comments (0)