DEV Community

Cover image for How We Rerank 565K Products Using Deep Learning
Carles
Carles

Posted on

How We Rerank 565K Products Using Deep Learning

At SeeStocks, we run a price comparison engine that tracks over 565,000 products across multiple retailers in Spain. One of our biggest challenges? Making sure that when a user lands on a category page, the most relevant products appear first — not just sorted by price, but ranked by actual relevance to what they're looking for.
This is the story of how we built a multi-stage reranking pipeline using deep learning, and what we learned along the way.
The Problem With Naive Sorting
Early on, our category pages were simple: pull all products tagged under a category, sort by price, done. But this quickly broke down:

A "pepper" category would surface hot sauce bottles before actual peppercorns
A "tool bags" page showed backpacks that happened to be in the same parent taxonomy
Products with misleading titles would float to the top simply because they were cheap

We needed something smarter than keyword matching and price sorting.
Our Approach: A Three-Stage Pipeline
We settled on a multi-stage architecture that balances speed with accuracy:
Stage 1: Candidate Retrieval (Fast, Broad)
We maintain a vector index of all products using embeddings from a fine-tuned vision-language model. When a user hits a category page, we first retrieve a broad set of candidates using approximate nearest neighbor search against the category centroid — a pre-computed embedding that represents the "ideal" product in that category.
This stage is optimized for recall over precision. We intentionally cast a wide net, pulling 3-5x more candidates than we'll ultimately display.
Stage 2: Cross-Encoder Reranking (Slow, Precise)
The candidates from Stage 1 are then passed through a cross-encoder model that scores each product against the category context. Unlike the bi-encoder in Stage 1 (which computes embeddings independently), the cross-encoder processes the product and category jointly, capturing fine-grained interactions.
We encode several signals:

Visual similarity: How well does the product image match the expected visual prototype for this category?
Taxonomic distance: How far is the product's assigned category from the target category in our taxonomy tree?
Title-category coherence: Does the product title semantically align with the category name and its parent path?
Price distribution fit: Is the product priced within a reasonable range for this category, or is it a statistical outlier?

Each signal produces a score, and we combine them using learned weights from a lightweight gradient-boosted model trained on human relevance judgments.
Stage 3: Business Rules & Diversity
The final stage applies hard constraints:

Deduplicate near-identical products from different retailers (keeping the cheapest)
Ensure retailer diversity (no single store dominates the top positions)
Apply freshness decay (products not seen in recent crawls get penalized)
Enforce minimum confidence thresholds

The Taxonomy Challenge
Our product taxonomy follows Google's Shopping taxonomy with 5,700+ categories organized in a deep hierarchy. One thing we learned: flat classification doesn't work for ecommerce at this scale.
A product image of black ground pepper could reasonably match:

Food > Condiments > Spices > Pepper ✅
Food > Condiments > Spices > Seasoning Mixes ❌ (close, but wrong)
Food > Condiments ❌ (too broad)

We built what we call a hierarchical disambiguation layer: when the model is uncertain between sibling categories, we generate discriminative text prompts that highlight the differences between them and re-score. This reduced misclassification between sibling categories by 34%.
What We Run in Production
The full pipeline runs on a single GPU server:

Candidate retrieval: ~15ms per category (pre-computed index)
Cross-encoder reranking: ~120ms for 200 candidates
Business rules: ~5ms
Total latency: under 200ms end-to-end

We pre-compute rankings for our 1,347 active category pages on a nightly batch job, so users never wait for the ML pipeline — they get served from cache.
Results
After deploying the reranking pipeline:

Product relevance score (human-evaluated): 71% → 94%
Category pages with misclassified products in top 10: 23% → 3%
User engagement (click-through to retailer): +41%
Bounce rate on category pages: -28%

Lessons Learned

  1. Your taxonomy is your moat. We spent more time curating our taxonomy tree and training discriminators between confusing categories than on any model architecture decision.
  2. Embeddings are just the beginning. The bi-encoder gets you 80% of the way there. The last 20% — which is what users actually notice — comes from cross-encoder reranking and business logic.
  3. Batch > real-time for this use case. We initially tried to run the full pipeline on every request. Switching to nightly batch computation with cache cut our GPU costs by 90% and simplified everything.
  4. Outlier detection matters more than ranking. Removing the wrong products from a category page had more impact than perfecting the order of the right ones.

We're continuing to iterate on this system. Next on our roadmap: using multimodal LLMs for attribute extraction (color, material, size) to enable smarter filtering within categories.
If you're working on similar problems in ecommerce search or product categorization, I'd love to hear how you approach it. Drop a comment or find us at es.seestocks.com .

Top comments (0)