Zalando Introduces MLLM-Based Evaluation for Product Retrieval

#ai #machinelearning #research #deeplearning

Zalando presents a multimodal LLM-based evaluation for product retrieval, aiming to enhance search relevance in e-commerce. This matters as it could set a new standard for assessing AI in retail search.

Key Takeaways

Zalando presents a multimodal LLM-based evaluation for product retrieval, aiming to enhance search relevance in e-commerce.
This matters as it could set a new standard for assessing AI in retail search.

What Happened

Zalando, the European fashion and lifestyle platform, has introduced a novel evaluation framework for product retrieval that leverages multimodal large language models (MLLMs). The work, reported by Let's Data Science, proposes a benchmark designed to test how well AI systems understand and retrieve products based on complex, multi-modal queries—combining text descriptions with visual inputs.

This is a direct response to a long-standing challenge in e-commerce: traditional product search often relies on text-only matching or separate image similarity models, which fail to capture nuanced user intents like "a blue dress similar to this one but with long sleeves" or "a casual jacket in the same style as this photo." MLLMs, which can process both text and images simultaneously, offer a path toward more natural and accurate retrieval.

Technical Details

The evaluation framework, as described, assesses retrieval performance across diverse product categories and query types. While specific numerical results (e.g., recall@k scores) were not detailed in the available source material, the core innovation lies in using MLLMs as both the retrieval engine and the evaluator. This dual role allows for:

Contextual understanding: MLLMs can interpret queries that mix visual examples with textual modifiers (e.g., "find shoes like this but in black leather").
Multi-modal reasoning: The model evaluates whether retrieved items match the combined text-image intent, rather than just text-to-text or image-to-image similarity.
Scalable evaluation: Automated scoring via MLLMs reduces the need for expensive human judgments, enabling faster iteration on retrieval models.

The benchmark likely includes a curated dataset of fashion products with ground-truth relevance judgments, though exact dataset size and composition were not confirmed in the summary.

Retail & Luxury Implications

For retailers and luxury brands, this work has immediate practical relevance. Product discovery remains a critical pain point: customers often struggle to articulate what they want in text alone, especially for visually-driven categories like apparel, accessories, and home decor. An MLLM-based retrieval system could:

Reduce search abandonment: By understanding complex, multi-modal queries, the system can surface relevant products faster, reducing the 40-70% of users who leave a site after a failed search.
Enable visual search at scale: Luxury brands with extensive catalogs (e.g., Richemont's watch collections, Kering's footwear lines) can allow customers to search by uploading a photo and refining with text—e.g., "a watch like this but with a leather strap."
Improve cross-selling: MLLMs can identify visually similar or complementary items, powering recommendations that feel intuitive rather than algorithmic.

However, the maturity level is early-stage. Zalando's work is a benchmark proposal, not a production system. The computational cost of running MLLMs for every search query remains high, and latency requirements for real-time retail search (sub-200ms) are stringent. Additionally, luxury brands must ensure that retrieval quality does not compromise brand perception—e.g., returning a similar-looking but lower-quality item could damage trust.

Business Impact

While Zalando has not released quantified business metrics for this specific benchmark, the broader trend toward multi-modal search is backed by industry data. According to a 2025 McKinsey report, retailers that implement advanced visual search see a 15-25% increase in conversion rates and a 10-20% reduction in return rates (as customers find what they actually want). For luxury brands, where average order values can exceed €500, even a 1% improvement in conversion can translate to significant revenue.

Zalando's move also signals competitive pressure. Other platforms like ASOS, Farfetch, and Amazon have invested in visual search, but MLLM-based retrieval represents a step change in capability. If Zalando open-sources this benchmark (common in the research community), it could become a de facto standard, similar to how MLPerf accelerated AI hardware evaluation.

Implementation Approach

To adopt similar technology, retail AI teams should:

Curate multi-modal training data: Pair product images with rich text descriptions (including style, material, fit) and annotate complex queries. A dataset of at least 100k items is recommended for meaningful results.
Select an MLLM backbone: Options include Google's Gemini 3 Pro, OpenAI's GPT-4o, or open-source models like LLaVA-NeXT. The choice depends on latency requirements and budget—open-source models can be fine-tuned but may lag in reasoning quality.
Define evaluation metrics: Beyond standard recall@k, consider multi-modal relevance—whether the retrieved item matches both the visual and textual intent. Human evaluation is still needed for validation.
Optimize for latency: Use model distillation or smaller MLLMs (e.g., Gemma 4 2B) for inference at scale, and cache frequent queries.
A/B test rigorously: Run controlled experiments measuring search-to-purchase conversion, time-to-find, and return rates.

Governance & Risk Assessment

Privacy: MLLMs processing user-uploaded images raise data protection concerns (GDPR compliance). Ensure images are not stored longer than necessary and are anonymized.
Bias: Training data may reflect biased fashion norms (e.g., limited size ranges, Eurocentric aesthetics). Regular audits are needed to avoid reinforcing stereotypes.
Maturity: This technology is at TRL 4-5 (technology validated in lab). Production deployments should start with non-critical use cases (e.g., inspiration boards) before powering core search.

gentic.news Analysis

Zalando's benchmark is a smart move. By publishing an evaluation standard, they position themselves as thought leaders while solving a real problem—product retrieval quality has plateaued with traditional methods. The use of MLLMs for both retrieval and evaluation is elegant but risky: it creates a circular dependency where the evaluator's biases are baked into the benchmark. Independent human validation will be critical.

For luxury brands, the takeaway is clear: multi-modal search is coming, and early adopters will have a competitive advantage. However, the technology is not yet plug-and-play. Teams should start experimenting with MLLMs for search in 2026, focusing on high-value categories (e.g., watches, handbags) where visual attributes are paramount. The cost of inference will drop as Google, OpenAI, and others compete on price—note Google's recent threat to cut Gemini reasoning model prices by 80% (June 2026).

Finally, watch for Google's own moves. With Gemini Embedding 2 and TPU infrastructure, Google Cloud is well-positioned to offer MLLM-based search as a service. Zalando's benchmark could accelerate that, making multi-modal retrieval a standard offering within two years.

Source: news.google.com

Originally published on gentic.news