by Fabian Eder
How can you identify product variants in product data? You don't need to search for them manually. We have a pipeline that finds them for us. We use a multi-step approach using hard constraints, a XGBoost model and a graph clustering algorithm.
What are product variants?
Product variants are products of the same manufacturer, that differ only slightly, for example in color or size. So, as they fulfill the same customer needs, MediaMarktSaturn shows them next to each other on the product detail page.
The definition of product variants purely relies on product attributes, and not on customer interaction. For this reason, we maintain variants in our product information management system. We put products into variant groups: products that are variants of each other are part of the same variant group. One product must only be part of 1 variant group, it cannot be in multiple variant groups.
Why do we need to improve on finding product variants?
To ensure that all product variants are visible on the product detail page, we need complete variant groups for our whole assortment. In each of the 11 countries we are operating in, the size of our assortment comprises tens of thousands of products.
There, content managers maintain variant groups manually: manufacturers provide spreadsheets with products for which variant groups need to be created, or content managers search in the product management system for similar products and create new variant groups, etc. As you can imagine, this process is cumbersome and time-consuming. As a consequence, our variant groups are never complete, and they are usually not reviewed or revised after they are created.
An automated solution to find product variants
To make product variant management easier, we have created a solution leveraging machine learning techniques. It reviews all products and all variant groups in our whole assortment to
- suggest new variant groups
- identify products that should be added to existing variant groups
- suggest changes to existing variant groups (split a group / merge groups)
The solution comprises the following processing steps:
- Identify potential product variant pairs using constraints: only products from the same manufacturer and from the same product category can be variants of each other.
- Get the probability for each potential product variant pair if they could be variants of each other: from a XGBoost model (see below for further details)
- Create a graph from the probabilities for each manufacturer and product category
- Identify subgraphs in each graph using a graph clustering algorithm: each subgraph represents a proposed variant group (see below for further details)
- Compare the proposed variant groups from the graph clustering algorithm with the existing variant groups from the product information management system
After step 5, we show the results to assortment managers and content managers. If they approve the new product variant groups and the suggested variant group changes, the new data can be imported into the product management system and distributed across the whole company from there.
The XGBoost model to assign probabilities to product pairs
XGBoost (eXtreme Gradient Boosting) is a supervised learning algorithm that needs labelled and structured data as input. That means, we need tabular data with a label column and feature columns. In our case, we have a binary classification problem: products are variants of each other (positive class) / are not variants of each other (negative class). Each potential product variant pair (the result of step 1) is one observation.
For creating the label column, we make use of the existing variant groups: product pairs from the same variant group belong to the positive class, products from different variant groups belong to the negative class. Products without a product variant group are not used for model training and model evaluation.
For creating the feature columns, we have sophisticated feature engineering in place. For example, we consider the number of product attributes that the products have in common, the similarity of their product names, their similarity with respect to system IDs and system timestamps, etc.
After we have successfully trained, validated and evaluated the classifier on training, validation and test sets, we apply it to all potential product variant pairs from the whole assortment. For each product pair, we get the prediction probability of the binary classifier. The prediction probability is a value between 0 and 1, and a higher value means that products are very likely variants of each other. We use the prediction probabilities to build graphs (see below).
Graph clustering to propose variant groups
The following image shows an example graph that we created from the prediction probabilities of the XGBoost model. Each node represents one product, and each edge shows the probability that the two products are variants of each other. Shorter and thicker edges represent a higher probability. Probabilities below 0.9 are not shown.
The Louvain algorithm has split up the graph into subgraphs, as indicated by the colors. Nodes that are strongly connected are put into the same subgraph. Each subgraph represents a proposed variant group. In other terms, the graph clustering algorithm puts products into the same variant group that are very likely to be variants of each other.
For each manufacturer and each product category, we create such a graph and apply the graph clustering algorithm on it. The algorithm proposes around 800-3000 new variant groups for 2000-9000 products for each country.
Put everything together
To execute all steps above in a coordinated manner and on a regular schedule, we have implemented a Kubeflow pipeline which we run via Vertex AI Pipelines on the Google Cloud Platform. A Streamlit app shows the variant group proposals of the algorithm to the assortment and content managers, who either approve or reject the proposals. Finally, we have set up a Cloud Run Job that supports the import of approved variants into the product management system.
Summary
There is no need to do data maintenance tasks in a purely manual manner anymore. Machine learning offers so many great opportunities to support and improve business processes.
In our case, we have successfully built a pipeline on Google Cloud Platform that suggests new product variant groups and that improves existing variant groups for thousands of products. It leverages a multi-step approach using hard constraints, an XGBoost model, and a graph clustering algorithm.
What about you? Do you face similar challenges in your business and use data science and machine learning approaches to make your life easier? Feel free to reach out to us and share your experience!
get to know us π https://mms.tech π


Top comments (0)