DEV Community

Cover image for 5 Must-read Papers on Product Categorization for Data Scientists
Limarc Ambalina
Limarc Ambalina

Posted on

5 Must-read Papers on Product Categorization for Data Scientists

Product categorization/product classification is the organization of products into their respective departments or categories. As well, a large part of the process is the design of the product taxonomy as a whole.

Product categorization was initially a text classification task that analyzed the product's title to choose the appropriate category. However, numerous methods have been developed which take into account the product title, description, images, and other available metadata. The following papers on product categorization represent essential reading in the field and offer novel approaches to product classification tasks.

1. Don't Classify, Translate

In this paper, researchers from the National University of Singapore and the Rakuten Institute of Technology propose and explain a novel machine translation approach to product categorization. The experiment uses the Rakuten Data Challenge and Rakuten Ichiba datasets. Their method translates or converts a product's description into a sequence of tokens which represent a root-to-leaf path to the correct category. Using this method, they are also able to propose meaningful new paths in the taxonomy.

The researchers state that their method outperforms many of the existing classification algorithms commonly used in machine learning today.

  • Published/Last Updated - Dec. 14, 2018
  • Authors and Contributors - Maggie Yundi Li (National University of Singapore), Stanley Kok (National University of Singapore), and Liling Tan (Rakuten Institute of Technology)

Read Now

2. Large-Scale Categorization of Japanese Product Titles Using Neural Attention Models

The authors of this paper propose attention convolutional neural network (ACNN) models over baseline convolutional neural network (CNN) models and gradient boosted tree (GBT) classifiers. The study uses Japanese product titles taken from Rakuten Ichiba as training data. Using this data, the authors compare the performance of the three methods (ACNN, CNN, and GBT) for large-scale product categorization. While differences in accuracy can be less than 5%, even minor improvements in accuracy can result in millions of additional correct categorizations.
Lastly, the authors explain how an ensemble of ACNN and GBT models can further minimize false categorizations.

  • Published/Last Updated - April, 2017 for EACL 2017
  • Authors and Contributors - From the Rakuten Institute of Technology: Yandi Xia, Aaron Levine, Pradipto Das Giuseppe Di Fabbrizio, Keiji Shinzato and Ankur Datta

Read Now

3. Atlas: A Dataset and Benchmark for Ecommerce Clothing Product Classification

Researchers at the University of Colorado and Ericsson Research (Chennai, India) have created a large product dataset known as Atlas. In this paper, the team presents their dataset which includes over 186,000 images of clothing products along with their product titles. Furthermore, they introduce related work in the field that has influenced their study. Finally, they test their dataset using a Resnet34 classification model and a Seq to Seq model to categorize the products.

The data is taken from Indian ecommerce stores, so some of the categories used may not be applicable to Western markets. However, the dataset has been open-sourced and is available on Github.

  • Published/Last Updated - Aug. 19, 2019
  • Authors and Contributors - Venkatesh Umaashankar (Ericsson Research), Girish Shanmugam (Ericsson Research), and Aditi Prakash (University of Colorado)

Read Now

4. Large Scale Product Categorization using Structured and Unstructured Attributes

In this study, a team at WalmartLabs compares hierarchical models to flat models for product categorization.
The researchers employ deep-learning based models which extract features from each product to create a product signature.

In the paper, the researchers describe a multi-LSTM and multi-CNN based approach to this extreme classification task. Furthermore, they present a novel way to use structured attributes. The team states that their methods can be scaled to take into account any number of product attributes during categorization.

  • Published/Last Updated - Mar. 1, 2019
  • Authors and Contributors - From WalmartLabs: Abhinandan Krishnan and Abilash Amarthaluri

Read Now

5. Multi-Label Product Categorization Using Multi-Modal Fusion Models

In this paper, researchers from New York University and U.S. Bank investigate multi-modal approaches to categorize products on Amazon. Their approach utilizes multiple classifiers trained on each type of input data from the product listings. Using a dataset of 9.4 million Amazon products, they developed a tri-modal model for product classification based on product images, titles, and descriptions. Their tri-modal late fusion model retains an F1 score of 88.2%.

The findings of their study demonstrate that increasing the number of modalities could improve performance in multi-label product categorization.

  • Published/Last Updated - June 30, 2019
  • Authors and Contributors - Pasawee Wirojwatanakul (New York University) and Artit Wangperawong (U.S. Bank)

Read Now


In the papers on product categorization above, the researchers trained their models on open datasets which included millions of products. However, if you are building a product categorization model for commercial use, many open datasets may not be available to you.

Looking for training data for your product classification model? Check out this training data guide and these open datasets.

Top comments (0)