DEV Community

Anuj Bolewar
Anuj Bolewar

Posted on

Building a 75,000-Product Image Feature Dataset for the Amazon ML Challenge 2025

dataset_ML-hakathon-Amazon
Hey everyone! Hope you’re all doing great today.

So I’ve got something pretty exciting to share with you all — I just created a massive image features dataset for the Amazon ML Challenge 2025!

Processing 75,000 product images and extracting deep learning features was absolutely wild. I never thought I’d be building ML-ready datasets at this scale — but here we are!


How It All Started

While reading through the Amazon ML Challenge problem statement about predicting product prices, I noticed something interesting.

The challenge provides product images, but most participants would probably struggle with efficient image processing.

That’s when it hit me —

“Wait, I could create a ready-to-use feature dataset that makes everyone’s life easier!”

Why make everyone extract image features from scratch when it can be done once and shared with the whole community?


The Challenge That Sparked This Idea

The competition focuses on predicting product prices using both text and images.

The connection between visuals and pricing is fascinating — a clean, professional product photo often correlates with higher perceived value.

But there’s a catch — working with raw images is:

  • Time-consuming
  • GPU-intensive
  • Technically complex for beginners
  • Repetitive work everyone must do

So I thought —

“What if I create a dataset that eliminates all these pain points?”


The Creative Process Was Actually Fun

Coming Up With The Architecture

I’ve always loved transfer learning, so I chose ResNet50, pretrained on ImageNet, to extract image embeddings.

The pipeline:

  1. Extract 2048-dimensional features using ResNet50
  2. Apply PCA compression to reduce them to 100 dimensions
  3. Save it all in a clean, ML-ready CSV file

The Technical Challenge

Processing 75,000 product images isn’t just “run a script and chill”.

I had to make sure the extracted features were:

  • Meaningful: capturing rich visual details
  • Efficiently processed: I hit 502 images/sec on GPU
  • Handled missing data: about 2% of products had missing images
  • Easy to use: simple CSV ready for pandas or sklearn

I also worked on:

  • Optimizing the batch pipeline
  • Fine-tuning PCA (retaining 83% variance)
  • Normalizing features
  • Writing documentation and examples

What I Learned About Feature Engineering

Before this project, I understood deep learning theory — but building a real-world feature pipeline taught me so much more.

We’re talking about compressing thousands of pixels into just 100 numerical features — yet those vectors can:

  • Identify similar products
  • Enable price prediction
  • Drive recommendation engines

That’s the power of representation learning.


The Technical Breakdown

Image Processing Pipeline

ResNet50 → 2048 features → PCA → 100 features

Quality Checks

  • 98% image coverage (73,485 / 75,000)
  • Feature distributions validated
  • Visual similarity tested

Optimization

  • GPU accelerated → 502 images/sec
  • Parallel processing
  • Memory-efficient pipeline

Documentation

  • Detailed README
  • Example notebooks
  • Ready-to-run scripts

The Challenges I Faced

The hardest decision? Choosing PCA dimensions.

After several experiments:

  • 100 dimensions retained 83% variance
  • Balanced accuracy + efficiency
  • Great for visualization and clustering

Handling missing images was another challenge — I used zero vectors with a missing flag to maintain data consistency.


💭 What This Experience Taught Me

  1. Data Infrastructure Matters — great models need great data.
  2. Usability Is Everything — make it easy for others to use.
  3. Optimization Pays Off — from 50 → 502 images/sec!
  4. Community Value — sharing saves others’ time and energy.

The Numbers That Made Me Proud

Metric Value
Total Products 75,000
Valid Images 73,485 (98%)
Speed 502 images/sec
Compression 2048 → 100 dims
Variance Retained 83%
File Size 79 MB

Use Cases I’m Excited About

  • Visual Product Search – find similar-looking items
  • Clustering – group visually related products
  • Price Prediction – combine with text features
  • Recommendations – suggest look-alike products
  • Research – benchmark multimodal ML pipelines

Future Plans

  • Extract features from other layers
  • Try EfficientNet or Vision Transformers (ViT)
  • Create multimodal datasets (image + text)
  • Build a real-time API for on-the-fly feature extraction

My Honest Take on Feature Engineering

Sure, raw images are great for end-to-end learning,

but pre-extracted features democratize access — no GPU required, and instant modeling possible.

It’s about practicality — get to the ML part faster


Check Out the Dataset

Dataset: Kaggle – Product Image Features

Includes:

  • Feature extraction code
  • Usage tutorials
  • Analysis scripts
  • Documentation for quick start

Perfect for the Amazon ML Challenge or any e-commerce ML project.


Final Thoughts

Creating this dataset was both challenging and deeply rewarding.

It taught me how to think about scalability, optimization, and sharing ML resources with the community.

There’s something magical about seeing those progress bars fly at 502 images/sec —

knowing your work might save others days of processing time.


Technical Specifications

Parameter Value
Model ResNet50 (ImageNet pretrained)
Original Features 2048
Compressed Features 100 via PCA
Variance Retained 83%
Processing Speed 502 images/sec (GPU)
File Format CSV (pandas-ready)

💬 What Do You Think?

What’s your take on feature engineering for ML competitions?

Have you ever built your own dataset like this?

Share your thoughts below — let’s talk data, optimization, and creativity!

Until next time —

Keep Building. Keep Learning.

Top comments (0)