Anuj Bolewar

Posted on Oct 17

Building a 75,000-Product Image Feature Dataset for the Amazon ML Challenge 2025

#machinelearning #deeplearning #dataengineering #pca

Hey everyone! Hope you’re all doing great today.

So I’ve got something pretty exciting to share with you all — I just created a massive image features dataset for the Amazon ML Challenge 2025!

Processing 75,000 product images and extracting deep learning features was absolutely wild. I never thought I’d be building ML-ready datasets at this scale — but here we are!

How It All Started

While reading through the Amazon ML Challenge problem statement about predicting product prices, I noticed something interesting.

The challenge provides product images, but most participants would probably struggle with efficient image processing.

That’s when it hit me —

“Wait, I could create a ready-to-use feature dataset that makes everyone’s life easier!”

Why make everyone extract image features from scratch when it can be done once and shared with the whole community?

The Challenge That Sparked This Idea

The competition focuses on predicting product prices using both text and images.

The connection between visuals and pricing is fascinating — a clean, professional product photo often correlates with higher perceived value.

But there’s a catch — working with raw images is:

Time-consuming
GPU-intensive
Technically complex for beginners
Repetitive work everyone must do

So I thought —

“What if I create a dataset that eliminates all these pain points?”

The Creative Process Was Actually Fun

Coming Up With The Architecture

I’ve always loved transfer learning, so I chose ResNet50, pretrained on ImageNet, to extract image embeddings.

The pipeline:

Extract 2048-dimensional features using ResNet50
Apply PCA compression to reduce them to 100 dimensions
Save it all in a clean, ML-ready CSV file

The Technical Challenge

Processing 75,000 product images isn’t just “run a script and chill”.

I had to make sure the extracted features were:

Meaningful: capturing rich visual details
Efficiently processed: I hit 502 images/sec on GPU
Handled missing data: about 2% of products had missing images
Easy to use: simple CSV ready for pandas or sklearn

I also worked on:

Optimizing the batch pipeline
Fine-tuning PCA (retaining 83% variance)
Normalizing features
Writing documentation and examples

What I Learned About Feature Engineering

Before this project, I understood deep learning theory — but building a real-world feature pipeline taught me so much more.

We’re talking about compressing thousands of pixels into just 100 numerical features — yet those vectors can:

Identify similar products
Enable price prediction
Drive recommendation engines

That’s the power of representation learning.

The Technical Breakdown

Image Processing Pipeline

ResNet50 → 2048 features → PCA → 100 features

Quality Checks

98% image coverage (73,485 / 75,000)
Feature distributions validated
Visual similarity tested

Optimization

GPU accelerated → 502 images/sec
Parallel processing
Memory-efficient pipeline

Documentation

Detailed README
Example notebooks
Ready-to-run scripts

The Challenges I Faced

The hardest decision? Choosing PCA dimensions.

After several experiments:

100 dimensions retained 83% variance
Balanced accuracy + efficiency
Great for visualization and clustering

Handling missing images was another challenge — I used zero vectors with a missing flag to maintain data consistency.

💭 What This Experience Taught Me

Data Infrastructure Matters — great models need great data.
Usability Is Everything — make it easy for others to use.
Optimization Pays Off — from 50 → 502 images/sec!
Community Value — sharing saves others’ time and energy.

The Numbers That Made Me Proud

Metric	Value
Total Products	75,000
Valid Images	73,485 (98%)
Speed	502 images/sec
Compression	2048 → 100 dims
Variance Retained	83%
File Size	79 MB

Use Cases I’m Excited About

Visual Product Search – find similar-looking items
Clustering – group visually related products
Price Prediction – combine with text features
Recommendations – suggest look-alike products
Research – benchmark multimodal ML pipelines

Future Plans

Extract features from other layers
Try EfficientNet or Vision Transformers (ViT)
Create multimodal datasets (image + text)
Build a real-time API for on-the-fly feature extraction

My Honest Take on Feature Engineering

Sure, raw images are great for end-to-end learning,

but pre-extracted features democratize access — no GPU required, and instant modeling possible.

It’s about practicality — get to the ML part faster

Check Out the Dataset

Dataset: Kaggle – Product Image Features

Includes:

Feature extraction code
Usage tutorials
Analysis scripts
Documentation for quick start

Perfect for the Amazon ML Challenge or any e-commerce ML project.

Final Thoughts

Creating this dataset was both challenging and deeply rewarding.

It taught me how to think about scalability, optimization, and sharing ML resources with the community.

There’s something magical about seeing those progress bars fly at 502 images/sec —

knowing your work might save others days of processing time.

Technical Specifications

Parameter	Value
Model	ResNet50 (ImageNet pretrained)
Original Features	2048
Compressed Features	100 via PCA
Variance Retained	83%
Processing Speed	502 images/sec (GPU)
File Format	CSV (pandas-ready)

💬 What Do You Think?

What’s your take on feature engineering for ML competitions?

Have you ever built your own dataset like this?

Share your thoughts below — let’s talk data, optimization, and creativity!

Until next time —

Keep Building. Keep Learning.

DEV Community