Hey everyone! Hope you’re all doing great today.
So I’ve got something pretty exciting to share with you all — I just created a massive image features dataset for the Amazon ML Challenge 2025!
Processing 75,000 product images and extracting deep learning features was absolutely wild. I never thought I’d be building ML-ready datasets at this scale — but here we are!
How It All Started
While reading through the Amazon ML Challenge problem statement about predicting product prices, I noticed something interesting.
The challenge provides product images, but most participants would probably struggle with efficient image processing.
That’s when it hit me —
“Wait, I could create a ready-to-use feature dataset that makes everyone’s life easier!”
Why make everyone extract image features from scratch when it can be done once and shared with the whole community?
The Challenge That Sparked This Idea
The competition focuses on predicting product prices using both text and images.
The connection between visuals and pricing is fascinating — a clean, professional product photo often correlates with higher perceived value.
But there’s a catch — working with raw images is:
- Time-consuming
- GPU-intensive
- Technically complex for beginners
- Repetitive work everyone must do
So I thought —
“What if I create a dataset that eliminates all these pain points?”
The Creative Process Was Actually Fun
Coming Up With The Architecture
I’ve always loved transfer learning, so I chose ResNet50, pretrained on ImageNet, to extract image embeddings.
The pipeline:
- Extract 2048-dimensional features using ResNet50
- Apply PCA compression to reduce them to 100 dimensions
- Save it all in a clean, ML-ready CSV file
The Technical Challenge
Processing 75,000 product images isn’t just “run a script and chill”.
I had to make sure the extracted features were:
- Meaningful: capturing rich visual details
- Efficiently processed: I hit 502 images/sec on GPU
- Handled missing data: about 2% of products had missing images
- Easy to use: simple CSV ready for pandas or sklearn
I also worked on:
- Optimizing the batch pipeline
- Fine-tuning PCA (retaining 83% variance)
- Normalizing features
- Writing documentation and examples
What I Learned About Feature Engineering
Before this project, I understood deep learning theory — but building a real-world feature pipeline taught me so much more.
We’re talking about compressing thousands of pixels into just 100 numerical features — yet those vectors can:
- Identify similar products
- Enable price prediction
- Drive recommendation engines
That’s the power of representation learning.
The Technical Breakdown
Image Processing Pipeline
ResNet50 → 2048 features → PCA → 100 features
Quality Checks
- 98% image coverage (73,485 / 75,000)
- Feature distributions validated
- Visual similarity tested
Optimization
- GPU accelerated → 502 images/sec
- Parallel processing
- Memory-efficient pipeline
Documentation
- Detailed README
- Example notebooks
- Ready-to-run scripts
The Challenges I Faced
The hardest decision? Choosing PCA dimensions.
After several experiments:
- 100 dimensions retained 83% variance
- Balanced accuracy + efficiency
- Great for visualization and clustering
Handling missing images was another challenge — I used zero vectors with a missing flag to maintain data consistency.
💭 What This Experience Taught Me
- Data Infrastructure Matters — great models need great data.
- Usability Is Everything — make it easy for others to use.
- Optimization Pays Off — from 50 → 502 images/sec!
- Community Value — sharing saves others’ time and energy.
The Numbers That Made Me Proud
Metric | Value |
---|---|
Total Products | 75,000 |
Valid Images | 73,485 (98%) |
Speed | 502 images/sec |
Compression | 2048 → 100 dims |
Variance Retained | 83% |
File Size | 79 MB |
Use Cases I’m Excited About
- Visual Product Search – find similar-looking items
- Clustering – group visually related products
- Price Prediction – combine with text features
- Recommendations – suggest look-alike products
- Research – benchmark multimodal ML pipelines
Future Plans
- Extract features from other layers
- Try EfficientNet or Vision Transformers (ViT)
- Create multimodal datasets (image + text)
- Build a real-time API for on-the-fly feature extraction
My Honest Take on Feature Engineering
Sure, raw images are great for end-to-end learning,
but pre-extracted features democratize access — no GPU required, and instant modeling possible.
It’s about practicality — get to the ML part faster
Check Out the Dataset
Dataset: Kaggle – Product Image Features
Includes:
- Feature extraction code
- Usage tutorials
- Analysis scripts
- Documentation for quick start
Perfect for the Amazon ML Challenge or any e-commerce ML project.
Final Thoughts
Creating this dataset was both challenging and deeply rewarding.
It taught me how to think about scalability, optimization, and sharing ML resources with the community.
There’s something magical about seeing those progress bars fly at 502 images/sec —
knowing your work might save others days of processing time.
Technical Specifications
Parameter | Value |
---|---|
Model | ResNet50 (ImageNet pretrained) |
Original Features | 2048 |
Compressed Features | 100 via PCA |
Variance Retained | 83% |
Processing Speed | 502 images/sec (GPU) |
File Format | CSV (pandas-ready) |
💬 What Do You Think?
What’s your take on feature engineering for ML competitions?
Have you ever built your own dataset like this?
Share your thoughts below — let’s talk data, optimization, and creativity!
Until next time —
Keep Building. Keep Learning.
Top comments (0)