DEV Community

Cover image for ML in Warehouse Operations - How I Built a Production ML System to Automate Fashion Return Classification
Daniel Popoola
Daniel Popoola

Posted on • Edited on

ML in Warehouse Operations - How I Built a Production ML System to Automate Fashion Return Classification

From a warehouse problem I read about to a working MLOps pipeline


There's a stat that stuck with me when I started this project: online fashion retailers see return rates of up to 30%. That's nearly 1 in 3 items coming back.

Behind that number is a real operational headache. Every returned item — a pair of casual shoes, a handbag, a watch — has to be physically inspected, categorized, and processed. Is it a shirt or a top? Does it go back on the shelf or get refurbished? That decision, made by a human staring at an item after a long shift, happens hundreds of times a day.

I wanted to solve that with machine learning. Not just train a model and call it a day — but build something that could actually run in the background of a warehouse operation: automated, reliable, and observable.

That project is RefundClassifier.


The Problem with "Just Training a Model"

When I started thinking about this, my first instinct was the same as any ML student's: find a dataset, train a classifier, hit 90%+ accuracy, done.

But accuracy on a test set doesn't keep a warehouse running. The real questions are harder:

  • What happens when the batch job crashes halfway through 400 images at 2 AM?
  • How do you update the model without taking the whole system down?
  • How do you know if predictions are quietly degrading weeks after deployment?
  • Who reviews the results in the morning — and in what format?

These are MLOps problems. And they're the gap between a notebook demo and a system someone can actually trust.

RefundClassifier is my attempt to close that gap.


What the System Does

In plain terms: every night at 2 AM, the system picks up all the return images uploaded during the business day, runs them through an ML model, and writes out a results file that warehouse staff can review in the morning.

The five categories it classifies are: Casual Shoes, Handbags, Shirts, Tops, and Watches — trained on 2,500 product images with 96.53% accuracy on the test set.

But the interesting parts aren't the model. They're the infrastructure around it.


How It's Built

The architecture has three main layers:

1. The Model Service (FastAPI)
A lightweight REST API that loads the EfficientNet-B0 model from an MLflow registry and serves /predict endpoints. It's stateless — it doesn't know or care about batches. It just classifies what it's given.

Separating the model into its own service was a deliberate choice. It means I can update, restart, or swap the model without touching the batch processing logic.

2. The Batch Orchestrator (Python)
This is the core of the system. It runs on a cron schedule, scans the input directory for unprocessed images, calls the Model Service in batches of 10, writes results to a CSV, and pushes metrics to Prometheus.

The most important feature here: checkpoint recovery. If the job crashes at image 287 of 400, it doesn't restart from zero. It reads the checkpoint, skips what's already done, and continues. In a production warehouse context, reprocessing already-classified items creates data integrity issues. This prevents that.

3. Monitoring (Prometheus + Grafana)
Every batch run pushes metrics — inference latency, batch success rate, class distribution — to a Prometheus Pushgateway. Grafana dashboards surface those metrics visually. If the model starts misclassifying at unusual rates, or a batch takes 3x longer than normal, it shows up.

This was the part I underestimated the most. Monitoring isn't a "nice to have." It's how you find out something is wrong before a human has to tell you.


Model Versioning with MLflow

The model is registered in MLflow with a production alias — a pointer that says "this is the version the Model Service should load." When I retrain with new data, I register the new version and promote it to production. The service picks it up on restart, no code changes needed.

This is the simplest version of a deployment pipeline, but it enforces a useful discipline: the model is never just a file on disk. It has a version, experiment metadata, accuracy metrics attached to it, and a clear promotion path.


The UI

There's also a Streamlit interface for manual use — useful for ad-hoc classification or demos. Staff can upload a batch of images, trigger classification, and see the results in a table without touching the command line.


What I Actually Learned

Building this taught me a few things that no ML course covered:

Batch processing is underrated. Most tutorials show real-time inference. But most real business operations don't need sub-second latency — they need reliable, scheduled, auditable processing. Batch is often the right answer.

The 10% that isn't model accuracy is 90% of the work. Getting to 96% accuracy took two days. Getting checkpoint recovery, metric pushing, model registry integration, and error handling right took the rest of the project.

Observability is the difference between a deployed model and a trusted system. A model running in the dark is not production. A model with dashboards, alerts, and traceable outputs is.


Links

  • GitHub: github.com/DanielPopoola/autorma
  • Dataset: Fashion Product Images (Kaggle) — 2,500 images across 5 categories
  • Stack: PyTorch · FastAPI · MLflow · Prometheus · Grafana · Streamlit · Docker

This was my final year CS project. I'm currently looking for roles in backend engineering and ML engineering — feel free to connect.

Top comments (0)