ML in Warehouse Operations - How I Built a Production ML System to Automate Fashion Return Classification

#automation #devops #machinelearning #showdev

From a warehouse problem I read about to a working MLOps pipeline

There's a stat that stuck with me when I started this project: online fashion retailers see return rates of up to 30%. That's nearly 1 in 3 items coming back.

Behind that number is a real operational headache. Every returned item — a pair of casual shoes, a handbag, a watch — has to be physically inspected, categorized, and processed. Is it a shirt or a top? Does it go back on the shelf or get refurbished? That decision, made by a human staring at an item after a long shift, happens hundreds of times a day.

I wanted to solve that with machine learning. Not just train a model and call it a day — but build something that could actually run in the background of a warehouse operation: automated, reliable, and observable.

That project is RefundClassifier.

The Problem with "Just Training a Model"

When I started thinking about this, my first instinct was the same as any ML student's: find a dataset, train a classifier, hit 90%+ accuracy, done.

But accuracy on a test set doesn't keep a warehouse running. The real questions are harder:

What happens when the batch job crashes halfway through 400 images at 2 AM?
How do you update the model without taking the whole system down?
How do you know if predictions are quietly degrading weeks after deployment?
Who reviews the results in the morning — and in what format?

These are MLOps problems. And they're the gap between a notebook demo and a system someone can actually trust.

RefundClassifier is my attempt to close that gap.

What the System Does

In plain terms: every night at 2 AM, the system picks up all the return images uploaded during the business day, runs them through an ML model, and writes out a results file that warehouse staff can review in the morning.

The five categories it classifies are: Casual Shoes, Handbags, Shirts, Tops, and Watches — trained on 2,500 product images with 96.53% accuracy on the test set.

But the interesting parts aren't the model. They're the infrastructure around it.

How It's Built

The architecture has three main layers:

1. The Model Service (FastAPI)
A lightweight REST API that loads the EfficientNet-B0 model from an MLflow registry and serves /predict endpoints. It's stateless — it doesn't know or care about batches. It just classifies what it's given.

Separating the model into its own service was a deliberate choice. It means I can update, restart, or swap the model without touching the batch processing logic.

2. The Batch Orchestrator (Python)
This is the core of the system. It runs on a cron schedule, scans the input directory for unprocessed images, calls the Model Service in batches of 10, writes results to a CSV, and pushes metrics to Prometheus.

The most important feature here: checkpoint recovery. If the job crashes at image 287 of 400, it doesn't restart from zero. It reads the checkpoint, skips what's already done, and continues. In a production warehouse context, reprocessing already-classified items creates data integrity issues. This prevents that.

3. Monitoring (Prometheus + Grafana)
Every batch run pushes metrics — inference latency, batch success rate, class distribution — to a Prometheus Pushgateway. Grafana dashboards surface those metrics visually. If the model starts misclassifying at unusual rates, or a batch takes 3x longer than normal, it shows up.

This was the part I underestimated the most. Monitoring isn't a "nice to have." It's how you find out something is wrong before a human has to tell you.

Model Versioning with MLflow

The model is registered in MLflow with a production alias — a pointer that says "this is the version the Model Service should load." When I retrain with new data, I register the new version and promote it to production. The service picks it up on restart, no code changes needed.

This is the simplest version of a deployment pipeline, but it enforces a useful discipline: the model is never just a file on disk. It has a version, experiment metadata, accuracy metrics attached to it, and a clear promotion path.

The UI

There's also a Streamlit interface for manual use — useful for ad-hoc classification or demos. Staff can upload a batch of images, trigger classification, and see the results in a table without touching the command line.

What I Actually Learned

Building this taught me a few things that no ML course covered:

Batch processing is underrated. Most tutorials show real-time inference. But most real business operations don't need sub-second latency — they need reliable, scheduled, auditable processing. Batch is often the right answer.

The 10% that isn't model accuracy is 90% of the work. Getting to 96% accuracy took two days. Getting checkpoint recovery, metric pushing, model registry integration, and error handling right took the rest of the project.

Observability is the difference between a deployed model and a trusted system. A model running in the dark is not production. A model with dashboards, alerts, and traceable outputs is.