From a warehouse problem I read about to a working MLOps pipeline
There's a stat that stuck with me when I started this project: online fashion retailers see return rates of up to 30%. That's nearly 1 in 3 items coming back.
Behind that number is a real operational headache. Every returned item — a pair of casual shoes, a handbag, a watch — has to be physically inspected, categorized, and processed. Is it a shirt or a top? Does it go back on the shelf or get refurbished? That decision, made by a human staring at an item after a long shift, happens hundreds of times a day.
I wanted to solve that with machine learning. Not just train a model and call it a day — but build something that could actually run in the background of a warehouse operation: automated, reliable, and observable.
That project is RefundClassifier.
The Problem with "Just Training a Model"
When I started thinking about this, my first instinct was the same as any ML student's: find a dataset, train a classifier, hit 90%+ accuracy, done.
But accuracy on a test set doesn't keep a warehouse running. The real questions are harder:
- What happens when the batch job crashes halfway through 400 images at 2 AM?
- How do you update the model without taking the whole system down?
- How do you know if predictions are quietly degrading weeks after deployment?
- Who reviews the results in the morning — and in what format?
These are MLOps problems. And they're the gap between a notebook demo and a system someone can actually trust.
RefundClassifier is my attempt to close that gap.
What the System Does
In plain terms: every night at 2 AM, the system picks up all the return images uploaded during the business day, runs them through an ML model, and writes out a results file that warehouse staff can review in the morning.
The five categories it classifies are: Casual Shoes, Handbags, Shirts, Tops, and Watches — trained on 2,500 product images with 96.53% accuracy on the test set.
But the interesting parts aren't the model. They're the infrastructure around it.
How It's Built
The architecture has three main layers:
1. The Model Service (FastAPI)
A lightweight REST API that loads the EfficientNet-B0 model from an MLflow registry and serves /predict endpoints. It's stateless — it doesn't know or care about batches. It just classifies what it's given.
Separating the model into its own service was a deliberate choice. It means I can update, restart, or swap the model without touching the batch processing logic.
2. The Batch Orchestrator (Python)
This is the core of the system. It runs on a cron schedule, scans the input directory for unprocessed images, calls the Model Service in batches of 10, writes results to a CSV, and pushes metrics to Prometheus.
The most important feature here: checkpoint recovery. If the job crashes at image 287 of 400, it doesn't restart from zero. It reads the checkpoint, skips what's already done, and continues. In a production warehouse context, reprocessing already-classified items creates data integrity issues. This prevents that.
3. Monitoring (Prometheus + Grafana)
Every batch run pushes metrics — inference latency, batch success rate, class distribution — to a Prometheus Pushgateway. Grafana dashboards surface those metrics visually. If the model starts misclassifying at unusual rates, or a batch takes 3x longer than normal, it shows up.
This was the part I underestimated the most. Monitoring isn't a "nice to have." It's how you find out something is wrong before a human has to tell you.
Model Versioning with MLflow
The model is registered in MLflow with a production alias — a pointer that says "this is the version the Model Service should load." When I retrain with new data, I register the new version and promote it to production. The service picks it up on restart, no code changes needed.
This is the simplest version of a deployment pipeline, but it enforces a useful discipline: the model is never just a file on disk. It has a version, experiment metadata, accuracy metrics attached to it, and a clear promotion path.
The UI
There's also a Streamlit interface for manual use — useful for ad-hoc classification or demos. Staff can upload a batch of images, trigger classification, and see the results in a table without touching the command line.
What I Actually Learned
Building this taught me a few things that no ML course covered:
Batch processing is underrated. Most tutorials show real-time inference. But most real business operations don't need sub-second latency — they need reliable, scheduled, auditable processing. Batch is often the right answer.
The 10% that isn't model accuracy is 90% of the work. Getting to 96% accuracy took two days. Getting checkpoint recovery, metric pushing, model registry integration, and error handling right took the rest of the project.
Observability is the difference between a deployed model and a trusted system. A model running in the dark is not production. A model with dashboards, alerts, and traceable outputs is.
Links
- GitHub: github.com/DanielPopoola/autorma
- Dataset: Fashion Product Images (Kaggle) — 2,500 images across 5 categories
- Stack: PyTorch · FastAPI · MLflow · Prometheus · Grafana · Streamlit · Docker
This was my final year CS project. I'm currently looking for roles in backend engineering and ML engineering — feel free to connect.
Top comments (0)