The Problem: “Will This Order Arrive Late?”
In any supply chain, late deliveries hurt customer trust and increase support costs. At order placement, logistics teams usually have no reliable way to know which orders will fail their delivery promise. Our client was flagging orders manually, based on gut feeling and experience no formal metric, no repeatability, and a lot of missed late deliveries.
We were asked to build a system that predicts late delivery risk at the moment an order is placed, enabling proactive interventions: expedited shipping, customer notifications, priority routing. The model had to be production‑ready, reproducible, and monitorable.
Problem Formulation – It’s All About the Cost of Missing a Late Order
We framed it as binary classification:
Target: Late_delivery_risk (0 = on time, 1 = late)
Primary metric: F2‑score because missing a late delivery (false negative) is twice as costly as a false alarm.
Guardrail: Recall ≥ 0.80 (catch at least 80% of true late orders)
The dataset contained 180,519 orders with 53 columns. Class distribution was near‑balanced (54.8% late, 45.2% on time).
First Step: Remove Leakage
The raw data included columns that would be impossible to know at order placement:
Delivery Status – literally the target
Days for shipping (real) – actual transit days, known only after delivery
shipping date (DateOrders) – when the order actually left the warehouse
Order Status – may reflect post‑order events (we dropped it to be safe)
We also dropped PII (email, names, password), pure IDs, 100% null columns (Product Description), and columns with extreme cardinality (e.g., Order City with 3,597 unique values).
After cleaning, we kept ~30 features: scheduled shipping days, benefit per order, shipping mode, market, customer segment, order hour, day of week, etc.
Feature Engineering – Preventing the #1 Silent Bug
pipeline = Pipeline([
('preprocessor', ColumnTransformer([
('numeric', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]), NUMERIC_COLS),
('onehot', Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='UNKNOWN')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
]), ONEHOT_COLS),
('target', Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='UNKNOWN')),
('encoder', TargetEncoder()) # out‑of‑fold encoding
]), TARGET_ENC_COLS),
])),
('model', model)
])
At serving time, the same frozen artifact is loaded – zero chance of applying a different imputation median or a different one‑hot encoding.
MLOps Architecture: Ten Stages, Five Pipelines
These stages are implemented as five pipelines in ZenML:
Training pipeline (stages 1‑6)
Inference pipeline (stage 7)
Drift detection pipeline (stage 9)
Monitoring pipeline (stage 8)
Retraining pipeline (stage 10)
All experiments are tracked in MLflow, integrated with ZenML. Every run logs hyperparameters, metrics, the exact dataset version (ZenML artifact ID), and the git commit hash.
Training & Evaluation – Why We Chose LightGBM
We trained a series of models, starting from a Dummy Classifier (majority class) as the absolute floor.
We used stratified 80/10/10 train/validation/test splits. The test set is touched exactly once at final evaluation only.
The success criteria for production:
F2‑score ≥ 0.75 on held‑out test set
Recall ≥ 0.80
Pipeline is reproducible – same data + same code = same result
(After first run, we can revise the F2 threshold based on actual business value.)
Deployment Batch Inference with Human‑in‑the‑Loop Promotion
We chose batch inference because order volume is moderate (~500 orders/day). Predictions are refreshed hourly. This is enough to trigger expedited shipping or customer notifications.
Model Promotion Workflow
Training run completes.
Evaluation gates pass (F2 > 0.75, recall ≥ 0.80).
Model is automatically registered in MLflow as “Staging”.
A human reviews metrics against the current production model in MLflow UI.
Human approves → model promoted to “Production”.
Previous production model is moved to “Archived” (retained for 90 days).
Rollback < 5 minutes. If a bad model slips through, we simply promote the archived version back to Production via MLflow UI. Because predictions are stored with a model_version column, the dashboard can immediately start using the restored model without data deletion.
System Design: How New Orders Get Scored
We needed a reliable way to identify unscored orders and write predictions without double‑writing. The solution uses an absence check and an idempotent insert.
Prediction Table Schema
CREATE TABLE predictions (
prediction_id BIGSERIAL PRIMARY KEY,
order_id BIGINT NOT NULL REFERENCES orders(order_id),
late_risk_score FLOAT NOT NULL CHECK (late_risk_score BETWEEN 0.0 AND 1.0),
predicted_late SMALLINT NOT NULL CHECK (predicted_late IN (0, 1)),
scored_at TIMESTAMPTZ NOT NULL DEFAULT now(),
model_version TEXT NOT NULL, -- MLflow model version tag
pipeline_run_id TEXT NOT NULL, -- ZenML run ID for audit
UNIQUE (order_id, model_version)
);
Detecting Unscored Orders
SELECT * FROM orders
WHERE order_id NOT IN (
SELECT order_id FROM predictions
WHERE model_version = :current_version
);
Idempotent Write
INSERT INTO predictions (order_id, late_risk_score, predicted_late,
model_version, pipeline_run_id)
VALUES (...)
ON CONFLICT (order_id, model_version) DO NOTHING;
Monitoring & Drift Detection – Don’t Wait for Ground Truth
Ground truth (Late_delivery_risk) arrives only after delivery – days after the prediction. That’s too late to notice data distribution changes. We use Evidently to detect drift in input features as soon as new orders come in.
Operational Reality What We Learned
1. Start with the simplest deployment that works.
At 500 orders/day, a single scheduled pipeline writing to one database table is correct. We did not need Kubernetes, real‑time APIs, or feature stores. Adding complexity too early kills velocity.
2. Train‑serving skew is the silent killer.
ackaging the fitted sklearn.Pipeline as a single artifact and reloading it in inference is non‑negotiable. Every imputation median, one‑hot mapping, and target encoding must be frozen.
3. Idempotency saves weekends.
ON CONFLICT DO NOTHING and the absence‑based order detection meant we never had to worry about replaying or cleaning up duplicate predictions.
4. Human promotion gates build trust.
The first production model was promoted manually after reviewing slice metrics. After three stable cycles, we may automate promotion – but not before. Trust is earned.
5. Monitor input drift, not just output.
We caught a carrier SLA change because Days for shipment (scheduled) distribution shifted. The Slack alert arrived two days before any ground truth label was available we fixed the feature mapping before any performance degradation.
What’s Next?
We deferred a few items that were not needed for MVP:
Real‑time API serving (batch is fine for now)
Fully automated retraining (earn it after evaluation gates are proven)
Cloud infrastructure (local ZenML stack is sufficient for this scale)
When order volume grows beyond ~10,000/day, we will revisit. But for now, the system is stable, reproducible, and delivers an F2‑score above the target.
Conclusion
The complete codebase follows the structure described in our internal design docs, with a core/ module containing pure business logic (no framework imports) and steps/ as thin ZenML wrappers. This makes testing fast and framework migration cheap.
Happy building, and may your deliveries always be on time. 🚚


Top comments (0)