DEV Community: Sugnik Mondal

How I Built a Late Delivery Risk Predictor for APL Logistics: What a 95% Delay Rate in First Class Shipping Taught Me About Supply Chain ML

Sugnik Mondal — Sun, 24 May 2026 11:54:32 +0000

Late deliveries are not just an inconvenience. For a global logistics operator like APL Logistics (KWE Group), a single delayed shipment can trigger SLA breaches, financial penalties, and long-term customer churn. Multiply that across hundreds of thousands of orders spanning five global markets, and the cost of reactive delay management becomes unsustainable.

The conventional approach has always been to handle delays after they happen — emergency rerouting, last-minute escalations, and reactive customer communication. This project takes a different approach entirely. Instead of reacting, it predicts.

This article walks through the end-to-end machine learning pipeline built to predict late delivery risk for APL Logistics — from raw data to a deployed Streamlit dashboard used by supply chain operations teams.

The Dataset

The project uses the DataCo Smart Supply Chain dataset — a comprehensive real-world transactional dataset from APL Logistics' global operations.

Raw dataset: 180,519 rows × 40 columns
After cleaning: 180,517 rows × 28 columns
Target variable: Late_delivery_risk (1 = Late, 0 = Not Late)

Target distribution:

Class	Count	Percentage
Not Late (0)	98,976	54.83%
Late (1)	81,541	45.17%

The near-balanced target distribution was an important early finding. It meant SMOTE was not required. class_weight='balanced' in all models was sufficient.

Data Cleaning — The Leakage Problem

The most critical cleaning decisions were around data leakage — columns that would not be available at the time of prediction (before dispatch) but that reveal the outcome after the fact.

Leakage columns dropped:

Delivery Status — Cramér's V of 1.00 with the target. Perfect correlation. This column contains values like "Late delivery" and "Shipping on time" — literally the answer. Using it would give 100% accuracy in training and zero accuracy in production.

Order Status — Values like COMPLETE, CLOSED, CANCELED are assigned after the order is fulfilled. At prediction time (before dispatch), this information does not exist.

The simple test for leakage: "At the moment the prediction is needed, would this information be available?" If not — drop it.

Other columns dropped:

PII columns: Customer Fname, Customer Lname, Customer Street, Customer Zipcode
ID columns: Category Id, Department Id, Customer Id, Order Customer Id
Redundant location: Latitude, Longitude (redundant with Market and Order Region)

Missing values: Only Customer Lname (8 rows) and Customer Zipcode (3 rows) had nulls — both dropped entirely as they were removal candidates anyway.

Duplicates: 2 duplicate rows removed.

Final cleaned dataset: 180,517 rows × 28 columns

Exploratory Data Analysis — The Most Surprising Finding

Before building any model, the data was explored thoroughly. The most counterintuitive finding came from shipping mode analysis.

Late delivery rate by shipping mode:

Shipping Mode	Late Delivery Rate
First Class	95.3%
Second Class	76.6%
Same Day	45.7%
Standard Class	38.1%

First Class shipping — which customers expect to be faster and more reliable — has a 95.3% late delivery rate. This is not a rounding error. Nearly every First Class order in the dataset arrived late. This suggests that First Class commitments are systematically over-promised relative to operational capacity.

Late delivery rate by market:

All five global markets showed rates between 54.4% and 55.2% — an extremely narrow band. This finding is operationally significant: the delay problem is not geographically concentrated. It is systemic across all markets, meaning market-level interventions alone will not solve it.

Shipping delay gap:

The gap between actual shipping days and scheduled shipping days averaged +0.57 days across all orders. 103,399 orders (57.3%) shipped later than scheduled. Most delays were by exactly one day — suggesting a consistent operational mismatch between scheduling and execution.

Correlation heatmap revealed multicollinearity:

Benefit per order and Order Profit Per Order → 1.00 correlation
Order Item Product Price and Product Price → 1.00 correlation
Sales per customer, Order Item Total, Sales → 0.99 correlation

These redundant columns were dropped in preprocessing to prevent multicollinearity — particularly harmful for Logistic Regression.

Feature Engineering — Where the Real Signal Was Created

Six new features were engineered from existing columns. These turned out to be some of the most important features in the final model.

1. Shipping Delay Gap

shipping_delay_gap = Days_for_shipping_real - Days_for_shipment_scheduled

Measures how many days actual shipping exceeded the scheduled commitment. This single feature ended up with an importance score of 0.7938 — accounting for 79% of the model's decision-making.

2. Shipping Pressure Index

shipping_pressure_index = Days_for_shipment_scheduled / (Order_Item_Quantity + 1)

Captures the relationship between delivery commitment and order complexity.

3. Is Express Flag

is_express = 1 if Shipping_Mode in ['First Class', 'Same Day'] else 0

Binary flag directly capturing the high-risk shipping modes identified in EDA.

4. High Discount Flag

high_discount_flag = 1 if Order_Item_Discount_Rate > 0.06 else 0

Flags orders with above-median discount rates.

5. Order Complexity Score

order_complexity_score = Order_Item_Quantity × Order_Item_Product_Price

Measures the financial complexity of the order.

6. Regional Congestion Score

region_congestion_score = average_late_delivery_rate_per_region

Encodes the historically observed delay rate per region as a continuous risk signal — ranging from 0.488 (Canada) to 0.580 (Central Africa).

The Anti-Leakage Preprocessing Pipeline

Preventing data leakage was not just about dropping columns. The entire preprocessing pipeline was structured to ensure no information from the test set contaminated the training process.

Load cleaned_data.csv
→ Feature Engineering (pure arithmetic — no fitting required)
→ Separate X and y
→ Train/Test Split (80/20, stratified) ← split happens HERE
→ Fit StandardScaler on X_train only
→ Transform X_train and X_test separately
→ Fit LabelEncoders on X_train only
→ Transform X_train and X_test separately
→ Save scaler.pkl and encoders.pkl
→ Train models on X_train only
→ Evaluate on X_test only

Why this matters for production:

The scaler saved to scaler.pkl carries the exact mean and standard deviation computed on X_train. When the Streamlit app receives a new order, it applies this saved scaler — not a newly fitted one. This guarantees that scaling is identical between training and inference, preventing silent prediction errors.

Split results:

Training set: 144,413 rows
Test set: 36,104 rows
Stratification maintained the 55/45 class ratio in both sets

Model Development — Dictionary Loop Approach

Three models were defined in a dictionary and trained in a loop — a clean, professional pattern that avoids repetitive code and makes comparison straightforward.

models = {
    "Logistic Regression": LogisticRegression(
        class_weight='balanced', max_iter=1000, random_state=42
    ),
    "Random Forest": RandomForestClassifier(
        class_weight='balanced', n_estimators=100, random_state=42, n_jobs=-1
    ),
    "XGBoost": XGBClassifier(
        scale_pos_weight=ratio, n_estimators=100,
        random_state=42, eval_metric='logloss'
    )
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    cv_roc = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    cv_f1  = cross_val_score(model, X_train, y_train, cv=5, scoring='f1')
    results[name] = {
        'cv_roc_auc': cv_roc.mean(),
        'cv_f1': cv_f1.mean()
    }

5-fold cross validation results (on X_train only):

Model	CV ROC-AUC	CV F1
Logistic Regression	0.9803 ± 0.0008	0.9749 ± 0.0019
Random Forest	0.9964 ± 0.0002	0.9786 ± 0.0004
XGBoost	0.9964 ± 0.0001	0.9792 ± 0.0005

Random Forest and XGBoost were essentially tied at baseline. XGBoost was selected for hyperparameter tuning due to its slightly lower variance and faster inference time.

Hyperparameter Tuning — RandomizedSearchCV

GridSearchCV was ruled out immediately. With 180,000+ rows and a large parameter space, exhaustive search would have been computationally prohibitive. RandomizedSearchCV with 30 iterations and 5-fold CV was used instead — sampling the parameter space efficiently.

param_grid = {
    'n_estimators':     [100, 200, 300],
    'max_depth':        [3, 4, 5, 6],
    'learning_rate':    [0.01, 0.05, 0.1, 0.2],
    'subsample':        [0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [1, 3, 5]
}

Best parameters found:

n_estimators: 200
max_depth: 6
learning_rate: 0.2
subsample: 0.9
colsample_bytree: 1.0
min_child_weight: 1

Best CV ROC-AUC: 0.9967

Model Evaluation — Final Results

Final model comparison (on X_test):

Model	Accuracy	Precision	Recall	F1 Score	ROC-AUC
Logistic Regression	0.9740	0.9589	0.9952	0.9767	0.9806
Random Forest	0.9773	0.9604	0.9998	0.9797	0.9970
XGBoost Baseline	0.9781	0.9633	0.9980	0.9804	0.9969
XGBoost Tuned	0.9787	0.9644	0.9980	0.9809	0.9972

Why XGBoost Tuned was selected as the best model:

Highest ROC-AUC: 0.9972
Highest Precision: 0.9644 — fewest false alarms
Lowest false positives: 730 (vs 845 for Logistic Regression)
Equal Recall to baseline XGBoost: 0.9980 — catches 99.8% of all true late deliveries
CV ROC-AUC (0.9967) and test ROC-AUC (0.9972) are consistent — no overfitting

Confusion matrix analysis:

Model	False Positives	False Negatives
Logistic Regression	845	95
Random Forest	817	4
XGBoost Baseline	752	39
XGBoost Tuned	730	39

In an operations context, false positives (flagging an on-time order as high risk) waste intervention resources. False negatives (missing a truly late order) lead to unmitigated delays. XGBoost Tuned minimizes both.

Feature Importance — What Actually Drives Late Deliveries

Top 10 global risk drivers (XGBoost Tuned):

Rank	Feature	Importance
1	Shipping Delay Gap	0.7938
2	Payment Type	0.1671
3	Scheduled Shipping Days	0.0023
4	Customer Country	0.0020
5	Order Country	0.0019
6	Market	0.0019
7	Regional Congestion Score	0.0019
8	Order State	0.0019
9	Customer City	0.0019
10	Order City	0.0018

shipping_delay_gap accounts for 79.38% of the model's decision-making. This engineered feature — created from the difference between actual and scheduled shipping days — is overwhelmingly the primary driver of late delivery risk.

The second most important feature is Payment Type at 16.71%. This was unexpected. Transfer payments show notably lower late delivery rates (48.5%) compared to other payment types (56.6%–57.5%). The mechanism behind this relationship warrants further investigation.

All other features combined account for less than 4% of importance — confirming that the delay gap is the fundamental root cause.

Risk Scoring

Each order received a Late Delivery Probability Score (0–1) and a Risk Category:

Low Risk: probability < 0.40
Medium Risk: 0.40 ≤ probability < 0.70
High Risk: probability ≥ 0.70

Risk distribution across 36,104 test orders:

Risk Category	Count	Percentage
High Risk	19,977	55.33%
Medium Risk	589	1.63%
Low Risk	15,538	43.04%

The bimodal probability distribution — with most orders near 0.0 or 1.0 — reflects the model's high confidence. The dominant shipping_delay_gap feature provides such strong signal that the model is rarely uncertain about an order's risk classification.

The Streamlit Application

A four-module Streamlit dashboard was built for supply chain operations teams:

Home — Project overview, professional disclaimer, methodology summary, usage guide. The app explicitly states it is designed for supply chain managers, logistics analysts, and operations teams — not end consumers. The reason: the inputs required (scheduled shipping days, actual shipping days, profit ratios, financial metrics) are only available in internal order management systems.

Risk Predictor — Operations teams enter order details. The app automatically engineers all 6 derived features, applies the saved scaler and encoders, and outputs a probability score, risk category, top risk drivers, and recommended action.

Risk Dashboard — Portfolio-level view of risk distribution, probability histogram, and feature importance chart.

Operations Action Panel — Filterable table of high-risk orders with adjustable threshold slider and CSV export.

Key Takeaways

1. Leakage prevention is non-negotiable.
Delivery Status had a Cramér's V of 1.00 with the target. Including it would have given a perfect model on paper and a useless model in production. Always ask: would this feature exist at prediction time?

2. Feature engineering made the biggest difference.
shipping_delay_gap — a single engineered feature — accounts for 79% of the model's decisions. No raw feature came close. Time spent on thoughtful feature engineering consistently outperforms time spent on model tuning.

3. Class balance should be checked before reaching for SMOTE.
The target was 55/45 — nearly balanced. SMOTE was unnecessary. class_weight='balanced' was cleaner, faster, and equally effective.

4. RandomizedSearchCV over GridSearchCV at scale.
With 180,000+ rows, GridSearch would have been impractical. RandomizedSearch with 30 iterations delivered strong results efficiently.

5. The most counterintuitive finding was the most actionable.
First Class shipping having a 95.3% late delivery rate is not a modeling artifact — it is a real operational failure that APL Logistics can act on directly, independent of any ML system.

Technical Stack

Python · Pandas · NumPy · Scikit-learn · XGBoost · Matplotlib · Seaborn · Plotly · Streamlit · Joblib · Jupyter Notebooks

This project was completed as part of the Data Science internship program at Unified Mentor Private Limited, in collaboration with APL Logistics (KWE Group).

Predictive Forecasting of Care Load & Placement Demand: What a 66% Structural Break Taught Me About Machine Learning

Sugnik Mondal — Sun, 08 Mar 2026 11:18:58 +0000

HHS Unaccompanied Alien Children Program — Data Science Internship Project

Sugnik Mondal · Unified Mentor Data Science Intern · March 2026

TL;DR: I built a forecasting system for the HHS UAC Program using 720 real records. Nine models were tested. Eight failed or underperformed. One won — but not because of model complexity. The reason every sophisticated model failed, and what fixed it, is the actual story here.

Abstract

The U.S. Department of Health & Human Services (HHS) Unaccompanied Alien Children (UAC) Program manages the care, custody, and sponsor placement of migrant children arriving at the U.S. border. Daily care load fluctuated between 1,972 and 11,516 children during the study period — a 5.8× range that makes capacity planning extremely difficult without reliable forecasts.

This paper presents a complete ML forecasting system built on 720 real operational records spanning January 2023 to December 2025. The central finding is that a January 2025 structural break — a permanent 66% drop in care load — caused every full-window model to fail catastrophically. The solution was simple in concept but required correctly diagnosing the problem first: recent-window retraining.

Final results:

Care Load Model: XGBoost MAE 5.48 · MAPE 0.23% · 9.6% better than naïve baseline
Discharge Model: XGBoost MAE 0.63 children/day
Dashboard: 6-page Streamlit app with zero-CSV-dependency prediction interface

1. The Problem

The HHS UAC Program needs to know, at minimum one day in advance:

How many children will be in HHS care tomorrow? (staffing, beds, resources)
How many children will be discharged tomorrow? (sponsor outreach, placement capacity)
Is a surge coming? (early warning, proactive capacity scaling)

Without forecasts, every decision is reactive. Surges cause acute crises. Troughs cause costly over-provisioning. The program needed a tool.

2. The Data

Source: HHS UAC Program public operational records

Property	Value
Raw records	720 observations
Date range	Jan 2023 – Dec 2025
After preprocessing	1,075 rows
Missing dates filled	355 (weekends, via linear interpolation)
Target 1	`hhs_care` — children in HHS care daily
Target 2	`hhs_discharged` — daily discharges
lag-1 autocorrelation	0.99

That lag-1 autocorrelation of 0.99 is important. It means yesterday's care load is an almost perfect predictor of today's. It immediately told me that the naïve baseline — predict tomorrow = today — was going to be very hard to beat.

Naïve Persistence MAE: 6.06. That's the bar everything had to clear.

3. The Structural Break — The Most Important Discovery

Before touching a single model, the EDA revealed something critical.

In January 2025, HHS care load dropped from approximately 6,500 children to approximately 2,200 children in under two weeks. That's a 66% reduction. And it never recovered — the low level persisted through the end of the dataset.

A structural break is a permanent, abrupt change in the statistical properties of a time series. Unlike a trend or seasonal pattern, it cannot be modelled away — the series before and after the break are effectively two different processes.

Here's why this matters for every model you try to build:

Full dataset training mean: ~6,061 children
Test set mean (post-break): ~2,300 children
Gap: ~3,761 children

Any model trained on the full dataset learns patterns centred around 6,061. When it predicts on data centred around 2,300, it's off by thousands. That's not a model quality problem. That's a data regime problem.

4. The Escalation Story — Nine Models, Eight Failures

Phase 1: Baseline (Floor Setting)

Model	MAE
Naïve Persistence	6.06 ← the bar
Moving Average (w=3)	9.76

Moving Average was actually worse than naïve because the slight upward trend in the test period caused systematic under-prediction — the rolling average always lags behind a rising series.

Phase 2: Statistical Models — All Failed

Model	MAE	Why
Exponential Smoothing	86.69	Anchored to pre-break mean ~6,000
ARIMA(3,1,3)	144.35	Mean-reverting behaviour pulled forecasts too high
SARIMA	433.17	Seasonal components amplified the regime-change error

SARIMA was the worst performing model overall. Adding more structure made the problem worse. The seasonal terms were learning patterns from the pre-break period that had no relevance to the post-break test data.

This was not a failure of ARIMA or SARIMA as methods. It was a failure to check whether their core assumptions were met before applying them. Both assume a stationary or trend-stationary series. A permanent 66% level shift violates that assumption completely.

Phase 3: Full-Window ML — Also Failed

Model	MAE
Linear Regression (full)	23.38
Random Forest (full)	25.41
XGBoost (full)	40.66

ML models were better than statistical models — but XGBoost performed 6.7× worse than naïve. The boosting process over-fitted to the high-variance pre-break period. The root cause was identical: wrong training distribution.

Phase 4: Recent-Window ML — The Solution ✅

The fix: retrain all models using only data from June 2024 onwards.

Model	MAE	vs Naïve
XGBoost (Recent)	5.48	✅ –9.6%
Random Forest (Recent)	6.54	❌ +7.9%
Linear Regression (Recent)	7.48	❌ +23.4%

By using only the recent window:

Training mean: ~2,800
Test mean: ~2,300
Gap: ~500 (vs 3,761 with full window)

XGBoost achieved MAE 5.48, RMSE 7.12, MAPE 0.23%. That's 99.77% forecast accuracy.

5. The Complete Leaderboard

Model	MAE ↓	RMSE ↓	MAPE ↓
🏆 XGBoost (Recent)	5.48	7.12	0.23%
Naïve Persistence	6.06	7.24	0.27%
Random Forest (Recent)	6.54	8.44	0.28%
Linear Regression (Recent)	7.48	8.86	0.31%
Moving Average (w=3)	9.76	11.77	0.43%
Ridge Regression (Recent)	17.80	22.80	0.74%
Exponential Smoothing	86.69	97.40	3.74%
ARIMA(3,1,3)	144.35	161.63	6.20%
SARIMA	433.17	501.04	18.53%

Only one model out of nine beat the naïve baseline. That model was trained on roughly 30% of the available data.

6. Feature Engineering & What the Model Actually Learned

30+ features were engineered from the five raw columns. The top features by XGBoost importance:

Feature	Importance	What it captures
`hhs_care_roll_min_30`	0.541	30-day rolling minimum — the post-break floor
`hhs_care_lag_2`	0.159	2-day autoregressive signal
`hhs_care_lag_1`	0.150	Yesterday's value
`cbp_transferred`	0.122	Today's pipeline transfers — leading indicator

The dominance of hhs_care_roll_min_30 (0.541 — over half the total importance) is revealing. The model's primary mechanism is recognising which regime it's in by checking the 30-day floor. The top four features account for 97.2% of total importance.

7. The Discharge Model

Discharge demand required separate treatment. The discharge structural break was even more severe:

Full-window training mean: 173 children/day
Post-break test mean: ~9 children/day
Reduction: 94.8%

A June 2024 cutoff still left a massive gap. A March 2025 cutoff reduced the training-test mean gap to 3.67. XGBoost achieved MAE 0.63 children/day — less than one child per day in prediction error.

(Ridge Regression achieved MAE 0.03 — excluded as overfitting. A result that perfect on a small training window is a red flag, not a win.)

8. The Streamlit Dashboard

A 6-page dashboard operationalises both models with a key design decision: zero CSV dependency for predictions.

Here's the reasoning: the training data ends December 2025. If a programme administrator uses this app in June 2026, lag values pulled from the historical CSV would be 6 months stale — completely wrong inputs for the model.

The solution: users enter only what they naturally know from their daily report:

Last 14 days of care load (from their records)
Today's CBP transfers, HHS discharges, CBP apprehensions

That's 17 numbers. The app computes all 30+ model features automatically — rolling means, standard deviations, min/max, net flow, calendar features — purely from those 17 inputs. Works for any future date, any year.

Dashboard pages:

Overview — KPI cards, historical trend, intake/discharge balance, leaderboard
Care Load Forecast — 14-day input grid, next-day prediction, alert level, scenario comparison
Discharge Forecast — Same zero-CSV interface, weekly/monthly capacity estimates
Early Warning System — Alert zones, 90-day history, 5 project KPIs
Model Performance — Full escalation story, feature importance, all notebook figures
About & Dataset — Problem statement, dataset details, tech stack

9. Key Takeaways

For data scientists:

1. Diagnose before modelling. Before choosing a model, check stationarity, look for structural breaks, and verify that the training distribution matches the test distribution. This project would have ended at ARIMA if I hadn't investigated why it failed.

2. Training window is a hyperparameter. The right window selection here improved XGBoost MAE from 40.66 to 5.48 — a 7.4× improvement. No hyperparameter tuning of the model itself could have achieved that.

3. More data is not always better. The winning model used 30% of available data. The rest was actively harmful.

4. Validate feature importance. The dominance of hhs_care_roll_min_30 revealed that the model was primarily doing regime detection, not pattern forecasting. That insight validates the approach and suggests the right questions to ask if the regime changes again.

For the project evaluator:

"Training window selection is as important as model selection in the presence of structural breaks."

This is the central research finding. It is not a statement about this dataset specifically — it is a general principle applicable to any forecasting domain where abrupt regime changes are possible.

Tech Stack

Python 3.x     pandas  numpy  matplotlib  seaborn
XGBoost        scikit-learn  statsmodels  joblib
Streamlit      Jupyter Notebooks

Project structure:

uac-forecasting/
├── notebooks/   01_EDA → 07_Model_Evaluation
├── models/      best_model_recent.joblib + configs
├── data/        raw + processed
├── reports/     figures from all notebooks
└── src/         app1.py (Streamlit dashboard)

References

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. KDD 2016.
Box, G. E. P., et al. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley.
Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts.
Zeileis, A., et al. (2003). Testing and dating of structural changes in practice. Computational Statistics & Data Analysis, 44(1–2), 109–123.
HHS Office of Refugee Resettlement. UAC Program Data. U.S. Department of Health & Human Services.

Built as part of the Unified Mentor Data Science Internship · March 2026

· GitHub: https://github.com/Sugnik27/uac-forecasting?tab=readme-ov-file
· Live App: https://uac-forecasting.streamlit.app/

· An executive summary prepared for non-technical HHS stakeholders is available here: https://drive.google.com/drive/folders/1di-SvV6YidjTOGIvU8sLPXdahH1qhgIa?usp=sharing