<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sugnik Mondal</title>
    <description>The latest articles on DEV Community by Sugnik Mondal (@sugnikm).</description>
    <link>https://dev.to/sugnikm</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3812804%2Fa5339335-6ade-4a65-8e6d-877cfde3cb35.png</url>
      <title>DEV Community: Sugnik Mondal</title>
      <link>https://dev.to/sugnikm</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sugnikm"/>
    <language>en</language>
    <item>
      <title>How I Built a Late Delivery Risk Predictor for APL Logistics: What a 95% Delay Rate in First Class Shipping Taught Me About Supply Chain ML</title>
      <dc:creator>Sugnik Mondal</dc:creator>
      <pubDate>Sun, 24 May 2026 11:54:32 +0000</pubDate>
      <link>https://dev.to/sugnikm/how-i-built-a-late-delivery-risk-predictor-for-apl-logistics-what-a-95-delay-rate-in-first-class-1d11</link>
      <guid>https://dev.to/sugnikm/how-i-built-a-late-delivery-risk-predictor-for-apl-logistics-what-a-95-delay-rate-in-first-class-1d11</guid>
      <description>&lt;p&gt;Late deliveries are not just an inconvenience. For a global logistics operator like APL Logistics (KWE Group), a single delayed shipment can trigger SLA breaches, financial penalties, and long-term customer churn. Multiply that across hundreds of thousands of orders spanning five global markets, and the cost of reactive delay management becomes unsustainable.&lt;/p&gt;

&lt;p&gt;The conventional approach has always been to handle delays after they happen — emergency rerouting, last-minute escalations, and reactive customer communication. This project takes a different approach entirely. Instead of reacting, it predicts.&lt;/p&gt;

&lt;p&gt;This article walks through the end-to-end machine learning pipeline built to predict late delivery risk for APL Logistics — from raw data to a deployed Streamlit dashboard used by supply chain operations teams.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dataset
&lt;/h2&gt;

&lt;p&gt;The project uses the DataCo Smart Supply Chain dataset — a comprehensive real-world transactional dataset from APL Logistics' global operations.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Raw dataset:&lt;/strong&gt; 180,519 rows × 40 columns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After cleaning:&lt;/strong&gt; 180,517 rows × 28 columns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target variable:&lt;/strong&gt; &lt;code&gt;Late_delivery_risk&lt;/code&gt; (1 = Late, 0 = Not Late)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Target distribution:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Class&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Percentage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Not Late (0)&lt;/td&gt;
&lt;td&gt;98,976&lt;/td&gt;
&lt;td&gt;54.83%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Late (1)&lt;/td&gt;
&lt;td&gt;81,541&lt;/td&gt;
&lt;td&gt;45.17%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The near-balanced target distribution was an important early finding. It meant SMOTE was not required. &lt;code&gt;class_weight='balanced'&lt;/code&gt; in all models was sufficient.&lt;/p&gt;




&lt;h2&gt;
  
  
  Data Cleaning — The Leakage Problem
&lt;/h2&gt;

&lt;p&gt;The most critical cleaning decisions were around &lt;strong&gt;data leakage&lt;/strong&gt; — columns that would not be available at the time of prediction (before dispatch) but that reveal the outcome after the fact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Leakage columns dropped:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Delivery Status&lt;/code&gt; — Cramér's V of 1.00 with the target. Perfect correlation. This column contains values like "Late delivery" and "Shipping on time" — literally the answer. Using it would give 100% accuracy in training and zero accuracy in production.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Order Status&lt;/code&gt; — Values like COMPLETE, CLOSED, CANCELED are assigned after the order is fulfilled. At prediction time (before dispatch), this information does not exist.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The simple test for leakage:&lt;/strong&gt; &lt;em&gt;"At the moment the prediction is needed, would this information be available?"&lt;/em&gt; If not — drop it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Other columns dropped:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PII columns: Customer Fname, Customer Lname, Customer Street, Customer Zipcode&lt;/li&gt;
&lt;li&gt;ID columns: Category Id, Department Id, Customer Id, Order Customer Id&lt;/li&gt;
&lt;li&gt;Redundant location: Latitude, Longitude (redundant with Market and Order Region)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Missing values:&lt;/strong&gt; Only Customer Lname (8 rows) and Customer Zipcode (3 rows) had nulls — both dropped entirely as they were removal candidates anyway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Duplicates:&lt;/strong&gt; 2 duplicate rows removed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final cleaned dataset: 180,517 rows × 28 columns&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Exploratory Data Analysis — The Most Surprising Finding
&lt;/h2&gt;

&lt;p&gt;Before building any model, the data was explored thoroughly. The most counterintuitive finding came from shipping mode analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Late delivery rate by shipping mode:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Shipping Mode&lt;/th&gt;
&lt;th&gt;Late Delivery Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;First Class&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;95.3%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Second Class&lt;/td&gt;
&lt;td&gt;76.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Same Day&lt;/td&gt;
&lt;td&gt;45.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Standard Class&lt;/td&gt;
&lt;td&gt;38.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;First Class shipping — which customers expect to be faster and more reliable — has a &lt;strong&gt;95.3% late delivery rate&lt;/strong&gt;. This is not a rounding error. Nearly every First Class order in the dataset arrived late. This suggests that First Class commitments are systematically over-promised relative to operational capacity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Late delivery rate by market:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All five global markets showed rates between 54.4% and 55.2% — an extremely narrow band. This finding is operationally significant: the delay problem is not geographically concentrated. It is systemic across all markets, meaning market-level interventions alone will not solve it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shipping delay gap:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The gap between actual shipping days and scheduled shipping days averaged &lt;strong&gt;+0.57 days&lt;/strong&gt; across all orders. 103,399 orders (57.3%) shipped later than scheduled. Most delays were by exactly one day — suggesting a consistent operational mismatch between scheduling and execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Correlation heatmap revealed multicollinearity:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Benefit per order&lt;/code&gt; and &lt;code&gt;Order Profit Per Order&lt;/code&gt; → 1.00 correlation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Order Item Product Price&lt;/code&gt; and &lt;code&gt;Product Price&lt;/code&gt; → 1.00 correlation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Sales per customer&lt;/code&gt;, &lt;code&gt;Order Item Total&lt;/code&gt;, &lt;code&gt;Sales&lt;/code&gt; → 0.99 correlation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These redundant columns were dropped in preprocessing to prevent multicollinearity — particularly harmful for Logistic Regression.&lt;/p&gt;




&lt;h2&gt;
  
  
  Feature Engineering — Where the Real Signal Was Created
&lt;/h2&gt;

&lt;p&gt;Six new features were engineered from existing columns. These turned out to be some of the most important features in the final model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Shipping Delay Gap&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;shipping_delay_gap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Days_for_shipping_real&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Days_for_shipment_scheduled&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Measures how many days actual shipping exceeded the scheduled commitment. This single feature ended up with an importance score of &lt;strong&gt;0.7938&lt;/strong&gt; — accounting for 79% of the model's decision-making.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Shipping Pressure Index&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;shipping_pressure_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Days_for_shipment_scheduled&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Order_Item_Quantity&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Captures the relationship between delivery commitment and order complexity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Is Express Flag&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;is_express&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;Shipping_Mode&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;First Class&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Same Day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Binary flag directly capturing the high-risk shipping modes identified in EDA.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. High Discount Flag&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;high_discount_flag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;Order_Item_Discount_Rate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.06&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Flags orders with above-median discount rates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Order Complexity Score&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;order_complexity_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Order_Item_Quantity&lt;/span&gt; &lt;span class="err"&gt;×&lt;/span&gt; &lt;span class="n"&gt;Order_Item_Product_Price&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Measures the financial complexity of the order.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Regional Congestion Score&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;region_congestion_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;average_late_delivery_rate_per_region&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Encodes the historically observed delay rate per region as a continuous risk signal — ranging from 0.488 (Canada) to 0.580 (Central Africa).&lt;/p&gt;




&lt;h2&gt;
  
  
  The Anti-Leakage Preprocessing Pipeline
&lt;/h2&gt;

&lt;p&gt;Preventing data leakage was not just about dropping columns. The entire preprocessing pipeline was structured to ensure no information from the test set contaminated the training process.&lt;/p&gt;

&lt;p&gt;Load cleaned_data.csv&lt;br&gt;
→ Feature Engineering (pure arithmetic — no fitting required)&lt;br&gt;
→ Separate X and y&lt;br&gt;
→ Train/Test Split (80/20, stratified) ← split happens HERE&lt;br&gt;
→ Fit StandardScaler on X_train only&lt;br&gt;
→ Transform X_train and X_test separately&lt;br&gt;
→ Fit LabelEncoders on X_train only&lt;br&gt;
→ Transform X_train and X_test separately&lt;br&gt;
→ Save scaler.pkl and encoders.pkl&lt;br&gt;
→ Train models on X_train only&lt;br&gt;
→ Evaluate on X_test only&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters for production:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The scaler saved to &lt;code&gt;scaler.pkl&lt;/code&gt; carries the exact mean and standard deviation computed on X_train. When the Streamlit app receives a new order, it applies this saved scaler — not a newly fitted one. This guarantees that scaling is identical between training and inference, preventing silent prediction errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Split results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Training set: 144,413 rows&lt;/li&gt;
&lt;li&gt;Test set: 36,104 rows&lt;/li&gt;
&lt;li&gt;Stratification maintained the 55/45 class ratio in both sets&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Model Development — Dictionary Loop Approach
&lt;/h2&gt;

&lt;p&gt;Three models were defined in a dictionary and trained in a loop — a clean, professional pattern that avoids repetitive code and makes comparison straightforward.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Logistic Regression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;class_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;balanced&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Random Forest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;class_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;balanced&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;XGBoost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;XGBClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;scale_pos_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ratio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eval_metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;logloss&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cv_roc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;roc_auc&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cv_f1&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;f1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cv_roc_auc&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cv_roc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cv_f1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cv_f1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;5-fold cross validation results (on X_train only):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;CV ROC-AUC&lt;/th&gt;
&lt;th&gt;CV F1&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Logistic Regression&lt;/td&gt;
&lt;td&gt;0.9803 ± 0.0008&lt;/td&gt;
&lt;td&gt;0.9749 ± 0.0019&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Random Forest&lt;/td&gt;
&lt;td&gt;0.9964 ± 0.0002&lt;/td&gt;
&lt;td&gt;0.9786 ± 0.0004&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XGBoost&lt;/td&gt;
&lt;td&gt;0.9964 ± 0.0001&lt;/td&gt;
&lt;td&gt;0.9792 ± 0.0005&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Random Forest and XGBoost were essentially tied at baseline. XGBoost was selected for hyperparameter tuning due to its slightly lower variance and faster inference time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hyperparameter Tuning — RandomizedSearchCV
&lt;/h2&gt;

&lt;p&gt;GridSearchCV was ruled out immediately. With 180,000+ rows and a large parameter space, exhaustive search would have been computationally prohibitive. RandomizedSearchCV with 30 iterations and 5-fold CV was used instead — sampling the parameter space efficiently.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;param_grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;n_estimators&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;max_depth&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;learning_rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subsample&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;colsample_bytree&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;min_child_weight&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Best parameters found:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;n_estimators:     200&lt;br&gt;
max_depth:        6&lt;br&gt;
learning_rate:    0.2&lt;br&gt;
subsample:        0.9&lt;br&gt;
colsample_bytree: 1.0&lt;br&gt;
min_child_weight: 1&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best CV ROC-AUC: 0.9967&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Model Evaluation — Final Results
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Final model comparison (on X_test):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th&gt;Recall&lt;/th&gt;
&lt;th&gt;F1 Score&lt;/th&gt;
&lt;th&gt;ROC-AUC&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Logistic Regression&lt;/td&gt;
&lt;td&gt;0.9740&lt;/td&gt;
&lt;td&gt;0.9589&lt;/td&gt;
&lt;td&gt;0.9952&lt;/td&gt;
&lt;td&gt;0.9767&lt;/td&gt;
&lt;td&gt;0.9806&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Random Forest&lt;/td&gt;
&lt;td&gt;0.9773&lt;/td&gt;
&lt;td&gt;0.9604&lt;/td&gt;
&lt;td&gt;0.9998&lt;/td&gt;
&lt;td&gt;0.9797&lt;/td&gt;
&lt;td&gt;0.9970&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XGBoost Baseline&lt;/td&gt;
&lt;td&gt;0.9781&lt;/td&gt;
&lt;td&gt;0.9633&lt;/td&gt;
&lt;td&gt;0.9980&lt;/td&gt;
&lt;td&gt;0.9804&lt;/td&gt;
&lt;td&gt;0.9969&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;XGBoost Tuned&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.9787&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.9644&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.9980&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.9809&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.9972&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why XGBoost Tuned was selected as the best model:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Highest ROC-AUC: 0.9972&lt;/li&gt;
&lt;li&gt;Highest Precision: 0.9644 — fewest false alarms&lt;/li&gt;
&lt;li&gt;Lowest false positives: 730 (vs 845 for Logistic Regression)&lt;/li&gt;
&lt;li&gt;Equal Recall to baseline XGBoost: 0.9980 — catches 99.8% of all true late deliveries&lt;/li&gt;
&lt;li&gt;CV ROC-AUC (0.9967) and test ROC-AUC (0.9972) are consistent — no overfitting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Confusion matrix analysis:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;False Positives&lt;/th&gt;
&lt;th&gt;False Negatives&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Logistic Regression&lt;/td&gt;
&lt;td&gt;845&lt;/td&gt;
&lt;td&gt;95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Random Forest&lt;/td&gt;
&lt;td&gt;817&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XGBoost Baseline&lt;/td&gt;
&lt;td&gt;752&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XGBoost Tuned&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;730&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In an operations context, false positives (flagging an on-time order as high risk) waste intervention resources. False negatives (missing a truly late order) lead to unmitigated delays. XGBoost Tuned minimizes both.&lt;/p&gt;




&lt;h2&gt;
  
  
  Feature Importance — What Actually Drives Late Deliveries
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Top 10 global risk drivers (XGBoost Tuned):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Importance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Shipping Delay Gap&lt;/td&gt;
&lt;td&gt;0.7938&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Payment Type&lt;/td&gt;
&lt;td&gt;0.1671&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Scheduled Shipping Days&lt;/td&gt;
&lt;td&gt;0.0023&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Customer Country&lt;/td&gt;
&lt;td&gt;0.0020&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Order Country&lt;/td&gt;
&lt;td&gt;0.0019&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Market&lt;/td&gt;
&lt;td&gt;0.0019&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Regional Congestion Score&lt;/td&gt;
&lt;td&gt;0.0019&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Order State&lt;/td&gt;
&lt;td&gt;0.0019&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Customer City&lt;/td&gt;
&lt;td&gt;0.0019&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Order City&lt;/td&gt;
&lt;td&gt;0.0018&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;shipping_delay_gap&lt;/code&gt; accounts for &lt;strong&gt;79.38%&lt;/strong&gt; of the model's decision-making. This engineered feature — created from the difference between actual and scheduled shipping days — is overwhelmingly the primary driver of late delivery risk.&lt;/p&gt;

&lt;p&gt;The second most important feature is Payment Type at &lt;strong&gt;16.71%&lt;/strong&gt;. This was unexpected. Transfer payments show notably lower late delivery rates (48.5%) compared to other payment types (56.6%–57.5%). The mechanism behind this relationship warrants further investigation.&lt;/p&gt;

&lt;p&gt;All other features combined account for less than 4% of importance — confirming that the delay gap is the fundamental root cause.&lt;/p&gt;




&lt;h2&gt;
  
  
  Risk Scoring
&lt;/h2&gt;

&lt;p&gt;Each order received a Late Delivery Probability Score (0–1) and a Risk Category:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Low Risk:&lt;/strong&gt; probability &amp;lt; 0.40&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medium Risk:&lt;/strong&gt; 0.40 ≤ probability &amp;lt; 0.70&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High Risk:&lt;/strong&gt; probability ≥ 0.70&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Risk distribution across 36,104 test orders:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Risk Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Percentage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;High Risk&lt;/td&gt;
&lt;td&gt;19,977&lt;/td&gt;
&lt;td&gt;55.33%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium Risk&lt;/td&gt;
&lt;td&gt;589&lt;/td&gt;
&lt;td&gt;1.63%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low Risk&lt;/td&gt;
&lt;td&gt;15,538&lt;/td&gt;
&lt;td&gt;43.04%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The bimodal probability distribution — with most orders near 0.0 or 1.0 — reflects the model's high confidence. The dominant &lt;code&gt;shipping_delay_gap&lt;/code&gt; feature provides such strong signal that the model is rarely uncertain about an order's risk classification.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Streamlit Application
&lt;/h2&gt;

&lt;p&gt;A four-module Streamlit dashboard was built for supply chain operations teams:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Home&lt;/strong&gt; — Project overview, professional disclaimer, methodology summary, usage guide. The app explicitly states it is designed for supply chain managers, logistics analysts, and operations teams — not end consumers. The reason: the inputs required (scheduled shipping days, actual shipping days, profit ratios, financial metrics) are only available in internal order management systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Risk Predictor&lt;/strong&gt; — Operations teams enter order details. The app automatically engineers all 6 derived features, applies the saved scaler and encoders, and outputs a probability score, risk category, top risk drivers, and recommended action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Risk Dashboard&lt;/strong&gt; — Portfolio-level view of risk distribution, probability histogram, and feature importance chart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operations Action Panel&lt;/strong&gt; — Filterable table of high-risk orders with adjustable threshold slider and CSV export.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Leakage prevention is non-negotiable.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;Delivery Status&lt;/code&gt; had a Cramér's V of 1.00 with the target. Including it would have given a perfect model on paper and a useless model in production. Always ask: would this feature exist at prediction time?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Feature engineering made the biggest difference.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;shipping_delay_gap&lt;/code&gt; — a single engineered feature — accounts for 79% of the model's decisions. No raw feature came close. Time spent on thoughtful feature engineering consistently outperforms time spent on model tuning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Class balance should be checked before reaching for SMOTE.&lt;/strong&gt;&lt;br&gt;
The target was 55/45 — nearly balanced. SMOTE was unnecessary. &lt;code&gt;class_weight='balanced'&lt;/code&gt; was cleaner, faster, and equally effective.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. RandomizedSearchCV over GridSearchCV at scale.&lt;/strong&gt;&lt;br&gt;
With 180,000+ rows, GridSearch would have been impractical. RandomizedSearch with 30 iterations delivered strong results efficiently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. The most counterintuitive finding was the most actionable.&lt;/strong&gt;&lt;br&gt;
First Class shipping having a 95.3% late delivery rate is not a modeling artifact — it is a real operational failure that APL Logistics can act on directly, independent of any ML system.&lt;/p&gt;




&lt;h2&gt;
  
  
  Technical Stack
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;Python&lt;/code&gt; · &lt;code&gt;Pandas&lt;/code&gt; · &lt;code&gt;NumPy&lt;/code&gt; · &lt;code&gt;Scikit-learn&lt;/code&gt; · &lt;code&gt;XGBoost&lt;/code&gt; · &lt;code&gt;Matplotlib&lt;/code&gt; · &lt;code&gt;Seaborn&lt;/code&gt; · &lt;code&gt;Plotly&lt;/code&gt; · &lt;code&gt;Streamlit&lt;/code&gt; · &lt;code&gt;Joblib&lt;/code&gt; · &lt;code&gt;Jupyter Notebooks&lt;/code&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This project was completed as part of the Data Science internship program at Unified Mentor Private Limited, in collaboration with APL Logistics (KWE Group).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
      <category>supplychain</category>
    </item>
    <item>
      <title>Predictive Forecasting of Care Load &amp; Placement Demand: What a 66% Structural Break Taught Me About Machine Learning</title>
      <dc:creator>Sugnik Mondal</dc:creator>
      <pubDate>Sun, 08 Mar 2026 11:18:58 +0000</pubDate>
      <link>https://dev.to/sugnikm/predictive-forecasting-of-care-load-placement-demand-what-a-66-structural-break-taught-me-about-4nb7</link>
      <guid>https://dev.to/sugnikm/predictive-forecasting-of-care-load-placement-demand-what-a-66-structural-break-taught-me-about-4nb7</guid>
      <description>&lt;p&gt;&lt;strong&gt;HHS Unaccompanied Alien Children Program — Data Science Internship Project&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Sugnik Mondal · Unified Mentor Data Science Intern · March 2026&lt;/em&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; I built a forecasting system for the HHS UAC Program using 720 real records. Nine models were tested. Eight failed or underperformed. One won — but not because of model complexity. The reason every sophisticated model failed, and what fixed it, is the actual story here.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Abstract
&lt;/h2&gt;

&lt;p&gt;The U.S. Department of Health &amp;amp; Human Services (HHS) Unaccompanied Alien Children (UAC) Program manages the care, custody, and sponsor placement of migrant children arriving at the U.S. border. Daily care load fluctuated between &lt;strong&gt;1,972 and 11,516 children&lt;/strong&gt; during the study period — a 5.8× range that makes capacity planning extremely difficult without reliable forecasts.&lt;/p&gt;

&lt;p&gt;This paper presents a complete ML forecasting system built on 720 real operational records spanning January 2023 to December 2025. The central finding is that a &lt;strong&gt;January 2025 structural break&lt;/strong&gt; — a permanent 66% drop in care load — caused every full-window model to fail catastrophically. The solution was simple in concept but required correctly diagnosing the problem first: &lt;strong&gt;recent-window retraining&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Care Load Model:&lt;/strong&gt; XGBoost MAE 5.48 · MAPE 0.23% · 9.6% better than naïve baseline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discharge Model:&lt;/strong&gt; XGBoost MAE 0.63 children/day&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dashboard:&lt;/strong&gt; 6-page Streamlit app with zero-CSV-dependency prediction interface&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. The Problem
&lt;/h2&gt;

&lt;p&gt;The HHS UAC Program needs to know, at minimum one day in advance:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;How many children will be in HHS care tomorrow?&lt;/strong&gt; (staffing, beds, resources)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How many children will be discharged tomorrow?&lt;/strong&gt; (sponsor outreach, placement capacity)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is a surge coming?&lt;/strong&gt; (early warning, proactive capacity scaling)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without forecasts, every decision is reactive. Surges cause acute crises. Troughs cause costly over-provisioning. The program needed a tool.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The Data
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source:&lt;/strong&gt; HHS UAC Program public operational records&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Raw records&lt;/td&gt;
&lt;td&gt;720 observations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Date range&lt;/td&gt;
&lt;td&gt;Jan 2023 – Dec 2025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After preprocessing&lt;/td&gt;
&lt;td&gt;1,075 rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Missing dates filled&lt;/td&gt;
&lt;td&gt;355 (weekends, via linear interpolation)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Target 1&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;hhs_care&lt;/code&gt; — children in HHS care daily&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Target 2&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;hhs_discharged&lt;/code&gt; — daily discharges&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;lag-1 autocorrelation&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.99&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That lag-1 autocorrelation of 0.99 is important. It means yesterday's care load is an almost perfect predictor of today's. It immediately told me that the naïve baseline — predict tomorrow = today — was going to be very hard to beat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Naïve Persistence MAE: 6.06.&lt;/strong&gt; That's the bar everything had to clear.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The Structural Break — The Most Important Discovery
&lt;/h2&gt;

&lt;p&gt;Before touching a single model, the EDA revealed something critical.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;January 2025&lt;/strong&gt;, HHS care load dropped from approximately &lt;strong&gt;6,500 children to approximately 2,200 children in under two weeks.&lt;/strong&gt; That's a 66% reduction. And it never recovered — the low level persisted through the end of the dataset.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A structural break is a permanent, abrupt change in the statistical properties of a time series. Unlike a trend or seasonal pattern, it cannot be modelled away — the series before and after the break are effectively two different processes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's why this matters for every model you try to build:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Full dataset training mean:&lt;/strong&gt; ~6,061 children&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test set mean (post-break):&lt;/strong&gt; ~2,300 children&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gap:&lt;/strong&gt; ~3,761 children&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Any model trained on the full dataset learns patterns centred around 6,061. When it predicts on data centred around 2,300, it's off by thousands. That's not a model quality problem. That's a &lt;strong&gt;data regime problem.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The Escalation Story — Nine Models, Eight Failures
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Phase 1: Baseline (Floor Setting)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;MAE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Naïve Persistence&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;6.06&lt;/strong&gt; ← the bar&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Moving Average (w=3)&lt;/td&gt;
&lt;td&gt;9.76&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Moving Average was actually &lt;em&gt;worse&lt;/em&gt; than naïve because the slight upward trend in the test period caused systematic under-prediction — the rolling average always lags behind a rising series.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2: Statistical Models — All Failed
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;MAE&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Exponential Smoothing&lt;/td&gt;
&lt;td&gt;86.69&lt;/td&gt;
&lt;td&gt;Anchored to pre-break mean ~6,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ARIMA(3,1,3)&lt;/td&gt;
&lt;td&gt;144.35&lt;/td&gt;
&lt;td&gt;Mean-reverting behaviour pulled forecasts too high&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SARIMA&lt;/td&gt;
&lt;td&gt;433.17&lt;/td&gt;
&lt;td&gt;Seasonal components amplified the regime-change error&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;SARIMA was the &lt;em&gt;worst&lt;/em&gt; performing model overall. Adding more structure made the problem worse. The seasonal terms were learning patterns from the pre-break period that had no relevance to the post-break test data.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This was not a failure of ARIMA or SARIMA as methods. It was a failure to check whether their core assumptions were met before applying them. Both assume a stationary or trend-stationary series. A permanent 66% level shift violates that assumption completely.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Phase 3: Full-Window ML — Also Failed
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;MAE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Linear Regression (full)&lt;/td&gt;
&lt;td&gt;23.38&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Random Forest (full)&lt;/td&gt;
&lt;td&gt;25.41&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XGBoost (full)&lt;/td&gt;
&lt;td&gt;40.66&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;ML models were better than statistical models — but XGBoost performed &lt;strong&gt;6.7× worse than naïve&lt;/strong&gt;. The boosting process over-fitted to the high-variance pre-break period. The root cause was identical: wrong training distribution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 4: Recent-Window ML — The Solution ✅
&lt;/h3&gt;

&lt;p&gt;The fix: retrain all models using &lt;strong&gt;only data from June 2024 onwards.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;MAE&lt;/th&gt;
&lt;th&gt;vs Naïve&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;XGBoost (Recent)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.48&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅ –9.6%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Random Forest (Recent)&lt;/td&gt;
&lt;td&gt;6.54&lt;/td&gt;
&lt;td&gt;❌ +7.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear Regression (Recent)&lt;/td&gt;
&lt;td&gt;7.48&lt;/td&gt;
&lt;td&gt;❌ +23.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;By using only the recent window:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Training mean: ~2,800&lt;/li&gt;
&lt;li&gt;Test mean: ~2,300&lt;/li&gt;
&lt;li&gt;Gap: ~500 (vs 3,761 with full window)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;XGBoost achieved &lt;strong&gt;MAE 5.48, RMSE 7.12, MAPE 0.23%.&lt;/strong&gt; That's 99.77% forecast accuracy.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. The Complete Leaderboard
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;MAE ↓&lt;/th&gt;
&lt;th&gt;RMSE ↓&lt;/th&gt;
&lt;th&gt;MAPE ↓&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🏆 &lt;strong&gt;XGBoost (Recent)&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.48&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.12&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.23%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Naïve Persistence&lt;/td&gt;
&lt;td&gt;6.06&lt;/td&gt;
&lt;td&gt;7.24&lt;/td&gt;
&lt;td&gt;0.27%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Random Forest (Recent)&lt;/td&gt;
&lt;td&gt;6.54&lt;/td&gt;
&lt;td&gt;8.44&lt;/td&gt;
&lt;td&gt;0.28%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear Regression (Recent)&lt;/td&gt;
&lt;td&gt;7.48&lt;/td&gt;
&lt;td&gt;8.86&lt;/td&gt;
&lt;td&gt;0.31%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Moving Average (w=3)&lt;/td&gt;
&lt;td&gt;9.76&lt;/td&gt;
&lt;td&gt;11.77&lt;/td&gt;
&lt;td&gt;0.43%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ridge Regression (Recent)&lt;/td&gt;
&lt;td&gt;17.80&lt;/td&gt;
&lt;td&gt;22.80&lt;/td&gt;
&lt;td&gt;0.74%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exponential Smoothing&lt;/td&gt;
&lt;td&gt;86.69&lt;/td&gt;
&lt;td&gt;97.40&lt;/td&gt;
&lt;td&gt;3.74%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ARIMA(3,1,3)&lt;/td&gt;
&lt;td&gt;144.35&lt;/td&gt;
&lt;td&gt;161.63&lt;/td&gt;
&lt;td&gt;6.20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SARIMA&lt;/td&gt;
&lt;td&gt;433.17&lt;/td&gt;
&lt;td&gt;501.04&lt;/td&gt;
&lt;td&gt;18.53%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Only &lt;strong&gt;one model&lt;/strong&gt; out of nine beat the naïve baseline. That model was trained on roughly 30% of the available data.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Feature Engineering &amp;amp; What the Model Actually Learned
&lt;/h2&gt;

&lt;p&gt;30+ features were engineered from the five raw columns. The top features by XGBoost importance:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Importance&lt;/th&gt;
&lt;th&gt;What it captures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hhs_care_roll_min_30&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.541&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;30-day rolling minimum — the post-break floor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hhs_care_lag_2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.159&lt;/td&gt;
&lt;td&gt;2-day autoregressive signal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hhs_care_lag_1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.150&lt;/td&gt;
&lt;td&gt;Yesterday's value&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cbp_transferred&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.122&lt;/td&gt;
&lt;td&gt;Today's pipeline transfers — leading indicator&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The dominance of &lt;code&gt;hhs_care_roll_min_30&lt;/code&gt; (0.541 — over half the total importance) is revealing. The model's primary mechanism is recognising &lt;em&gt;which regime it's in&lt;/em&gt; by checking the 30-day floor. The top four features account for &lt;strong&gt;97.2% of total importance.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  7. The Discharge Model
&lt;/h2&gt;

&lt;p&gt;Discharge demand required separate treatment. The discharge structural break was even more severe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Full-window training mean:&lt;/strong&gt; 173 children/day&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-break test mean:&lt;/strong&gt; ~9 children/day&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduction:&lt;/strong&gt; 94.8%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A June 2024 cutoff still left a massive gap. A &lt;strong&gt;March 2025 cutoff&lt;/strong&gt; reduced the training-test mean gap to 3.67. XGBoost achieved &lt;strong&gt;MAE 0.63 children/day&lt;/strong&gt; — less than one child per day in prediction error.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Ridge Regression achieved MAE 0.03 — excluded as overfitting. A result that perfect on a small training window is a red flag, not a win.)&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  8. The Streamlit Dashboard
&lt;/h2&gt;

&lt;p&gt;A 6-page dashboard operationalises both models with a key design decision: &lt;strong&gt;zero CSV dependency for predictions.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's the reasoning: the training data ends December 2025. If a programme administrator uses this app in June 2026, lag values pulled from the historical CSV would be 6 months stale — completely wrong inputs for the model.&lt;/p&gt;

&lt;p&gt;The solution: users enter only what they naturally know from their daily report:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Last 14 days of care load (from their records)&lt;/li&gt;
&lt;li&gt;Today's CBP transfers, HHS discharges, CBP apprehensions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's 17 numbers. The app computes all 30+ model features automatically — rolling means, standard deviations, min/max, net flow, calendar features — purely from those 17 inputs. Works for any future date, any year.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dashboard pages:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Overview&lt;/strong&gt; — KPI cards, historical trend, intake/discharge balance, leaderboard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Care Load Forecast&lt;/strong&gt; — 14-day input grid, next-day prediction, alert level, scenario comparison&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discharge Forecast&lt;/strong&gt; — Same zero-CSV interface, weekly/monthly capacity estimates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Early Warning System&lt;/strong&gt; — Alert zones, 90-day history, 5 project KPIs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Performance&lt;/strong&gt; — Full escalation story, feature importance, all notebook figures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;About &amp;amp; Dataset&lt;/strong&gt; — Problem statement, dataset details, tech stack&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  9. Key Takeaways
&lt;/h2&gt;

&lt;h3&gt;
  
  
  For data scientists:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Diagnose before modelling.&lt;/strong&gt; Before choosing a model, check stationarity, look for structural breaks, and verify that the training distribution matches the test distribution. This project would have ended at ARIMA if I hadn't investigated why it failed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Training window is a hyperparameter.&lt;/strong&gt; The right window selection here improved XGBoost MAE from 40.66 to 5.48 — a 7.4× improvement. No hyperparameter tuning of the model itself could have achieved that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. More data is not always better.&lt;/strong&gt; The winning model used 30% of available data. The rest was actively harmful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Validate feature importance.&lt;/strong&gt; The dominance of &lt;code&gt;hhs_care_roll_min_30&lt;/code&gt; revealed that the model was primarily doing regime detection, not pattern forecasting. That insight validates the approach and suggests the right questions to ask if the regime changes again.&lt;/p&gt;

&lt;h3&gt;
  
  
  For the project evaluator:
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Training window selection is as important as model selection in the presence of structural breaks."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the central research finding. It is not a statement about this dataset specifically — it is a general principle applicable to any forecasting domain where abrupt regime changes are possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Python 3.x     pandas  numpy  matplotlib  seaborn
XGBoost        scikit-learn  statsmodels  joblib
Streamlit      Jupyter Notebooks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Project structure:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;uac-forecasting/
├── notebooks/   01_EDA → 07_Model_Evaluation
├── models/      best_model_recent.joblib + configs
├── data/        raw + processed
├── reports/     figures from all notebooks
└── src/         app1.py (Streamlit dashboard)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Chen, T., &amp;amp; Guestrin, C. (2016). XGBoost: A scalable tree boosting system. &lt;em&gt;KDD 2016.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Box, G. E. P., et al. (2015). &lt;em&gt;Time Series Analysis: Forecasting and Control&lt;/em&gt; (5th ed.). Wiley.&lt;/li&gt;
&lt;li&gt;Hyndman, R. J., &amp;amp; Athanasopoulos, G. (2021). &lt;em&gt;Forecasting: Principles and Practice&lt;/em&gt; (3rd ed.). OTexts.&lt;/li&gt;
&lt;li&gt;Zeileis, A., et al. (2003). Testing and dating of structural changes in practice. &lt;em&gt;Computational Statistics &amp;amp; Data Analysis, 44&lt;/em&gt;(1–2), 109–123.&lt;/li&gt;
&lt;li&gt;HHS Office of Refugee Resettlement. UAC Program Data. U.S. Department of Health &amp;amp; Human Services.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Built as part of the Unified Mentor Data Science Internship · March 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;· GitHub: &lt;a href="https://github.com/Sugnik27/uac-forecasting?tab=readme-ov-file" rel="noopener noreferrer"&gt;https://github.com/Sugnik27/uac-forecasting?tab=readme-ov-file&lt;/a&gt; &lt;br&gt;
· Live App: &lt;a href="https://uac-forecasting.streamlit.app/" rel="noopener noreferrer"&gt;https://uac-forecasting.streamlit.app/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;· An executive summary prepared for non-technical HHS stakeholders is available here: &lt;a href="https://drive.google.com/drive/folders/1di-SvV6YidjTOGIvU8sLPXdahH1qhgIa?usp=sharing" rel="noopener noreferrer"&gt;https://drive.google.com/drive/folders/1di-SvV6YidjTOGIvU8sLPXdahH1qhgIa?usp=sharing&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
      <category>timeseries</category>
    </item>
  </channel>
</rss>
