Sugnik Mondal

Posted on Mar 8

Predictive Forecasting of Care Load & Placement Demand: What a 66% Structural Break Taught Me About Machine Learning

#machinelearning #datascience #python #timeseries

HHS Unaccompanied Alien Children Program — Data Science Internship Project

Sugnik Mondal · Unified Mentor Data Science Intern · March 2026

TL;DR: I built a forecasting system for the HHS UAC Program using 720 real records. Nine models were tested. Eight failed or underperformed. One won — but not because of model complexity. The reason every sophisticated model failed, and what fixed it, is the actual story here.

Abstract

The U.S. Department of Health & Human Services (HHS) Unaccompanied Alien Children (UAC) Program manages the care, custody, and sponsor placement of migrant children arriving at the U.S. border. Daily care load fluctuated between 1,972 and 11,516 children during the study period — a 5.8× range that makes capacity planning extremely difficult without reliable forecasts.

This paper presents a complete ML forecasting system built on 720 real operational records spanning January 2023 to December 2025. The central finding is that a January 2025 structural break — a permanent 66% drop in care load — caused every full-window model to fail catastrophically. The solution was simple in concept but required correctly diagnosing the problem first: recent-window retraining.

Final results:

Care Load Model: XGBoost MAE 5.48 · MAPE 0.23% · 9.6% better than naïve baseline
Discharge Model: XGBoost MAE 0.63 children/day
Dashboard: 6-page Streamlit app with zero-CSV-dependency prediction interface

1. The Problem

The HHS UAC Program needs to know, at minimum one day in advance:

How many children will be in HHS care tomorrow? (staffing, beds, resources)
How many children will be discharged tomorrow? (sponsor outreach, placement capacity)
Is a surge coming? (early warning, proactive capacity scaling)

Without forecasts, every decision is reactive. Surges cause acute crises. Troughs cause costly over-provisioning. The program needed a tool.

2. The Data

Source: HHS UAC Program public operational records

Property	Value
Raw records	720 observations
Date range	Jan 2023 – Dec 2025
After preprocessing	1,075 rows
Missing dates filled	355 (weekends, via linear interpolation)
Target 1	`hhs_care` — children in HHS care daily
Target 2	`hhs_discharged` — daily discharges
lag-1 autocorrelation	0.99

That lag-1 autocorrelation of 0.99 is important. It means yesterday's care load is an almost perfect predictor of today's. It immediately told me that the naïve baseline — predict tomorrow = today — was going to be very hard to beat.

Naïve Persistence MAE: 6.06. That's the bar everything had to clear.

3. The Structural Break — The Most Important Discovery

Before touching a single model, the EDA revealed something critical.

In January 2025, HHS care load dropped from approximately 6,500 children to approximately 2,200 children in under two weeks. That's a 66% reduction. And it never recovered — the low level persisted through the end of the dataset.

A structural break is a permanent, abrupt change in the statistical properties of a time series. Unlike a trend or seasonal pattern, it cannot be modelled away — the series before and after the break are effectively two different processes.

Here's why this matters for every model you try to build:

Full dataset training mean: ~6,061 children
Test set mean (post-break): ~2,300 children
Gap: ~3,761 children

Any model trained on the full dataset learns patterns centred around 6,061. When it predicts on data centred around 2,300, it's off by thousands. That's not a model quality problem. That's a data regime problem.

4. The Escalation Story — Nine Models, Eight Failures

Phase 1: Baseline (Floor Setting)

Model	MAE
Naïve Persistence	6.06 ← the bar
Moving Average (w=3)	9.76

Moving Average was actually worse than naïve because the slight upward trend in the test period caused systematic under-prediction — the rolling average always lags behind a rising series.

Phase 2: Statistical Models — All Failed

Model	MAE	Why
Exponential Smoothing	86.69	Anchored to pre-break mean ~6,000
ARIMA(3,1,3)	144.35	Mean-reverting behaviour pulled forecasts too high
SARIMA	433.17	Seasonal components amplified the regime-change error

SARIMA was the worst performing model overall. Adding more structure made the problem worse. The seasonal terms were learning patterns from the pre-break period that had no relevance to the post-break test data.

This was not a failure of ARIMA or SARIMA as methods. It was a failure to check whether their core assumptions were met before applying them. Both assume a stationary or trend-stationary series. A permanent 66% level shift violates that assumption completely.

Phase 3: Full-Window ML — Also Failed

Model	MAE
Linear Regression (full)	23.38
Random Forest (full)	25.41
XGBoost (full)	40.66

ML models were better than statistical models — but XGBoost performed 6.7× worse than naïve. The boosting process over-fitted to the high-variance pre-break period. The root cause was identical: wrong training distribution.

Phase 4: Recent-Window ML — The Solution ✅

The fix: retrain all models using only data from June 2024 onwards.

Model	MAE	vs Naïve
XGBoost (Recent)	5.48	✅ –9.6%
Random Forest (Recent)	6.54	❌ +7.9%
Linear Regression (Recent)	7.48	❌ +23.4%

By using only the recent window:

Training mean: ~2,800
Test mean: ~2,300
Gap: ~500 (vs 3,761 with full window)

XGBoost achieved MAE 5.48, RMSE 7.12, MAPE 0.23%. That's 99.77% forecast accuracy.

5. The Complete Leaderboard

Model	MAE ↓	RMSE ↓	MAPE ↓
🏆 XGBoost (Recent)	5.48	7.12	0.23%
Naïve Persistence	6.06	7.24	0.27%
Random Forest (Recent)	6.54	8.44	0.28%
Linear Regression (Recent)	7.48	8.86	0.31%
Moving Average (w=3)	9.76	11.77	0.43%
Ridge Regression (Recent)	17.80	22.80	0.74%
Exponential Smoothing	86.69	97.40	3.74%
ARIMA(3,1,3)	144.35	161.63	6.20%
SARIMA	433.17	501.04	18.53%

Only one model out of nine beat the naïve baseline. That model was trained on roughly 30% of the available data.

6. Feature Engineering & What the Model Actually Learned

30+ features were engineered from the five raw columns. The top features by XGBoost importance:

Feature	Importance	What it captures
`hhs_care_roll_min_30`	0.541	30-day rolling minimum — the post-break floor
`hhs_care_lag_2`	0.159	2-day autoregressive signal
`hhs_care_lag_1`	0.150	Yesterday's value
`cbp_transferred`	0.122	Today's pipeline transfers — leading indicator

The dominance of hhs_care_roll_min_30 (0.541 — over half the total importance) is revealing. The model's primary mechanism is recognising which regime it's in by checking the 30-day floor. The top four features account for 97.2% of total importance.

7. The Discharge Model

Discharge demand required separate treatment. The discharge structural break was even more severe:

Full-window training mean: 173 children/day
Post-break test mean: ~9 children/day
Reduction: 94.8%

A June 2024 cutoff still left a massive gap. A March 2025 cutoff reduced the training-test mean gap to 3.67. XGBoost achieved MAE 0.63 children/day — less than one child per day in prediction error.

(Ridge Regression achieved MAE 0.03 — excluded as overfitting. A result that perfect on a small training window is a red flag, not a win.)

8. The Streamlit Dashboard

A 6-page dashboard operationalises both models with a key design decision: zero CSV dependency for predictions.

Here's the reasoning: the training data ends December 2025. If a programme administrator uses this app in June 2026, lag values pulled from the historical CSV would be 6 months stale — completely wrong inputs for the model.

The solution: users enter only what they naturally know from their daily report:

Last 14 days of care load (from their records)
Today's CBP transfers, HHS discharges, CBP apprehensions

That's 17 numbers. The app computes all 30+ model features automatically — rolling means, standard deviations, min/max, net flow, calendar features — purely from those 17 inputs. Works for any future date, any year.

Dashboard pages:

Overview — KPI cards, historical trend, intake/discharge balance, leaderboard
Care Load Forecast — 14-day input grid, next-day prediction, alert level, scenario comparison
Discharge Forecast — Same zero-CSV interface, weekly/monthly capacity estimates
Early Warning System — Alert zones, 90-day history, 5 project KPIs
Model Performance — Full escalation story, feature importance, all notebook figures
About & Dataset — Problem statement, dataset details, tech stack

9. Key Takeaways

For data scientists:

1. Diagnose before modelling. Before choosing a model, check stationarity, look for structural breaks, and verify that the training distribution matches the test distribution. This project would have ended at ARIMA if I hadn't investigated why it failed.

2. Training window is a hyperparameter. The right window selection here improved XGBoost MAE from 40.66 to 5.48 — a 7.4× improvement. No hyperparameter tuning of the model itself could have achieved that.

3. More data is not always better. The winning model used 30% of available data. The rest was actively harmful.

4. Validate feature importance. The dominance of hhs_care_roll_min_30 revealed that the model was primarily doing regime detection, not pattern forecasting. That insight validates the approach and suggests the right questions to ask if the regime changes again.

For the project evaluator:

"Training window selection is as important as model selection in the presence of structural breaks."

This is the central research finding. It is not a statement about this dataset specifically — it is a general principle applicable to any forecasting domain where abrupt regime changes are possible.

Tech Stack

Python 3.x     pandas  numpy  matplotlib  seaborn
XGBoost        scikit-learn  statsmodels  joblib
Streamlit      Jupyter Notebooks

Project structure:

uac-forecasting/
├── notebooks/   01_EDA → 07_Model_Evaluation
├── models/      best_model_recent.joblib + configs
├── data/        raw + processed
├── reports/     figures from all notebooks
└── src/         app1.py (Streamlit dashboard)

References

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. KDD 2016.
Box, G. E. P., et al. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley.
Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts.
Zeileis, A., et al. (2003). Testing and dating of structural changes in practice. Computational Statistics & Data Analysis, 44(1–2), 109–123.
HHS Office of Refugee Resettlement. UAC Program Data. U.S. Department of Health & Human Services.

Built as part of the Unified Mentor Data Science Internship · March 2026

· GitHub: https://github.com/Sugnik27/uac-forecasting?tab=readme-ov-file
· Live App: https://uac-forecasting.streamlit.app/

· An executive summary prepared for non-technical HHS stakeholders is available here: https://drive.google.com/drive/folders/1di-SvV6YidjTOGIvU8sLPXdahH1qhgIa?usp=sharing

DEV Community