HHS Unaccompanied Alien Children Program — Data Science Internship Project
Sugnik Mondal · Unified Mentor Data Science Intern · March 2026
TL;DR: I built a forecasting system for the HHS UAC Program using 720 real records. Nine models were tested. Eight failed or underperformed. One won — but not because of model complexity. The reason every sophisticated model failed, and what fixed it, is the actual story here.
Abstract
The U.S. Department of Health & Human Services (HHS) Unaccompanied Alien Children (UAC) Program manages the care, custody, and sponsor placement of migrant children arriving at the U.S. border. Daily care load fluctuated between 1,972 and 11,516 children during the study period — a 5.8× range that makes capacity planning extremely difficult without reliable forecasts.
This paper presents a complete ML forecasting system built on 720 real operational records spanning January 2023 to December 2025. The central finding is that a January 2025 structural break — a permanent 66% drop in care load — caused every full-window model to fail catastrophically. The solution was simple in concept but required correctly diagnosing the problem first: recent-window retraining.
Final results:
- Care Load Model: XGBoost MAE 5.48 · MAPE 0.23% · 9.6% better than naïve baseline
- Discharge Model: XGBoost MAE 0.63 children/day
- Dashboard: 6-page Streamlit app with zero-CSV-dependency prediction interface
1. The Problem
The HHS UAC Program needs to know, at minimum one day in advance:
- How many children will be in HHS care tomorrow? (staffing, beds, resources)
- How many children will be discharged tomorrow? (sponsor outreach, placement capacity)
- Is a surge coming? (early warning, proactive capacity scaling)
Without forecasts, every decision is reactive. Surges cause acute crises. Troughs cause costly over-provisioning. The program needed a tool.
2. The Data
Source: HHS UAC Program public operational records
| Property | Value |
|---|---|
| Raw records | 720 observations |
| Date range | Jan 2023 – Dec 2025 |
| After preprocessing | 1,075 rows |
| Missing dates filled | 355 (weekends, via linear interpolation) |
| Target 1 |
hhs_care — children in HHS care daily |
| Target 2 |
hhs_discharged — daily discharges |
| lag-1 autocorrelation | 0.99 |
That lag-1 autocorrelation of 0.99 is important. It means yesterday's care load is an almost perfect predictor of today's. It immediately told me that the naïve baseline — predict tomorrow = today — was going to be very hard to beat.
Naïve Persistence MAE: 6.06. That's the bar everything had to clear.
3. The Structural Break — The Most Important Discovery
Before touching a single model, the EDA revealed something critical.
In January 2025, HHS care load dropped from approximately 6,500 children to approximately 2,200 children in under two weeks. That's a 66% reduction. And it never recovered — the low level persisted through the end of the dataset.
A structural break is a permanent, abrupt change in the statistical properties of a time series. Unlike a trend or seasonal pattern, it cannot be modelled away — the series before and after the break are effectively two different processes.
Here's why this matters for every model you try to build:
- Full dataset training mean: ~6,061 children
- Test set mean (post-break): ~2,300 children
- Gap: ~3,761 children
Any model trained on the full dataset learns patterns centred around 6,061. When it predicts on data centred around 2,300, it's off by thousands. That's not a model quality problem. That's a data regime problem.
4. The Escalation Story — Nine Models, Eight Failures
Phase 1: Baseline (Floor Setting)
| Model | MAE |
|---|---|
| Naïve Persistence | 6.06 ← the bar |
| Moving Average (w=3) | 9.76 |
Moving Average was actually worse than naïve because the slight upward trend in the test period caused systematic under-prediction — the rolling average always lags behind a rising series.
Phase 2: Statistical Models — All Failed
| Model | MAE | Why |
|---|---|---|
| Exponential Smoothing | 86.69 | Anchored to pre-break mean ~6,000 |
| ARIMA(3,1,3) | 144.35 | Mean-reverting behaviour pulled forecasts too high |
| SARIMA | 433.17 | Seasonal components amplified the regime-change error |
SARIMA was the worst performing model overall. Adding more structure made the problem worse. The seasonal terms were learning patterns from the pre-break period that had no relevance to the post-break test data.
This was not a failure of ARIMA or SARIMA as methods. It was a failure to check whether their core assumptions were met before applying them. Both assume a stationary or trend-stationary series. A permanent 66% level shift violates that assumption completely.
Phase 3: Full-Window ML — Also Failed
| Model | MAE |
|---|---|
| Linear Regression (full) | 23.38 |
| Random Forest (full) | 25.41 |
| XGBoost (full) | 40.66 |
ML models were better than statistical models — but XGBoost performed 6.7× worse than naïve. The boosting process over-fitted to the high-variance pre-break period. The root cause was identical: wrong training distribution.
Phase 4: Recent-Window ML — The Solution ✅
The fix: retrain all models using only data from June 2024 onwards.
| Model | MAE | vs Naïve |
|---|---|---|
| XGBoost (Recent) | 5.48 | ✅ –9.6% |
| Random Forest (Recent) | 6.54 | ❌ +7.9% |
| Linear Regression (Recent) | 7.48 | ❌ +23.4% |
By using only the recent window:
- Training mean: ~2,800
- Test mean: ~2,300
- Gap: ~500 (vs 3,761 with full window)
XGBoost achieved MAE 5.48, RMSE 7.12, MAPE 0.23%. That's 99.77% forecast accuracy.
5. The Complete Leaderboard
| Model | MAE ↓ | RMSE ↓ | MAPE ↓ |
|---|---|---|---|
| 🏆 XGBoost (Recent) | 5.48 | 7.12 | 0.23% |
| Naïve Persistence | 6.06 | 7.24 | 0.27% |
| Random Forest (Recent) | 6.54 | 8.44 | 0.28% |
| Linear Regression (Recent) | 7.48 | 8.86 | 0.31% |
| Moving Average (w=3) | 9.76 | 11.77 | 0.43% |
| Ridge Regression (Recent) | 17.80 | 22.80 | 0.74% |
| Exponential Smoothing | 86.69 | 97.40 | 3.74% |
| ARIMA(3,1,3) | 144.35 | 161.63 | 6.20% |
| SARIMA | 433.17 | 501.04 | 18.53% |
Only one model out of nine beat the naïve baseline. That model was trained on roughly 30% of the available data.
6. Feature Engineering & What the Model Actually Learned
30+ features were engineered from the five raw columns. The top features by XGBoost importance:
| Feature | Importance | What it captures |
|---|---|---|
hhs_care_roll_min_30 |
0.541 | 30-day rolling minimum — the post-break floor |
hhs_care_lag_2 |
0.159 | 2-day autoregressive signal |
hhs_care_lag_1 |
0.150 | Yesterday's value |
cbp_transferred |
0.122 | Today's pipeline transfers — leading indicator |
The dominance of hhs_care_roll_min_30 (0.541 — over half the total importance) is revealing. The model's primary mechanism is recognising which regime it's in by checking the 30-day floor. The top four features account for 97.2% of total importance.
7. The Discharge Model
Discharge demand required separate treatment. The discharge structural break was even more severe:
- Full-window training mean: 173 children/day
- Post-break test mean: ~9 children/day
- Reduction: 94.8%
A June 2024 cutoff still left a massive gap. A March 2025 cutoff reduced the training-test mean gap to 3.67. XGBoost achieved MAE 0.63 children/day — less than one child per day in prediction error.
(Ridge Regression achieved MAE 0.03 — excluded as overfitting. A result that perfect on a small training window is a red flag, not a win.)
8. The Streamlit Dashboard
A 6-page dashboard operationalises both models with a key design decision: zero CSV dependency for predictions.
Here's the reasoning: the training data ends December 2025. If a programme administrator uses this app in June 2026, lag values pulled from the historical CSV would be 6 months stale — completely wrong inputs for the model.
The solution: users enter only what they naturally know from their daily report:
- Last 14 days of care load (from their records)
- Today's CBP transfers, HHS discharges, CBP apprehensions
That's 17 numbers. The app computes all 30+ model features automatically — rolling means, standard deviations, min/max, net flow, calendar features — purely from those 17 inputs. Works for any future date, any year.
Dashboard pages:
- Overview — KPI cards, historical trend, intake/discharge balance, leaderboard
- Care Load Forecast — 14-day input grid, next-day prediction, alert level, scenario comparison
- Discharge Forecast — Same zero-CSV interface, weekly/monthly capacity estimates
- Early Warning System — Alert zones, 90-day history, 5 project KPIs
- Model Performance — Full escalation story, feature importance, all notebook figures
- About & Dataset — Problem statement, dataset details, tech stack
9. Key Takeaways
For data scientists:
1. Diagnose before modelling. Before choosing a model, check stationarity, look for structural breaks, and verify that the training distribution matches the test distribution. This project would have ended at ARIMA if I hadn't investigated why it failed.
2. Training window is a hyperparameter. The right window selection here improved XGBoost MAE from 40.66 to 5.48 — a 7.4× improvement. No hyperparameter tuning of the model itself could have achieved that.
3. More data is not always better. The winning model used 30% of available data. The rest was actively harmful.
4. Validate feature importance. The dominance of hhs_care_roll_min_30 revealed that the model was primarily doing regime detection, not pattern forecasting. That insight validates the approach and suggests the right questions to ask if the regime changes again.
For the project evaluator:
"Training window selection is as important as model selection in the presence of structural breaks."
This is the central research finding. It is not a statement about this dataset specifically — it is a general principle applicable to any forecasting domain where abrupt regime changes are possible.
Tech Stack
Python 3.x pandas numpy matplotlib seaborn
XGBoost scikit-learn statsmodels joblib
Streamlit Jupyter Notebooks
Project structure:
uac-forecasting/
├── notebooks/ 01_EDA → 07_Model_Evaluation
├── models/ best_model_recent.joblib + configs
├── data/ raw + processed
├── reports/ figures from all notebooks
└── src/ app1.py (Streamlit dashboard)
References
- Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. KDD 2016.
- Box, G. E. P., et al. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley.
- Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts.
- Zeileis, A., et al. (2003). Testing and dating of structural changes in practice. Computational Statistics & Data Analysis, 44(1–2), 109–123.
- HHS Office of Refugee Resettlement. UAC Program Data. U.S. Department of Health & Human Services.
Built as part of the Unified Mentor Data Science Internship · March 2026
· GitHub: https://github.com/Sugnik27/uac-forecasting?tab=readme-ov-file
· Live App: https://uac-forecasting.streamlit.app/
· An executive summary prepared for non-technical HHS stakeholders is available here: https://drive.google.com/drive/folders/1di-SvV6YidjTOGIvU8sLPXdahH1qhgIa?usp=sharing
Top comments (0)