The United States is in the grip of an opioid crisis. Between 2019 and 2023,
fentanyl-related overdose deaths skyrocketed — but the impact is not uniform
across the country. Are the hardest-hit states also the poorest? We built a
full Big Data pipeline to answer that question.
The Data
We combined two official U.S. government sources:
- CDC VSRR (Vital Statistics Rapid Release) — state-level fentanyl overdose deaths per 12-month rolling period, from 2015 to 2023
- U.S. Census Bureau ACS 5-Year — median household income, poverty rate, and unemployment rate for all 50 states + D.C.
The Architecture
CDC API ──┐
├── Apache Spark ── Elasticsearch ── Kibana Dashboard
Census ───┘ │
└── scikit-learn (ML)
Step 1 — Ingestion
Python scripts fetch both datasets via REST APIs and land them in a raw
datalake (data/raw/), with UTC timestamps for traceability.
Step 2 — Spark Processing
Apache Spark formats and combines both sources:
- Filters for fentanyl-specific death indicators
- Joins CDC deaths with Census socioeconomic data by state and year
- Computes a risk score: 40% poverty + 30% unemployment + 30% inverse income
- Outputs Parquet files via PyArrow (Windows-compatible, no winutils needed)
### Step 3 — Machine Learning
Two scikit-learn models add predictive power:
Linear Regression — predicts deaths from socioeconomic features.
Result: R²=0.066, showing that poverty alone doesn't explain deaths linearly.
K-Means Clustering (k=3) — groups states into risk profiles:
| Cluster | States | Avg Deaths/Year |
|---------|--------|----------------|
| LOW_RISK | 21 states | ~415 |
| MEDIUM_RISK | 17 states | ~1,200 |
| HIGH_RISK | 11 states | ~3,800+ |
Ohio, Pennsylvania, West Virginia and California consistently appear in the
HIGH_RISK cluster — driven by both high absolute populations and deep
socioeconomic distress.
### Step 4 — Elasticsearch + Kibana
All 637 documents (3 indices) are indexed on Elastic Cloud Serverless,
with
geo_pointcoordinates for each state enabling map visualizations. The Kibana dashboard includes: - 🗺️ Map — deaths per state with bubble sizing
- 📈 Timeline — the 2019-2021 explosion visible at a glance
- 💰 Scatter plot — income vs. deaths (weak but visible inverse trend)
- 🏷️ Risk table — all 51 states ranked by risk score and cluster ## Key Finding The correlation between poverty and fentanyl deaths is positive but weak (Pearson r ≈ 0.04 for poverty rate, r ≈ 0.36 for unemployment). This suggests the epidemic crosses socioeconomic lines — but unemployment is a stronger signal than raw poverty. The K-Means clustering is more revealing: states with combined economic distress AND large populations form the HIGH_RISK cluster. ## The Stack | Component | Tool | |-----------|------| | Ingestion | Python + Requests | | Processing | Apache Spark 3.5 + PyArrow | | ML | scikit-learn (LinearRegression, KMeans) | | Storage | Parquet (datalake) + Elasticsearch 8.13 | | Visualization | Kibana on Elastic Cloud | | Orchestration | Apache Airflow (daily DAG) | | Version control | GitHub | ## Try It Yourself The full pipeline is open source: 👉 github.com/tristandaniel8/fentanyl-poverty-epidemic
bash
git clone https://github.com/tristandaniel8/fentanyl-poverty-epidemic
pip install -r requirements.txt
python pipeline.py
Top comments (0)