Titre :
Fentanyl × Poverty: Building a Big Data Pipeline to Map America's Overdose Epidemic
Tags : bigdata, elasticsearch, spark, python, datascience
The United States is in the grip of an opioid crisis. Between 2019 and 2023,
fentanyl-related overdose deaths skyrocketed — but the impact is not uniform
across the country. Are the hardest-hit states also the poorest? We built a
full Big Data pipeline to answer that question.
The Data
We combined two official U.S. government sources:
- CDC VSRR (Vital Statistics Rapid Release) — state-level fentanyl overdose deaths per 12-month rolling period, from 2015 to 2023 (83,160 rows)
- U.S. Census Bureau ACS 5-Year — median household income, poverty rate, and unemployment rate for all 50 states + D.C.
The Architecture
CDC API ──┐
├── Apache Spark ── Elasticsearch ── Kibana Dashboard
Census ───┘ │
└── scikit-learn (ML)
Step 1 — Ingestion
Python scripts fetch both datasets via REST APIs and land them in a raw
datalake (data/raw/), with UTC timestamps for traceability.
Step 2 — Spark Processing
Apache Spark formats and combines both sources:
- Filters for fentanyl-specific death indicators
- Joins CDC deaths with Census socioeconomic data by state and year
- Computes a risk score: 40% poverty + 30% unemployment + 30% inverse income
- Outputs Parquet files via PyArrow (Snappy compression)
Pearson correlations computed on the combined dataset:
| Indicator | Pearson r | Interpretation |
|-----------|-----------|----------------|
| Unemployment rate | +0.36 | Moderate positive |
| Median income | +0.14 | Weak positive |
| Poverty rate | +0.04 | Very weak |
### Step 3 — Machine Learning
Two scikit-learn models add predictive power:
Linear Regression — predicts deaths from socioeconomic features.
Result: R²=0.066, MAE=876 deaths. The low R² confirms the epidemic
is multifactorial — socioeconomic factors alone explain less than 7%
of the variance.
K-Means Clustering (k=3) — groups states into risk profiles:
| Cluster | States | Avg Deaths/Year |
|---------|--------|----------------|
| LOW_RISK | 21 states | ~415 |
| MEDIUM_RISK | 17 states | ~1,245 |
| HIGH_RISK | 11 states | ~1,667+ |
Ohio, Pennsylvania, California, Tennessee and North Carolina consistently
appear in the HIGH_RISK cluster — combining economic distress AND high mortality.
### Step 4 — Elasticsearch + Kibana
All 637 documents (3 indices) are indexed on Elastic Cloud Serverless
(GCP US-Central 1), with
geo_pointcoordinates for each state enabling map visualizations. Three indices: -
fentanyl_latest— 49 docs, one per state (latest year snapshot) -
fentanyl_timeseries— 539 docs, full 2015-2023 history -
fentanyl_ml— 49 docs, ML results (predicted_deaths + risk_cluster) The Kibana dashboard includes: - 🗺️ Geographic map — deaths per state with bubble sizing
- 📈 Timeline — the 2019-2021 explosion visible at a glance
- 💰 Scatter plots — income and poverty vs. deaths
- 🏷️ Risk table — all 51 states ranked by composite risk score
### Step 5 — Airflow Orchestration
An Apache Airflow DAG (
fentanyl_poverty_pipeline) orchestrates the entire pipeline with a@dailyschedule. The two ingestions run in parallel, followed by Spark processing, then Elasticsearch indexing. ingest_cdc_data --+ +--> spark_format_and_combine --> index_to_elasticsearch ingest_census_data --+
Key Finding
The correlation between poverty and fentanyl deaths is positive but weak
(r=0.04). Unemployment is the strongest predictor (r=0.36). But even
combined, socioeconomic factors explain less than 7% of the variance —
the epidemic transcends economic lines.
California is the perfect example: one of the wealthiest states
($81,400 median income) yet number one in absolute deaths (5,649/year).
The K-Means clustering is more revealing: states with combined
economic distress AND large populations form the HIGH_RISK cluster.
The 2019 explosion remains the defining event — when illicit fentanyl
flooded the drug supply.
The Stack
| Component | Tool |
|---|---|
| Ingestion | Python + Requests |
| Processing | Apache Spark 3.5 + PyArrow |
| ML | scikit-learn (LinearRegression, KMeans) |
| Storage | Parquet (datalake) + Elasticsearch 8.13 |
| Visualization | Kibana on Elastic Cloud (GCP) |
| Orchestration | Apache Airflow 2.9.2 (@daily DAG) |
| Version control | GitHub |
Try It Yourself
The full pipeline is open source:
👉 github.com/tristandaniel8/fentanyl-poverty-epidemic
bash
git clone https://github.com/tristandaniel8/fentanyl-poverty-epidemic
pip install -r requirements.txt
python pipeline.py # ingest → spark → ml → elasticsearch
Top comments (4)
Another great project you've completed ! gg Tristan 🔥
Thank you very much ! Apreciate it !
Impressive work that should be read by the highest spheres, it could be useful
You're going a bit far haha. Thank you so much !!