Fentanyl Poverty: Building a Big Data Pipeline to Map America's Overdose Epidemic

#bigdata #elasticsearch #spark #python

Titre :
Fentanyl × Poverty: Building a Big Data Pipeline to Map America's Overdose Epidemic

Tags : bigdata, elasticsearch, spark, python, datascience

The United States is in the grip of an opioid crisis. Between 2019 and 2023,
fentanyl-related overdose deaths skyrocketed — but the impact is not uniform
across the country. Are the hardest-hit states also the poorest? We built a
full Big Data pipeline to answer that question.

The Data

We combined two official U.S. government sources:

CDC VSRR (Vital Statistics Rapid Release) — state-level fentanyl overdose deaths per 12-month rolling period, from 2015 to 2023 (83,160 rows)
U.S. Census Bureau ACS 5-Year — median household income, poverty rate, and unemployment rate for all 50 states + D.C.

The Architecture

CDC API ──┐
├── Apache Spark ── Elasticsearch ── Kibana Dashboard
Census ───┘ │
└── scikit-learn (ML)

Step 1 — Ingestion

Python scripts fetch both datasets via REST APIs and land them in a raw
datalake (data/raw/), with UTC timestamps for traceability.

Step 2 — Spark Processing

Apache Spark formats and combines both sources:

Filters for fentanyl-specific death indicators
Joins CDC deaths with Census socioeconomic data by state and year
Computes a risk score: 40% poverty + 30% unemployment + 30% inverse income
Outputs Parquet files via PyArrow (Snappy compression) Pearson correlations computed on the combined dataset: | Indicator | Pearson r | Interpretation | |-----------|-----------|----------------| | Unemployment rate | +0.36 | Moderate positive | | Median income | +0.14 | Weak positive | | Poverty rate | +0.04 | Very weak | ### Step 3 — Machine Learning Two scikit-learn models add predictive power: Linear Regression — predicts deaths from socioeconomic features. Result: R²=0.066, MAE=876 deaths. The low R² confirms the epidemic is multifactorial — socioeconomic factors alone explain less than 7% of the variance. K-Means Clustering (k=3) — groups states into risk profiles: | Cluster | States | Avg Deaths/Year | |---------|--------|----------------| | LOW_RISK | 21 states | ~415 | | MEDIUM_RISK | 17 states | ~1,245 | | HIGH_RISK | 11 states | ~1,667+ | Ohio, Pennsylvania, California, Tennessee and North Carolina consistently appear in the HIGH_RISK cluster — combining economic distress AND high mortality. ### Step 4 — Elasticsearch + Kibana All 637 documents (3 indices) are indexed on Elastic Cloud Serverless (GCP US-Central 1), with geo_point coordinates for each state enabling map visualizations. Three indices:
fentanyl_latest — 49 docs, one per state (latest year snapshot)
fentanyl_timeseries — 539 docs, full 2015-2023 history
fentanyl_ml — 49 docs, ML results (predicted_deaths + risk_cluster) The Kibana dashboard includes:
🗺️ Geographic map — deaths per state with bubble sizing
📈 Timeline — the 2019-2021 explosion visible at a glance
💰 Scatter plots — income and poverty vs. deaths
🏷️ Risk table — all 51 states ranked by composite risk score ### Step 5 — Airflow Orchestration An Apache Airflow DAG (fentanyl_poverty_pipeline) orchestrates the entire pipeline with a @daily schedule. The two ingestions run in parallel, followed by Spark processing, then Elasticsearch indexing. ingest_cdc_data --+ +--> spark_format_and_combine --> index_to_elasticsearch ingest_census_data --+

Key Finding

The correlation between poverty and fentanyl deaths is positive but weak
(r=0.04). Unemployment is the strongest predictor (r=0.36). But even
combined, socioeconomic factors explain less than 7% of the variance —
the epidemic transcends economic lines.
California is the perfect example: one of the wealthiest states
($81,400 median income) yet number one in absolute deaths (5,649/year).
The K-Means clustering is more revealing: states with combined
economic distress AND large populations form the HIGH_RISK cluster.
The 2019 explosion remains the defining event — when illicit fentanyl
flooded the drug supply.

The Stack

Component	Tool
Ingestion	Python + Requests
Processing	Apache Spark 3.5 + PyArrow
ML	scikit-learn (LinearRegression, KMeans)
Storage	Parquet (datalake) + Elasticsearch 8.13
Visualization	Kibana on Elastic Cloud (GCP)
Orchestration	Apache Airflow 2.9.2 (@daily DAG)
Version control	GitHub

Try It Yourself

The full pipeline is open source:
👉 github.com/tristandaniel8/fentanyl-poverty-epidemic


bash
git clone https://github.com/tristandaniel8/fentanyl-poverty-epidemic
pip install -r requirements.txt
python pipeline.py  # ingest → spark → ml → elasticsearch