DEV Community

StiiWann
StiiWann

Posted on

Fentanyl Poverty: Building a Big Data Pipeline to Map America's Overdose Epidemic

The United States is in the grip of an opioid crisis. Between 2019 and 2023,
fentanyl-related overdose deaths skyrocketed — but the impact is not uniform
across the country. Are the hardest-hit states also the poorest? We built a
full Big Data pipeline to answer that question.

The Data

We combined two official U.S. government sources:

  • CDC VSRR (Vital Statistics Rapid Release) — state-level fentanyl overdose deaths per 12-month rolling period, from 2015 to 2023
  • U.S. Census Bureau ACS 5-Year — median household income, poverty rate, and unemployment rate for all 50 states + D.C.

The Architecture

CDC API ──┐
├── Apache Spark ── Elasticsearch ── Kibana Dashboard
Census ───┘ │
└── scikit-learn (ML)

Step 1 — Ingestion

Python scripts fetch both datasets via REST APIs and land them in a raw
datalake (data/raw/), with UTC timestamps for traceability.

Step 2 — Spark Processing

Apache Spark formats and combines both sources:

  • Filters for fentanyl-specific death indicators
  • Joins CDC deaths with Census socioeconomic data by state and year
  • Computes a risk score: 40% poverty + 30% unemployment + 30% inverse income
  • Outputs Parquet files via PyArrow (Windows-compatible, no winutils needed) ### Step 3 — Machine Learning Two scikit-learn models add predictive power: Linear Regression — predicts deaths from socioeconomic features. Result: R²=0.066, showing that poverty alone doesn't explain deaths linearly. K-Means Clustering (k=3) — groups states into risk profiles: | Cluster | States | Avg Deaths/Year | |---------|--------|----------------| | LOW_RISK | 21 states | ~415 | | MEDIUM_RISK | 17 states | ~1,200 | | HIGH_RISK | 11 states | ~3,800+ | Ohio, Pennsylvania, West Virginia and California consistently appear in the HIGH_RISK cluster — driven by both high absolute populations and deep socioeconomic distress. ### Step 4 — Elasticsearch + Kibana All 637 documents (3 indices) are indexed on Elastic Cloud Serverless, with geo_point coordinates for each state enabling map visualizations. The Kibana dashboard includes:
  • 🗺️ Map — deaths per state with bubble sizing
  • 📈 Timeline — the 2019-2021 explosion visible at a glance
  • 💰 Scatter plot — income vs. deaths (weak but visible inverse trend)
  • 🏷️ Risk table — all 51 states ranked by risk score and cluster ## Key Finding The correlation between poverty and fentanyl deaths is positive but weak (Pearson r ≈ 0.04 for poverty rate, r ≈ 0.36 for unemployment). This suggests the epidemic crosses socioeconomic lines — but unemployment is a stronger signal than raw poverty. The K-Means clustering is more revealing: states with combined economic distress AND large populations form the HIGH_RISK cluster. ## The Stack | Component | Tool | |-----------|------| | Ingestion | Python + Requests | | Processing | Apache Spark 3.5 + PyArrow | | ML | scikit-learn (LinearRegression, KMeans) | | Storage | Parquet (datalake) + Elasticsearch 8.13 | | Visualization | Kibana on Elastic Cloud | | Orchestration | Apache Airflow (daily DAG) | | Version control | GitHub | ## Try It Yourself The full pipeline is open source: 👉 github.com/tristandaniel8/fentanyl-poverty-epidemic

bash
git clone https://github.com/tristandaniel8/fentanyl-poverty-epidemic
pip install -r requirements.txt
python pipeline.py
Enter fullscreen mode Exit fullscreen mode

Top comments (0)