<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: StiiWann</title>
    <description>The latest articles on DEV Community by StiiWann (@stiiwann_35eb8bb2cf8dc53e).</description>
    <link>https://dev.to/stiiwann_35eb8bb2cf8dc53e</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3941032%2F00c211bf-b3b5-4a26-8e2e-9eec25965287.jpg</url>
      <title>DEV Community: StiiWann</title>
      <link>https://dev.to/stiiwann_35eb8bb2cf8dc53e</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/stiiwann_35eb8bb2cf8dc53e"/>
    <language>en</language>
    <item>
      <title>Fentanyl Poverty: Building a Big Data Pipeline to Map America's Overdose Epidemic</title>
      <dc:creator>StiiWann</dc:creator>
      <pubDate>Tue, 19 May 2026 20:53:49 +0000</pubDate>
      <link>https://dev.to/stiiwann_35eb8bb2cf8dc53e/fentanyl-x-poverty-building-a-big-data-pipeline-to-map-americas-overdose-epidemic-5dhm</link>
      <guid>https://dev.to/stiiwann_35eb8bb2cf8dc53e/fentanyl-x-poverty-building-a-big-data-pipeline-to-map-americas-overdose-epidemic-5dhm</guid>
      <description>&lt;p&gt;Titre :&lt;br&gt;
Fentanyl × Poverty: Building a Big Data Pipeline to Map America's Overdose Epidemic&lt;/p&gt;

&lt;p&gt;Tags : bigdata, elasticsearch, spark, python, datascience&lt;/p&gt;

&lt;p&gt;The United States is in the grip of an opioid crisis. Between 2019 and 2023, &lt;br&gt;
fentanyl-related overdose deaths skyrocketed — but the impact is not uniform &lt;br&gt;
across the country. Are the hardest-hit states also the poorest? We built a &lt;br&gt;
full Big Data pipeline to answer that question.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Data
&lt;/h2&gt;

&lt;p&gt;We combined two official U.S. government sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CDC VSRR&lt;/strong&gt; (Vital Statistics Rapid Release) — state-level fentanyl overdose 
deaths per 12-month rolling period, from 2015 to 2023 (83,160 rows)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;U.S. Census Bureau ACS 5-Year&lt;/strong&gt; — median household income, poverty rate, 
and unemployment rate for all 50 states + D.C.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;CDC API ──┐&lt;br&gt;
├── Apache Spark ── Elasticsearch ── Kibana Dashboard&lt;br&gt;
Census ───┘ │&lt;br&gt;
└── scikit-learn (ML)&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 — Ingestion
&lt;/h3&gt;

&lt;p&gt;Python scripts fetch both datasets via REST APIs and land them in a raw &lt;br&gt;
datalake (&lt;code&gt;data/raw/&lt;/code&gt;), with UTC timestamps for traceability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 — Spark Processing
&lt;/h3&gt;

&lt;p&gt;Apache Spark formats and combines both sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filters for fentanyl-specific death indicators&lt;/li&gt;
&lt;li&gt;Joins CDC deaths with Census socioeconomic data by state and year&lt;/li&gt;
&lt;li&gt;Computes a &lt;strong&gt;risk score&lt;/strong&gt;: 40% poverty + 30% unemployment + 30% inverse income&lt;/li&gt;
&lt;li&gt;Outputs Parquet files via PyArrow (Snappy compression)
Pearson correlations computed on the combined dataset:
| Indicator | Pearson r | Interpretation |
|-----------|-----------|----------------|
| Unemployment rate | +0.36 | Moderate positive |
| Median income | +0.14 | Weak positive |
| Poverty rate | +0.04 | Very weak |
### Step 3 — Machine Learning
Two scikit-learn models add predictive power:
&lt;strong&gt;Linear Regression&lt;/strong&gt; — predicts deaths from socioeconomic features.
Result: R²=0.066, MAE=876 deaths. The low R² confirms the epidemic 
is multifactorial — socioeconomic factors alone explain less than 7% 
of the variance.
&lt;strong&gt;K-Means Clustering (k=3)&lt;/strong&gt; — groups states into risk profiles:
| Cluster | States | Avg Deaths/Year |
|---------|--------|----------------|
| LOW_RISK | 21 states | ~415 |
| MEDIUM_RISK | 17 states | ~1,245 |
| HIGH_RISK | 11 states | ~1,667+ |
Ohio, Pennsylvania, California, Tennessee and North Carolina consistently 
appear in the HIGH_RISK cluster — combining economic distress AND high mortality.
### Step 4 — Elasticsearch + Kibana
All 637 documents (3 indices) are indexed on &lt;strong&gt;Elastic Cloud Serverless 
(GCP US-Central 1)&lt;/strong&gt;, with &lt;code&gt;geo_point&lt;/code&gt; coordinates for each state enabling 
map visualizations.
Three indices:&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fentanyl_latest&lt;/code&gt; — 49 docs, one per state (latest year snapshot)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fentanyl_timeseries&lt;/code&gt; — 539 docs, full 2015-2023 history&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fentanyl_ml&lt;/code&gt; — 49 docs, ML results (predicted_deaths + risk_cluster)
The Kibana dashboard includes:&lt;/li&gt;
&lt;li&gt;🗺️ &lt;strong&gt;Geographic map&lt;/strong&gt; — deaths per state with bubble sizing&lt;/li&gt;
&lt;li&gt;📈 &lt;strong&gt;Timeline&lt;/strong&gt; — the 2019-2021 explosion visible at a glance&lt;/li&gt;
&lt;li&gt;💰 &lt;strong&gt;Scatter plots&lt;/strong&gt; — income and poverty vs. deaths&lt;/li&gt;
&lt;li&gt;🏷️ &lt;strong&gt;Risk table&lt;/strong&gt; — all 51 states ranked by composite risk score
### Step 5 — Airflow Orchestration
An Apache Airflow DAG (&lt;code&gt;fentanyl_poverty_pipeline&lt;/code&gt;) orchestrates the 
entire pipeline with a &lt;code&gt;@daily&lt;/code&gt; schedule. The two ingestions run in 
parallel, followed by Spark processing, then Elasticsearch indexing.
ingest_cdc_data --+
+--&amp;gt; spark_format_and_combine --&amp;gt; index_to_elasticsearch
ingest_census_data --+&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Finding
&lt;/h2&gt;

&lt;p&gt;The correlation between poverty and fentanyl deaths is &lt;strong&gt;positive but weak&lt;/strong&gt; &lt;br&gt;
(r=0.04). Unemployment is the strongest predictor (r=0.36). But even &lt;br&gt;
combined, socioeconomic factors explain less than 7% of the variance — &lt;br&gt;
the epidemic transcends economic lines.&lt;br&gt;
California is the perfect example: one of the wealthiest states &lt;br&gt;
($81,400 median income) yet number one in absolute deaths (5,649/year).&lt;br&gt;
The K-Means clustering is more revealing: states with &lt;em&gt;combined&lt;/em&gt; &lt;br&gt;
economic distress AND large populations form the HIGH_RISK cluster. &lt;br&gt;
The 2019 explosion remains the defining event — when illicit fentanyl &lt;br&gt;
flooded the drug supply.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ingestion&lt;/td&gt;
&lt;td&gt;Python + Requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Processing&lt;/td&gt;
&lt;td&gt;Apache Spark 3.5 + PyArrow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ML&lt;/td&gt;
&lt;td&gt;scikit-learn (LinearRegression, KMeans)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;Parquet (datalake) + Elasticsearch 8.13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visualization&lt;/td&gt;
&lt;td&gt;Kibana on Elastic Cloud (GCP)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;Apache Airflow 2.9.2 (@daily DAG)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Version control&lt;/td&gt;
&lt;td&gt;GitHub&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The full pipeline is open source:&lt;br&gt;
👉 &lt;a href="https://github.com/tristandaniel8/fentanyl-poverty-epidemic" rel="noopener noreferrer"&gt;github.com/tristandaniel8/fentanyl-poverty-epidemic&lt;/a&gt;&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
git clone https://github.com/tristandaniel8/fentanyl-poverty-epidemic
pip install -r requirements.txt
python pipeline.py  # ingest → spark → ml → elasticsearch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>bigdata</category>
      <category>elasticsearch</category>
      <category>spark</category>
      <category>python</category>
    </item>
  </channel>
</rss>
