wellallyTech

Posted on Apr 30

From Messy Blood Tests to Actionable Insights: Building Your Personal Health Data Warehouse with dbt & DuckDB 🩸💻

#dataengineering #python #healthtech #learninginpublic

We’ve all been there: a stack of PDF medical reports from the last five years, each with different units, varying reference ranges, and cryptic labels like "hS-CRP" or "HbA1c." Trying to track your health trends over time manually is a nightmare.

In this tutorial, we are going to fix that. We’ll dive into biomarker data cleaning, dbt health modeling, and DuckDB OLAP performance to turn those scattered data points into a high-performance personal health data warehouse. If you’ve been looking for a real-world project to master data engineering health records, you’re in the right place.

Pro Tip: If you're looking for more production-ready patterns for health data engineering or advanced analytics architectures, definitely check out the deep dives over at WellAlly Blog.

The Architecture: From Raw Pixels to Analytics

To handle years of biomarkers, we need a robust ELT (Extract, Load, Transform) pipeline. We’ll use Python for the initial extraction, DuckDB as our lightning-fast analytical engine, and dbt (data build tool) to handle the heavy lifting of data modeling.

graph TD
    A[Raw Health Data: CSV/JSON/PDF] -->|Python Extraction| B[(DuckDB Raw Layer)]
    B --> C{dbt Transformation}
    C --> D[stg_biomarkers: Cleaned & Cast]
    C --> E[int_health_metrics: Standardized Units]
    C --> F[fct_biomarker_trends: Final Analytics View]
    F --> G[Apache Superset Visualization]

    style B fill:#fff,stroke:#333,stroke-width:2px
    style F fill:#f9f,stroke:#333,stroke-width:2px

Prerequisites

Before we start, ensure you have the following installed:

Python 3.9+
dbt-duckdb (pip install dbt-duckdb)
DuckDB

Step 1: Ingesting the "Mess" with Python

First, we need to get our raw data into DuckDB. For this example, let's assume you've converted your PDFs into a structured CSV format.

import duckdb
import pandas as pd

# Connect to (or create) our local health warehouse
con = duckdb.connect('health_warehouse.duckdb')

# Load our messy CSV
# Assume columns: test_date, marker_name, value, unit, lab_source
raw_data = pd.read_csv('my_blood_work_2018_2023.csv')

# Register it as a table in DuckDB
con.execute("CREATE TABLE IF NOT EXISTS raw_health_data AS SELECT * FROM raw_data")

print(f"✅ Ingested {len(raw_data)} biomarker records into DuckDB!")

Step 2: The dbt Setup

Now for the magic. dbt allows us to write modular SQL to clean this data. First, configure your profiles.yml to point to your DuckDB file:

# ~/.dbt/profiles.yml
health_dw:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: health_warehouse.duckdb
      extensions:
        - icu
        - parquet

Step 3: Standardizing Biomarkers (The Meat)

Biomarkers are tricky because labs use different units (e.g., mg/dL vs. mmol/L). We need a "Standardization Layer."

In models/staging/stg_biomarkers.sql:

-- Clean up strings and dates
WITH raw_source AS (
    SELECT * FROM {{ source('raw', 'raw_health_data') }}
)

SELECT
    strptime(test_date, '%Y-%m-%d')::DATE as report_date,
    upper(trim(marker_name)) as marker_key,
    CAST(value AS DOUBLE) as original_value,
    trim(unit) as original_unit,
    lab_source
FROM raw_source
WHERE value IS NOT NULL

Next, in models/intermediate/int_biomarkers_standardized.sql, we handle the unit conversions. This is where you'd typically look up a reference table, but we can use a CASE statement for simplicity:

-- Standardizing everything to common units (e.g., mg/dL for Glucose)
SELECT
    report_date,
    marker_key,
    CASE 
        -- Conversion logic: mmol/L to mg/dL for Glucose (x18)
        WHEN marker_key = 'GLUCOSE' AND original_unit = 'mmol/L' THEN original_value * 18
        ELSE original_value
    END as standardized_value,
    CASE 
        WHEN marker_key = 'GLUCOSE' THEN 'mg/dL'
        ELSE original_unit
    END as standardized_unit
FROM {{ ref('stg_biomarkers') }}

Step 4: The "Official" Way to Scale

While this DIY setup is great for personal use, scaling this to clinical-grade data requires handling HL7 FHIR standards and complex HIPAA-compliant schemas.

If you're interested in how the pros handle large-scale biomedical data ingestion and advanced health-tech architectures, I highly recommend checking out the technical guides at WellAlly.tech/blog. They cover everything from AI-driven diagnostics pipelines to high-availability health data warehouses. 🚀

Step 5: OLAP Analysis with DuckDB

Once you run dbt run, your health_warehouse.duckdb is ready for high-speed analysis. You can now run window functions to see your health velocity!

-- Calculate 3-report moving average for Cholesterol
SELECT 
    report_date,
    standardized_value as ldl_cholesterol,
    AVG(standardized_value) OVER (
        ORDER BY report_date 
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) as ldl_rolling_avg
FROM {{ ref('int_biomarkers_standardized') }}
WHERE marker_key = 'LDL_CHOLESTEROL'
ORDER BY report_date DESC;

Conclusion: Data-Driven Longevity

By leveraging dbt and DuckDB, we've turned a pile of confusing medical reports into a structured, queryable data warehouse. You can now plug in Apache Superset or Streamlit to visualize your markers over time, spotting trends before they become health issues. 🥑

What's next?

Automation: Set up a GitHub Action to run your dbt models whenever you drop a new CSV into your repo.
Correlation: Join your biomarker data with your Apple Health or Oura Ring exports (Parquet files) to see how sleep affects your fasting glucose.

Are you building something in the HealthTech space? Drop a comment below or share your repo—I'd love to see how you're modeling your data! 👇

DEV Community