Beck_Moulton

Posted on Feb 28

Quantified Self 2.0: Build a Unified Health Data Warehouse with DuckDB and dbt

#python #dataengineering #quantifiedself #healthtech

Ever tried to compare your Oura sleep score with your Garmin body battery, only to realize you’re comparing apples to... well, very differently formatted oranges?

If you're a data nerd like me, you probably track everything. But the "Quantified Self" dream quickly turns into a nightmare when you're juggling JSON exports from Oura, messy CSVs from Fitbit, and timezone-conflicted data from Garmin.

In this tutorial, we are going to solve the fragmented data problem by building a local Modern Data Stack (MDS). We’ll use DuckDB as our powerhouse engine and dbt (data build tool) to transform raw, messy health metrics into a standardized Common Data Model (CDM).

By the end of this, you'll have a production-grade Data Engineering pipeline running right on your laptop, ready for advanced AI analysis or visualization in Apache Superset.

The Architecture

The goal is to move from "Raw Silos" to a "Unified Analytics Layer." We use Python to fetch/land the data, DuckDB as the storage and compute engine, and dbt to handle the heavy lifting of modeling.

graph TD
    subgraph Sources
        A[Oura API/JSON] --> D
        B[Fitbit CSVs] --> D
        C[Garmin Fit Files] --> D
    end

    subgraph "Data Lake (DuckDB)"
        D[raw_health_data.duckdb]
    end

    subgraph "Transformation (dbt)"
        D --> E[stg_oura]
        D --> F[stg_fitbit]
        D --> G[stg_garmin]
        E & F & G --> H[fact_daily_metrics]
        H --> I[dim_user_health]
    end

    subgraph "Visualization"
        I --> J[Apache Superset]
    end

Prerequisites

Before we dive in, ensure you have the following installed:

Python 3.9+
DuckDB: The "SQLite for Analytics."
dbt-duckdb: The adapter that allows dbt to talk to DuckDB.
Apache Superset: For the shiny dashboards.

Note: While this setup is perfect for local experimentation, if you are looking for more production-ready patterns, enterprise-grade AI integrations, or advanced data architecture insights, I highly recommend checking out the deep dives over at the WellAlly Blog.

Step 1: Ingesting Data with Python & DuckDB

First, we need to get our files into DuckDB. DuckDB is incredible because it can query JSON and CSV files directly.

import duckdb
import pandas as pd

# Initialize our DuckDB database
con = duckdb.connect('health_warehouse.duckdb')

# Load Oura JSON data
# DuckDB's read_json_auto is magic ✨
con.execute("""
    CREATE TABLE IF NOT EXISTS raw_oura AS 
    SELECT * FROM read_json_auto('data/oura_export/*.json');
""")

# Load Fitbit CSV data
con.execute("""
    CREATE TABLE IF NOT EXISTS raw_fitbit AS 
    SELECT * FROM read_csv_auto('data/fitbit_export/*.csv');
""")

print("✅ Data successfully ingested into raw layer!")

Step 2: Harmonizing Data with dbt

The real magic happens in dbt. We need to solve two main problems:

Schema Disparity: Oura calls it score_sleep, Fitbit calls it sleep_efficiency.
Timezone Hell: Garmin might be in UTC, while Oura is in your local "start of day" time.

The Staging Layer (`stg_oura.sql`)

We create a view to rename columns and cast types correctly.

-- models/staging/stg_oura.sql
SELECT
    CAST(summary_date AS DATE) as activity_date,
    score as sleep_score,
    'oura' as source_system,
    (rem_duration + deep_duration) / 3600 as restorative_sleep_hours
FROM {{ source('raw', 'raw_oura') }}

The Common Data Model (`fact_daily_health.sql`)

Now, we union everything into a single, clean table. This is the "Gold" layer of your warehouse.

-- models/marts/fact_daily_health.sql
{{ config(materialized='table') }}

WITH unified AS (
    SELECT activity_date, sleep_score, source_system FROM {{ ref('stg_oura') }}
    UNION ALL
    SELECT activity_date, sleep_score, source_system FROM {{ ref('stg_fitbit') }}
)

SELECT
    activity_date,
    AVG(sleep_score) as avg_sleep_score,
    -- Simple logic to handle multi-device conflicts
    MAX(sleep_score) FILTER (WHERE source_system = 'oura') as primary_sleep_score
FROM unified
GROUP BY 1

Step 3: Handling Timezones Like a Pro

One of the biggest hurdles in Quantified Self data is the offset. If you fly from NY to London, your sleep heart rate might appear "in the future."

In dbt, we use a macro or a cross-join with a dim_date table that includes UTC offsets.

-- logic to normalize to local time
SELECT 
    event_timestamp_utc,
    timezone_offset,
    event_timestamp_utc + INTERVAL (timezone_offset) SECOND as local_time
FROM {{ ref('stg_garmin_heartrate') }}

Step 4: The "Official" Way to Scale

Building this locally is an amazing way to learn Data Engineering. However, when you start dealing with real-time biometric streams or sensitive health data at scale, you need a more robust framework.

For those interested in how to take these "Learning in Public" projects and turn them into scalable, secure production systems, the team at WellAlly has published some incredible resources. They cover everything from Pydantic data validation to Vector Databases for health AI.

👉 Check out more production-ready patterns at: wellally.tech/blog

Step 5: Visualization in Apache Superset

Connect Superset to your health_warehouse.duckdb using the SQLAlchemy URI:
duckdb:////path/to/health_warehouse.duckdb

Now you can build a "Unified Health Dashboard" that finally shows you if that late-night pizza actually affects your Oura HRV and your Garmin recovery time simultaneously.

Conclusion

You've just built a modern data warehouse on your local machine!

DuckDB handled the storage with lightning speed.
dbt turned chaos into a structured Common Data Model.
Python acted as the glue.

What's next?

Try adding your Apple Health XML export (warning: it's a beast!).
Plug in an LLM to your DuckDB and ask: "Hey GPT-4, why was my recovery so low last Tuesday?"

Did you find this helpful? Drop a comment below with your favorite wearable, and don't forget to star the repo!

DEV Community

Quantified Self 2.0: Build a Unified Health Data Warehouse with DuckDB and dbt

The Architecture

Prerequisites

Step 1: Ingesting Data with Python & DuckDB

Step 2: Harmonizing Data with dbt

The Staging Layer (`stg_oura.sql`)

The Common Data Model (`fact_daily_health.sql`)

Step 3: Handling Timezones Like a Pro

Step 4: The "Official" Way to Scale

Step 5: Visualization in Apache Superset

Conclusion

Top comments (0)

The Architecture

Prerequisites

Step 1: Ingesting Data with Python & DuckDB

Step 2: Harmonizing Data with dbt

The Staging Layer (stg_oura.sql)

The Common Data Model (fact_daily_health.sql)

Step 3: Handling Timezones Like a Pro

Step 4: The "Official" Way to Scale

Step 5: Visualization in Apache Superset

Conclusion

The Staging Layer (`stg_oura.sql`)

The Common Data Model (`fact_daily_health.sql`)