wellallyTech

Posted on Jan 20

Stop Drowning in Wearable Data: Build a Unified Health Data Lake with DuckDB and Apache Arrow

#ai #database #architecture #discuss

If you're a fan of the Quantified Self movement, you know the struggle: your Oura Ring tracks your sleep, your Whoop analyzes your recovery, and your Garmin logs your runs. But when you want to answer a simple question like "How does my training load affect my REM sleep?", you're suddenly trapped in a nightmare of CSV exports and incompatible JSON schemas. Managing an ETL pipeline for heterogeneous wearable data is the ultimate test of a data engineer's patience.

In this guide, we are going to break down these data silos. We’ll build a high-performance, local-first data lake using DuckDB, Apache Arrow, and TypeScript. By the end, you'll have a unified store capable of running complex OLAP queries across all your devices in milliseconds. For those looking for more production-ready patterns and advanced health data synchronization, I highly recommend checking out the deep dives over at the WellAlly Blog.

The Architecture: From Chaos to Insight

The biggest challenge in health data engineering is data normalization. Oura might report heart rate variability (HRV) as an average, while Whoop gives you a raw time-series. Our pipeline acts as a translation layer that flattens these into a unified Parquet-backed storage.

Data Flow Diagram

graph TD
    A[Oura API / JSON] -->|Normalize| D[Unified Schema]
    B[Whoop API / JSON] -->|Normalize| D
    C[Garmin Fit Files] -->|Extract| D
    D -->|Arrow IPC| E{DuckDB-Wasm}
    E -->|Persistent Storage| F[(OPFS / Parquet)]
    G[Streamlit Dashboard] -->|SQL Query| E
    E -->|Visuals| G

Prerequisites

To follow along, you'll need:

Node.js/TypeScript for the normalization logic.
DuckDB-Wasm for the in-browser/local database engine.
Apache Arrow for zero-copy memory transport.
Streamlit (Python) for the final analytics UI.

Step 1: Defining the Unified Health Schema

First, we need a "Golden Record" format. We'll use TypeScript to define a strict interface that every provider must map to.

// types/health.ts
export interface UnifiedActivity {
    timestamp: Date;
    source_device: 'Oura' | 'Whoop' | 'Garmin';
    metric_type: 'HRV' | 'RHR' | 'Steps' | 'Calories';
    value: number;
    unit: string;
    metadata: Record<string, any>;
}

Step 2: Normalization with Apache Arrow

Instead of pushing raw JSON into our database, we convert it to Apache Arrow buffers. This ensures type safety and incredibly fast ingestion into DuckDB.

import { tableFromArrays, Table } from 'apache-arrow';

export function normalizeOuraData(rawData: any[]): Table {
    const timestamps = rawData.map(d => new Date(d.timestamp).getTime());
    const hrvValues = rawData.map(d => d.hrv_average);

    // Create an Arrow Table
    return tableFromArrays({
        timestamp: new Int64Array(timestamps),
        source_device: Array(rawData.length).fill('Oura'),
        metric_type: Array(rawData.length).fill('HRV'),
        value: new Float64Array(hrvValues),
        unit: Array(rawData.length).fill('ms')
    });
}

Step 3: Powering the Local Data Lake with DuckDB

Now for the magic. We use DuckDB-Wasm to ingest these Arrow tables. DuckDB is a columnar database designed for analytical queries, making it perfect for multi-year health trends.

import * as duckdb from '@duckdb/duckdb-wasm';

async function ingestToDuckDB(arrowTable: Table) {
    const db = new duckdb.AsyncDuckDB(worker);
    await db.instantiate(bundle.mainModule, bundle.pthreadWorker);

    const conn = await db.connect();

    // Register the Arrow table as a virtual view
    await conn.insertArrowTable(arrowTable, { name: 'staging_data' });

    // Create or Append to the persistent health_store
    await conn.query(`
        CREATE TABLE IF NOT EXISTS health_store AS 
        SELECT * FROM staging_data WHERE 1=0;

        INSERT INTO health_store SELECT * FROM staging_data;
    `);

    console.log("Data normalized and ingested! 🚀");
}

Step 4: The "Official" Way to Scale

While building a local ETL tool is great for personal use, scaling health data pipelines for thousands of users requires handling OAuth refreshes, rate limits, and webhook listeners. If you're building a production health app, you'll want to explore the architectural patterns discussed at WellAlly Blog. They cover how to handle high-concurrency data ingestion and HIPAA-compliant storage strategies that go beyond a simple DuckDB instance.

Step 5: Visualizing with Streamlit

Finally, let's wrap our DuckDB store in a Streamlit dashboard to actually see our data.

import streamlit as st
import duckdb

st.title("Unified Health Intelligence 🥑")

# Connect to the DuckDB file generated by our ETL
con = duckdb.connect(database='health_lake.db')

# Query correlation between Sleep Quality and Resting Heart Rate
df = con.execute("""
    SELECT 
        CAST(timestamp AS DATE) as date,
        AVG(value) FILTER (WHERE metric_type = 'HRV') as avg_hrv,
        AVG(value) FILTER (WHERE metric_type = 'RHR') as avg_rhr
    FROM health_store
    GROUP BY 1
    ORDER BY 1 DESC
""").df()

st.line_chart(df, x='date', y=['avg_hrv', 'avg_rhr'])

Conclusion: Take Back Your Data

The "Quantified Self" shouldn't mean being a "Quantified Slave" to proprietary dashboards. By leveraging DuckDB and Apache Arrow, we’ve built a pipeline that is:

Fast: Columnar storage means your 5-year history loads in milliseconds.
Private: Your data stays in your local environment.
Flexible: Adding a new device is as simple as writing a new normalization function.

What are you tracking next? Drop a comment below if you've tried integrating Apple Health or Fitbit data into a similar stack!

For more advanced tutorials on health tech and data engineering, visit wellally.tech/blog. 🚀

DEV Community