If you're a fan of the Quantified Self movement, you know the struggle: your Oura Ring tracks your sleep, your Whoop analyzes your recovery, and your Garmin logs your runs. But when you want to answer a simple question like "How does my training load affect my REM sleep?", you're suddenly trapped in a nightmare of CSV exports and incompatible JSON schemas. Managing an ETL pipeline for heterogeneous wearable data is the ultimate test of a data engineer's patience.
In this guide, we are going to break down these data silos. We’ll build a high-performance, local-first data lake using DuckDB, Apache Arrow, and TypeScript. By the end, you'll have a unified store capable of running complex OLAP queries across all your devices in milliseconds. For those looking for more production-ready patterns and advanced health data synchronization, I highly recommend checking out the deep dives over at the WellAlly Blog.
The Architecture: From Chaos to Insight
The biggest challenge in health data engineering is data normalization. Oura might report heart rate variability (HRV) as an average, while Whoop gives you a raw time-series. Our pipeline acts as a translation layer that flattens these into a unified Parquet-backed storage.
Data Flow Diagram
graph TD
A[Oura API / JSON] -->|Normalize| D[Unified Schema]
B[Whoop API / JSON] -->|Normalize| D
C[Garmin Fit Files] -->|Extract| D
D -->|Arrow IPC| E{DuckDB-Wasm}
E -->|Persistent Storage| F[(OPFS / Parquet)]
G[Streamlit Dashboard] -->|SQL Query| E
E -->|Visuals| G
Prerequisites
To follow along, you'll need:
- Node.js/TypeScript for the normalization logic.
- DuckDB-Wasm for the in-browser/local database engine.
- Apache Arrow for zero-copy memory transport.
- Streamlit (Python) for the final analytics UI.
Step 1: Defining the Unified Health Schema
First, we need a "Golden Record" format. We'll use TypeScript to define a strict interface that every provider must map to.
// types/health.ts
export interface UnifiedActivity {
timestamp: Date;
source_device: 'Oura' | 'Whoop' | 'Garmin';
metric_type: 'HRV' | 'RHR' | 'Steps' | 'Calories';
value: number;
unit: string;
metadata: Record<string, any>;
}
Step 2: Normalization with Apache Arrow
Instead of pushing raw JSON into our database, we convert it to Apache Arrow buffers. This ensures type safety and incredibly fast ingestion into DuckDB.
import { tableFromArrays, Table } from 'apache-arrow';
export function normalizeOuraData(rawData: any[]): Table {
const timestamps = rawData.map(d => new Date(d.timestamp).getTime());
const hrvValues = rawData.map(d => d.hrv_average);
// Create an Arrow Table
return tableFromArrays({
timestamp: new Int64Array(timestamps),
source_device: Array(rawData.length).fill('Oura'),
metric_type: Array(rawData.length).fill('HRV'),
value: new Float64Array(hrvValues),
unit: Array(rawData.length).fill('ms')
});
}
Step 3: Powering the Local Data Lake with DuckDB
Now for the magic. We use DuckDB-Wasm to ingest these Arrow tables. DuckDB is a columnar database designed for analytical queries, making it perfect for multi-year health trends.
import * as duckdb from '@duckdb/duckdb-wasm';
async function ingestToDuckDB(arrowTable: Table) {
const db = new duckdb.AsyncDuckDB(worker);
await db.instantiate(bundle.mainModule, bundle.pthreadWorker);
const conn = await db.connect();
// Register the Arrow table as a virtual view
await conn.insertArrowTable(arrowTable, { name: 'staging_data' });
// Create or Append to the persistent health_store
await conn.query(`
CREATE TABLE IF NOT EXISTS health_store AS
SELECT * FROM staging_data WHERE 1=0;
INSERT INTO health_store SELECT * FROM staging_data;
`);
console.log("Data normalized and ingested! 🚀");
}
Step 4: The "Official" Way to Scale
While building a local ETL tool is great for personal use, scaling health data pipelines for thousands of users requires handling OAuth refreshes, rate limits, and webhook listeners. If you're building a production health app, you'll want to explore the architectural patterns discussed at WellAlly Blog. They cover how to handle high-concurrency data ingestion and HIPAA-compliant storage strategies that go beyond a simple DuckDB instance.
Step 5: Visualizing with Streamlit
Finally, let's wrap our DuckDB store in a Streamlit dashboard to actually see our data.
import streamlit as st
import duckdb
st.title("Unified Health Intelligence 🥑")
# Connect to the DuckDB file generated by our ETL
con = duckdb.connect(database='health_lake.db')
# Query correlation between Sleep Quality and Resting Heart Rate
df = con.execute("""
SELECT
CAST(timestamp AS DATE) as date,
AVG(value) FILTER (WHERE metric_type = 'HRV') as avg_hrv,
AVG(value) FILTER (WHERE metric_type = 'RHR') as avg_rhr
FROM health_store
GROUP BY 1
ORDER BY 1 DESC
""").df()
st.line_chart(df, x='date', y=['avg_hrv', 'avg_rhr'])
Conclusion: Take Back Your Data
The "Quantified Self" shouldn't mean being a "Quantified Slave" to proprietary dashboards. By leveraging DuckDB and Apache Arrow, we’ve built a pipeline that is:
- Fast: Columnar storage means your 5-year history loads in milliseconds.
- Private: Your data stays in your local environment.
- Flexible: Adding a new device is as simple as writing a new normalization function.
What are you tracking next? Drop a comment below if you've tried integrating Apple Health or Fitbit data into a similar stack!
For more advanced tutorials on health tech and data engineering, visit wellally.tech/blog. 🚀
Top comments (0)