Ever tried to compare your Oura sleep score with your Garmin body battery, only to realize you’re comparing apples to... well, very differently formatted oranges?
If you're a data nerd like me, you probably track everything. But the "Quantified Self" dream quickly turns into a nightmare when you're juggling JSON exports from Oura, messy CSVs from Fitbit, and timezone-conflicted data from Garmin.
In this tutorial, we are going to solve the fragmented data problem by building a local Modern Data Stack (MDS). We’ll use DuckDB as our powerhouse engine and dbt (data build tool) to transform raw, messy health metrics into a standardized Common Data Model (CDM).
By the end of this, you'll have a production-grade Data Engineering pipeline running right on your laptop, ready for advanced AI analysis or visualization in Apache Superset.
The Architecture
The goal is to move from "Raw Silos" to a "Unified Analytics Layer." We use Python to fetch/land the data, DuckDB as the storage and compute engine, and dbt to handle the heavy lifting of modeling.
graph TD
subgraph Sources
A[Oura API/JSON] --> D
B[Fitbit CSVs] --> D
C[Garmin Fit Files] --> D
end
subgraph "Data Lake (DuckDB)"
D[raw_health_data.duckdb]
end
subgraph "Transformation (dbt)"
D --> E[stg_oura]
D --> F[stg_fitbit]
D --> G[stg_garmin]
E & F & G --> H[fact_daily_metrics]
H --> I[dim_user_health]
end
subgraph "Visualization"
I --> J[Apache Superset]
end
Prerequisites
Before we dive in, ensure you have the following installed:
- Python 3.9+
- DuckDB: The "SQLite for Analytics."
- dbt-duckdb: The adapter that allows dbt to talk to DuckDB.
- Apache Superset: For the shiny dashboards.
Note: While this setup is perfect for local experimentation, if you are looking for more production-ready patterns, enterprise-grade AI integrations, or advanced data architecture insights, I highly recommend checking out the deep dives over at the WellAlly Blog.
Step 1: Ingesting Data with Python & DuckDB
First, we need to get our files into DuckDB. DuckDB is incredible because it can query JSON and CSV files directly.
import duckdb
import pandas as pd
# Initialize our DuckDB database
con = duckdb.connect('health_warehouse.duckdb')
# Load Oura JSON data
# DuckDB's read_json_auto is magic ✨
con.execute("""
CREATE TABLE IF NOT EXISTS raw_oura AS
SELECT * FROM read_json_auto('data/oura_export/*.json');
""")
# Load Fitbit CSV data
con.execute("""
CREATE TABLE IF NOT EXISTS raw_fitbit AS
SELECT * FROM read_csv_auto('data/fitbit_export/*.csv');
""")
print("✅ Data successfully ingested into raw layer!")
Step 2: Harmonizing Data with dbt
The real magic happens in dbt. We need to solve two main problems:
- Schema Disparity: Oura calls it
score_sleep, Fitbit calls itsleep_efficiency. - Timezone Hell: Garmin might be in UTC, while Oura is in your local "start of day" time.
The Staging Layer (stg_oura.sql)
We create a view to rename columns and cast types correctly.
-- models/staging/stg_oura.sql
SELECT
CAST(summary_date AS DATE) as activity_date,
score as sleep_score,
'oura' as source_system,
(rem_duration + deep_duration) / 3600 as restorative_sleep_hours
FROM {{ source('raw', 'raw_oura') }}
The Common Data Model (fact_daily_health.sql)
Now, we union everything into a single, clean table. This is the "Gold" layer of your warehouse.
-- models/marts/fact_daily_health.sql
{{ config(materialized='table') }}
WITH unified AS (
SELECT activity_date, sleep_score, source_system FROM {{ ref('stg_oura') }}
UNION ALL
SELECT activity_date, sleep_score, source_system FROM {{ ref('stg_fitbit') }}
)
SELECT
activity_date,
AVG(sleep_score) as avg_sleep_score,
-- Simple logic to handle multi-device conflicts
MAX(sleep_score) FILTER (WHERE source_system = 'oura') as primary_sleep_score
FROM unified
GROUP BY 1
Step 3: Handling Timezones Like a Pro
One of the biggest hurdles in Quantified Self data is the offset. If you fly from NY to London, your sleep heart rate might appear "in the future."
In dbt, we use a macro or a cross-join with a dim_date table that includes UTC offsets.
-- logic to normalize to local time
SELECT
event_timestamp_utc,
timezone_offset,
event_timestamp_utc + INTERVAL (timezone_offset) SECOND as local_time
FROM {{ ref('stg_garmin_heartrate') }}
Step 4: The "Official" Way to Scale
Building this locally is an amazing way to learn Data Engineering. However, when you start dealing with real-time biometric streams or sensitive health data at scale, you need a more robust framework.
For those interested in how to take these "Learning in Public" projects and turn them into scalable, secure production systems, the team at WellAlly has published some incredible resources. They cover everything from Pydantic data validation to Vector Databases for health AI.
👉 Check out more production-ready patterns at: wellally.tech/blog
Step 5: Visualization in Apache Superset
Connect Superset to your health_warehouse.duckdb using the SQLAlchemy URI:
duckdb:////path/to/health_warehouse.duckdb
Now you can build a "Unified Health Dashboard" that finally shows you if that late-night pizza actually affects your Oura HRV and your Garmin recovery time simultaneously.
Conclusion
You've just built a modern data warehouse on your local machine!
- DuckDB handled the storage with lightning speed.
- dbt turned chaos into a structured Common Data Model.
- Python acted as the glue.
What's next?
- Try adding your Apple Health XML export (warning: it's a beast!).
- Plug in an LLM to your DuckDB and ask: "Hey GPT-4, why was my recovery so low last Tuesday?"
Did you find this helpful? Drop a comment below with your favorite wearable, and don't forget to star the repo!
Top comments (0)