Beck_Moulton

Posted on Mar 9

Quantified Self: Building a Production-Grade ETL Pipeline for 10+ Wearables

#python #dataengineering #opensource #airflow

In the era of the Quantified Self, we are drowning in data but starving for insights. Between your Oura Ring's sleep scores, Garmin's recovery metrics, and Apple Health's step counts, our personal health data is scattered across a dozen proprietary silos. If you've ever tried to answer a simple question like "How does my deep sleep correlate with my workout intensity across different apps?", you know the struggle of fragmented APIs and inconsistent schemas.

Building a robust ETL Pipeline and a centralized Personal Health Data Lake is the only way to reclaim ownership of your metrics. In this guide, we'll walk through a professional-grade Data Engineering architecture designed to handle rate limits, schema drift, and the dreaded "Timezone Hell" using industry-standard tools.

For more advanced data engineering patterns and production-ready infrastructure templates, check out the deep-dive articles at WellAlly Blog.

The Architecture: From API to Insight

Handling 10+ different wearable APIs requires a decoupled architecture. We don't want a failing Fitbit API to crash our entire ingestion engine.

graph TD
    subgraph "Data Sources (SaaS APIs)"
        A[Oura API] --> E
        B[Garmin Connect] --> E
        C[Apple Health / HealthKit] --> E
        D[Whoop / Strava] --> E
    end

    subgraph "Orchestration & Storage"
        E[Apache Airflow] -- "Extract & Load" --> F[(PostgreSQL - Bronze Layer)]
        G[Terraform] -- "Provisions" --> E
        G -- "Provisions" --> F
    end

    subgraph "Transformation & Modeling"
        F -- "Refine" --> H[dbt - Silver/Gold Layer]
        H -- "Unified Schema" --> I[(Final Health Data Lake)]
    end

    subgraph "Visualization"
        I --> J[Grafana / Metabase]
    end

The Tech Stack

Infrastructure as Code: Terraform to manage our RDS (PostgreSQL) and compute instances.
Orchestration: Apache Airflow to handle retries, backfills, and API rate limiting.
Transformation: dbt (data build tool) to turn messy JSON blobs into clean, relational tables.
Database: PostgreSQL (The "Lakehouse" for our personal needs).

Step 1: Provisioning the Foundation with Terraform

We don't do manual setups. We need our environment to be reproducible. Here’s a snippet to spin up our PostgreSQL instance.

# main.tf
resource "aws_db_instance" "health_db" {
  allocated_storage    = 20
  engine               = "postgres"
  engine_version       = "15.3"
  instance_class       = "db.t3.micro"
  db_name              = "quantified_self"
  username             = var.db_user
  password             = var.db_password
  skip_final_snapshot  = true
  publicly_accessible  = true # Restricted via Security Group
}

Step 2: The Ingestion Logic (Airflow)

The biggest challenge with wearables is OAuth2 and Rate Limiting. We use Airflow's HttpOperator or a custom PythonOperator to fetch data incrementally.

Handling the "Bronze" Layer (Raw Data)

We store the raw JSON response directly in a JSONB column. This prevents the pipeline from breaking if an API adds new fields.

from airflow.decorators import dag, task
from datetime import datetime
import requests

@dag(start_date=datetime(2023, 1, 1), schedule="@daily", catchup=False)
def wearable_ingestion():

    @task
    def fetch_oura_data():
        # Using Oura API as an example
        headers = {'Authorization': f'Bearer {OURA_TOKEN}'}
        response = requests.get('https://api.ouraring.com/v2/usercollection/sleep', headers=headers)
        response.raise_for_status()
        return response.json()['data']

    @task
    def load_to_postgres(raw_data):
        # Logic to insert into postgres raw_sleep_data table
        # We use 'ON CONFLICT DO UPDATE' to handle data updates
        pass

    load_to_postgres(fetch_oura_data())

wearable_ingestion_dag = wearable_ingestion()

Step 3: Normalization with dbt (The Magic Sauce)

Every wearable has its own idea of what "sleep" looks like. One returns seconds, another returns minutes, and timezones are usually a mess. dbt allows us to write modular SQL to clean this up.

The Unified Sleep Model (`models/gold/fct_sleep.sql`)

{{ config(materialized='table') }}

WITH unified_sleep AS (
    SELECT
        'oura' as source,
        user_id,
        (data->>'total_sleep_duration')::int / 3600.0 as duration_hours,
        (data->>'bedtime_start')::timestamp AT TIME ZONE 'UTC' as bedtime_start
    FROM {{ ref('stg_oura_sleep') }}

    UNION ALL

    SELECT
        'whoop' as source,
        user_id,
        (data->>'score'->>'stage_summary'->>'total_in_bed_milli')::bigint / 3600000.0 as duration_hours,
        (data->>'start')::timestamp AT TIME ZONE 'UTC' as bedtime_start
    FROM {{ ref('stg_whoop_sleep') }}
)

SELECT 
    *,
    date_trunc('day', bedtime_start) as report_date
FROM unified_sleep

Pro-Tip: Solving the Timezone Conflict

Health data is sensitive to "User Local Time." If you fly from NYC to London, your sleep record might double-count or skip hours.

Always store raw data in UTC.
Capture the offset from the API if available.
Use a dim_dates table in dbt to map UTC timestamps back to the user's "Body Clock" day.

For a deeper dive into handling complex temporal data transformations, the team at WellAlly Tech has published an extensive guide on "Production Data Modeling for Time-Series Events" that is highly recommended.

Conclusion: Data Sovereignty

By building your own ETL pipeline, you aren't just making a dashboard; you are building an asset. You can now:

Predict Burnout: Correlate heart rate variability (HRV) with calendar meetings.
Cross-Pollinate: See how your Strava runs affect your Oura readiness.
Future Proof: If you switch from Garmin to Apple Watch, your historical data remains intact.

What's next?

[ ] Set up an Alerting system in Airflow if an API token expires.
[ ] Connect Metabase to your Gold layer for beautiful mobile-friendly charts.
[ ] Explore wellally.tech/blog for more insights on scaling your data infrastructure!

Are you tracking your life with code? Drop your favorite API in the comments! 👇

DEV Community

Quantified Self: Building a Production-Grade ETL Pipeline for 10+ Wearables

The Architecture: From API to Insight

The Tech Stack

Step 1: Provisioning the Foundation with Terraform

Step 2: The Ingestion Logic (Airflow)

Handling the "Bronze" Layer (Raw Data)

Step 3: Normalization with dbt (The Magic Sauce)

The Unified Sleep Model (`models/gold/fct_sleep.sql`)

Pro-Tip: Solving the Timezone Conflict

Conclusion: Data Sovereignty

Top comments (0)

The Architecture: From API to Insight

The Tech Stack

Step 1: Provisioning the Foundation with Terraform

Step 2: The Ingestion Logic (Airflow)

Handling the "Bronze" Layer (Raw Data)

Step 3: Normalization with dbt (The Magic Sauce)

The Unified Sleep Model (models/gold/fct_sleep.sql)

Pro-Tip: Solving the Timezone Conflict

Conclusion: Data Sovereignty

The Unified Sleep Model (`models/gold/fct_sleep.sql`)