Unifying Your Health Data: Building a High-Frequency ETL Pipeline for Apple HealthKit and Google Health Connect 🏃‍♂️📊

#datascience #ai #python #react

Ever wondered why your Apple Watch says you're a marathon runner while Google Fit thinks you're a couch potato? Building a Quantified Self Data Lake is the ultimate dream for data nerds, but syncing Apple HealthKit and Google Health Connect involves more than just a simple API call. Between mismatched sampling frequencies and the eternal nightmare of timezone conversions, creating a reliable high-frequency ETL pipeline is a true test of your Data Engineering mettle.

In this tutorial, we are going to architect a robust system that pulls raw telemetry from mobile SDKs, processes it through Apache Hop, and stores it in a structured PostgreSQL data lake. We'll solve the "heterogeneous data" problem and ensure your heart rate variability (HRV) and step counts are synchronized with millisecond precision. 🚀

The Architecture: From Pulse to Postgres

To handle the high-frequency nature of health data (which can generate thousands of rows per hour), we need a decoupled architecture. We use a Python-based middleware to bridge the mobile SDKs and an orchestration layer to handle the heavy lifting.

graph TD
    A[Apple HealthKit SDK] -->|JSON Stream| B(Python FastAPI Middleware)
    C[Google Health Connect] -->|Batch Export| B
    B -->|Raw Storage| D[(PostgreSQL Staging)]

    subgraph ETL Orchestration
    E[Apache Hop] -->|Extract & Normalize| D
    E -->|Transform Timezones| F{Data Validator}
    F -->|Load| G[(Quantified Self Data Lake)]
    end

    G --> H[Grafana / Superset Visualization]

Prerequisites 🛠️

Before we dive into the code, ensure you have the following in your toolkit:

Python 3.10+: For our middleware and data validation.
PostgreSQL: Our destination data lake.
Apache Hop: The successor to Kettle/PDI for visual ETL orchestration.
HealthKit/Health Connect SDKs: Configured in your mobile project.

Step 1: Handling Heterogeneous Data with Pydantic

The biggest hurdle is that Apple and Google represent data differently. Apple uses "Samples," while Google often aggregates into "Records." We'll use Pydantic to enforce a unified schema before the data even hits our staging area.

from pydantic import BaseModel, Field
from datetime import datetime
from typing import Optional, Union

class HealthMetric(BaseModel):
    source_device: str # 'apple_watch' or 'pixel_watch'
    metric_type: str   # 'step_count', 'heart_rate', 'active_calories'
    value: float
    unit: str
    start_time: datetime
    end_time: datetime
    timezone: str = "UTC"
    metadata: Optional[dict] = None

# Example: Validating a high-frequency Heart Rate sample
raw_data = {
    "source_device": "apple_watch_s8",
    "metric_type": "heart_rate",
    "value": 72.5,
    "unit": "count/min",
    "start_time": "2023-10-27T10:00:00Z",
    "end_time": "2023-10-27T10:00:00Z",
    "timezone": "America/New_York"
}

metric = HealthMetric(**raw_data)
print(f"Validated: {metric.metric_type} at {metric.start_time}")

Step 2: Designing the Data Lake Schema

We need a schema that supports "Upsert" operations. Why? Because mobile devices often resync old data when they reconnect to Wi-Fi. We don't want duplicate steps (as much as we'd like the extra credit! 😅).

CREATE TABLE health_metrics_raw (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    external_id TEXT UNIQUE, -- Hash of (source + metric + start_time)
    source_device VARCHAR(50),
    metric_type VARCHAR(50),
    value NUMERIC,
    unit VARCHAR(20),
    ts_start TIMESTAMP WITH TIME ZONE,
    ts_end TIMESTAMP WITH TIME ZONE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_metric_time ON health_metrics_raw (metric_type, ts_start);

Step 3: Orchestration with Apache Hop

While Python is great for ingestion, Apache Hop shines in complex ETL logic. We create a pipeline that:

Reads from the PostgreSQL staging table.
Standardizes Units: Converts everything to SI units (e.g., all energy to Kilocalories).
Timezone Normalization: Uses the timezone field to convert all ts_start to a localized user timeline and a universal UTC timeline.
Deduplication: Uses a "Unique Rows" transform based on the external_id.

Pro Tip: For advanced data engineering patterns and more production-ready examples of high-throughput pipelines, check out the deep-dive articles at WellAlly Tech Blog. They have fantastic resources on scaling PostgreSQL for time-series data! 🥑

Step 4: The Timezone Trap 🕰️

When you fly from New York to Tokyo, your "daily steps" become a mess. If you store data in UTC, your 10k steps might appear split across two days.

Solution: Always store the Offset and the Local Time.
In your ETL logic, create a generated column:

ALTER TABLE health_metrics_raw 
ADD COLUMN local_day DATE GENERATED ALWAYS AS ( (ts_start AT TIME ZONE timezone)::DATE ) STORED;

This allows you to query "Steps per day" based on where you physically were at that moment.

Conclusion: Data-Driven Wellness

By building this pipeline, you've moved from "guessing" your health to "engineering" your wellness. You now have a unified, deduplicated, and timezone-aware data lake ready for Analysis or even training your own ML models.

What's next?

Connect Grafana to your Postgres instance for real-time dashboards.
Implement Anomaly Detection to alert you when your resting heart rate spikes.

Did you run into issues with the HealthKit background delivery? Or maybe Google's OAuth 2.0 flow is giving you a headache? Let’s chat in the comments below! 👇

If you enjoyed this build, don't forget to follow for more "Learning in Public" tutorials and visit wellally.tech/blog for the full source code of this pipeline! 🚀💻