Beck_Moulton

Posted on Mar 15

Hardcore ETL: Taming 5GB+ of Apple Health XML Data with DuckDB and dbt

#webdev #ai #beginners #python

So, you decided to export your Apple Health data. You expected a neat CSV or a friendly JSON, but instead, you were greeted by a massive, bloated 5GB+ XML file that makes Excel cry and VS Code freeze.

In this guide, we are building a high-performance ETL pipeline to transform that chaotic XML into a structured Personal Data Warehouse. We’ll be using the "Modern Data Stack for local machines": DuckDB for lightning-fast processing, dbt for modeling, and Apache Parquet for efficient storage.

By the end of this, you'll be performing Data Engineering on your own heartbeat, steps, and sleep patterns like a pro.

The Architecture: From Raw Pixels to Structured SQL

Before we dive into the code, let's look at the data flow. We need to move from a hierarchical, redundant XML format to a columnar, analytical format.

graph TD
    A[Apple Health Export.xml] -->|Python Streaming Parser| B(Apache Parquet)
    B -->|DuckDB External Table| C[dbt Seed/Stage]
    C -->|SQL Transformation| D[dbt Marts: Daily Metrics]
    D -->|Analytics| E[Evidence.dev / Metabase / Python]

    subgraph Transformation Layer
    C
    D
    end

    style B fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#bbf,stroke:#333,stroke-width:2px

Prerequisites

To follow along, ensure you have the following installed:

Python 3.9+
DuckDB (pip install duckdb)
dbt-duckdb (pip install dbt-duckdb)

Step 1: The Great Flattening (XML to Parquet)

Apple Health's export.xml is notoriously difficult because it's essentially one giant list of <Record /> tags. Parsing a 5GB file in memory will crash most consumer laptops. We'll use Python's lxml with a streaming approach to convert it into Apache Parquet, which DuckDB can query instantly.

import lxml.etree as ET
import pandas as pd
import duckdb

# Path to your export
XML_PATH = 'export.xml'
PARQUET_PATH = 'health_data.parquet'

def stream_to_parquet(xml_file):
    context = ET.iterparse(xml_file, events=('end',), tag='Record')
    batch = []
    batch_size = 100000

    # Initialize DuckDB for writing
    con = duckdb.connect()

    print("🚀 Starting transformation...")
    for event, elem in context:
        # Extract attributes into a dict
        record = dict(elem.attrib)
        batch.append(record)

        # Clear element to save memory
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]

        if len(batch) >= batch_size:
            df = pd.DataFrame(batch)
            # Efficiently append to parquet via DuckDB
            con.execute("INSERT INTO raw_health SELECT * FROM df")
            batch = []

    print(f"✅ Success! Data saved to {PARQUET_PATH}")

# Pro-tip: Check out https://www.wellally.tech/blog 
# for more production-ready patterns on handling large-scale XML ingestion.

Step 2: Setting up dbt with DuckDB

Now that we have a .parquet file, we don't need a heavy Postgres or Snowflake instance. DuckDB acts as our local OLAP engine.

Create a profiles.yml for dbt:

health_dw:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: 'health_analytics.duckdb'
      extensions:
        - parquet

Step 3: Modeling Data in dbt

Our goal is to turn raw "Records" into meaningful health insights. Let's create a staging model to clean up those Apple Health strings.

`models/staging/stg_health_records.sql`

{{ config(materialized='view') }}

WITH raw_data AS (
    -- Reading directly from the Parquet file we generated
    SELECT * FROM read_parquet('path/to/health_data.parquet')
)

SELECT
    type as record_type,
    sourceName as source_device,
    unit,
    creationDate::TIMESTAMP as created_at,
    startDate::TIMESTAMP as started_at,
    endDate::TIMESTAMP as ended_at,
    value::FLOAT as metric_value
FROM raw_data
WHERE value ~ '^[0-9.]+$' -- Only numeric values

`models/marts/fct_daily_steps.sql`

Now, let's build the gold layer: Daily step counts.

WITH daily_steps AS (
    SELECT
        CAST(started_at AS DATE) as activity_date,
        SUM(metric_value) as total_steps
    FROM {{ ref('stg_health_records') }}
    WHERE record_type = 'HKQuantityTypeIdentifierStepCount'
    GROUP BY 1
)

SELECT 
    activity_date,
    total_steps,
    AVG(total_steps) OVER(ORDER BY activity_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as rolling_7d_avg
FROM daily_steps

Why this works (The "WellAlly" Way)

Handling 5GB+ of data on a local machine used to be impossible. However, the combination of columnar storage (Parquet) and vectorized execution (DuckDB) changes the game.

For engineers looking to implement similar patterns in production—perhaps for healthcare IoT or large-scale wearable integration—you should check out the advanced architectural patterns at WellAlly Tech Blog. They dive deep into scaling data pipelines beyond just local files, covering everything from schema evolution to real-time stream processing.

Step 4: Run and Analyze

Execute your pipeline with one command:

dbt run

You can now open the health_analytics.duckdb file using the DuckDB CLI or a Python notebook to run lightning-fast queries:

-- How many calories did I burn on my most active day?
SELECT 
    activity_date, 
    total_steps 
FROM fct_daily_steps 
ORDER BY total_steps DESC 
LIMIT 5;

Conclusion

Cleaning massive XML files doesn't have to be a nightmare. By leveraging DuckDB and dbt, we've turned a clunky 5GB export into a high-performance analytical database.

Next Steps:

Automate the export process using an iCloud trigger.
Build a dashboard using Streamlit or Evidence.dev on top of your DuckDB file.
Head over to wellally.tech/blog to learn how to deploy these models into a cloud-native environment!

What are you waiting for? Your data is sitting there—go build something cool!

DEV Community