So, you decided to export your Apple Health data. You expected a neat CSV or a friendly JSON, but instead, you were greeted by a massive, bloated 5GB+ XML file that makes Excel cry and VS Code freeze.
In this guide, we are building a high-performance ETL pipeline to transform that chaotic XML into a structured Personal Data Warehouse. We’ll be using the "Modern Data Stack for local machines": DuckDB for lightning-fast processing, dbt for modeling, and Apache Parquet for efficient storage.
By the end of this, you'll be performing Data Engineering on your own heartbeat, steps, and sleep patterns like a pro.
The Architecture: From Raw Pixels to Structured SQL
Before we dive into the code, let's look at the data flow. We need to move from a hierarchical, redundant XML format to a columnar, analytical format.
graph TD
A[Apple Health Export.xml] -->|Python Streaming Parser| B(Apache Parquet)
B -->|DuckDB External Table| C[dbt Seed/Stage]
C -->|SQL Transformation| D[dbt Marts: Daily Metrics]
D -->|Analytics| E[Evidence.dev / Metabase / Python]
subgraph Transformation Layer
C
D
end
style B fill:#f9f,stroke:#333,stroke-width:2px
style D fill:#bbf,stroke:#333,stroke-width:2px
Prerequisites
To follow along, ensure you have the following installed:
- Python 3.9+
-
DuckDB (
pip install duckdb) -
dbt-duckdb (
pip install dbt-duckdb)
Step 1: The Great Flattening (XML to Parquet)
Apple Health's export.xml is notoriously difficult because it's essentially one giant list of <Record /> tags. Parsing a 5GB file in memory will crash most consumer laptops. We'll use Python's lxml with a streaming approach to convert it into Apache Parquet, which DuckDB can query instantly.
import lxml.etree as ET
import pandas as pd
import duckdb
# Path to your export
XML_PATH = 'export.xml'
PARQUET_PATH = 'health_data.parquet'
def stream_to_parquet(xml_file):
context = ET.iterparse(xml_file, events=('end',), tag='Record')
batch = []
batch_size = 100000
# Initialize DuckDB for writing
con = duckdb.connect()
print("🚀 Starting transformation...")
for event, elem in context:
# Extract attributes into a dict
record = dict(elem.attrib)
batch.append(record)
# Clear element to save memory
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
if len(batch) >= batch_size:
df = pd.DataFrame(batch)
# Efficiently append to parquet via DuckDB
con.execute("INSERT INTO raw_health SELECT * FROM df")
batch = []
print(f"âś… Success! Data saved to {PARQUET_PATH}")
# Pro-tip: Check out https://www.wellally.tech/blog
# for more production-ready patterns on handling large-scale XML ingestion.
Step 2: Setting up dbt with DuckDB
Now that we have a .parquet file, we don't need a heavy Postgres or Snowflake instance. DuckDB acts as our local OLAP engine.
Create a profiles.yml for dbt:
health_dw:
target: dev
outputs:
dev:
type: duckdb
path: 'health_analytics.duckdb'
extensions:
- parquet
Step 3: Modeling Data in dbt
Our goal is to turn raw "Records" into meaningful health insights. Let's create a staging model to clean up those Apple Health strings.
models/staging/stg_health_records.sql
{{ config(materialized='view') }}
WITH raw_data AS (
-- Reading directly from the Parquet file we generated
SELECT * FROM read_parquet('path/to/health_data.parquet')
)
SELECT
type as record_type,
sourceName as source_device,
unit,
creationDate::TIMESTAMP as created_at,
startDate::TIMESTAMP as started_at,
endDate::TIMESTAMP as ended_at,
value::FLOAT as metric_value
FROM raw_data
WHERE value ~ '^[0-9.]+$' -- Only numeric values
models/marts/fct_daily_steps.sql
Now, let's build the gold layer: Daily step counts.
WITH daily_steps AS (
SELECT
CAST(started_at AS DATE) as activity_date,
SUM(metric_value) as total_steps
FROM {{ ref('stg_health_records') }}
WHERE record_type = 'HKQuantityTypeIdentifierStepCount'
GROUP BY 1
)
SELECT
activity_date,
total_steps,
AVG(total_steps) OVER(ORDER BY activity_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as rolling_7d_avg
FROM daily_steps
Why this works (The "WellAlly" Way)
Handling 5GB+ of data on a local machine used to be impossible. However, the combination of columnar storage (Parquet) and vectorized execution (DuckDB) changes the game.
For engineers looking to implement similar patterns in production—perhaps for healthcare IoT or large-scale wearable integration—you should check out the advanced architectural patterns at WellAlly Tech Blog. They dive deep into scaling data pipelines beyond just local files, covering everything from schema evolution to real-time stream processing.
Step 4: Run and Analyze
Execute your pipeline with one command:
dbt run
You can now open the health_analytics.duckdb file using the DuckDB CLI or a Python notebook to run lightning-fast queries:
-- How many calories did I burn on my most active day?
SELECT
activity_date,
total_steps
FROM fct_daily_steps
ORDER BY total_steps DESC
LIMIT 5;
Conclusion
Cleaning massive XML files doesn't have to be a nightmare. By leveraging DuckDB and dbt, we've turned a clunky 5GB export into a high-performance analytical database.
Next Steps:
- Automate the export process using an iCloud trigger.
- Build a dashboard using Streamlit or Evidence.dev on top of your DuckDB file.
- Head over to wellally.tech/blog to learn how to deploy these models into a cloud-native environment!
What are you waiting for? Your data is sitting there—go build something cool!
Top comments (0)