Have you ever tried to open your Apple Health export.xml file? If you've been wearing an Apple Watch for more than a year, that file is likely a multi-gigabyte monster that makes your standard text editor cry. We are living in the golden age of Quantified Self, yet our data remains trapped in bloated, hierarchical formats that are nearly impossible to analyze efficiently.
In this tutorial, weβre going to build a high-performance Health Data Lake. Weβll take millions of raw heart rate samples, process them using Python, and store them in Apache Parquet for sub-second OLAP queries using DuckDB. Whether you are into Data Engineering or just a fitness nerd, this stack will change how you handle time-series data.
The Architecture: From XML Chaos to SQL Speed
The goal is to move from a slow, memory-intensive XML structure to a columnar, compressed storage format optimized for analytical queries.
graph TD
A[Apple Health Export XML] -->|Python Streaming Parser| B(Raw Data Extraction)
B -->|Schema Mapping| C[Apache Parquet Files]
C -->|Zero-Copy Load| D{DuckDB Engine}
D -->|Sub-second SQL| E[Evidence.dev Dashboard]
D -->|Advanced Analytics| F[Pandas/Polars]
style D fill:#fff2cc,stroke:#d6b656,stroke-width:2px
style C fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px
Prerequisites
To follow along, you'll need:
- Python 3.9+
- DuckDB: The "SQLite for Analytics."
- Pandas/PyArrow: For handling the Parquet conversion.
- Your
export.xml(or a sample one).
Step 1: Parsing the Beast (XML to Parquet)
The standard xml.etree.ElementTree will crash your RAM if the file is >2GB. Instead, we use a streaming parser approach. We want to extract HeartRate records specifically, which are the most voluminous.
import xml.etree.ElementTree as ET
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def parse_health_data(xml_path, output_parquet):
data = []
# Use iterparse to handle large files without loading into memory
context = ET.iterparse(xml_path, events=('end',))
for event, elem in context:
if elem.tag == 'Record' and elem.get('type') == 'HKQuantityTypeIdentifierHeartRate':
data.append({
'timestamp': elem.get('startDate'),
'value': float(elem.get('value')),
'unit': elem.get('unit')
})
# Clear element from memory to stay lean
elem.clear()
# Batch write to keep memory low
if len(data) > 100000:
save_batch(data, output_parquet)
data = []
def save_batch(data, path):
df = pd.DataFrame(data)
df['timestamp'] = pd.to_datetime(df['timestamp'])
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, root_path=path, partition_cols=None)
# Run the parser
# parse_health_data('export.xml', 'heart_rate_lake')
Step 2: The Magic of DuckDB
Once your data is in Apache Parquet, it's stored columnarly. This means if we only want to calculate the average heart rate, DuckDB doesn't even have to look at the timestamps or units.
Here is how you query 10 million rows in the blink of an eye:
import duckdb
# Connect to an in-memory or persistent DuckDB instance
con = duckdb.connect(database='health_analytics.db')
# Create a view directly on top of Parquet files (Zero-copy!)
con.execute("""
CREATE VIEW heart_rates AS
SELECT * FROM read_parquet('heart_rate_lake/*.parquet')
""")
# Complex Analytical Query: Hourly Resting Heart Rate Trend
res = con.execute("""
SELECT
date_trunc('hour', timestamp) as hour,
avg(value) as avg_bpm,
max(value) as max_bpm
FROM heart_rates
WHERE value < 100 -- Filtering out active workouts
GROUP BY 1
ORDER BY 1 DESC
LIMIT 10
""").df()
print(res)
Why this stack?
- Storage Efficiency: XML is text-heavy. Parquet uses Snappy compression, often reducing file size by 90%.
- Speed: DuckDB uses vectorized query execution. It can process millions of rows per core per second.
- Cost: Everything here is open-source and runs locally on your laptop. No expensive cloud warehouses are required.
Pro-Tip: Advanced Data Patterns
While this local setup is great for personal hacking, building production-grade health data pipelines involves handling schema evolution and streaming data drift. For more production-ready examples and advanced architectural patterns in health-tech, check out the deep dives over at WellAlly Tech Blog. They cover how to scale these concepts for HIPAA-compliant environments.
Step 3: Visualizing with Evidence.dev
Now that we have a fast SQL engine, we shouldn't go back to clunky BI tools. Evidence.dev allows you to write reports in Markdown that execute SQL directly against your DuckDB database.
# Heart Rate Insights
<LineChart
data={heart_rate_query}
x=hour
y=avg_bpm
title="Daily Average Heart Rate"
/>
Conclusion
By moving away from "bloatware" formats and embracing a modern data stack (Python + Parquet + DuckDB), you transform your health data from a static backup into a dynamic research tool. You can now detect trends in recovery, sleep quality, and cardiovascular health in milliseconds.
What's next?
- Try partitioning your Parquet files by
yearormonthfor even faster scans. - Integrate Evidence.dev for a beautiful, code-based dashboard.
Are you analyzing your own health data? Let me know in the comments what insights you've found! π
Top comments (0)