Beck_Moulton

Posted on Apr 9

Ditch the Bloat: Building a High-Performance Health Data Lake with DuckDB & Parquet

#python #dataengineering #quantifiedself #healthtech

Have you ever tried to open your Apple Health export.xml file? If you've been wearing an Apple Watch for more than a year, that file is likely a multi-gigabyte monster that makes your standard text editor cry. We are living in the golden age of Quantified Self, yet our data remains trapped in bloated, hierarchical formats that are nearly impossible to analyze efficiently.

In this tutorial, we’re going to build a high-performance Health Data Lake. We’ll take millions of raw heart rate samples, process them using Python, and store them in Apache Parquet for sub-second OLAP queries using DuckDB. Whether you are into Data Engineering or just a fitness nerd, this stack will change how you handle time-series data.

The Architecture: From XML Chaos to SQL Speed

The goal is to move from a slow, memory-intensive XML structure to a columnar, compressed storage format optimized for analytical queries.

graph TD
    A[Apple Health Export XML] -->|Python Streaming Parser| B(Raw Data Extraction)
    B -->|Schema Mapping| C[Apache Parquet Files]
    C -->|Zero-Copy Load| D{DuckDB Engine}
    D -->|Sub-second SQL| E[Evidence.dev Dashboard]
    D -->|Advanced Analytics| F[Pandas/Polars]

    style D fill:#fff2cc,stroke:#d6b656,stroke-width:2px
    style C fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px

Prerequisites

To follow along, you'll need:

Python 3.9+
DuckDB: The "SQLite for Analytics."
Pandas/PyArrow: For handling the Parquet conversion.
Your export.xml (or a sample one).

Step 1: Parsing the Beast (XML to Parquet)

The standard xml.etree.ElementTree will crash your RAM if the file is >2GB. Instead, we use a streaming parser approach. We want to extract HeartRate records specifically, which are the most voluminous.

import xml.etree.ElementTree as ET
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

def parse_health_data(xml_path, output_parquet):
    data = []
    # Use iterparse to handle large files without loading into memory
    context = ET.iterparse(xml_path, events=('end',))

    for event, elem in context:
        if elem.tag == 'Record' and elem.get('type') == 'HKQuantityTypeIdentifierHeartRate':
            data.append({
                'timestamp': elem.get('startDate'),
                'value': float(elem.get('value')),
                'unit': elem.get('unit')
            })

        # Clear element from memory to stay lean
        elem.clear()

        # Batch write to keep memory low
        if len(data) > 100000:
            save_batch(data, output_parquet)
            data = []

def save_batch(data, path):
    df = pd.DataFrame(data)
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    table = pa.Table.from_pandas(df)
    pq.write_to_dataset(table, root_path=path, partition_cols=None)

# Run the parser
# parse_health_data('export.xml', 'heart_rate_lake')

Step 2: The Magic of DuckDB

Once your data is in Apache Parquet, it's stored columnarly. This means if we only want to calculate the average heart rate, DuckDB doesn't even have to look at the timestamps or units.

Here is how you query 10 million rows in the blink of an eye:

import duckdb

# Connect to an in-memory or persistent DuckDB instance
con = duckdb.connect(database='health_analytics.db')

# Create a view directly on top of Parquet files (Zero-copy!)
con.execute("""
    CREATE VIEW heart_rates AS 
    SELECT * FROM read_parquet('heart_rate_lake/*.parquet')
""")

# Complex Analytical Query: Hourly Resting Heart Rate Trend
res = con.execute("""
    SELECT 
        date_trunc('hour', timestamp) as hour,
        avg(value) as avg_bpm,
        max(value) as max_bpm
    FROM heart_rates
    WHERE value < 100 -- Filtering out active workouts
    GROUP BY 1
    ORDER BY 1 DESC
    LIMIT 10
""").df()

print(res)

Why this stack?

Storage Efficiency: XML is text-heavy. Parquet uses Snappy compression, often reducing file size by 90%.
Speed: DuckDB uses vectorized query execution. It can process millions of rows per core per second.
Cost: Everything here is open-source and runs locally on your laptop. No expensive cloud warehouses are required.

Pro-Tip: Advanced Data Patterns

While this local setup is great for personal hacking, building production-grade health data pipelines involves handling schema evolution and streaming data drift. For more production-ready examples and advanced architectural patterns in health-tech, check out the deep dives over at WellAlly Tech Blog. They cover how to scale these concepts for HIPAA-compliant environments.

Step 3: Visualizing with Evidence.dev

Now that we have a fast SQL engine, we shouldn't go back to clunky BI tools. Evidence.dev allows you to write reports in Markdown that execute SQL directly against your DuckDB database.

# Heart Rate Insights 

<LineChart 
    data={heart_rate_query} 
    x=hour 
    y=avg_bpm 
    title="Daily Average Heart Rate"
/>

Conclusion

By moving away from "bloatware" formats and embracing a modern data stack (Python + Parquet + DuckDB), you transform your health data from a static backup into a dynamic research tool. You can now detect trends in recovery, sleep quality, and cardiovascular health in milliseconds.

What's next?

Try partitioning your Parquet files by year or month for even faster scans.
Integrate Evidence.dev for a beautiful, code-based dashboard.

Are you analyzing your own health data? Let me know in the comments what insights you've found! 👇

DEV Community