wellallyTech

Posted on Jun 1

Your Apple Health Data is a Mess—Fix it 10x Faster with Polars 🚀

#dataengineering #opensource #python #polars

If you’ve ever tried to export your Apple Health data, you know the struggle. You click "Export," wait for ages, and finally receive a massive export.zip. Inside lies export.xml, a beast that can easily reach 1GB to 5GB in size.

For most data engineering tasks, the go-to is usually Pandas. But try loading a 3GB XML file into a Pandas DataFrame, and your RAM will cry for mercy. In this tutorial, we’re going to master large-scale data processing by replacing the "Pandas tax" with Polars—the lightning-fast DataFrame library—to clean and downsample Apple Health's messy XML records.

The Problem: The "XML Wall" 🧱

Apple Health's export isn't just large; it's deeply nested and repetitive. We are dealing with hundreds of thousands (or millions) of Record tags. Traditional DOM parsing loads the entire tree into memory, which is a recipe for a crash.

To build a high-performance ETL pipeline for this, we need a strategy that combines iterative parsing with the columnar power of Polars.

The Architecture 🏗️

Our pipeline follows a "Stream-Transform-Sink" pattern. We stream the XML to avoid OOM (Out of Memory) errors, batch the records into Polars, and sink the cleaned data into Parquet for 99% better storage efficiency.

graph TD
    A[export.xml GB-sized] --> B{Iterative Parser}
    B -->|Stream Records| C[Dictionary Buffer]
    C -->|Batch Load| D[Polars DataFrame]
    D --> E[Data Cleaning & Type Casting]
    E --> F[Time-Series Downsampling]
    F --> G[Parquet File Output]
    G --> H[Ready for Analysis/BI]

Prerequisites 🛠️

Before we dive in, make sure you have the necessary tools installed:

pip install polars lxml pyarrow

Polars: Our high-performance engine.
lxml/ElementTree: For iterative XML streaming.
PyArrow: To handle the Parquet serialization.

Step 1: The High-Performance Iterative Parser 🔍

Instead of ET.parse(), we use ET.iterparse(). This allows us to clear the memory as we process each node.

import xml.etree.ElementTree as ET
import polars as pl

def stream_apple_health_xml(file_path):
    # Iterate through 'Record' tags only
    context = ET.iterparse(file_path, events=('end',))

    records = []
    for event, elem in context:
        if elem.tag == 'Record':
            # Extract attributes efficiently
            record_data = {
                'type': elem.get('type'),
                'value': elem.get('value'),
                'unit': elem.get('unit'),
                'startDate': elem.get('startDate'),
            }
            records.append(record_data)

            # Clean up element to save memory
            elem.clear()

        # Yield in batches of 100k to keep Polars happy
        if len(records) >= 100_000:
            yield records
            records = []

    yield records

Step 2: Polars Transformation Magic ✨

Once we have our chunks, we use Polars to perform "lazy" transformations. This is where we handle the "dirty" part: string-to-float conversions, timezone handling, and removing outliers.

def process_to_polars(batch):
    df = pl.DataFrame(batch)

    processed_df = df.with_columns([
        # Clean type names (e.g., 'HKQuantityTypeIdentifierStepCount' -> 'StepCount')
        pl.col("type").str.replace("HKQuantityTypeIdentifier", ""),

        # Fast datetime casting
        pl.col("startDate").str.to_datetime("%Y-%m-%d %H:%M:%S %z"),

        # Numeric conversion (handling nulls/empty strings)
        pl.col("value").cast(pl.Float64, strict=False).fill_null(0.0)
    ])

    return processed_df

Step 3: Downsampling for Performance 📉

Heart rate data is recorded every few seconds. For a yearly view, you don't need that granularity. Let's downsample to 1-hour intervals using Polars' incredibly fast group_by_dynamic.

def downsample_data(df):
    return (
        df.sort("startDate")
        .group_by_dynamic(
            "startDate", 
            every="1h", 
            group_by="type"
        )
        .agg(pl.col("value").mean().alias("avg_value"))
    )

The "Official" Way to Scale 🥑

While the script above works for personal exports, production-grade health data pipelines require more robust handling of schema drift and multi-user concurrency.

Pro-Tip: If you're building a production-ready health app or need more advanced patterns for handling multimodal biometric data, check out the detailed guides at WellAlly Tech Blog. They cover how to scale these Python pipelines into distributed cloud architectures.

Step 4: Putting it All Together 🏎️

Here is the final execution block. On a typical 2GB XML file, this approach finishes in under 30 seconds, whereas Pandas would likely take minutes or swap to disk.

path = "export.xml"
all_frames = []

print("🚀 Starting the engine...")

for batch in stream_apple_health_xml(path):
    clean_batch = process_to_polars(batch)
    all_frames.append(clean_batch)

# Combine and Downsample
final_df = pl.concat(all_frames)
daily_summary = downsample_data(final_df)

# Save to Parquet (The Gold Standard)
daily_summary.write_parquet("health_data_cleaned.parquet")

print(f"✅ Success! Processed {len(final_df)} rows into a compact Parquet file.")

Why this wins:

Memory Safety: elem.clear() ensures your RAM usage stays flat regardless of file size.
Speed: Polars utilizes all CPU cores for the group_by and cast operations.
Storage: Parquet files are typically 90-95% smaller than the raw XML, making them perfect for mobile or web apps.

Conclusion 🎁

Cleaning Apple Health data doesn't have to be a nightmare. By moving away from bloated XML parsers and adopting Polars, you turn a "coffee break" task into a "blink of an eye" task.

What's next?

Try visualizing your heart rate variability (HRV) using Plotly with the Parquet file we just created.
Head over to wellally.tech/blog to learn how to integrate this data into a vector database for AI-driven health insights!

Did this help you save some RAM? Let me know in the comments! 👇

DEV Community