Beck_Moulton

Posted on May 2

Taming the Chaos: Cleaning 10M+ Apple Health Records into a Production-Ready Parquet Lakehouse

#ai #python #opensource #tutorial

If you’ve ever tried to click that "Export Health Data" button on your iPhone, you know the feeling of pure dread that follows. You expect a clean CSV; you get a bloated, multi-gigabyte XML file that looks like it was designed by a chaotic deity.

When building high-performance AI models for health tech, Apple Health data is a goldmine—but only if you can navigate the minefield of data engineering challenges. We’re talking about massive data volumes, duplicate entries from overlapping devices (iPhone vs. Apple Watch), and inconsistent sampling frequencies that would make any data scientist cry.

In this tutorial, we are going to build a robust Data Pipeline using Polars, Apache Hop, and S3 to transform "dirty" XML exports into a standardized, high-performance Parquet Lakehouse.

Pro-Tip: If you are looking for advanced architectural patterns for health-tech scaling, I highly recommend checking out the production-ready examples over at WellAlly's Engineering Blog.

The Architecture: From XML Mess to Parquet Gold

Before we dive into the code, let's look at the flow. We need a system that can handle millions of rows without blowing up your RAM.

graph TD
    A[Apple Health Export XML] --> B[Apache Hop: Ingestion]
    B --> C[S3 Raw Landing Zone]
    C --> D[Polars: Data Cleaning & De-duplication]
    D --> E[Outlier Removal & Resampling]
    E --> F[Standardized Parquet Lakehouse]
    F --> G[Downstream AI/ML Models]

    subgraph "The Processing Core"
    D
    E
    end

Prerequisites

To follow along, you’ll need:

Python 3.9+
Polars: The lightning-fast DataFrame library.
Apache Hop: For workflow orchestration.
S3 Bucket: To act as our data lake storage.

Step 1: Orchestrating the Ingestion with Apache Hop

While we love Python, using Apache Hop for the initial ingestion allows us to handle the XML-to-S3 transfer visually and reliably. Hop handles the metadata-driven workflow, ensuring that if the upload fails halfway through, we can resume.

Create a "Get XML" transform.
Use the "S3 File Output" to land the raw XML into an s3://raw-zone/.

Step 2: The Polars "Speed Demon" Cleaning Logic

Why Polars? Because parsing 10 million rows of heart rate data in Pandas is a great way to fry an egg on your CPU. Polars handles this in seconds thanks to its multi-threaded query engine.

Here is how we handle the most common issue: Multi-device conflicts. If your Watch and Phone both record "Steps" at the same time, you'll double-count unless you prioritize sources.

import polars as pl

def clean_health_data(file_path: str):
    # 1. Lazy Loading for memory efficiency
    df = pl.scan_parquet(file_path)

    # 2. Filtering & Type Casting
    # Apple Health timestamps come as strings with timezones
    df = df.with_columns([
        pl.col("creationDate").str.to_datetime(),
        pl.col("value").cast(pl.Float64, strict=False)
    ])

    # 3. Handling Multi-Device Conflicts
    # Strategy: Prioritize Apple Watch (HKDevice) over iPhone
    df = df.sort(["creationDate", "sourceName"], descending=True)
    df = df.unique(subset=["creationDate", "type"], keep="first")

    # 4. Outlier Removal (e.g., impossible Heart Rates)
    # Using Z-Score or simple bounds
    df = df.filter(
        (pl.col("type") == "HKQuantityTypeIdentifierHeartRate") & 
        (pl.col("value") > 30) & (pl.col("value") < 220) |
        (pl.col("type") != "HKQuantityTypeIdentifierHeartRate")
    )

    return df.collect()

# Example Usage
# df_clean = clean_health_data("s3://raw-zone/export.parquet")

Step 3: Normalizing Sampling Frequencies

Apple Health is "event-driven." Your heart rate might be recorded every 1 minute or every 10 minutes. For AI models, we need a consistent grid (e.g., 5-minute intervals).

def resample_to_grid(df: pl.DataFrame, interval="5m"):
    return (
        df.sort("creationDate")
        .upsample(time_column="creationDate", every=interval, group_by="type")
        .interpolate() # Linear interpolation for missing gaps
        .fill_null(strategy="forward") # Fill remaining nulls
    )

Step 4: Building the Parquet Lakehouse

The final step is writing the data to S3 in a partitioned format. Partitioning by year/month and data_type ensures that downstream queries only read what they need.

def save_to_lakehouse(df: pl.DataFrame, output_path: str):
    df = df.with_columns([
        pl.col("creationDate").dt.year().alias("year"),
        pl.col("creationDate").dt.month().alias("month")
    ])

    # Writing to S3 as Partitioned Parquet
    df.write_parquet(
        output_path,
        use_pyarrow=True,
        pyarrow_options={"partition_cols": ["type", "year", "month"]}
    )
    print(f"🚀 Successfully deployed data to {output_path}")

The "Official" Way to Scale

While this script works for personal projects, enterprise-grade health data engineering requires specialized handling for HIPAA compliance, data lineage, and schema evolution.

For more production-ready patterns—including how to integrate these pipelines with vector databases for RAG—check out the deep-dive articles at WellAlly.tech/blog. They cover the nuances of building resilient AI-driven health platforms that go far beyond basic scripts.

Conclusion

Cleaning Apple Health data doesn't have to be a nightmare. By leveraging Polars for its speed, Apache Hop for orchestration, and Parquet for efficient storage, you can turn millions of rows of "dirty" XML into a high-performance feature set for your next AI project.

What's next?

Try running the Polars script on your own export.
Experiment with different resampling strategies.
Drop a comment below if you've found an even weirder data bug in Apple Health!

Happy coding! 💻🚀

DEV Community