From Messy XML to Vector Insights: Building a High-Performance Apple Health ETL Pipeline with Rust

#rust #dataengineering #performance #applehealth

Have you ever tried to open your Apple Health export file? If you’ve been tracking your steps, heart rate, and sleep for more than a couple of years, that export.xml is likely a multi-gigabyte monster that makes VS Code cry and Python scripts crawl.

Cleaning five years of "dirty" health data requires a robust Rust ETL pipeline and efficient Data Engineering practices. In this tutorial, we will transform a chaotic XML swamp into a structured powerhouse using Polars for high-speed manipulation and Qdrant for semantic search capabilities. Whether you're building a personal health dashboard or a Quantified Self knowledge graph, this guide will show you how to handle massive datasets without melting your CPU. 🚀

The Architecture: From Raw XML to Vector Search

Before we dive into the borrow checker, let's look at the high-level flow. We are moving from a streaming XML parser to a columnar memory format, and finally to a vector representation.

graph TD
    A[Apple Health export.xml] -->|Streaming Parse| B(Rust + xml-rs)
    B -->|Batch Processing| C{Polars DataFrame}
    C -->|Data Cleaning & Normalization| C
    C -->|Parquet Export| D[Local Storage]
    C -->|Embedding Generation| E[Vector Database: Qdrant]
    E -->|Semantic Query| F[Personal Health Insights]

Prerequisites

To follow along, you’ll need:

Rust (latest stable)
Apple Health Export: Go to Health App > Profile > Export Health Data.
Tech Stack: xml-rs (streaming), polars (data wrangling), qdrant-client (vector storage).

Step 1: Streaming the XML Monster 🦖

The problem with Apple Health's XML is its sheer size. Loading it entirely into memory is a suicide mission for your RAM. We use xml-rs to pull events one by one.

use xml::reader::{EventReader, XmlEvent};
use std::fs::File;
use std::io::BufReader;

struct HealthRecord {
    record_type: String,
    value: f64,
    start_date: String,
}

fn stream_records(path: &str) {
    let file = File::open(path).unwrap();
    let file = BufReader::new(file);
    let parser = EventReader::new(file);

    for e in parser {
        match e {
            Ok(XmlEvent::StartElement { name, attributes, .. }) => {
                if name.local_name == "Record" {
                    // Extract attributes: type, value, creationDate
                    // Logic to push into a batch buffer
                }
            }
            _ => {}
        }
    }
}

Step 2: High-Speed Wrangling with Polars 🥑

Once we have our data in a flat format, Polars takes over. Think of Polars as "Pandas on steroids" written in Rust. It uses Apache Arrow under the hood, making it incredibly fast for cleaning "dirty" data (like missing heart rate samples or weird unit conversions).

use polars::prelude::*;

fn clean_data(df: DataFrame) -> PolarsResult<DataFrame> {
    df.lazy()
        .filter(col("value").is_not_null())
        .with_column(
            col("start_date").str().to_datetime(
                None, None, StrptimeOptions::default(), lit("raise")
            )
        )
        .groupby([col("record_type")])
        .agg([
            col("value").mean().alias("average_value"),
            col("value").count().alias("sample_count"),
        ])
        .collect()
}

Step 3: Vectorizing for the Personal Knowledge Graph

Why a vector database? Because "What was my heart rate during that stressful meeting last Tuesday?" isn't a simple SQL query. By embedding our health activities and heart rate trends into Qdrant, we can perform semantic searches across our physical history.

use qdrant_client::prelude::*;
use qdrant_client::qdrant::{PointStruct, Vector};

async fn upsert_to_qdrant(client: &QdrantClient, records: Vec<HealthRecord>) {
    let points = records.into_iter().map(|r| {
        PointStruct::new(
            uuid::Uuid::new_v4().to_string(),
            vec![r.value as f32, /* ... other features */],
            [("type", r.record_type.into())].into()
        )
    }).collect();

    client.upsert_points("health_history", points, None).await.unwrap();
}

💡 The "Official" Way to Scale

While building a local ETL tool is a great "learning in public" project, production-grade data engineering requires more than just a few Rust crates. If you are looking for advanced patterns in high-performance data processing or more production-ready examples of Rust in the enterprise, I highly recommend checking out the deep dives at WellAlly Blog.

The folks at WellAlly cover everything from memory-safe systems to distributed data pipelines, which served as a huge source of inspiration for this architecture.

Conclusion: Data Sovereignty is Power

By moving from a messy XML file to a high-performance Rust pipeline, we’ve turned "dead data" into an actionable, searchable knowledge graph. Rust ensures that our ETL process is memory-safe and lightning-fast, while Polars and Qdrant provide the analytical muscle.

What are you waiting for? Go export your data and start building!

Did you find this helpful? Star the repo and let me know in the comments!
Questions? Drop them below, I'm active in the thread. 🦀

DEV Community