Vectorizing Your Vitals: Converting 10GB of Apple Health Data into a Personal RAG Brain

#rag #dataengineering #machinelearning #python

If you've ever tried to open your Apple Health export file, you know it's where dreams of "quantified self" go to die. You're met with a monolithic export.xml file that can easily swell to 10GB+, filled with deeply nested tags and millions of rows of heart rate samples, sleep stages, and workout metrics.

In this tutorial, we’re going to perform some heavy-duty Data Engineering to transform that chaotic XML into a high-performance RAG (Retrieval-Augmented Generation) system. We will leverage DuckDB for lightning-fast time-series processing, Apache Arrow for memory-efficient data transport, and Qdrant with LlamaIndex to build an AI that actually knows your health history.

By the end, you’ll be able to ask your LLM: "How has my resting heart rate trended on days after I did a HIIT workout compared to yoga?"

The Architecture: From Raw XML to Vector Insights

Handling 10GB of XML requires a specialized pipeline. We can't just throw this into a pandas dataframe unless we want our RAM to spontaneously combust.

graph TD
    A[Apple Health export.xml] --> B[DuckDB XML Parser]
    B --> C{Feature Engineering}
    C --> D[Apache Arrow Stream]
    D --> E[LlamaIndex Document Ingestion]
    E --> F[Qdrant Vector Database]
    G[User Query] --> H[LlamaIndex Retriever]
    H --> F
    F --> I[GPT-4o / Claude 3.5]
    I --> J[Insightful Health Answer]

Prerequisites

To follow along, you’ll need:

Python 3.10+
DuckDB: Our analytical engine.
Qdrant: The vector store for high-speed retrieval.
LlamaIndex: The orchestration framework for RAG.
Apache Arrow: For zero-copy data handling.

pip install duckdb qdrant-client llama-index-vector-stores-qdrant adbc-driver-manager pyarrow

Step 1: Taming the XML Beast with DuckDB

Standard XML parsers in Python are too slow for gigabyte-scale data. DuckDB can treat XML as a relational source and perform "lazy loading." We’ll extract key metrics like HeartRate, StepCount, and SleepAnalysis.

import duckdb

# Connect to a local DuckDB instance
con = duckdb.connect(database='health_data.db')

# Use the DuckDB XML extension (ensure it's installed/loaded)
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("INSTALL xml; LOAD xml;")

# Extracting heart rate records into a structured table
# Apple Health XML uses <Record type="HKQuantityTypeIdentifierHeartRate" ... />
query = """
CREATE OR REPLACE TABLE heart_rate AS
SELECT 
    attr_type as type,
    attr_value::DOUBLE as value,
    attr_unit as unit,
    attr_startDate::TIMESTAMP as start_date
FROM (
    SELECT 
        unnest(xpath(xml_content, '//Record[@type="HKQuantityTypeIdentifierHeartRate"]/@type'))::VARCHAR as attr_type,
        unnest(xpath(xml_content, '//Record[@type="HKQuantityTypeIdentifierHeartRate"]/@value'))::VARCHAR as attr_value,
        unnest(xpath(xml_content, '//Record[@type="HKQuantityTypeIdentifierHeartRate"]/@unit'))::VARCHAR as attr_unit,
        unnest(xpath(xml_content, '//Record[@type="HKQuantityTypeIdentifierHeartRate"]/@startDate'))::VARCHAR as attr_startDate
    FROM read_blob('export.xml') AS t(xml_content)
);
"""
con.execute(query)
print("✅ Heart rate data processed into DuckDB!")

Step 2: Time-Series Feature Engineering

A RAG system is only as good as the context you provide. Instead of raw timestamps, we need to aggregate data so the LLM can understand patterns. We'll use Apache Arrow to stream this data out of DuckDB efficiently.

# Aggregate daily averages to create "Health Summaries"
summary_df = con.execute("""
    SELECT 
        CAST(start_date AS DATE) as day,
        AVG(value) as avg_heart_rate,
        MAX(value) as max_heart_rate,
        COUNT(*) as samples
    FROM heart_rate
    GROUP BY 1
    ORDER BY 1 DESC
""").fetch_arrow_table()

print(f"Aggregated {len(summary_df)} days of health metrics.")

Step 3: Building the RAG Knowledge Base

Now, we use LlamaIndex to turn these rows into "Documents." We’ll store them in Qdrant, which allows us to perform hybrid searches (semantic + metadata filtering).

from llama_index.core import Document, StorageContext, VectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

# 1. Initialize Qdrant Client
client = qdrant_client.QdrantClient(path="./qdrant_db")

# 2. Convert Arrow/Table rows into searchable Documents
documents = []
for row in summary_df.to_pylist():
    text_content = f"On {row['day']}, average heart rate was {row['avg_heart_rate']:.2f} bpm with a peak of {row['max_heart_rate']:.2f}."
    doc = Document(text=text_content, metadata={"date": str(row['day'])})
    documents.append(doc)

# 3. Setup Vector Store
vector_store = QdrantVectorStore(client=client, collection_name="apple_health")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# 4. Indexing (This will trigger embeddings)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
print("🚀 Knowledge base is ready!")

The "Official" Way: Advanced Patterns

While this setup works for a local experiment, production-grade health-tech applications require more robust data validation (Pydantic), HIPAA-compliant storage, and sophisticated window-based retrieval.

For a deep dive into advanced RAG patterns, production-ready data pipelines, and scalable AI architectures, I highly recommend checking out the technical deep-dives over at WellAlly Blog. They cover the intersection of health data and engineering at a much more granular level!

Step 4: Querying Your Vitals

Now comes the magic. We can ask natural language questions about our raw health data.

query_engine = index.as_query_engine()

response = query_engine.query(
    "Looking at my heart rate data from the last month, "
    "was there any day where my peak heart rate was unusually high? "
    "Summarize the potential reasons based on the trends."
)

print(f"AI Health Assistant: {response}")

Why this works:

DuckDB handled the heavy lifting of the 10GB XML, preventing memory overflows.
Apache Arrow ensured that moving data from the DB to the Vector Store was "zero-copy" and fast.
Qdrant allows us to scale to millions of health data points while maintaining sub-millisecond retrieval.

Conclusion

We've successfully moved from a messy export.xml to a structured, AI-ready knowledge base. This is the foundation for building Personal Health Coaches or Quantified Self Dashboards that actually understand the context of your life.

Key Takeaways:

Stop parsing XML with ElementTree; use DuckDB.
Aggregate your data before vectorizing it to provide the LLM with meaningful context.
Use LlamaIndex + Qdrant for the heavy lifting of semantic search.

What are you going to build with your health data? Let me know in the comments!

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.