DEV Community

wellallyTech
wellallyTech

Posted on

Chat with Your Past Self: Building a RAG System for 10 Years of Apple Health Data πŸƒβ€β™‚οΈπŸ€–

Have you ever wondered exactly when your fitness journey peaked, or why your sleep quality plummeted three years ago? Apple Health sits on a goldmine of your personal metrics, but its native interface makes long-term trend analysis feel like reading tea leaves.

In this tutorial, we’re going to build a Retrieval-Augmented Generation (RAG) system that transforms your raw HealthKit data into a searchable, chat-ready knowledge base. By combining Swift, ChromaDB, and LangChain, we will bridge the gap between "Exporting XML" and "Asking an AI about my VO2 Max trends." This project covers the full pipeline: data extraction, vectorization, and conversational AI.

Keywords: HealthKit, RAG System, ChromaDB, LangChain, Vector Database, Apple Health Analytics.


πŸ— The Architecture

To make our health data "talkative," we need a pipeline that moves data from the encrypted iOS sandbox to a queryable vector space.

graph TD
    A[Apple Watch/iPhone] -->|HealthKit SDK| B[Swift Export Tool]
    B -->|export.xml| C[Python Parser]
    C -->|Chunking & Cleaning| D[Embedding Model]
    D -->|Vectors| E[(ChromaDB)]
    F[User Query] --> G[LangChain RAG Chain]
    E --> G
    G -->|Context + Query| H[LLM - GPT-4o]
    H -->|Natural Language Answer| I[Final User Response]
Enter fullscreen mode Exit fullscreen mode

πŸ›  Prerequisites

To follow along, you'll need:

  • Xcode (for the Swift export logic)
  • Python 3.10+
  • Tech Stack: HealthKit SDK, ChromaDB, LangChain, OpenAI API (or Ollama for local LLMs)

Step 1: Exporting Health Data with Swift

Apple Health data is private and encrypted. To get it out, we use the HKHealthStore. While you can use the manual export in the Health app, writing a small Swift script allows for specific data filtering (e.g., just heart rate and workouts).

import HealthKit

let healthStore = HKHealthStore()

func exportHealthData() {
    // Requesting heart rate and step count
    let typesToRead: Set = [
        HKObjectType.quantityType(forIdentifier: .heartRate)!,
        HKObjectType.quantityType(forIdentifier: .stepCount)!
    ]

    healthStore.requestAuthorization(toShare: nil, read: typesToRead) { (success, error) in
        guard success else { return }

        // Querying data (Simplified Example)
        let predicate = HKQuery.predicateForSamples(withStart: Date.distantPast, end: Date(), options: .strictStartDate)
        let query = HKSampleQuery(sampleType: HKObjectType.quantityType(forIdentifier: .stepCount)!, 
                                  predicate: predicate, 
                                  limit: HKObjectQueryNoLimit, 
                                  sortDescriptors: nil) { (query, samples, error) in

            guard let samples = samples as? [HKQuantitySample] else { return }
            // Convert samples to a CSV or JSON for our Python pipeline
            for sample in samples {
                print("\(sample.startDate): \(sample.quantity)")
            }
        }
        healthStore.execute(query)
    }
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Parsing and Vectorizing with LangChain

Once you have your export.xml (which can be several gigabytes for 10 years of data!), we need to parse it. XML is notoriously noisy, so we'll use Python to extract meaningful snippets before sending them to ChromaDB.

For more production-ready patterns on handling massive biometric datasets, check out the specialized guides over at WellAlly Tech Blog, where we dive deeper into health-data engineering.

from langchain_community.document_loaders import UnstructuredXMLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# 1. Load the messy XML
loader = UnstructuredXMLLoader("./apple_health_export.xml")
data = loader.load()

# 2. Split into chunks (Vital for LLM context windows)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(data)

# 3. Vectorize and store in ChromaDB
vectorstore = Chroma.from_documents(
    documents=docs, 
    embedding=OpenAIEmbeddings(),
    persist_directory="./health_db"
)
print("Vector database created successfully! πŸš€")
Enter fullscreen mode Exit fullscreen mode

Step 3: Setting Up the RAG Query Chain

Now comes the magic. We want to ask: "How has my average resting heart rate changed since 2018?" The RAG chain will search the vector database for the relevant logs and pass them to the LLM.

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Create the retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)

query = "Based on my data, what was my most active month in 2022 and what were the step counts?"
response = qa_chain.run(query)

print(f"πŸ€– AI Health Coach: {response}")
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Pro-Tip: Contextual Intelligence

The raw XML often lacks context (e.g., "Why was my heart rate 140 bpm at 2 AM?"). When building a RAG system for health, you should augment your data with "Metadata Tags" during the ingestion phase. For example, tagging segments with season, location, or workout_type significantly improves retrieval accuracy.

For advanced implementation strategies on building "context-aware" AI for wearables, I highly recommend reading the deep-dives on wellally.tech/blog. They have excellent resources on managing data privacy while maintaining high model performance.


🏁 Conclusion

Building a personal health RAG system isn't just a cool weekend projectβ€”it's a step toward Personal AI. Instead of generic advice, you get insights tailored to your actual biology and history.

Next Steps:

  1. Privacy First: Move to a local LLM like Llama 3 using Ollama to keep your data off the cloud.
  2. Visualization: Connect your ChromaDB to a Streamlit dashboard to see the data the AI is citing.

What’s the weirdest trend you found in your health data? Let me know in the comments! πŸ‘‡

Top comments (0)