Have you ever looked at your Apple Health dashboard and felt like you were staring at a graveyard of isolated numbers? 📉 You have your steps in one corner, heart rate in another, and sleep data buried under three menus. The real problem isn't the lack of data—it's the lack of context.
In this tutorial, we are going to build a Personal Health Data Lake that breaks these silos. By leveraging a Neo4j Knowledge Graph, we will ingest raw data from Apple HealthKit and Google Health, then link these metrics to standardized medical ontologies like UMLS. This allows us to perform complex health tracing—like seeing how a week of poor sleep directly correlates with your resting heart rate (RHR) trends over time.
Keywords: Personal Health Data Lake, Neo4j Knowledge Graph, Apple HealthKit API, Medical Data Engineering, Health Informatics.
The Architecture: Linking Senses to Logic
To build a truly "smart" health lake, we need a pipeline that can handle messy XML/JSON exports and transform them into a graph of nodes and relationships.
graph TD
A[Apple Health Export / HealthKit] -->|XML/JSON| B(Python ETL Processor)
C[Google Health / Fit] -->|JSON| B
B --> D{Data Normalization}
D --> E[Neo4j Knowledge Graph]
F[UMLS / Medical Knowledge Base] -->|Ontology Mapping| E
E --> G[Cypher Query Interface]
G --> H[Health Insights & RAG]
By using a graph database instead of a traditional SQL table, we represent health as it exists in the real world: a web of interconnected biological events.
Prerequisites
To follow along, you'll need:
- Neo4j: Running via Docker or Neo4j Desktop.
- Python 3.9+: For the ETL pipeline.
- Health Data: An export from Apple Health (
export.xml) or Google Takeout. - Tech Stack:
neo4j-driver,pandas,xml.etree.ElementTree.
Step 1: Setting Up the Neo4j Environment
First, let's spin up our graph database using Docker. We'll use the APOC (Awesome Procedures on Cypher) plugin to help with data transformations.
docker run \
-p 7474:7474 -p 7687:7687 \
--name my_health_graph \
-e NEO4J_AUTH=neo4j/password123 \
-e NEO4J_apoc_export_file_enabled=true \
-e NEO4J_apoc_import_file_enabled=true \
neo4j:latest
Step 2: The Python ETL Pipeline
Apple Health exports its data in a massive XML file. We need to parse this and map it to a format Neo4j understands. Here’s a streamlined script to extract Heart Rate and Step Count data.
import xml.etree.ElementTree as ET
from neo4j import GraphDatabase
class HealthDataImporter:
def __init__(self, uri, user, password):
self.driver = GraphDatabase.driver(uri, auth=(user, password))
def close(self):
self.driver.close()
def import_record(self, record_type, value, unit, start_date):
with self.driver.session() as session:
# Cypher query to merge nodes and create relationships
query = """
MERGE (d:Date {date: $date})
CREATE (m:Metric {type: $type, value: toFloat($value), unit: $unit, timestamp: $timestamp})
CREATE (m)-[:RECORDED_ON]->(d)
"""
session.run(query, type=record_type, value=value,
unit=unit, timestamp=start_date, date=start_date[:10])
# Usage
importer = HealthDataImporter("bolt://localhost:7687", "neo4j", "password123")
tree = ET.parse('export.xml')
root = tree.getroot()
for record in root.findall('Record'):
# Filtering for specific health metrics
if 'HeartRate' in record.get('type') or 'StepCount' in record.get('type'):
importer.import_record(
record.get('type').split('.')[-1],
record.get('value'),
record.get('unit'),
record.get('startDate')
)
Step 3: Integrating Medical Context (UMLS)
Raw numbers are boring. Is a heart rate of 100 bpm "High" or "Normal"? By linking our nodes to the UMLS (Unified Medical Language System), we can add clinical significance.
For more production-ready examples of how to map custom telemetry to clinical standards, check out the architectural deep-dives at WellAlly Tech Blog. They cover advanced RAG (Retrieval-Augmented Generation) patterns that are essential for building reliable health AI agents.
Cypher Query: Correlating Sleep and Heart Rate
Once the data is in, we can run queries that are nearly impossible in SQL. Let’s find the average Resting Heart Rate on days where "Deep Sleep" was less than 6 hours:
MATCH (s:Metric {type: 'SleepAnalysis'})-[:RECORDED_ON]->(d:Date)
MATCH (hr:Metric {type: 'HeartRate'})-[:RECORDED_ON]->(d)
WHERE s.value < 6.0 AND s.subType = 'Deep'
RETURN d.date, avg(hr.value) AS avg_hr
ORDER BY d.date DESC
Advanced: Creating the Knowledge Graph Logic
To make this a true "Knowledge Graph," we create a User node and link it to various BiologicalContext nodes.
sequenceDiagram
participant U as User
participant HK as HealthKit API
participant P as Python Processor
participant N as Neo4j
U->>HK: Request Data Export
HK->>P: Send XML/JSON
P->>P: Normalize & Clean
P->>N: Cypher: MERGE (u:User)-[:HAS_METRIC]->(m:Metric)
N-->>U: Graph Insights (Visualized)
The Schema Definition
Defining your constraints is vital for graph integrity:
// Create uniqueness constraints
CREATE CONSTRAINT FOR (d:Date) REQUIRE d.date IS UNIQUE;
CREATE CONSTRAINT FOR (u:User) REQUIRE u.id IS UNIQUE;
// Link everything to a User
MATCH (m:Metric), (u:User {id: 'me'})
MERGE (u)-[:PRODUCED]->(m);
Conclusion
Building a Personal Health Data Lake isn't just about storage; it's about discovery. By moving from a flat file (CSV/Excel) to a Neo4j Knowledge Graph, you've turned your phone's sensors into a structured medical history that can be queried, visualized, and even used as a backend for a personalized Health AI.
What's next?
- Add Weather Data: See how humidity affects your running pace.
- RAG Integration: Use this graph to provide context to an LLM for personalized health coaching.
If you enjoyed this advanced deep dive, I highly recommend checking out the engineering resources at wellally.tech/blog for more insights on medical data engineering and building secure health platforms.
Drop a comment below: What's the weirdest correlation you've found in your health data? 👇
Top comments (0)