DEV Community

Beck_Moulton
Beck_Moulton

Posted on

From Pixels to Pulse: Building a Personal Health Knowledge Graph with Neo4j and Python

Have you ever looked at your Apple Health dashboard and felt like you were staring at a graveyard of isolated numbers? 📉 You have your steps in one corner, heart rate in another, and sleep data buried under three menus. The real problem isn't the lack of data—it's the lack of context.

In this tutorial, we are going to build a Personal Health Data Lake that breaks these silos. By leveraging a Neo4j Knowledge Graph, we will ingest raw data from Apple HealthKit and Google Health, then link these metrics to standardized medical ontologies like UMLS. This allows us to perform complex health tracing—like seeing how a week of poor sleep directly correlates with your resting heart rate (RHR) trends over time.

Keywords: Personal Health Data Lake, Neo4j Knowledge Graph, Apple HealthKit API, Medical Data Engineering, Health Informatics.


The Architecture: Linking Senses to Logic

To build a truly "smart" health lake, we need a pipeline that can handle messy XML/JSON exports and transform them into a graph of nodes and relationships.

graph TD
    A[Apple Health Export / HealthKit] -->|XML/JSON| B(Python ETL Processor)
    C[Google Health / Fit] -->|JSON| B
    B --> D{Data Normalization}
    D --> E[Neo4j Knowledge Graph]
    F[UMLS / Medical Knowledge Base] -->|Ontology Mapping| E
    E --> G[Cypher Query Interface]
    G --> H[Health Insights & RAG]
Enter fullscreen mode Exit fullscreen mode

By using a graph database instead of a traditional SQL table, we represent health as it exists in the real world: a web of interconnected biological events.


Prerequisites

To follow along, you'll need:

  • Neo4j: Running via Docker or Neo4j Desktop.
  • Python 3.9+: For the ETL pipeline.
  • Health Data: An export from Apple Health (export.xml) or Google Takeout.
  • Tech Stack: neo4j-driver, pandas, xml.etree.ElementTree.

Step 1: Setting Up the Neo4j Environment

First, let's spin up our graph database using Docker. We'll use the APOC (Awesome Procedures on Cypher) plugin to help with data transformations.

docker run \
    -p 7474:7474 -p 7687:7687 \
    --name my_health_graph \
    -e NEO4J_AUTH=neo4j/password123 \
    -e NEO4J_apoc_export_file_enabled=true \
    -e NEO4J_apoc_import_file_enabled=true \
    neo4j:latest
Enter fullscreen mode Exit fullscreen mode

Step 2: The Python ETL Pipeline

Apple Health exports its data in a massive XML file. We need to parse this and map it to a format Neo4j understands. Here’s a streamlined script to extract Heart Rate and Step Count data.

import xml.etree.ElementTree as ET
from neo4j import GraphDatabase

class HealthDataImporter:
    def __init__(self, uri, user, password):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))

    def close(self):
        self.driver.close()

    def import_record(self, record_type, value, unit, start_date):
        with self.driver.session() as session:
            # Cypher query to merge nodes and create relationships
            query = """
            MERGE (d:Date {date: $date})
            CREATE (m:Metric {type: $type, value: toFloat($value), unit: $unit, timestamp: $timestamp})
            CREATE (m)-[:RECORDED_ON]->(d)
            """
            session.run(query, type=record_type, value=value, 
                        unit=unit, timestamp=start_date, date=start_date[:10])

# Usage
importer = HealthDataImporter("bolt://localhost:7687", "neo4j", "password123")
tree = ET.parse('export.xml')
root = tree.getroot()

for record in root.findall('Record'):
    # Filtering for specific health metrics
    if 'HeartRate' in record.get('type') or 'StepCount' in record.get('type'):
        importer.import_record(
            record.get('type').split('.')[-1],
            record.get('value'),
            record.get('unit'),
            record.get('startDate')
        )
Enter fullscreen mode Exit fullscreen mode

Step 3: Integrating Medical Context (UMLS)

Raw numbers are boring. Is a heart rate of 100 bpm "High" or "Normal"? By linking our nodes to the UMLS (Unified Medical Language System), we can add clinical significance.

For more production-ready examples of how to map custom telemetry to clinical standards, check out the architectural deep-dives at WellAlly Tech Blog. They cover advanced RAG (Retrieval-Augmented Generation) patterns that are essential for building reliable health AI agents.

Cypher Query: Correlating Sleep and Heart Rate

Once the data is in, we can run queries that are nearly impossible in SQL. Let’s find the average Resting Heart Rate on days where "Deep Sleep" was less than 6 hours:

MATCH (s:Metric {type: 'SleepAnalysis'})-[:RECORDED_ON]->(d:Date)
MATCH (hr:Metric {type: 'HeartRate'})-[:RECORDED_ON]->(d)
WHERE s.value < 6.0 AND s.subType = 'Deep'
RETURN d.date, avg(hr.value) AS avg_hr
ORDER BY d.date DESC
Enter fullscreen mode Exit fullscreen mode

Advanced: Creating the Knowledge Graph Logic

To make this a true "Knowledge Graph," we create a User node and link it to various BiologicalContext nodes.

sequenceDiagram
    participant U as User
    participant HK as HealthKit API
    participant P as Python Processor
    participant N as Neo4j
    U->>HK: Request Data Export
    HK->>P: Send XML/JSON
    P->>P: Normalize & Clean
    P->>N: Cypher: MERGE (u:User)-[:HAS_METRIC]->(m:Metric)
    N-->>U: Graph Insights (Visualized)
Enter fullscreen mode Exit fullscreen mode

The Schema Definition

Defining your constraints is vital for graph integrity:

// Create uniqueness constraints
CREATE CONSTRAINT FOR (d:Date) REQUIRE d.date IS UNIQUE;
CREATE CONSTRAINT FOR (u:User) REQUIRE u.id IS UNIQUE;

// Link everything to a User
MATCH (m:Metric), (u:User {id: 'me'})
MERGE (u)-[:PRODUCED]->(m);
Enter fullscreen mode Exit fullscreen mode

Conclusion

Building a Personal Health Data Lake isn't just about storage; it's about discovery. By moving from a flat file (CSV/Excel) to a Neo4j Knowledge Graph, you've turned your phone's sensors into a structured medical history that can be queried, visualized, and even used as a backend for a personalized Health AI.

What's next?

  1. Add Weather Data: See how humidity affects your running pace.
  2. RAG Integration: Use this graph to provide context to an LLM for personalized health coaching.

If you enjoyed this advanced deep dive, I highly recommend checking out the engineering resources at wellally.tech/blog for more insights on medical data engineering and building secure health platforms.

Drop a comment below: What's the weirdest correlation you've found in your health data? 👇

Top comments (0)