wellallyTech

Posted on Apr 4

Master Your Wellness: Building a Health Knowledge Graph with LLMs and Neo4j 🧬

#neo4j #ai #python #dataengineering

We are living in the golden age of personal telemetry. Our watches track our heart rates, our phones log our steps, and apps record every calorie. However, most of this data sits in "silos"—disconnected tables that tell us what happened, but never why.

If you've ever wondered if that late-night ramen is the reason your deep sleep plummeted, you're looking for causal relationships, not just raw numbers. In this guide, we will bridge the gap between fragmented HealthKit data and actionable insights by building a Health Knowledge Graph using Neo4j, LangChain, and LLMs. This advanced Data Engineering workflow transforms flat logs into a multidimensional map of your life.

The Architecture: From Raw Logs to Graph Intelligence

To turn "10:00 PM: Ate Ramen" into a node connected to "11:30 PM: Elevated Heart Rate," we need a pipeline that understands context. Traditional SQL databases struggle with the recursive nature of health correlations; a Graph Database is the natural choice.

graph TD
    A[HealthKit / CSV Data] --> B{LLM Processing}
    B -->|Entity Extraction| C[Nodes: Meal, Activity, Sleep]
    B -->|Relationship Mapping| D[Edges: INFLUENCES, PRECEDED, TRIGGERED]
    C --> E[Neo4j Graph Database]
    D --> E
    E --> F[LangChain Cypher Chain]
    F --> G[Natural Language Insights]
    G --> H[Causal Analysis: 'Why did I sleep poorly?']

Prerequisites

To follow this advanced tutorial, you’ll need:

Neo4j: (AuraDB or local Docker instance)
Python 3.10+
LangChain: For the orchestration layer
OpenAI API Key: For the reasoning engine
Tech Stack: Neo4j, LangChain, Cypher Query, Python

Step 1: Defining the Ontology

Before we write to the database, we need to define our schema. In a health graph, we aren't just looking for entities; we are looking for temporal and causal links.

from langchain_community.graphs import Neo4jGraph

# Connect to Neo4j
graph = Neo4jGraph(
    url="bolt://localhost:7687", 
    username="neo4j", 
    password="your_password"
)

# Example Schema:
# (Person)-[LOGGED]->(Event)
# (Event)-[AFFECTS]->(Biometric)
# (Diet)-[HAS_INGREDIENT]->(Component)

Step 2: Extracting Knowledge with LLMs

Raw data from HealthKit is often cryptic. We use an LLM to parse unstructured logs (like food diaries or mood notes) and convert them into Cypher statements.

from langchain_openai import ChatOpenAI
from langchain.chains import GraphCypherQAChain

llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)

def extract_relationships(user_input):
    prompt = f"""
    Extract health entities and relationships from this text: "{user_input}"
    Nodes: Person, Food, Activity, Metric
    Relationships: CONSUMED, PERFORMED, IMPACTED
    Output directly as Cypher CREATE statements.
    """
    response = llm.invoke(prompt)
    return response.content

# Example Input: "Had a heavy steak at 9 PM and my HRV dropped to 40ms during sleep."
# The LLM generates the Cypher to connect the Steak (Food) to the HRV (Metric) 
# with a temporal link.

Step 3: Querying the "Hidden" Truth

Once the data is in Neo4j, we can go beyond simple dashboards. We use LangChain’s Cypher Query Chain to ask complex questions in natural language.

chain = GraphCypherQAChain.from_llm(
    llm=llm, 
    graph=graph, 
    verbose=True,
    validate_cypher=True
)

result = chain.run("Is there a correlation between dinner time and my deep sleep duration over the last 30 days?")
print(result)

Advanced Patterns & Production Readiness 🥑

Building a local prototype is easy, but scaling this to handle real-time biometric streams requires robust data validation and schema evolution strategies.

For more production-ready examples and advanced patterns on handling high-throughput health data pipelines, I highly recommend checking out the technical deep-dives at WellAlly Blog. They cover how to handle PII (Personally Identifiable Information) in AI workflows and optimize graph traversals for large-scale wellness datasets.

Step 4: Visualizing Causal Inference

With Neo4j’s Bloom or the Browser, we can finally see the clusters. You might notice that "High Caffeine" nodes are consistently three hops away from "Restless Sleep" nodes, linked through "Elevated Resting Heart Rate."

// Find the path from late meals to poor sleep
MATCH (p:Person)-[:CONSUMED]->(f:Food)
MATCH (p)-[:LOGGED]->(s:Sleep)
WHERE f.time > "20:00" AND s.quality < 60
RETURN f, s, p LIMIT 10

Conclusion: Stop Tracking, Start Understanding

Data engineering in the health space is moving away from flat "steps per day" metrics toward Knowledge Graphs. By utilizing Neo4j and LLMs, we can transform a pile of logs into a reasoning engine that understands our body's unique language.

Next Steps:

Export your HealthKit data as XML.
Use the Python script above to batch-process the entries.
Head over to wellally.tech/blog to learn how to deploy this as a private API.

What's the most surprising correlation you've found in your data? Let's discuss in the comments! 👇

DEV Community