wellallyTech

Posted on Mar 18

Beyond Vector Search: Building a Personal Health Knowledge Graph with GraphRAG and Neo4j 🧬📊

#rag #health #neo4j #ai

Have you ever tried to ask a standard RAG (Retrieval-Augmented Generation) system about your health trends? You upload five years of PDF medical reports, ask "How has my LDL cholesterol changed since 2019?", and the AI gives you a generic hallucination or just picks one random document. 🤦‍♂️

The problem is knowledge fragmentation. Traditional vector databases are great at finding "similar" text, but they are terrible at connecting dots across time and disparate entities. To solve this, we need a Knowledge Graph. By combining the reasoning power of LLMs with the structural integrity of a graph database, we can build a GraphRAG system that actually understands your health journey.

In this guide, we’ll leverage Knowledge Graph construction, Health Data Engineering, and Neo4j to transform messy PDFs and images into a queryable medical brain. For those looking for even more advanced production-ready patterns, I highly recommend checking out the architectural deep dives at WellAlly Tech Blog, which served as a huge inspiration for this implementation.

🏗️ The Architecture: From Unstructured Data to Graph Insights

Traditional RAG treats documents like a pile of laundry. GraphRAG treats them like a structured database. Here is how the data flows from a blurry JPEG of a blood test to a temporal health insight:

graph TD
    A[Raw Health Reports: PDF/Images] --> B[Unstructured.io: Partitioning & OCR]
    B --> C[LlamaIndex/LangChain: Entity & Relation Extraction]
    C --> D[Neo4j: Knowledge Graph Storage]
    D --> E[GraphRAG Query Engine]
    F[User Question: 'What is my glucose trend?'] --> E
    E --> G[Cypher Query + Vector Search]
    G --> H[Final Insightful Answer]

🛠️ Prerequisites

To follow along, you'll need the following stack:

Unstructured.io: To handle the nightmare of parsing table data from PDFs.
Neo4j: Our graph database (use AuraDB for a free cloud instance).
LangChain / LlamaIndex: To orchestrate the LLM and Graph interactions.
OpenAI (GPT-4o): For high-accuracy medical entity extraction.

Step 1: Parsing the Chaos with Unstructured.io 📄

Medical reports are usually tables disguised as PDFs. Using a standard PyPDF2 won't work because it loses the grid structure. We use Unstructured.io to maintain the semantic layout.

from unstructured.partition.pdf import partition_pdf

# Extracting elements while preserving table structures
elements = partition_pdf(
    filename="health_report_2023.pdf",
    infer_table_structure=True,
    strategy="hi_res", # Uses layout model to detect tables
)

# Filter for tables specifically
tables = [el for el in elements if el.category == "Table"]
print(f"Detected {len(tables)} tables in the report.")

Step 2: Defining the Schema & Extracting Entities 🧠

We need to turn text into Nodes (User, Metric, Date, Value) and Relationships (HAS_RESULT, RECORDED_ON). Here’s how we define a Pydantic schema for the LLM to follow:

from pydantic import BaseModel, Field
from typing import List

class HealthMetric(BaseModel):
    name: str = Field(description="The name of the test, e.g., LDL Cholesterol")
    value: float = Field(description="The numeric result")
    unit: str = Field(description="The unit, e.g., mg/dL")
    status: str = Field(description="Normal, High, or Low")

class ReportExtraction(BaseModel):
    date: str = Field(description="ISO format date of the report")
    metrics: List[HealthMetric]

Using LangChain’s structured output, we pass the parsed table text to GPT-4o to get clean JSON.

Step 3: Populating Neo4j 🕸️

Now, we connect the dots. In a graph, "Cholesterol" is a single node. Every time a new report comes in, we create a new "Result" node and link it to the "Cholesterol" node. This creates a Time-Series Graph.

from langchain_community.graphs import Neo4jGraph

graph = Neo4jGraph(url="bolt://localhost:7687", username="neo4j", password="password")

def add_report_to_graph(extraction: ReportExtraction):
    cypher = """
    MERGE (p:Person {name: 'User01'})
    MERGE (m:Metric {name: $metric_name})
    CREATE (r:Result {value: $value, unit: $unit, date: date($date), status: $status})
    CREATE (p)-[:HAS_RESULT]->(r)
    CREATE (r)-[:MEASURES]->(m)
    """
    for metric in extraction.metrics:
        graph.query(cypher, {
            "metric_name": metric.name,
            "value": metric.value,
            "unit": metric.unit,
            "date": extraction.date,
            "status": metric.status
        })

Step 4: The GraphRAG Advantage 🚀

Why did we go through all this trouble? Because now we can perform Global Health Analytics.

When a user asks: "Compare my Liver Function tests over the last 3 years," a standard RAG would struggle to find all relevant chunks. GraphRAG simply traverses the path:
(Person)-[:HAS_RESULT]->(Result)-[:MEASURES]->(Metric {category: 'Liver'}).

For a deeper dive into how to optimize these Cypher queries for speed and accuracy, the team at WellAlly Tech has published an excellent series on Advanced Graph Traversal for RAG, which covers index optimization that is crucial for large-scale medical datasets.

The Result: Real Intelligence

By turning pixels into a graph, you've built a system that understands:

Trends: Automatic calculation of slopes for specific biomarkers.
Correlations: Does your Vitamin D go up when your Calcium goes down?
Completeness: Identifying missing tests based on age/gender profiles.

Sample Query Output:

User: How is my fasting blood sugar trending?
AI: Based on your reports from 2021 to 2024, your Glucose levels have shown a steady decrease from 105 mg/dL to 92 mg/dL. This move from 'Pre-diabetic range' to 'Normal' correlates with the weight loss noted in your 2023 physical.

Conclusion 🥑

The future of personal AI isn't just "chatting with PDFs"—it's about building a structured world model of your data. GraphRAG with Neo4j and LlamaIndex provides the bridge between messy unstructured files and actionable health insights.

What's next?

Add a "Medical Knowledge Base" subgraph (e.g., PubMed data) to explain why a metric is high.
Use Streamlit to visualize these graph trends.

If you enjoyed this build, don't forget to heart this post and check out wellally.tech/blog for more cutting-edge AI tutorials!

Happy coding, and stay healthy! 🚀💻

DEV Community