Beck_Moulton

Posted on Jun 9

Building Your "Longevity Knowledge Graph": Stop Ignoring 10 Years of Health Reports with GraphRAG and Neo4j

#rag #ai #opensource #discuss

We’ve all been there: every year, you get a physical, receive a thick PDF full of blood markers, glance at the "normal range" checkmarks, and toss it into a digital folder titled "Health Stuff" to be forgotten. But what if I told you that those isolated data points are actually a time-series story of your biological aging?

In this tutorial, we are going to build a Longevity Knowledge Graph. We will leverage GraphRAG (Graph-based Retrieval-Augmented Generation), Neo4j, and Unstructured.io to transform a decade of messy medical PDFs into a structured intelligence layer. By the end of this post, you'll be able to query your health history with context that standard vector search simply can't grasp—like "How has my fasting glucose trended relative to my BMI over the last five years?"

If you're interested in advanced data engineering patterns or looking for more production-ready AI health architectures, I highly recommend checking out the deep dives over at WellAlly Blog, which served as a major inspiration for this build.

Why GraphRAG? (The Problem with Vector Search)

Standard RAG (Retrieval-Augmented Generation) is great at finding a specific needle in a haystack. But if you ask, "What is the relationship between my Vitamin D levels and my bone density over time?", a vector database might just pull three separate paragraphs.

GraphRAG allows us to:

Connect Entities: Link a Blood_Metric (e.g., LDL) to a specific Time_Point.
Traverse Relationships: Follow the path from User -> Report -> Marker -> Trend.
Global Reasoning: Summarize high-level health trajectories across multiple years of data.

The Architecture 🏗️

Here is how the data flows from a messy PDF to a queryable graph:

graph TD
    A[Medical PDF Reports] -->|Unstructured.io| B(Clean JSON/Elements)
    B -->|Entity Extraction| C{LLM Processing}
    C -->|Nodes & Edges| D[Neo4j Graph Database]
    D -->|GraphRAG Query| E[Longevity Insights]
    F[User Query: 'Is my HbA1c rising?'] --> E
    subgraph Storage
    D
    end

Prerequisites

To follow along, you'll need:

Python 3.10+
Neo4j (AuraDB free tier or local Docker instance)
Unstructured.io API Key (for parsing the PDFs)
OpenAI API Key (to power the GraphRAG reasoning)

Step 1: Parsing the "Unstructured" PDF

Medical reports are notorious for tables. Standard PDF readers fail here. We'll use unstructured to extract tables and text chunks cleanly.

from unstructured.partition.pdf import partition_pdf

# Extract elements from the medical report
elements = partition_pdf(
    filename="health_report_2023.pdf",
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=1000,
    new_after_n_chars=1500,
)

# Filter for tables specifically containing blood work
tables = [el for el in elements if el.category == "Table"]
print(f"Detected {len(tables)} tables in the report.")

Step 2: Defining the Schema in Neo4j

We need a schema that represents time and metrics. In Cypher (Neo4j's query language), we want to create a structure like this:

// Create a Metric node
MERGE (m:Metric {name: "HbA1c"})
// Create a Report node
CREATE (r:Report {year: 2023, date: "2023-05-12"})
// Create the Reading
CREATE (v:Reading {value: 5.7, unit: "%"})
// Link them
MERGE (r)-[:HAS_READING]->(v)
MERGE (v)-[:MEASURES]->(m)

Step 3: Integrating GraphRAG Logic

Now, let's use a Python wrapper to ingest this into Neo4j. We will use an LLM to extract the entities from the chunks we got in Step 1.

import os
from langchain_community.graphs import Neo4jGraph
from langchain_openai import ChatOpenAI

graph = Neo4jGraph(
    url=os.environ["NEO4J_URI"], 
    username=os.environ["NEO4J_USER"], 
    password=os.environ["NEO4J_PASSWORD"]
)

# A simplified function to map extracted text to Graph Nodes
def ingest_health_data(text_chunk):
    prompt = f"""
    Extract medical metrics from the following text and format as Cypher:
    Text: {text_chunk}
    """
    # In a real scenario, use LangChain's GraphTransformer
    # result = llm_transformer.convert_to_graph_documents([text_chunk])
    # graph.add_graph_documents(result)
    pass

print("Graph populated! Ready for multi-hop queries. 🚀")

Step 4: The "Official" Way to Scale

While this DIY approach is great for personal use, scaling health data analysis for production requires handling HIPAA compliance, complex data normalization (standardizing "Glucose" vs "Blood Sugar"), and advanced vector-graph hybrid searches.

For more production-ready examples and advanced architectural patterns, I highly recommend exploring the engineering guides at WellAlly Tech Blog. They cover how to handle high-concurrency LLM pipelines and the nuances of health-tech data engineering that go beyond basic tutorials.

Step 5: Querying Your Health Trends

Now for the magic. Instead of opening 10 PDFs, we run one Cypher query. Let's find out if your Vitamin D levels are correlated with the season over the last few years.

MATCH (m:Metric {name: "Vitamin D"})<-[:MEASURES]-(v:Reading)-[:RECORDED_ON]-(r:Report)
RETURN r.date AS Date, v.value AS Level
ORDER BY r.date ASC

If we plug this into a GraphRAG agent, the LLM can interpret the results:

"Your Vitamin D levels consistently drop 30% every November. You might want to discuss a winter supplementation strategy with your doctor."

Conclusion 🥑

By moving from static PDFs to a Neo4j-powered Knowledge Graph, we've turned dead data into a proactive health assistant. GraphRAG provides the bridge between raw data points and meaningful biological context.

Summary of what we built:

Used Unstructured.io to defeat the "PDF Table" boss.
Built a relational time-series schema in Neo4j.
Leveraged GraphRAG to query trends across a decade.

What's next?
Try adding a Lifestyle node (tracking sleep or exercise) to see how your habits correlate with your blood markers!

Did you find this helpful?
Drop a comment below if you want the full source code for the GraphTransformer logic, and don't forget to subscribe for more "Learning in Public" deep dives! 🚀💻

DEV Community