wellallyTech

Posted on Jun 10

From Paper Reports to Personal Insights: Building a Medical GraphRAG with Neo4j and LlamaIndex 🚀

#ai #python #dataengineering #neo4j

Have you ever tried to upload five years of medical lab results to a standard ChatGPT interface and asked, "How has my LDL cholesterol trended compared to my Vitamin D levels?"

If you have, you probably noticed the "hallucination wall." Traditional Vector RAG (Retrieval-Augmented Generation) is fantastic at finding specific text chunks, but it fundamentally fails at understanding relational trends and temporal data hidden across multiple documents. It treats your health history like a pile of disconnected sentences rather than a continuous journey.

Today, we are moving beyond simple vector embeddings. We’re diving into the world of Knowledge Graph RAG (GraphRAG). By combining Neo4j, LlamaIndex, and Unstructured.io, we will build a system that transforms messy PDF lab reports into a queryable, structured medical knowledge graph.

Keywords: GraphRAG, Neo4j Medical Database, LlamaIndex Tutorial, LLM Medical Analysis, Unstructured Data Processing.

The Architecture: Why Graphs? 🧠

In a medical context, a single data point (e.g., "Glucose: 105 mg/dL") is useless without context. You need to know the unit, the reference range, the date of the test, and how it relates to previous results.

A Knowledge Graph allows us to link an Indicator to a Report, which is linked to a Date, creating a path that an LLM can traverse to provide chronological reasoning.

graph TD
    A[Lab PDF Reports] -->|Unstructured.io| B(Raw Text & Tables)
    B -->|LlamaIndex Extraction| C{Property Graph Index}
    C -->|Store Nodes/Edges| D[Neo4j Database]
    E[User Query: What is my Glucose trend?] -->|Cypher Query| D
    D -->|Contextual Subgraph| F[GPT-4o / Claude 3.5]
    F -->|Natural Language Response| G[Final Insight]

    subgraph "Graph Schema"
    D1[Patient] --- D2[Report]
    D2 --- D3[Biomarker: Glucose]
    D3 --- D4[Value: 105]
    D3 --- D5[Date: 2023-10-01]
    end

Prerequisites 🛠️

Before we get our hands dirty, ensure you have the following stack ready:

Neo4j: Our graph database (running via Docker).
LlamaIndex: The orchestration framework for LLM data.
Unstructured.io: To handle the nightmare of parsing medical PDF tables.
Docker: For easy deployment.
OpenAI API Key: For entity extraction and synthesis.

Step 1: Spin up the Graph Infrastructure 🐳

First, we need a place for our data to live. Use Docker to get Neo4j up and running in seconds.

docker run \
    --name neo4j-medical-graph \
    -p 7474:7474 -p 7687:7687 \
    -d \
    -e NEO4J_AUTH=neo4j/password_here \
    -e NEO4J_PLUGINS='["apoc"]' \
    neo4j:latest

Step 2: Parsing Complex Medical PDFs 📄

Medical reports are notoriously difficult to parse. They are filled with tables, nested grids, and tiny fonts. We'll use Unstructured.io to normalize this data before feeding it to our index.

from llama_index.readers.file import UnstructuredReader

# Initialize the reader
reader = UnstructuredReader()

# Load your lab reports
documents = reader.load_data(input_file="./data/blood_test_oct_2023.pdf")

print(f"Parsed {len(documents)} document chunks.")

Step 3: Building the Property Graph Index 🧬

This is where the magic happens. Instead of just creating "chunks," we tell LlamaIndex to extract specific Entities (like Biomarker, Value, Unit) and Relationships (like HAS_VALUE, TESTED_ON).

from llama_index.core import PropertyGraphIndex
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore
from llama_index.llms.openai import OpenAI

# Connection to our Dockerized Neo4j
graph_store = Neo4jPropertyGraphStore(
    username="neo4j",
    password="password_here",
    url="bolt://localhost:7687"
)

# Define the LLM for extraction
llm = OpenAI(model="gpt-4o")

# Create the Graph Index
index = PropertyGraphIndex.from_documents(
    documents,
    llm=llm,
    graph_store=graph_store,
    embed_model="local:BAAI/bge-small-en-v1.5", # Efficient local embedding
    show_progress=True
)

Step 4: Querying for Temporal Insights 🔍

Now, instead of a simple similarity search, we can perform Text-to-Cypher queries. This allows the LLM to write graph queries that navigate the timeline of your health.

query_engine = index.as_query_engine(
    include_text=True, 
    similarity_top_k=5
)

response = query_engine.query(
    "Compare my Glucose levels from the last three reports. Are they increasing?"
)

print(f"📊 Analysis: {response}")

The "Official" Way to Scale 🥑

Building a local prototype is a great start, but when you're moving toward production-grade medical AI applications, there are nuances in data privacy (HIPAA), graph schema optimization, and multi-agent orchestration that you need to consider.

For deep dives into advanced RAG patterns and production-ready architectures, I highly recommend checking out the engineering deep-dives at WellAlly Tech Blog. They cover everything from sovereign LLM deployments to high-performance graph optimizations that were the primary inspiration for this build.

Conclusion: The Future is Relational 🚀

Traditional RAG is like reading a book page by page. GraphRAG is like having the entire plot, character map, and timeline in your head at once. For medical data, where the "truth" lies in the relationships between numbers over time, GraphRAG isn't just an "upgrade"—it's a necessity.

What's next?

Try adding Multi-Agent support to check your lab results against the latest medical journals.
Implement Hybrid Search (Vector + Graph) to get the best of both worlds.

Did you find this helpful? Drop a comment below with your thoughts on Knowledge Graphs, and don't forget to star the repo if you're building along! 🌟

DEV Community