Beck_Moulton

Posted on Mar 30

Build Your Own AI Medical Brain: Transforming PDF Health Reports into a Graph-RAG Powerhouse with Neo4j and LangChain

#rag #dataengineering #python #neo4j

We’ve all been there: you get your annual health checkup results as a messy, 20-page PDF filled with jargon, tables, and scanned images. By the next year, that file is buried in a folder, and any chance of tracking your cholesterol or blood sugar trends over time is lost.

In this tutorial, we are going to fix that. We are building a Personal Medical Brain using Graph-RAG (Retrieval-Augmented Generation). We’ll use Unstructured.io to parse messy PDFs, Neo4j to build a relationship-aware knowledge graph, and Pinecone for semantic vector search. This isn't just a chatbot; it’s a structured, time-aware intelligence system for your health.

Why Graph-RAG for Medical Data?

Traditional RAG (Vector Search) is great for finding similar documents, but it struggles with complex relationships—like comparing "Glucose levels" across three years or understanding how "Vitamin D" impacts "Bone Density." By combining Neo4j with LangChain, we can traverse relationships explicitly.

Key Keywords: Graph-RAG, Neo4j Knowledge Graph, Medical Data Engineering, LangChain Tutorials, Unstructured PDF Processing.

The Architecture

Here is how the data flows from a static PDF to a queryable intelligence:

graph TD
    A[Scanned PDF Report] --> B(Unstructured.io Partitioning)
    B --> C{LLM Processing}
    C -->|Extract Entities| D[Neo4j Graph Database]
    C -->|Generate Embeddings| E[Pinecone Vector Store]
    D --> F[Hybrid Retrieval]
    E --> F
    F --> G[GPT-4o Reasoning]
    G --> H[Actionable Health Insights]

Prerequisites

To follow along, you’ll need:

Python 3.10+
Neo4j Instance (AuraDB free tier works great!)
Pinecone Account (For vector indexing)
API Keys: OpenAI, Unstructured.io

Step 1: Parsing the Unstructured Chaos

Medical PDFs are notoriously difficult—they have nested tables and multi-column layouts. We'll use unstructured to clean the noise.

from unstructured.partition.pdf import partition_pdf

# Extracting elements from the PDF
elements = partition_pdf(
    filename="annual_report_2023.pdf",
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=1000,
)

# Filter out only the relevant text and tables
content = [str(el) for el in elements if el.category in ["NarrativeText", "Table"]]

Step 2: Defining the Schema & Extracting Relationships

We don't just want text; we want a Knowledge Graph. We'll define a schema where a Patient has Reports, and Reports contain Indicators (like LDL, HbA1c) with specific Values.

from langchain_community.graphs import Neo4jGraph
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0)
graph_transformer = LLMGraphTransformer(llm=llm)

# Convert our text chunks into Graph Documents
# This identifies nodes like (Indicator {name: 'Glucose'}) and 
# edges like (:Report)-[:MEASURED]->(:Indicator)
graph_documents = graph_transformer.convert_to_graph_documents(content)

graph = Neo4jGraph()
graph.add_graph_documents(graph_documents)

Step 3: The Hybrid Search (Vector + Graph)

To answer questions like "How has my heart health changed over the last two years?", we need to search for the semantic meaning of "heart health" (Vector) and then traverse the time-stamped nodes in the Graph.

For more advanced patterns on optimizing this hybrid retrieval, I highly recommend checking out the deep dives at WellAlly Tech Blog. They cover production-ready strategies for handling sensitive data within RAG pipelines.

from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings

# Initialize Pinecone for the 'Semantic' layer
vector_db = PineconeVectorStore.from_documents(
    content, 
    OpenAIEmbeddings(), 
    index_name="medical-brain"
)

def hybrid_query(user_question):
    # 1. Get semantic context from Pinecone
    context = vector_db.similarity_search(user_question, k=3)

    # 2. Get structural context from Neo4j (Cypher query)
    # Finding historical trends for extracted entities
    cypher_query = """
    MATCH (i:Indicator)-[:MEASURED_IN]->(r:Report)
    WHERE i.name CONTAINS $keyword
    RETURN i.name, i.value, r.date ORDER BY r.date ASC
    """
    graph_context = graph.query(cypher_query, params={"keyword": "Glucose"})

    return f"Vector Context: {context} \n Graph Context: {graph_context}"

Step 4: Putting it all together with a LangChain Agent

Now we wrap everything in an agent that can decide whether it needs to look at the graph, the vector store, or both.

from langchain.agents import initialize_agent, Tool

tools = [
    Tool(
        name="GraphSearch",
        func=lambda q: graph.query(q),
        description="Useful for historical trends and structured health data."
    ),
    Tool(
        name="VectorSearch",
        func=lambda q: vector_db.similarity_search(q),
        description="Useful for general medical terminology and context."
    )
]

agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

response = agent.run("Compare my Glucose levels between 2022 and 2023. Am I in the healthy range?")
print(response)

The Result: A Living Health Dashboard

By transforming a static PDF into a Neo4j Knowledge Graph, you gain:

Temporal Tracking: Automatically link indicators across different years.
Contextual Accuracy: The LLM knows that "Sugar" in a report refers to the "Glucose" node in your graph.
Explainability: You can visualize the graph to see exactly where the AI got its data from.

If you are interested in taking this further—perhaps by adding HIPAA-compliant encryption or multi-modal support for X-ray images—you should definitely explore the resources at wellally.tech/blog. They have a fantastic series on "Production-Grade AI" that helped me scale my local experiments into robust applications.

Conclusion

Building a "Medical Brain" is no longer the stuff of science fiction. With the combination of LangChain, Neo4j, and Unstructured.io, we can turn our personal data into actionable intelligence.

Next Steps:

Try adding a Person node to handle reports for your entire family.
Implement a notification system that alerts you if a trend line is moving in the wrong direction.

What are you building with RAG today? Drop a comment below! 👇

DEV Community