wellallyTech

Posted on Mar 29

10 Years of Blood Reports into One Graph: Building a Personal Medical Knowledge Base with Unstructured.io, Neo4j, and LlamaIndex

#dataengineering #ai #neo4j #rag

We’ve all been there: a stack of printed blood reports gathering dust in a drawer, or a folder full of cryptic filenames like report_2014_final_v2.pdf. When your doctor asks, "How has your LDL cholesterol trended over the last decade?", you're stuck scrolling through hundreds of pages of unstructured data.

In this tutorial, we are going to solve this using a cutting-edge RAG architecture (Retrieval-Augmented Generation). We will transform fragmented PDF medical records into a structured Medical Knowledge Graph using Unstructured.io, Neo4j, and Milvus, all orchestrated by LlamaIndex. By the end of this guide, you’ll have a system capable of cross-year biochemical trend analysis and anomaly detection. 🚀

The Architecture: From Pixels to Nodes

To handle 10 years of data, a simple vector search isn't enough. We need GraphRAG. While Milvus handles the semantic similarity of medical terms, Neo4j stores the temporal relationships between tests, biomarkers, and health states.

graph TD
    A[PDF Blood Reports] -->|Partitioning| B(Unstructured.io)
    B -->|Extract Tables & Text| C{LlamaIndex Orchestrator}
    C -->|Embeddings| D[Milvus Vector DB]
    C -->|Entities & Relations| E[Neo4j Graph DB]
    F[User Query: 'Show my Glucose trend'] --> C
    C -->|Hybrid Search| D
    C -->|Cypher Query| E
    E --> G[Structured Health Insights]
    D --> G

Prerequisites

Before we dive in, ensure you have the following stack ready:

Unstructured.io: For high-accuracy PDF table extraction.
Neo4j: Our Graph Database for longitudinal tracking.
Milvus: For high-performance vector retrieval.
LlamaIndex: The glue connecting our LLM with our data stores.
Python 3.10+

Step 1: Parsing Complex Medical Tables with Unstructured.io

Medical PDFs are notorious for complex nested tables. Standard PDF parsers usually fail here. We'll use unstructured to partition the document into logical elements.

from unstructured.partition.pdf import partition_pdf

# Extracting elements while preserving hierarchy
elements = partition_pdf(
    filename="blood_report_2023.pdf",
    strategy="hi_res",           # Use layout analysis for tables
    infer_table_structure=True,  # Extract cells from tables
    chunking_strategy="by_title"
)

# Filter for table elements specifically
tables = [el for el in elements if el.category == "Table"]
print(f"Detected {len(tables)} tables in the report.")

Step 2: Setting up the Knowledge Graph (Neo4j)

A Knowledge Graph allows us to link Biomarker nodes to Test nodes across different Time dimensions. This is the "secret sauce" for longitudinal analysis.

// Our Graph Schema
CREATE CONSTRAINT FOR (b:Biomarker) REQUIRE b.name IS UNIQUE;
CREATE CONSTRAINT FOR (p:Patient) REQUIRE p.id IS UNIQUE;

// Relationship Example
// (Patient)-[:HAD_TEST]->(Report)-[:CONTAINS]->(Biomarker {value: 95, unit: 'mg/dL'})

Step 3: Hybrid Retrieval with LlamaIndex

We use LlamaIndex to index our data into two places: Milvus for semantic queries (e.g., "What does high ALT mean?") and Neo4j for structured queries (e.g., "List all Glucose values from 2015 to 2023").

from llama_index.core import StorageContext, KnowledgeGraphIndex
from llama_index.graph_stores.neo4j import Neo4jGraphStore
from llama_index.vector_stores.milvus import MilvusVectorStore

# Setup Neo4j Graph Store
graph_store = Neo4jGraphStore(
    username="neo4j", password="password", url="bolt://localhost:7687"
)

# Setup Milvus Vector Store
vector_store = MilvusVectorStore(dim=1536, collection_name="medical_reports")

storage_context = StorageContext.from_defaults(
    graph_store=graph_store, 
    vector_store=vector_store
)

# Building the Index
index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    max_triplets_per_chunk=5,
    include_embeddings=True,
)

Step 4: Querying Your Health History

Now, instead of reading PDFs, you can ask natural language questions. The LLM will generate a Cypher query for Neo4j and a Vector search for Milvus.

query_engine = index.as_query_engine(
    include_text=True, 
    response_mode="tree_summarize"
)

response = query_engine.query(
    "Analyze my Vitamin D levels over the last 5 years. Is there a downward trend?"
)

print(f"Health Insights: {response}")

The "Official" Way: Advanced Patterns 🥑

Building a production-ready medical AI system requires more than just basic scripts. You need to handle data privacy (HIPAA), complex entity resolution, and sophisticated schema mapping.

For more production-ready examples and advanced patterns on structuring unstructured data for RAG, I highly recommend checking out the technical deep-dives at WellAlly Blog. They have excellent resources on optimizing GraphRAG performance and handling large-scale document ingestion that were instrumental in refining this architecture.

Conclusion

By combining the structural power of Neo4j with the semantic capabilities of Milvus and LlamaIndex, we've turned a pile of useless PDFs into a living, breathing medical history. This Personal Medical Knowledge Graph doesn't just store data; it provides context and insight.

What's next?

Anomaly Detection: Use Graph algorithms to flag results that deviate from your personal baseline.
Visualization: Use Streamlit to plot the Neo4j data into interactive charts.

Did you find this helpful? Drop a comment below or share your experience with GraphRAG! 💻🏥

DEV Community