Beck_Moulton

Posted on Apr 14

Biohackers-RAG: Stop Guessing Your Supplements! Build an Explainable Health Graph with Neo4j & PubMed

#ai #rag #neo4j #biohacking

If you've ever looked at a blood test report and spent three hours scrolling through Reddit or PubMed trying to figure out if your Vitamin D levels are "optimal" or just "adequate," you've felt the pain of the information gap.

Traditional Retrieval-Augmented Generation (RAG) often fails in the medical domain because it treats documents as flat chunks of text. If you want to know how Magnesium L-Threonate affects BDNF levels specifically in the context of sleep deprivation, a simple vector search might give you three unrelated paragraphs. To bridge the gap, we need GraphRAG.

In this guide, we’re building Biohackers-RAG: a precision medicine pipeline that uses Neo4j, LlamaIndex, and Ollama to turn PubMed papers into a structured knowledge graph. This allows us to perform "multi-hop" reasoning—connecting your personal biomarkers to biological mechanisms with 100% explainability.

The Architecture: Why Graphs Beat Vectors

While vector databases are great for "vibes" (semantic similarity), Graphs are built for "facts" (relationships). By using a Knowledge Graph, we can traverse the path from a Blood Marker to a Nutrient to a Biological Mechanism.

graph TD
    A[User Blood Report] --> B[LlamaIndex Property Extractor]
    B --> C{Neo4j Knowledge Graph}
    D[PubMed API] --> E[Entity Linking]
    E --> C
    C --> F[GraphRAG Query Engine]
    G[Ollama: Llama3/Mistral] --> F
    F --> H[Explainable Health Advice]

    subgraph "Knowledge Layer"
    C --- I[Marker: Ferritin]
    I --- J[Related To: Iron Metabolism]
    J --- K[Inhibited By: Calcium]
    end

Prerequisites

To follow this advanced tutorial, you'll need:

Neo4j: A running instance (Desktop or AuraDB).
Ollama: For local LLM inference (we'll use llama3).
LlamaIndex: The framework for data orchestration.
PubMed API Key: (Optional, but recommended for high-volume fetching).

Step 1: Modeling the Health Domain

We aren't just storing text; we are storing entities. Our graph schema will focus on three primary nodes: Biomarker, Compound (Supplements), and Mechanism.

Setting up the Neo4j Graph Store

from llama_index.graph_stores.neo4j import Neo4jGraphStore
from llama_index.core import StorageContext

# Initialize the Neo4j Store
graph_store = Neo4jGraphStore(
    username="neo4j",
    password="your_password",
    url="bolt://localhost:7687",
    database="neo4j",
)

storage_context = StorageContext.from_defaults(graph_store=graph_store)

Step 2: Fetching and Indexing PubMed Papers

Instead of manual PDF uploads, we use the PubMed API to fetch the latest research on specific biomarkers. We then use an LLM to extract "triplets" (Subject -> Predicate -> Object).

import os
from llama_index.core import KnowledgeGraphIndex
from llama_index.llms.ollama import Ollama

# Using Llama3 via Ollama for extraction
llm = Ollama(model="llama3", request_timeout=300.0)

# Example: Constructing the index from PubMed data
# In a real scenario, use a PubMed tool to fetch text snippets
documents = ["Magnesium increases BDNF expression in the hippocampus, improving synaptic plasticity."]

index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    max_triplets_per_chunk=3,
    llm=llm,
    include_embeddings=True,
)

Step 3: Integrating Personal Biomarkers

The magic happens when you map your blood test results into the graph. We can represent a high or low lab value as a state in the graph.

def add_user_biomarker(marker_name, value, reference_range):
    # Logic to determine if 'High' or 'Low'
    status = "Low" if value < reference_range[0] else "Optimal"

    # Injecting a personal node into the graph
    cypher_query = f"""
    MERGE (u:User {{id: 'DevUser'}})
    MERGE (m:Biomarker {{name: '{marker_name}'}})
    MERGE (u)-[:HAS_LEVEL {{value: {value}, status: '{status}'}}]->(m)
    """
    graph_store.query(cypher_query)

add_user_biomarker("Vitamin D", 22, [30, 100])

Step 4: The GraphRAG Query Engine

Standard RAG would just look for the words "Vitamin D." GraphRAG asks: "Find me all compounds that increase Vitamin D and show me the biological pathways they activate according to PubMed."

query_engine = index.as_query_engine(
    include_text=True, 
    response_mode="tree_summarize",
    embedding_mode="hybrid",
    similarity_top_k=5
)

response = query_engine.query(
    "My Vitamin D is low. Based on the graph, what supplements help, "
    "and what are the downstream effects on my immune system?"
)

print(f"🥑 Biohacker Insight: {response}")

The "Official" Way: Production Grade GraphRAG

Building a production-ready medical knowledge graph requires more than just extracting triplets. You need entity resolution (ensuring "Vitamin D3" and "Cholecalciferol" map to the same node) and rigorous evaluation.

For more production-ready examples and advanced patterns on handling high-throughput medical data, I highly recommend checking out the technical deep-dives at WellAlly Blog. They cover everything from scaling Neo4j clusters to fine-tuning LLMs for clinical accuracy.

Conclusion: The Future of Personal Health

By combining Neo4j's relational power with LlamaIndex's RAG capabilities, we've moved from "searching for documents" to "querying biology." This Biohackers-RAG setup ensures that every supplement recommendation comes with a verifiable path back to a peer-reviewed paper.

Next Steps for you:

Try adding a Gene node to link your 23andMe data to the graph.
Use Ollama's local inference to keep your health data private.
Check out wellally.tech/blog to learn how to deploy this into a secure cloud environment.

What biomarker are you tracking next? Let me know in the comments! 👇

DEV Community