Ever looked at a shelf full of medicine bottles and wondered, "Wait, can I take this allergy pill with my prescription antibiotics?" Mismanaging medication interactions is a leading cause of preventable hospitalizations.
In this tutorial, we are building a Vision-to-Knowledge pipeline. We'll use multimodal AI, specifically GPT-4o Vision, to extract text from medicine packaging, and then map those entities to a Neo4j Knowledge Graph to detect potential drug-drug interactions (DDI). By combining GPT-4o Vision with Knowledge Graph technology, we move beyond simple OCR into the realm of intelligent, safety-critical reasoning.
The Problem: The "Black Box" of Medical Leaflets
Traditional OCR struggles with the curved surfaces of pill bottles and the complex typography of pharmaceutical branding. Even if you get the text, knowing that "Advil" and "Motrin" are both Ibuprofen requires a semantic layer.
The Architecture
Our system follows a "Vision-Extraction-Query" flow. We use GPT-4o for high-fidelity extraction and Neo4j for the relational logic of contraindications.
graph TD
A[User Uploads Photo] --> B{GPT-4o Vision API}
B -->|Extracts| C[Drug Name + Active Ingredients]
C --> D[Pydantic Validation]
D --> E[Neo4j Cypher Query]
E --> F{Interaction Found?}
F -->|Yes| G[⚠️ High Risk Warning]
F -->|No| H[✅ Safe to Combine]
G --> I[RAG: Fetch Official Leaflet Details]
H --> I
Prerequisites
Before we dive in, ensure you have the following:
- Tech Stack: Python 3.10+, OpenAI API Key, Neo4j (Aura DB or Local).
- Libraries:
openai,neo4j,pydantic,pytesseract(as a fallback).
Step 1: Extracting Structured Data from Images
We don't just want a raw string; we want a structured object. We’ll use GPT-4o’s vision capabilities paired with Pydantic for schema enforcement.
import os
from openai import OpenAI
from pydantic import BaseModel
from typing import List
client = OpenAI(api_key="YOUR_OPENAI_KEY")
class MedicationInfo(BaseModel):
brand_name: str
active_ingredients: List[str]
dosage_form: str # e.g., Tablet, Syrup
def analyze_medication_image(image_url: str):
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Extract the brand name and active ingredients from this medicine packaging."},
{"type": "image_url", "image_url": {"url": image_url}}
],
}
],
response_format=MedicationInfo,
)
return response.choices[0].message.parsed
# Example Usage
# result = analyze_medication_image("https://example.com/pill_bottle.jpg")
# print(f"Detected: {result.brand_name} with {result.active_ingredients}")
Step 2: Mapping to the Knowledge Graph (Neo4j)
Once we have the active ingredients (e.g., Warfarin, Aspirin), we need to check our Neo4j database for a CONTRAINDICATED_WITH relationship.
from neo4j import GraphDatabase
class DrugSafetyChecker:
def __init__(self, uri, user, password):
self.driver = GraphDatabase.driver(uri, auth=(user, password))
def check_interaction(self, ingredient_a, ingredient_b):
with self.driver.session() as session:
query = """
MATCH (d1:Ingredient {name: $name_a})
MATCH (d2:Ingredient {name: $name_b})
MATCH (d1)-[r:CONTRAINDICATED_WITH]-(d2)
RETURN r.severity as severity, r.reason as reason
"""
result = session.run(query, name_a=ingredient_a, name_b=ingredient_b)
return result.single()
# Usage logic
# checker = DrugSafetyChecker("bolt://localhost:7687", "neo4j", "password")
# risk = checker.check_interaction("Aspirin", "Warfarin")
The "Official" Way to Build Health-Tech
While this tutorial covers the basics of multimodal extraction, production-grade systems require much more robust handling of medical terminology and RAG (Retrieval-Augmented Generation) architectures.
For advanced patterns on scaling Knowledge Graphs and fine-tuning Vision models for medical precision, I highly recommend checking out the technical deep-dives at WellAlly Blog. They offer incredible resources on building production-ready AI agents that bridge the gap between messy real-world data and structured medical knowledge.
Step 3: Closing the Loop with RAG
If an interaction is found, we don't just say "No." We use RAG to pull the specific paragraph from the official FDA or EMA leaflet to explain why.
def generate_safety_report(drug_a, drug_b, risk_details):
# Context retrieved from our Vector DB (e.g., Pinecone or Neo4j Vector)
context = fetch_official_leaflet_text(drug_a)
prompt = f"""
The user is taking {drug_a} and {drug_b}.
Our database flags a risk: {risk_details}.
Based on the official leaflet: {context},
explain the danger in simple, empathetic terms.
"""
# Call GPT-4o again for the final summary...
Conclusion: The Power of Multimodal Context
By combining GPT-4o Vision for perception and Neo4j for factual reasoning, we've built a prototype that is significantly more reliable than a simple chatbot.
Key Takeaways:
- Vision is the entry point: Use GPT-4o to turn "pixels" into "entities."
- Graphs are the source of truth: Don't rely on the LLM to remember drug interactions; query a database.
- Validate everything: Use Pydantic to ensure the LLM output matches your database schema.
Are you working on AI in healthcare or vision systems? Let’s chat in the comments! And don't forget to visit wellally.tech/blog for more advanced engineering guides.
Top comments (0)