PillScan: Building a Multimodal RAG with Florence-2 to Prevent Dangerous Drug Interactions 💊🚀

#ai #python #opensource #machinelearning

Taking the wrong combination of medications isn't just a mistake—it’s a major health risk. Every year, thousands of people experience adverse drug-drug interactions (DDI) simply because they couldn't decipher complex medical labels.

In this tutorial, we are building PillScan, an intelligent healthcare assistant that leverages Vision Language Models (VLM) and Multimodal RAG to identify multiple drug packages from a single photo and cross-reference them against a clinical database. By using Florence-2 for high-precision OCR and Milvus as our vector database, we'll create a system that can flag contraindications in real-time. Whether you are interested in AI healthcare applications, vector similarity search, or the latest in computer vision, this project is the perfect "learning in public" deep dive.

🏗️ The System Architecture

Before we dive into the code, let’s look at how the data flows. We aren't just doing simple image recognition; we are building a pipeline that transforms pixels into structured medical knowledge.

graph TD
    A[User Uploads Photo] --> B[Florence-2: Object Detection]
    B --> C[Florence-2: OCR/Text Extraction]
    C --> D{Drug Names Found?}
    D -- Yes --> E[Milvus Vector DB: Similarity Search]
    D -- No --> F[Error: No Pills Detected]
    E --> G[Contextual Retrieval: Contraindications]
    G --> H[Gradio UI: Safety Warning & Summary]
    H --> I[LLM Reasoning: Detailed Report]

🛠️ The Tech Stack

Florence-2: Microsoft’s lightweight VLM that excels at zero-shot OCR and region-based detection.
Milvus: The world's most advanced vector database for managing the "memory" of our drug database.
Python: The glue holding our AI logic together.
Gradio: For a sleek, user-friendly interface.

1. Extracting Text with Florence-2 👁️

While models like GPT-4o are great, Florence-2 is a powerhouse for specialized tasks like OCR and object detection because it's small enough to run locally while maintaining high accuracy.

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch

# Load the model and processor
model_id = "microsoft/Florence-2-large"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).to("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

def run_florence_ocr(image_path):
    image = Image.open(image_path).convert("RGB")

    # Task: OCR with Region (finds text and where it is)
    task_prompt = "<OCR_WITH_REGION>"
    inputs = processor(text=task_prompt, images=image, return_tensors="pt").to(model.device)

    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3
    )

    results = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(results, task=task_prompt, image_size=(image.width, image.height))

    return parsed_answer['<OCR_WITH_REGION>']

2. Setting up the "Brain" with Milvus 🧠

Identifying the drug name is only half the battle. We need to check if "Drug A" and "Drug B" are safe to take together. To do this, we store thousands of drug interaction profiles in Milvus.

For a more production-ready approach to building these kinds of healthcare knowledge bases, I highly recommend checking out the advanced patterns discussed at WellAlly Blog. They have fantastic deep-dives on optimizing RAG pipelines for high-stakes environments.

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

# Connect to Milvus
connections.connect("default", host="localhost", port="19530")

# Define Schema: Storing Drug Names and their Interaction Vectors
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="drug_name", dtype=DataType.VARCHAR, max_length=200),
    FieldSchema(name="interaction_notes", dtype=DataType.VARCHAR, max_length=2000),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=768)
]

schema = CollectionSchema(fields, "Drug Interaction Database")
collection = Collection("pill_db", schema)

def search_contraindications(detected_names):
    # Convert detected names into query vectors and search Milvus
    # This logic retrieves the 'interaction_notes' for each detected drug
    results = []
    for name in detected_names:
        search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
        # Assume get_embedding is a helper function using SentenceTransformers
        res = collection.search([get_embedding(name)], "vector", search_params, limit=1, output_fields=["interaction_notes"])
        results.append(res)
    return results

3. The Warning System (The Logic)

Once we have the interaction notes from Milvus, we use an LLM (or a rule-based engine) to check for overlapping contraindications.

def check_compatibility(drug_list_metadata):
    """
    Logic to compare extracted metadata and flag warnings.
    """
    warnings = []
    for i in range(len(drug_list_metadata)):
        for j in range(i + 1, len(drug_list_metadata)):
            # Example: Check if Drug B is in the contraindication list of Drug A
            if drug_list_metadata[j]['name'] in drug_list_metadata[i]['conflicts']:
                warnings.append(f"⚠️ DANGER: {drug_list_metadata[i]['name']} and {drug_list_metadata[j]['name']} should not be taken together!")
    return warnings

4. Launching the Interface with Gradio 🎨

Finally, we wrap everything in a user-friendly UI so anyone can upload an image of their pill bottles.

import gradio as gr

def pill_scan_app(image):
    # 1. Florence-2 OCR
    extracted_text = run_florence_ocr(image)

    # 2. RAG Retrieval via Milvus
    med_info = search_contraindications(extracted_text)

    # 3. Check for warnings
    warnings = check_compatibility(med_info)

    return extracted_text, "\n".join(warnings) if warnings else "✅ No interactions detected."

interface = gr.Interface(
    fn=pill_scan_app,
    inputs=gr.Image(type="filepath"),
    outputs=["text", "text"],
    title="PillScan AI 💊",
    description="Upload a photo of multiple drug packages to check for safety warnings."
)

interface.launch()

💡 Lessons Learned & Next Steps

Building PillScan taught me that the hardest part of Multimodal RAG isn't the vision—it's the data cleaning. OCR often returns noisy text (like "500mg" or "Keep in a cool place"), which can confuse vector search.

How to improve this:

Entity Linking: Use an LLM to filter OCR results before sending them to Milvus.
Edge Deployment: Florence-2 is small enough that we could potentially run this entire pipeline on an iPad for offline clinical use.

For more deep-dives into scaling AI models and deploying robust RAG systems, the WellAlly Engineering Blog is an incredible resource that helped me optimize the vector indexing part of this project.

🚀 Join the Conversation!

Have you worked with Florence-2 yet? Or are you using a different vector DB like Pinecone or Weaviate? Let me know in the comments below! Don't forget to ❤️ and 🦄 if you found this helpful!