DEV Community

Beck_Moulton
Beck_Moulton

Posted on

Pill-ID: Building a Visual RAG System for Medication Safety with CLIP and Milvus

Have you ever looked at a handful of white, round tablets and wondered, "Wait, is this my aspirin or my blood pressure medication?" Medication errors are a silent crisis, but thanks to the rise of Visual RAG and Multimodal AI, we can now build systems that "see" and verify medication in real-time.

In this tutorial, we are going to build Pill-ID, a cross-check system that uses Computer Vision and Vector Databases to identify pills from a photo and verify them against an electronic prescription. We'll be leveraging the power of CLIP for multimodal embeddings, Milvus for high-speed similarity search, and FastAPI to tie it all together.


The Architecture: How Visual RAG Works

Unlike traditional RAG (Retrieval-Augmented Generation) which focuses on text, Visual RAG allows us to query a database using image features. We don't just search for the word "Ibuprofen"; we search for a vector that represents the specific shape, color, and texture of an Ibuprofen pill.

graph TD
    A[User Takes Photo] --> B[OpenCV: Image Preprocessing]
    B --> C[CLIP: Generate Image Embedding]
    C --> D[Milvus: Vector Similarity Search]
    D --> E[Retrieve Metadata: Pill Name, Dosage]
    E --> F[FastAPI: Cross-check with Prescription]
    F --> G{Match Found?}
    G -- Yes --> H[🚀 Safety Verified]
    G -- No --> I[⚠️ Warning: Dosage Mismatch]
Enter fullscreen mode Exit fullscreen mode

Prerequisites

To follow along, you'll need the following tech stack:

  • Python 3.9+
  • OpenCV: For image manipulation.
  • CLIP (OpenAI): To bridge the gap between text and images.
  • Milvus: Our high-performance vector database.
  • FastAPI: To build our lightning-fast API.

Step 1: Setting Up the Vector Brain (Milvus)

First, we need a place to store our "visual fingerprints." Milvus is perfect for this because it handles millions of vectors with sub-millisecond latency.

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection

# Connect to Milvus
connections.connect("default", host="localhost", port="19530")

# Define Schema: ID, Image Vector (512 dims for CLIP), and Metadata
fields = [
    FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="pill_vector", dtype=DataType.FLOAT_VECTOR, dim=512),
    FieldSchema(name="pill_name", dtype=DataType.VARCHAR, max_length=200),
    FieldSchema(name="dosage_mg", dtype=DataType.INT64)
]

schema = CollectionSchema(fields, "Pill identification collection")
pill_collection = Collection("pill_registry", schema)
Enter fullscreen mode Exit fullscreen mode

Step 2: Extracting Features with CLIP

CLIP (Contrastive Language-Image Pre-training) is the magic sauce. It allows us to convert an image of a pill into a 512-dimensional vector.

import torch
from PIL import Image
from sentence_transformers import SentenceTransformer

# Load the CLIP model
model = SentenceTransformer('clip-ViT-B-32')

def get_image_embedding(image_path):
    # Load and encode the image
    img = Image.open(image_path)
    embedding = model.encode(img)
    return embedding.tolist()
Enter fullscreen mode Exit fullscreen mode

Step 3: Preprocessing with OpenCV

Raw photos can be messy. We use OpenCV to crop the pill and remove background noise to ensure our CLIP model focuses only on the medicine.

import cv2
import numpy as np

def preprocess_pill_image(image_bytes):
    nparr = np.frombuffer(image_bytes, np.uint8)
    img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)

    # Simple thresholding to find the pill contour
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    _, thresh = cv2.threshold(gray, 200, 255, cv2.THRESH_BINARY_INV)

    # Find the largest contour (the pill) and crop
    contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    if contours:
        c = max(contours, key=cv2.contourArea)
        x, y, w, h = cv2.boundingRect(c)
        cropped_pill = img[y:y+h, x:x+w]
        return cropped_pill
    return img
Enter fullscreen mode Exit fullscreen mode

Step 4: The FastAPI Orchestrator

Now, let's build the endpoint that receives a photo and a prescription, then does the cross-checking.

from fastapi import FastAPI, UploadFile, File

app = FastAPI()

@app.post("/verify-medication/")
async def verify_medication(prescribed_name: str, file: UploadFile = File(...)):
    # 1. Preprocess and Embed
    image_bytes = await file.read()
    processed_img = preprocess_pill_image(image_bytes)
    vector = get_image_embedding(processed_img)

    # 2. Search Milvus for the closest match
    search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
    results = pill_collection.search(
        data=[vector], 
        anns_field="pill_vector", 
        param=search_params, 
        limit=1,
        output_fields=["pill_name"]
    )

    detected_name = results[0][0].entity.get("pill_name")

    # 3. Cross-check Logic
    if detected_name.lower() == prescribed_name.lower():
        return {"status": "MATCH", "message": f"Verified: {detected_name} identified."}
    else:
        return {
            "status": "WARNING", 
            "message": f"Possible Mismatch! Found {detected_name} but prescription says {prescribed_name}."
        }
Enter fullscreen mode Exit fullscreen mode

Advanced Patterns & Production Safety

Building a proof-of-concept is easy, but making it production-ready for healthcare requires strict validation, confidence scoring, and robust data pipelines.

If you are looking for more advanced patterns on deploying these multimodal models or want to see how to scale vector search in a cloud-native environment, I highly recommend checking out the technical deep dives at WellAlly Tech Blog. They cover great architectural insights on moving from "Localhost AI" to "Enterprise AI."


Conclusion

By combining CLIP and Milvus, we've turned a difficult computer vision problem into a simple vector search problem. Pill-ID demonstrates how Visual RAG can be applied to real-world safety scenarios, potentially saving lives by preventing medication errors.

What's next?

  • Add support for blister pack detection.
  • Integrate with FHIR (Fast Healthcare Interoperability Resources) APIs for real prescription data.
  • Deploy the model using ONNX for faster edge inference.

Have questions about the vector search or the CLIP implementation? Drop a comment below!

Top comments (0)