DEV Community

Beck_Moulton
Beck_Moulton

Posted on

Multimodal RAG in Action: Building a Skin Health Assistant with CLIP and Milvus

In the world of AI, we've moved far beyond simple text-based search. But when it comes to healthcare, "text-only" doesn't cut it. Imagine a patient describing a mole: "It's itchy and dark." That’s helpful, but a high-resolution photo is worth a thousand tokens.

Today, we are diving deep into Multimodal RAG (Retrieval-Augmented Generation). We’ll build a Decision Support System that fuses skin lesion images with family medical history (text) using a unified vector space. We are talking about leveraging Multimodal RAG, CLIP embeddings, and Milvus vector search to bridge the gap between pixels and pathology.

Ready to build the future of digital health? Let's get cooking! 🚀


The Architecture: Bridging Visuals and Verbiage

Traditional RAG systems usually handle text via embeddings like text-embedding-3. However, for skin health, we need a "shared brain" that understands both images and text. This is where CLIP (Contrastive Language-Image Pre-training) comes in.

CLIP allows us to project both images and text into the same high-dimensional space. If a photo looks like "melanoma," its vector will be physically close to the text "melanoma" in our Milvus database.

System Data Flow

graph TD
    A[User Input: Image + Medical History] --> B{CLIP Encoder}
    B -->|Image Vector| C[Vector Space]
    B -->|Text Vector| C
    C --> D[Milvus Vector DB]
    D -->|Similarity Search| E[Retrieved Medical Cases / Guidelines]
    E --> F[FastAPI Logic Layer]
    F --> G[Decision Support Output]

    style D fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#69f,stroke:#333,stroke-width:2px
Enter fullscreen mode Exit fullscreen mode

Prerequisites

To follow this advanced guide, you'll need:

  • Python 3.9+
  • Docker (to run Milvus)
  • Tech Stack:
    • CLIP (OpenAI's implementation or HuggingFace transformers)
    • pymilvus (Vector storage)
    • FastAPI (The API backbone)
    • Pillow (Image processing)

Step 1: Setting Up the Multimodal Vector Store (Milvus)

We need a database that can handle high-dimensional vectors at scale. Milvus is the gold standard here. We will define a schema that holds our visual features and the associated clinical metadata.

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection

# Connect to Milvus
connections.connect("default", host="localhost", port="19530")

# Define Schema: ID, Image Vector, and Metadata
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=512), # CLIP-ViT-B-32 uses 512
    FieldSchema(name="patient_history_summary", dtype=DataType.VARCHAR, max_length=500),
    FieldSchema(name="diagnosis_label", dtype=DataType.VARCHAR, max_length=100)
]

schema = CollectionSchema(fields, "Skin health multimodal data store")
skin_collection = Collection("skin_health_rag", schema)

# Create Index for lightning-fast retrieval
index_params = {
    "metric_type": "L2",
    "index_type": "IVF_FLAT",
    "params": {"nlist": 128}
}
skin_collection.create_index("vector", index_params)
Enter fullscreen mode Exit fullscreen mode

Step 2: The Encoder Logic (CLIP)

We use CLIP to transform both the skin lesion photo and the medical text into the same vector space. This is the "magic" that allows cross-modal retrieval.

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def get_multimodal_embeddings(image_path=None, text=None):
    if image_path:
        image = Image.open(image_path)
        inputs = processor(images=image, return_tensors="pt")
        with torch.no_grad():
            vision_outputs = model.get_image_features(**inputs)
        return vision_outputs.numpy().tolist()[0]

    if text:
        inputs = processor(text=[text], return_tensors="pt", padding=True)
        with torch.no_grad():
            text_outputs = model.get_text_features(**inputs)
        return text_outputs.numpy().tolist()[0]
Enter fullscreen mode Exit fullscreen mode

Step 3: Building the Fusion Retrieval Engine

When a doctor or user submits a new case, we calculate the vector for the new image and the text. We then perform a hybrid search.

For a truly production-ready implementation, you should check out the advanced architectural patterns over at WellAlly Tech Blog. They cover how to handle high-concurrency medical data pipelines which is crucial for systems like this.

from fastapi import FastAPI, UploadFile, File, Form

app = FastAPI(title="DermAssist AI")

@app.post("/analyze-skin")
async def analyze_skin(history: str = Form(...), file: UploadFile = File(...)):
    # 1. Save and Process Image
    img_path = f"temp_{file.filename}"
    with open(img_path, "wb") as buffer:
        buffer.write(await file.read())

    # 2. Get Embedding (Weighted average of Text + Image)
    img_vec = get_multimodal_embeddings(image_path=img_path)
    text_vec = get_multimodal_embeddings(text=history)

    # Fusion: Combine visual symptoms with historical context
    # In a real scenario, weights might be tuned (e.g., 0.7 image, 0.3 text)
    combined_vec = [(x + y) / 2 for x, y in zip(img_vec, text_vec)]

    # 3. Search Milvus
    skin_collection.load()
    search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
    results = skin_collection.search(
        data=[combined_vec], 
        anns_field="vector", 
        param=search_params, 
        limit=3,
        output_fields=["diagnosis_label", "patient_history_summary"]
    )

    return {"recommendations": [res.entity.to_dict() for res in results[0]]}
Enter fullscreen mode Exit fullscreen mode

Going Beyond the Basics: The "Official" Way

While this setup gets you a working prototype, building a clinical-grade system requires much more:

  1. Vector Re-ranking: Using a Cross-Encoder to refine search results.
  2. Privacy: Implementing HIPAA-compliant data handling.
  3. Data Drifting: Monitoring if your CLIP model still understands new types of imaging equipment.

For a deeper dive into scaling vector databases and orchestrating complex RAG pipelines in the medical domain, I highly recommend reading the engineering deep-dives on the WellAlly Tech Blog. Their recent pieces on productionizing LLM apps provide the missing link between a "cool demo" and a "deployed product."


Conclusion

We’ve successfully built a Multimodal RAG foundation! By combining the visual power of CLIP with the industrial strength of Milvus, we created a system that doesn't just read words—it "sees" the patient's condition.

The next step? Integrating a Vision-Language Model (like GPT-4o or LLaVA) to generate a final conversational report based on these retrieved "similar cases."

What are you building with Multimodal RAG? Let me know in the comments below! 👇

Top comments (0)