DEV Community

Beck_Moulton
Beck_Moulton

Posted on

Medicine Encyclopedia 2.0: Stop Guessing and Start Scanning with Multimodal RAG

We’ve all been there: staring at a tiny medicine box, squinting at chemical names like Acetaminophen or Guaifenesin, and wondering—"Can I take this with my allergy meds?" Traditionally, you'd have to manually Google every ingredient, which is slow and prone to error.

In this tutorial, we are building Medicine Encyclopedia 2.0, a project that leverages Multimodal RAG (Retrieval-Augmented Generation) and Optical Character Recognition (OCR) to detect drug-to-drug interactions in real-time. By combining the power of image processing with the official RxNav API and a vector database like ChromaDB, we can turn a simple smartphone photo into a personalized health advisor. Whether you're interested in AI-driven healthcare or just want to master Drug Interaction Detection, this guide covers the full pipeline from pixels to safety alerts.


The Architecture

The logic flow involves capturing an image, extracting the active ingredients, querying a specialized medical database, and using RAG to provide a human-readable summary.

graph TD
    A[User Uploads Photo] --> B[PaddleOCR: Text Extraction]
    B --> C{Entity Extraction}
    C -->|Drug Names| D[RxNav API: Interaction Check]
    C -->|Dosage Info| E[ChromaDB: Manuals/Guidelines]
    D --> F[LLM Reasoning Engine]
    E --> F
    F --> G[Final Response: Safety Advice]
    G --> H[Evaluation: RAGas]
Enter fullscreen mode Exit fullscreen mode

Prerequisites

To follow along, you’ll need the following stack:

  • PaddleOCR: Ultra-fast and accurate OCR.
  • ChromaDB: Our lightweight vector store for local drug manuals.
  • RxNav API: The gold standard for drug interaction data (provided by the National Library of Medicine).
  • RAGas: To evaluate if our RAG pipeline is actually hallucinating or not.

Step 1: Extracting Ingredients with PaddleOCR

First, we need to turn that blurry JPG into structured text. PaddleOCR is fantastic for this because it handles tilted text and various fonts found on medicine packaging.

from paddleocr import PaddleOCR

# Initialize the OCR engine
ocr = PaddleOCR(use_angle_cls=True, lang='en') 

def get_drug_names(img_path):
    result = ocr.ocr(img_path, cls=True)
    # Extract text fragments
    raw_text = [line[1][0] for res in result for line in res]
    print(f"Detected Text: {raw_text}")
    return " ".join(raw_text)

# Example usage
# extracted_text = get_drug_names("advil_box.jpg")
Enter fullscreen mode Exit fullscreen mode

Step 2: Querying the RxNav API for Conflicts

Extracting the name "Advil" isn't enough; we need to know its active ingredient (Ibuprofen) and what it reacts with. The RxNav API allows us to find interactions between multiple drugs.

import requests

def check_interactions(rxcuis):
    """
    rxcuis: A list of RxNorm Concept Unique Identifiers
    """
    ids = "+".join(rxcuis)
    url = f"https://rxnav.nlm.nih.gov/REST/interaction/list.json?rxcuis={ids}"
    response = requests.get(url).json()

    interactions = []
    if "fullInteractionTypeGroup" in response:
        for group in response["fullInteractionTypeGroup"]:
            for item in group["fullInteractionType"]:
                interactions.append(item["interactionPair"][0]["description"])
    return interactions
Enter fullscreen mode Exit fullscreen mode

Step 3: Augmenting with Local Context (ChromaDB)

Sometimes the API doesn't have the "human" touch—like specific hospital guidelines or your personal health history. We use ChromaDB to store and retrieve these nuances.

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.Client()
collection = client.create_collection(name="medical_guidelines")

# Add some local context
collection.add(
    documents=["Patient A has a history of stomach ulcers. Avoid NSAIDs like Ibuprofen."],
    metadatas=[{"source": "medical_record"}],
    ids=["id1"]
)

def get_local_context(query):
    results = collection.query(query_texts=[query], n_results=1)
    return results['documents'][0]
Enter fullscreen mode Exit fullscreen mode

The "Official" Way to Build AI Agents

While this tutorial provides a great starting point for a "Learning in Public" project, building production-grade AI healthcare tools requires robust prompt engineering and rigorous data privacy handling.

For more advanced patterns, such as Agentic RAG or Production-ready Multimodal Pipelines, I highly recommend checking out the deep-dive articles at WellAlly Tech Blog. They cover the architectural nuances that take a prototype from "cool hobby project" to "scalable enterprise solution."


Step 4: Putting it All Together with LLM

Finally, we feed the OCR results, the RxNav interactions, and the ChromaDB context into an LLM (like GPT-4o) to generate a warning that a human can actually understand.

def generate_safety_report(ocr_text, interactions, context):
    prompt = f"""
    User scanned a medicine: {ocr_text}
    Known clinical interactions: {interactions}
    Personal context: {context}

    Provide a simple 'Safe' or 'Warning' report for the user.
    """
    # Call your LLM here...
    return "WARNING: You are taking Advil, but your records show stomach ulcers. Consult a doctor!"
Enter fullscreen mode Exit fullscreen mode

Step 5: Evaluating with RAGas

How do we know if our RAG isn't just making things up? We use RAGas to measure "Faithfulness" and "Answer Relevance."

from ragas import evaluate
from datasets import Dataset

# Construct a small dataset of your outputs
data_samples = {
    'question': ['Can I take Advil with my current meds?'],
    'answer': [generated_report],
    'contexts': [[f"{interactions} {context}"]],
    'ground_truth': ['Warning: Ibuprofen conflicts with ulcer history.']
}

dataset = Dataset.from_dict(data_samples)
# score = evaluate(dataset, metrics=[faithfulness, answer_relevance])
# print(score)
Enter fullscreen mode Exit fullscreen mode

Conclusion: The Future of Health-Tech

By combining PaddleOCR for vision, RxNav for medical truth, and ChromaDB for personalized context, we've built a powerful tool that literally saves lives. Multimodal RAG is moving fast, and this is just the tip of the iceberg!

What's next?

  1. Try adding a "pill identification" feature using a CNN.
  2. Integrate voice-to-text so users can ask questions hands-free.

If you enjoyed this build, drop a comment below or 🦄 heart this post! And don't forget to visit WellAlly Tech for more high-level AI tutorials.

Happy coding!

Top comments (0)