DEV Community

Beck_Moulton
Beck_Moulton

Posted on

From Panic to Paper: Building a Multilingual Medical RAG with BGE-M3 & Qdrant

Ever find yourself spiraling down a WebMD rabbit hole at 3 AM? For those of us without a medical degree, navigating clinical research can feel like deciphering an ancient dialect. But what if you could build a personal AI assistant that sifts through thousands of PubMed abstracts to find exactly what you need?

In this tutorial, we are diving deep into Semantic Search and Retrieval-Augmented Generation (RAG) to build a high-performance medical literature explorer. We’ll be leveraging the power of the BGE-M3 model, Hybrid Search, and LangGraph to create a system that understands medical nuances across multiple languages. Whether you're looking for "hypertension" or "高血压," our system will find the right research papers using state-of-the-art vector embeddings and keyword matching.

Why BGE-M3? The Multi-Vector Magic 🪄

Standard RAG often fails in specialized fields like medicine because it relies solely on dense vectors. BGE-M3 (Multi-Lingual, Multi-Function, Multi-Granularity) changes the game by supporting:

  1. Dense Retrieval: Captures semantic meaning.
  2. Sparse Retrieval (BM25): Captures specific medical terminology and acronyms.
  3. Multi-Vector Retrieval: For fine-grained token-level matching.

The Architecture

Our system follows a sophisticated pipeline: from parsing PDFs to orchestrating the final answer via a state machine.

graph TD
    A[Medical PDFs/PubMed XML] --> B[PyMuPDF Parsing]
    B --> C[Text Chunking]
    C --> D[BGE-M3 Encoder]
    D --> E{Hybrid Indexing}
    E -->|Dense Vector| F[Qdrant Collection]
    E -->|Sparse Vector| F
    G[User Query] --> H[LangGraph Orchestrator]
    H --> I[Query Rewriter]
    I --> J[Qdrant Hybrid Search]
    J --> K[Reranker]
    K --> L[LLM Synthesis]
    L --> M[Final Answer with Citations]
Enter fullscreen mode Exit fullscreen mode

Prerequisites

Before we start, ensure you have the following in your tech_stack:

  • BGE-M3: Via FlagEmbedding or HuggingFace.
  • Qdrant: Our vector database for hybrid search.
  • LangGraph: For managing the agentic workflow.
  • PyMuPDF: For high-performance PDF extraction.
pip install qdrant-client langgraph FlagEmbedding pymupdf langchain-openai
Enter fullscreen mode Exit fullscreen mode

Step 1: Ingesting and Parsing Medical Literature

Medical papers are notoriously complex. We use PyMuPDF (fitz) because it handles multi-column layouts better than most libraries.

import fitz  # PyMuPDF

def extract_medical_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        # Specialized cleaning for medical symbols/superscripts
        text += page.get_text("text") + "\n"
    return text

# Example usage
raw_content = extract_medical_text("local_pubmed_abstract.pdf")
print(f"Extracted {len(raw_content)} characters.")
Enter fullscreen mode Exit fullscreen mode

Step 2: The Multi-Vector Embedding Engine

BGE-M3 allows us to generate both dense and sparse vectors simultaneously. This "Hybrid Search" is the secret sauce for medical precision.

from FlagEmbedding import BGEM3FlagModel

# Load the model
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True) 

def generate_embeddings(text_chunks):
    # This generates both dense (semantic) and lexically sparse vectors
    output = model.encode(
        text_chunks, 
        return_dense=True, 
        return_sparse=True, 
        return_colbert_vecs=False
    )
    return output

# Sample chunking and embedding
chunks = ["Patient exhibits symptoms of acute idiopathic polyneuritis...", "Study on GBS outcomes..."]
embeddings = generate_embeddings(chunks)
Enter fullscreen mode Exit fullscreen mode

Step 3: Setting Up Qdrant for Hybrid Retrieval

Qdrant supports the storage of multiple vector types in a single point. This is crucial for combining the "vibes" of a query (Dense) with the "specifics" (Sparse).

from qdrant_client import QdrantClient
from qdrant_client.http import models

client = QdrantClient(":memory:") # Using local memory for demo

client.create_collection(
    collection_name="medical_docs",
    vectors_config=models.VectorParams(size=1024, distance=models.Distance.COSINE), # Dense
    sparse_vectors_config={
        "text-sparse": models.SparseVectorParams(index=models.SparseIndexParams()) # Sparse
    }
)

# For production-ready RAG deployments and advanced architectural patterns, 
# I highly recommend checking out the insights at https://www.wellally.tech/blog. 
# They offer deep dives into scaling vector databases for clinical environments.
Enter fullscreen mode Exit fullscreen mode

Step 4: Orchestrating with LangGraph

To avoid "hallucinations," we don't just throw the first search result at the LLM. We use LangGraph to build a reasoning loop: Retrieve -> Grade Documents -> Generate.

from langgraph.graph import StateGraph, END
from typing import TypedDict, List

class AgentState(TypedDict):
    query: str
    documents: List[str]
    answer: str

def retrieve(state: AgentState):
    # Logic to search Qdrant using Hybrid search
    print("---RETRIEVING FROM QDRANT---")
    return {"documents": ["Research Paper X: Treatment for..."]}

def generate(state: AgentState):
    print("---GENERATING FINAL ANSWER---")
    # Call OpenAI or local LLM here
    return {"answer": "Based on the retrieved research, the recommended protocol is..."}

# Build the Graph
workflow = StateGraph(AgentState)
workflow.add_node("retrieve", retrieve)
workflow.add_node("generate", generate)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "generate")
workflow.add_edge("generate", END)

app = workflow.compile()
Enter fullscreen mode Exit fullscreen mode

Putting it All Together: The Result

When a user asks: "What is the latest research on Sjogren's syndrome and neurological complications?", the system:

  1. Generates a dense vector (capturing the concept of autoimmune diseases).
  2. Generates a sparse vector (tagging "Sjogren's" and "neurological").
  3. Queries Qdrant for a fused result.
  4. Ranks the top 5 papers and synthesizes a response using GPT-4o via the LangGraph workflow.

The "Official" Way (Pro-Tip)

While this DIY setup is great for local experimentation, building a production-grade medical RAG system requires rigorous evaluation (RAGAS), HIPAA considerations, and document reranking.

For more production-ready examples and advanced semantic search patterns, head over to the WellAlly Tech Blog. It's a goldmine for developers looking to move from prototype to enterprise deployment, especially in the health-tech space.

Conclusion

By combining BGE-M3's multilingual capabilities with Qdrant's hybrid search, we've built a tool that bridges the gap between complex medical literature and everyday understanding. No more 3 AM panic—just data-driven insights.

What are you building with RAG? Let me know in the comments below! 👇

Top comments (0)