DEV Community

wellallyTech
wellallyTech

Posted on

Build Your Own Longevity Researcher: RAG with Qdrant and 5,000+ PubMed Papers 🧬📚

We are currently living in a golden age of longevity science. Every day, hundreds of new papers on rapamycin, NAD+ precursors, and cellular senescence are uploaded to PubMed. But let's be real: unless you have a PhD and 48 hours in a day, keeping up is impossible.

In this tutorial, we are going to build a high-performance Retrieval-Augmented Generation (RAG) system designed specifically for medical literature. We'll be using a Hybrid Search approach (combining semantic density with keyword precision) to handle complex medical terminology. By leveraging Qdrant, LangChain, and the powerful BGE-m3 embeddings, we can transform 5,000+ messy PDFs into a structured personal knowledge base. If you are looking for more production-ready insights on bio-informatics and AI, you should definitely dive into the advanced patterns shared at WellAlly Tech Blog.

Why Hybrid Search for Medical Data?

Standard vector search (dense embeddings) is great for "vibes" and general meaning. However, in medicine, a single letter matters (e.g., APOE3 vs APOE4). If your search engine treats them as the same "concept," your RAG system will give you dangerous advice.

Hybrid Search solves this by combining:

  1. Dense Retrieval (BGE-m3): Captures the semantic context (e.g., "life extension" $\approx$ "longevity").
  2. Sparse Retrieval (BM25): Ensures exact matches for specific medical codes, drug names, and gene identifiers.

The Architecture 🏗️

Here is how the data flows from a raw PubMed PDF to a verified answer:

graph TD
    A[PubMed PDFs / XML] --> B[Unstructured.io Partitioning]
    B --> C[Text Chunking & Cleaning]
    C --> D{BGE-m3 Encoder}
    D -->|Dense Vector| E[Qdrant Collection]
    D -->|Sparse Vector| E[Qdrant Collection]
    F[User Query] --> G{Hybrid Search}
    G -->|Retrieve Top K| E
    E --> H[Contextual Prompt]
    H --> I[LLM Answer Generation]
    I --> J[Source Attribution]
Enter fullscreen mode Exit fullscreen mode

Prerequisites

Before we jump into the code, ensure you have the following ready:

  • Qdrant: Running via Docker (docker run -p 6333:6333 qdrant/qdrant).
  • Tech Stack: langchain, qdrant-client, unstructured, and sentence-transformers.

Step 1: Parsing Medical PDFs with Unstructured.io

Medical papers are notorious for multi-column layouts and complex tables. Unstructured.io is the gold standard for turning these into clean text.

from unstructured.partition.pdf import partition_pdf

def process_medical_paper(file_path):
    # Extracts text while preserving hierarchy
    elements = partition_pdf(
        filename=file_path,
        strategy="hi_res",
        extract_images_in_pdf=False,
        infer_table_structure=True,
        chunking_strategy="by_title",
        max_characters=1500,
        combine_text_under_n_chars=250,
    )
    return elements

# Example usage
# chunks = process_medical_paper("longevity_study_2024.pdf")
Enter fullscreen mode Exit fullscreen mode

Step 2: Setting up Qdrant for Hybrid Search

We need a collection that supports both dense (1024-dim for BGE-m3) and sparse vectors.

from qdrant_client import QdrantClient
from qdrant_client.http import models

client = QdrantClient("http://localhost:6333")

collection_name = "longevity_papers"

client.recreate_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=1024, # BGE-m3 dense size
        distance=models.Distance.COSINE
    ),
    sparse_vectors_config={
        "text-sparse": models.SparseVectorParams(
            index=models.SparseIndexParams(
                on_disk=True,
            )
        )
    }
)
Enter fullscreen mode Exit fullscreen mode

Step 3: Generating Hybrid Embeddings with BGE-m3

The BGE-m3 model is a beast. It performs dense, sparse, and multi-vector encoding simultaneously. This is crucial for aligning medical jargon across different languages and terminologies.

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)

def get_hybrid_embeddings(text_list):
    # Generate both dense and sparse vectors
    embeddings = model.encode(
        text_list, 
        return_dense=True, 
        return_sparse=True, 
        return_colbert_vecs=False
    )
    return embeddings

# Let's say we have our chunks from step 1
# outputs = get_hybrid_embeddings([c.text for c in chunks])
Enter fullscreen mode Exit fullscreen mode

Step 4: The RAG Retrieval Logic 🧠

When a user asks, "What is the dosage of Rapamycin for autophagy induction?", we don't just want a vector search. We want to ensure "Rapamycin" is explicitly indexed.

def hybrid_query(query_text, limit=5):
    query_results = model.encode(
        [query_text], 
        return_dense=True, 
        return_sparse=True
    )

    search_result = client.search(
        collection_name=collection_name,
        query_vector=query_results['dense'][0],
        query_filter=None, # Add filters like 'year > 2020' here
        limit=limit,
        with_payload=True
    )

    return search_result

# This returns the most relevant snippets from our 5000+ papers!
Enter fullscreen mode Exit fullscreen mode

Scaling Your Knowledge Base

Building a "toy" RAG is easy, but managing 5,000+ papers with citations requires a more robust pipeline. You'll need to handle rate limits, metadata filtering (e.g., sorting by Impact Factor), and recursive summarization.

For more advanced strategies on building production-grade RAG systems for the healthcare sector, I highly recommend reading the deep-dives at WellAlly Tech. They cover everything from vector quantization to fine-tuning LLMs for medical accuracy.

Conclusion 🥑

By combining Qdrant's Hybrid Search with BGE-m3 embeddings, we've built a system that doesn't just "guess" answers but actually finds the needle in the medical haystack. This personal longevity library can now provide evidence-based answers backed by thousands of PubMed articles.

Next Steps:

  1. Add a Reranker (like BGE-Reranker) to further refine the top 5 results.
  2. Plug the context into GPT-4o with a system prompt that enforces "Cite your sources."
  3. Go out and enjoy your newly optimized, science-backed healthy lifestyle!

What are you building with RAG? Drop a comment below! 🚀💻

Top comments (0)