wellallyTech

Posted on Feb 2

Biohack Your Health: Building a Science-Backed Supplement Advisor with Qdrant & PubMed 🧪

#biohacking #ai #python #odrant

If you've ever spent hours scrolling through Reddit or fitness forums trying to figure out if NMN or Creatine actually works, you know the struggle. There is a massive gap between "bro-science" and peer-reviewed clinical data. In the world of Biohacking, information is power, but only if it's accurate.

Today, we are building a production-grade RAG architecture (Retrieval-Augmented Generation) to bridge that gap. We will use a Vector Database to store high-fidelity embeddings from PubMed, allowing us to perform Semantic Search across thousands of medical abstracts. By the end of this guide, you’ll have a local knowledge base that answers your supplement questions with real scientific citations. 🚀

The Architecture 🏗️

To build a reliable biohacking tool, we need a pipeline that handles data ingestion, embedding, and retrieval. Here is how the data flows from a PubMed research paper to your terminal:

graph TD
    A[PubMed Search Query] --> B[BeautifulSoup Scraper]
    B --> C[Text Chunking - LangChain]
    C --> D[Sentence Transformers - Embeddings]
    D --> E[(Qdrant Vector DB)]
    F[User Question] --> G[Query Embedding]
    G --> H{Similarity Search}
    E --> H
    H --> I[Context + Prompt]
    I --> J[LLM Response with Citations]

Prerequisites 🛠️

Make sure you have the following in your tech_stack:

Python 3.9+
Qdrant: Our high-performance vector database.
Sentence Transformers: For generating local embeddings.
LangChain: The glue for our RAG pipeline.
BeautifulSoup: For parsing PubMed's HTML.

pip install qdrant-client sentence-transformers beautifulsoup4 langchain langchain-community

Step 1: Scraping PubMed Research 📄

PubMed is the gold standard for medical research. While they have an API (Entrez), sometimes we need to scrape specific metadata or handle dynamic queries. Here’s a robust snippet to get us started.

import requests
from bs4 import BeautifulSoup

def fetch_pubmed_abstracts(query, max_results=10):
    base_url = f"https://pubmed.ncbi.nlm.nih.gov/?term={query}"
    response = requests.get(base_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    links = [f"https://pubmed.ncbi.nlm.nih.gov{a['href']}" for a in soup.select('.docsum-title', limit=max_results)]

    abstracts = []
    for link in links:
        page = requests.get(link)
        page_soup = BeautifulSoup(page.text, 'html.parser')
        abstract_text = page_soup.find('div', id='eng-abstract')
        if abstract_text:
            abstracts.append({
                "source": link,
                "content": abstract_text.get_text().strip()
            })
    return abstracts

# Example: Fetching data for NMN
data = fetch_pubmed_abstracts("NMN supplement longevity", max_results=5)
print(f"Fetched {len(data)} abstracts!")

Step 2: Vectorizing the Evidence with Qdrant 🧠

Storing raw text isn't enough; we need to store the meaning of the text. This is where Qdrant shines. We’ll use Sentence Transformers to turn our abstracts into 384-dimensional vectors.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Qdrant

# Initialize local Qdrant (or use :memory: for testing)
client = QdrantClient(path="./qdrant_db")

# Create a collection for our supplements
client.recreate_collection(
    collection_name="biohacking_science",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

# Initialize Embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Prepare documents for LangChain
from langchain.schema import Document
docs = [Document(page_content=item['content'], metadata={"source": item['source']}) for item in data]

# Upload to Qdrant
vectorstore = Qdrant(
    client=client, 
    collection_name="biohacking_science", 
    embeddings=embeddings
)
vectorstore.add_documents(docs)
print("Vector database is ready! 🥑")

Step 3: The RAG Implementation 🤖

Now, we can query our database. Instead of a keyword search, we’re doing a semantic search. If you ask about "muscle recovery," the system will find papers on "Creatine monohydrate" even if the word "recovery" isn't in the title.

from langchain.chains import RetrievalQA
from langchain_community.llms import OpenAI # or use local models like Llama3

# Setup the retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Basic search functionality
query = "What are the benefits of NMN for mitochondrial health?"
found_docs = retriever.get_relevant_documents(query)

for i, doc in enumerate(found_docs):
    print(f"Source {i+1}: {doc.metadata['source']}")
    print(f"Snippet: {doc.page_content[:200]}...\n")

Going Beyond the Basics 🚀

While this script is a great start, production-ready biohacking tools require more advanced patterns—like hybrid search (combining keyword and vector search) and reranking to ensure the most clinically relevant papers appear first.

💡 Developer Pro-Tip: For more production-ready examples and advanced patterns in AI-driven healthcare data engineering, I highly recommend checking out the engineering deep-dives at WellAlly Blog. They cover how to scale these architectures for real-world medical applications.

Conclusion

By moving away from static bookmarks and toward a Qdrant-powered RAG system, you've turned a chaotic library of PDFs and URLs into a queryable, intelligent research assistant. Biohacking is fundamentally a data engineering challenge—the more clean, evidence-based data you can retrieve, the better your decisions will be.

What's next?

Try adding a "Confidence Score" based on the vector distance.
Integrate a Cron job to auto-update your PubMed database every week.
Deploy this as a FastAPI endpoint to your mobile health dashboard.

Happy hacking! Stay scientific. 🧬💻

Did you find this tutorial helpful? Drop a comment below with which supplement you're researching next! 👇*

DEV Community