I Built a Biomedical RAG System, and a 40-Year-Old Algorithm Beat My Vector Database

#ai #machinelearning #datascience #python

A hands-on walkthrough of a retrieval-augmented QA pipeline over PubMed abstracts, the evaluation that kept me grounded, and why BM25 out-retrieved a FAISS vector index.

Everyone reaches for a vector database the moment they hear "RAG". I did too. Then I measured it against a lexical baseline from the 1980s, and the baseline won on every metric.

This post walks through a small retrieval-augmented question-answering system I built over biomedical literature, the evaluation that produced that result, and the two lessons that mattered more than any model choice. The full code is on GitHub: gbadedata/biomedical-rag-qa.

What we are building

The task: given a clinical or biological question, retrieve the passages that bear on it and produce a grounded yes / no / maybe answer, with citations, rather than letting a language model answer from memory.

The pipeline is five small modules, each usable and testable on its own.

Ingest from public APIs, build a passage corpus, retrieve, generate a grounded answer, and evaluate every stage against a baseline.

I will focus on the two stages that produced the interesting results: retrieval and answering.

The data

I used PubMedQA (Jin et al., 2019): 1,000 expert-labelled biomedical questions, each paired with the abstract it was written from, already split into labelled sections, plus a yes / no / maybe decision.

To turn this into a retrieval benchmark, I treat each abstract section as a passage (3,358 in total, about 60 tokens each) and define a question's gold passages as the sections from its own abstract. A retriever's job is then to rank a question's gold passages against the whole pool. Clean, reproducible, and it lets several retrievers compete on identical ground.

Retrieval: three approaches, one interface

The key design choice is that the retriever is a swappable component, not a hard-wired call. Everything sits behind one interface, so I can benchmark a lexical method, a dense vector method, and a random floor without touching the rest of the pipeline.

class DenseRetriever:
    """TF-IDF -> truncated SVD (LSA) -> L2-normalise -> FAISS inner-product index."""
    name = "dense_lsa_faiss"

    def index(self, passages):
        self._ids = [p.passage_id for p in passages]
        mat = self._embed([p.text for p in passages], fit=True)  # (n, d) float32
        self._index = faiss.IndexFlatIP(mat.shape[1])
        self._index.add(mat)
        return self

    def search(self, query, k=10):
        q = self._embed([query], fit=False)
        scores, idx = self._index.search(q, k)
        return [(self._ids[i], float(s)) for i, s in zip(idx[0], scores[0]) if i != -1]

The dense retriever builds TF-IDF vectors, compresses them to 256 dimensions with truncated SVD (latent semantic analysis), normalises, and indexes them in FAISS. The lexical retriever is plain BM25. And there is a random retriever, because you always want to know the floor.

Here is what happened.

BM25 places a gold passage first for 94.3% of questions (MRR 0.959) and leads the dense LSA + FAISS index across the board. Random is the floor.

BM25 wins everywhere. Its MRR of 0.959 means the first relevant passage is almost always ranked first or second.

Why did the "simple" method win? Because PubMedQA questions are written from their source abstracts, so they share a lot of vocabulary with the passages that answer them, and lexical overlap is a very strong signal here. Compressing that into 256 LSA dimensions trades away precision the benchmark actually rewards.

There is a subtlety in recall@1 worth knowing. Each question has about 3.4 gold passages, so you can only ever retrieve one of them at rank 1. That caps mean recall@1 at 0.319. BM25 scores 0.300, which is 94% of the mathematical ceiling. Its top rank is almost always correct.

The lesson is not "BM25 is better than vectors". It is measure it. On a different corpus, with a biomedical transformer embedder instead of LSA, the result could flip. But you only know by benchmarking against a baseline, and the swap is a one-line change to DenseRetriever._embed.

Does retrieval actually help the answer?

Good retrieval is worth nothing if it does not improve the answer. So I ran a diagnostic: train a simple decision classifier on four feature sets and compare them against a majority-class baseline.

Question only (no retrieval)
Retrieved context (BM25 top-3)
Gold context (perfect retrieval, the ceiling)
The majority baseline itself

A linear reader lifts macro-F1 well above the baseline by learning the minority classes, but nothing beats the baseline's accuracy of 0.553.

This is where it got uncomfortable, and interesting. The classifier lifts macro-F1 from the baseline's 0.237 to about 0.41, but no condition beats the baseline's accuracy, and feeding it retrieved passages by naive concatenation actually hurt it. Break it down by class and the reason is clear:

Even with perfect context, a bag-of-words reader manages F1 0.67 on "yes" but only 0.20 on the ambiguous "maybe" class.

PubMedQA is deliberately built to require reasoning over evidence, and a bag-of-words linear model cannot reason. It handles the easy majority class and falls apart on the ambiguous one.

This is the second, bigger lesson: retrieval quality is necessary but not sufficient. The value of RAG shows up only with a reader capable of reasoning over the retrieved evidence. Which is exactly why the answer step in the pipeline is an LLM, not a classifier.

Grounded generation

The generation step is thin on purpose, and strict about grounding. The model must answer only from the numbered passages, cite them, and return machine-checkable JSON.

SYSTEM_PROMPT = (
    "You are a careful biomedical research assistant. Answer the question using ONLY "
    "the numbered passages provided. Decide yes, no, or maybe. Use 'maybe' when the "
    "passages are mixed or insufficient. Do not use outside knowledge. Reply as strict "
    'JSON with keys "decision", "justification", and "supporting_passages".'
)

That contract is what makes the output auditable: you can check that the cited passages exist, and later that they actually support the justification. It also stops the model quietly answering from training memory, which is the whole point of RAG in a domain where a wrong answer matters.

Engineering choices that paid off

A few decisions that are easy to skip and worth keeping:

Baselines everywhere. A random-retrieval floor and a majority-class floor. The random floor is how you catch a silent indexing bug; the majority floor is how you avoid celebrating a model that only learned the class balance.
One retriever interface. Swapping BM25 for a dense model, or LSA for transformer embeddings, is a local change. The index and search loop do not move.
Two API styles behind one schema. Ingestion pulls from Europe PMC (REST, cursor pagination) and ClinicalTrials.gov v2 (REST, token pagination), plus Open Targets (GraphQL), all normalised to one passage schema with retry and backoff.
Metrics that do not need the LLM. The reported numbers are retrieval metrics and a linear diagnostic, so anyone can reproduce them without an API key. The generation step is real and runnable, but I did not claim an accuracy number I could not reproduce cheaply.

Run it yourself

git clone https://github.com/gbadedata/biomedical-rag-qa
cd biomedical-rag-qa
pip install -r requirements.txt
python scripts/fetch_data.py
python -m biomedqa.cli eval --data data/ori_pqal.json     # reproduces the numbers above

The evaluation is deterministic, tests run in CI across Python 3.10 to 3.12, and the whole thing is MIT licensed.

Takeaways

Two things I will carry into the next RAG project. First, benchmark retrieval against a lexical baseline before you assume a vector database is buying you anything, because sometimes it is not. Second, retrieval and generation are separate problems: strong retrieval with a weak reader still fails, so measure them independently and put the reasoning where it belongs.

Code, tests and full results: github.com/gbadedata/biomedical-rag-qa. Questions and critique welcome.

If you want the classical-ML counterpart, I ran a similar teardown on 215,000 patient drug reviews, sentiment classification plus complaint mining, with the same emphasis on baselines and reporting the results that do not help: github.com/gbadedata/drug-review-nlp.