Two Retrieval Methods Are Better Than One: Evidence from 500 Clinical Queries

#python #rag #ai #productivity

When I set out to evaluate retrieval configurations for Portuguese clinical text, I expected one method to dominate. Instead, I found something more interesting: BM25 and dense retrieval solve different questions. Neither is a substitute for the other.

This post summarizes the methodology and results from a 500-query empirical study of hybrid retrieval for clinical question answering. All code is open source: https://github.com/nomad-link-id/hybrid-rag-pipeline

The Setup

500 clinical queries across 6 medical specialties (cardiology, endocrinology, infectology, nephrology, neurology, oncology). Each query has a single reference answer grounded in a specific passage from clinical documentation.

Four retrieval configurations were evaluated:

Config	Method
BM25-only	BM25 with Portuguese stopword removal
Dense-only	BioBERTpt embeddings, cosine similarity
Hybrid-RRF	BM25 + dense via Reciprocal Rank Fusion
Hybrid-Rerank	RRF candidates re-ranked with cross-encoder

What Is Reciprocal Rank Fusion?

RRF combines ranked lists from multiple retrievers without requiring score normalization:

def rrf_score(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return dict(sorted(scores.items(), key=lambda x: x[1], reverse=True))

Results

Config	Recall@5	MRR	Citation F1
BM25-only	0.71	0.64	0.82
Dense-only	0.68	0.61	0.78
Hybrid-RRF	0.84	0.77	0.91
Hybrid-Rerank	0.86	0.79	0.93

The Complementarity Finding

McNemar's test on BM25-only versus dense-only:

BM25 correct, dense incorrect: 89 queries
Dense correct, BM25 incorrect: 57 queries
McNemar chi2 = 39.55, p < 0.001

The asymmetry is statistically significant. Dense-only missed 22.2% of queries that BM25 solved. You need both.

Citation Verification

Deterministic approach (BM25 score threshold + exact n-gram overlap): 461/500 citations verified.

Prompt-based LLM approach (same passages, ask LLM "does this support the answer?"): 1/500.

The difference is task design, not model quality. A deterministic check measures actual textual overlap; a prompt check measures the model's opinion of the overlap.

Inter-Annotator Agreement

100 query-response pairs independently annotated by two reviewers. Cohen's kappa = 0.954 — near-perfect agreement on what constitutes correct retrieval for clinical text.