Two Retrieval Methods Are Better Than One: Evidence from 500 Clinical Queries

Igor Eduardo — Wed, 13 May 2026 19:14:30 +0000

When I set out to evaluate retrieval configurations for Portuguese clinical text, I expected one method to dominate. Instead, I found something more interesting: BM25 and dense retrieval solve different questions. Neither is a substitute for the other.

This post summarizes the methodology and results from a 500-query empirical study of hybrid retrieval for clinical question answering. All code is open source: https://github.com/nomad-link-id/hybrid-rag-pipeline

The Setup

500 clinical queries across 6 medical specialties (cardiology, endocrinology, infectology, nephrology, neurology, oncology). Each query has a single reference answer grounded in a specific passage from clinical documentation.

Four retrieval configurations were evaluated:

Config	Method
BM25-only	BM25 with Portuguese stopword removal
Dense-only	BioBERTpt embeddings, cosine similarity
Hybrid-RRF	BM25 + dense via Reciprocal Rank Fusion
Hybrid-Rerank	RRF candidates re-ranked with cross-encoder

What Is Reciprocal Rank Fusion?

RRF combines ranked lists from multiple retrievers without requiring score normalization:

def rrf_score(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return dict(sorted(scores.items(), key=lambda x: x[1], reverse=True))

Results

Config	Recall@5	MRR	Citation F1
BM25-only	0.71	0.64	0.82
Dense-only	0.68	0.61	0.78
Hybrid-RRF	0.84	0.77	0.91
Hybrid-Rerank	0.86	0.79	0.93

The Complementarity Finding

McNemar's test on BM25-only versus dense-only:

BM25 correct, dense incorrect: 89 queries
Dense correct, BM25 incorrect: 57 queries
McNemar chi2 = 39.55, p < 0.001

The asymmetry is statistically significant. Dense-only missed 22.2% of queries that BM25 solved. You need both.

Citation Verification

Deterministic approach (BM25 score threshold + exact n-gram overlap): 461/500 citations verified.

Prompt-based LLM approach (same passages, ask LLM "does this support the answer?"): 1/500.

The difference is task design, not model quality. A deterministic check measures actual textual overlap; a prompt check measures the model's opinion of the overlap.

Inter-Annotator Agreement

100 query-response pairs independently annotated by two reviewers. Cohen's kappa = 0.954 — near-perfect agreement on what constitutes correct retrieval for clinical text.

Practical Takeaway

Run both BM25 and dense retrieval
Use RRF to merge results
Implement deterministic citation verification
Measure complementarity with McNemar's test on your domain

Code and Data

Preprint: https://doi.org/10.5281/zenodo.19686739

Igor Eduardo | igoreduardo.com | ORCID: 0009-0005-6288-1135

DEV Community: Igor Eduardo