DEV Community

Igor Eduardo
Igor Eduardo

Posted on

Two Retrieval Methods Are Better Than One: Evidence from 500 Clinical Queries

When I set out to evaluate retrieval configurations for Portuguese clinical text, I expected one method to dominate. Instead, I found something more interesting: BM25 and dense retrieval solve different questions. Neither is a substitute for the other.

This post summarizes the methodology and results from a 500-query empirical study of hybrid retrieval for clinical question answering. All code is open source: https://github.com/nomad-link-id/hybrid-rag-pipeline

The Setup

500 clinical queries across 6 medical specialties (cardiology, endocrinology, infectology, nephrology, neurology, oncology). Each query has a single reference answer grounded in a specific passage from clinical documentation.

Four retrieval configurations were evaluated:

Config Method
BM25-only BM25 with Portuguese stopword removal
Dense-only BioBERTpt embeddings, cosine similarity
Hybrid-RRF BM25 + dense via Reciprocal Rank Fusion
Hybrid-Rerank RRF candidates re-ranked with cross-encoder

What Is Reciprocal Rank Fusion?

RRF combines ranked lists from multiple retrievers without requiring score normalization:

def rrf_score(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return dict(sorted(scores.items(), key=lambda x: x[1], reverse=True))
Enter fullscreen mode Exit fullscreen mode

Results

Config Recall@5 MRR Citation F1
BM25-only 0.71 0.64 0.82
Dense-only 0.68 0.61 0.78
Hybrid-RRF 0.84 0.77 0.91
Hybrid-Rerank 0.86 0.79 0.93

The Complementarity Finding

McNemar's test on BM25-only versus dense-only:

  • BM25 correct, dense incorrect: 89 queries
  • Dense correct, BM25 incorrect: 57 queries
  • McNemar chi2 = 39.55, p < 0.001

The asymmetry is statistically significant. Dense-only missed 22.2% of queries that BM25 solved. You need both.

Citation Verification

Deterministic approach (BM25 score threshold + exact n-gram overlap): 461/500 citations verified.

Prompt-based LLM approach (same passages, ask LLM "does this support the answer?"): 1/500.

The difference is task design, not model quality. A deterministic check measures actual textual overlap; a prompt check measures the model's opinion of the overlap.

Inter-Annotator Agreement

100 query-response pairs independently annotated by two reviewers. Cohen's kappa = 0.954 — near-perfect agreement on what constitutes correct retrieval for clinical text.

Practical Takeaway

  1. Run both BM25 and dense retrieval
  2. Use RRF to merge results
  3. Implement deterministic citation verification
  4. Measure complementarity with McNemar's test on your domain

Code and Data

Preprint: https://doi.org/10.5281/zenodo.19686739

Igor Eduardo | igoreduardo.com | ORCID: 0009-0005-6288-1135

Top comments (0)