When I set out to evaluate retrieval configurations for Portuguese clinical text, I expected one method to dominate. Instead, I found something more interesting: BM25 and dense retrieval solve different questions. Neither is a substitute for the other.
This post summarizes the methodology and results from a 500-query empirical study of hybrid retrieval for clinical question answering. All code is open source: https://github.com/nomad-link-id/hybrid-rag-pipeline
The Setup
500 clinical queries across 6 medical specialties (cardiology, endocrinology, infectology, nephrology, neurology, oncology). Each query has a single reference answer grounded in a specific passage from clinical documentation.
Four retrieval configurations were evaluated:
| Config | Method |
|---|---|
| BM25-only | BM25 with Portuguese stopword removal |
| Dense-only | BioBERTpt embeddings, cosine similarity |
| Hybrid-RRF | BM25 + dense via Reciprocal Rank Fusion |
| Hybrid-Rerank | RRF candidates re-ranked with cross-encoder |
What Is Reciprocal Rank Fusion?
RRF combines ranked lists from multiple retrievers without requiring score normalization:
def rrf_score(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
scores: dict[str, float] = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking, start=1):
scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
return dict(sorted(scores.items(), key=lambda x: x[1], reverse=True))
Results
| Config | Recall@5 | MRR | Citation F1 |
|---|---|---|---|
| BM25-only | 0.71 | 0.64 | 0.82 |
| Dense-only | 0.68 | 0.61 | 0.78 |
| Hybrid-RRF | 0.84 | 0.77 | 0.91 |
| Hybrid-Rerank | 0.86 | 0.79 | 0.93 |
The Complementarity Finding
McNemar's test on BM25-only versus dense-only:
- BM25 correct, dense incorrect: 89 queries
- Dense correct, BM25 incorrect: 57 queries
- McNemar chi2 = 39.55, p < 0.001
The asymmetry is statistically significant. Dense-only missed 22.2% of queries that BM25 solved. You need both.
Citation Verification
Deterministic approach (BM25 score threshold + exact n-gram overlap): 461/500 citations verified.
Prompt-based LLM approach (same passages, ask LLM "does this support the answer?"): 1/500.
The difference is task design, not model quality. A deterministic check measures actual textual overlap; a prompt check measures the model's opinion of the overlap.
Inter-Annotator Agreement
100 query-response pairs independently annotated by two reviewers. Cohen's kappa = 0.954 — near-perfect agreement on what constitutes correct retrieval for clinical text.
Practical Takeaway
- Run both BM25 and dense retrieval
- Use RRF to merge results
- Implement deterministic citation verification
- Measure complementarity with McNemar's test on your domain
Code and Data
- https://github.com/nomad-link-id/hybrid-rag-pipeline
- https://github.com/nomad-link-id/citation-guard
Preprint: https://doi.org/10.5281/zenodo.19686739
Igor Eduardo | igoreduardo.com | ORCID: 0009-0005-6288-1135
Top comments (0)