This post was originally published on mcpproxy.app/blog.
TL;DR
In our earlier post, we made the case for BM25 as the right default for MCP tool discovery -- and for small-to-medium tool sets, that case still holds. But new benchmarks from StackOne, Stacklok, and the RAG-MCP paper paint a more nuanced picture: BM25 alone delivers just 14% top-1 accuracy when tool counts climb past a few hundred. Hybrid approaches combining BM25 with semantic search hit 94%. This post lays out what the data actually shows, why BM25 degrades at scale, and how MCPProxy is evolving toward hybrid search while keeping the zero-dependency simplicity that makes it useful.
The Benchmarks Are In
Three independent evaluations have landed in the last few months, and they tell a consistent story.
StackOne's benchmark tested 270 tools across 11 API categories with 2,700 natural-language queries:
| Method | Top-1 Accuracy | Top-5 Accuracy | Latency |
|---|---|---|---|
| BM25 only | 14% | 87% | <1ms |
| TF-IDF/BM25 hybrid | 21% | 90% | <1ms |
| Embedding search | 38% | 85% | 50-200ms |
| Reranker | 40%+ | 90%+ | 200-500ms |
Stacklok's MCP Optimizer ran a head-to-head comparison against Anthropic's built-in Tool Search across 2,792 tools. Their hybrid semantic+BM25 approach achieved 94% selection accuracy versus 34% for BM25-only.
The RAG-MCP paper confirmed that agents given every tool upfront achieve just 13.6% accuracy, while retrieval-first routing more than triples it to 43.1%.
Why BM25 Breaks Down at Scale
Common verbs saturate the index. When you have 2,000+ tools, verbs like "create," "list," "get" appear in hundreds of tool names. BM25's IDF component loses discriminating power.
Short documents amplify the problem. Tool descriptions are uniformly short (10-50 words), collapsing a dimension BM25 normally uses for discrimination.
Semantic intent gets lost. "notify the team about a deployment" might need Slack, PagerDuty, or email. BM25 cannot bridge the gap between "notify" and "send_message."
None of this invalidates BM25 for smaller deployments. The 87% top-5 accuracy confirms BM25 almost always gets the right tool somewhere in the results.
What Hybrid Search Actually Looks Like
Step 1: Parallel Retrieval
The query runs simultaneously through two paths:
- BM25 path: Keyword search against the Bleve index. Sub-millisecond, zero dependencies.
- Semantic path: Query embedded via lightweight model, compared against pre-computed tool embeddings.
Step 2: Reciprocal Rank Fusion
The two ranked lists merge using RRF:
RRF_score(tool) = 1/(k + rank_bm25) + 1/(k + rank_semantic)
RRF is score-agnostic -- it works on rank positions, not raw scores. This sidesteps the normalization problem entirely.
Why This Works So Well
BM25 excels at exact term matching. Embeddings excel at semantic bridging. RRF ensures high confidence when both signals agree. Stacklok's 94% vs BM25's 34% on 2,792 tools proves the combination is categorically better at scale.
Where BM25 Still Wins
Small-to-medium tool sets (under 100 tools). 87% top-5 accuracy, zero dependencies, sub-millisecond.
Air-gapped environments. No network calls required.
Determinism and debuggability. BM25 scoring is fully transparent and inspectable.
Cold start speed. Indexes built instantly from tool metadata.
MCPProxy's Roadmap: Hybrid Without Compromise
Phase 1: Smarter BM25 (Now)
- Field-weighted scoring (tool names > descriptions)
- Verb deweighting for common actions
- Query expansion for abbreviations
- Server-context boosting
Phase 2: Optional Embedding Layer
- Local embedding models (~80MB, single-digit ms)
- Pre-computed embeddings stored alongside Bleve index
- RRF fusion
- Graceful degradation to BM25-only
Phase 3: Hierarchical Discovery
- Server-level grouping as first-level filter
- Progressive disclosure (mirrors Claude Code's pattern)
- Dynamic tool sets by annotation or usage
The Guiding Principle
Every phase maintains MCPProxy's core contract: it ships as a single binary with zero required external dependencies.
What This Means for You
| Your Scale | Recommended Approach | Expected Top-1 Accuracy |
|---|---|---|
| 10-50 tools | BM25 (MCPProxy default) | ~80-85% |
| 50-200 tools | BM25 with field weighting | ~60-70% |
| 200-500 tools | Hybrid BM25 + embedding | ~85-90% |
| 500+ tools | Hybrid + hierarchical discovery | ~90-94% |
The earlier BM25 post was not wrong -- it was incomplete. BM25 is the right starting point. But the data is clear that BM25 alone does not scale to the hundreds-of-tools future. MCPProxy is evolving toward hybrid search because the constraints are changing -- and we would rather share that data honestly than pretend a single algorithm solves everything forever.
MCPProxy is open source at github.com/smart-mcp-proxy/mcpproxy-go. Originally published at mcpproxy.app/blog.
Top comments (0)