Algis

Posted on Mar 15 • Originally published at mcpproxy.app

Beyond BM25: The Future of MCP Tool Discovery

#mcp #ai #search #opensource

This post was originally published on mcpproxy.app/blog.

TL;DR

In our earlier post, we made the case for BM25 as the right default for MCP tool discovery -- and for small-to-medium tool sets, that case still holds. But new benchmarks from StackOne, Stacklok, and the RAG-MCP paper paint a more nuanced picture: BM25 alone delivers just 14% top-1 accuracy when tool counts climb past a few hundred. Hybrid approaches combining BM25 with semantic search hit 94%. This post lays out what the data actually shows, why BM25 degrades at scale, and how MCPProxy is evolving toward hybrid search while keeping the zero-dependency simplicity that makes it useful.

The Benchmarks Are In

Three independent evaluations have landed in the last few months, and they tell a consistent story.

StackOne's benchmark tested 270 tools across 11 API categories with 2,700 natural-language queries:

Method	Top-1 Accuracy	Top-5 Accuracy	Latency
BM25 only	14%	87%	<1ms
TF-IDF/BM25 hybrid	21%	90%	<1ms
Embedding search	38%	85%	50-200ms
Reranker	40%+	90%+	200-500ms

Stacklok's MCP Optimizer ran a head-to-head comparison against Anthropic's built-in Tool Search across 2,792 tools. Their hybrid semantic+BM25 approach achieved 94% selection accuracy versus 34% for BM25-only.

The RAG-MCP paper confirmed that agents given every tool upfront achieve just 13.6% accuracy, while retrieval-first routing more than triples it to 43.1%.

Why BM25 Breaks Down at Scale

Common verbs saturate the index. When you have 2,000+ tools, verbs like "create," "list," "get" appear in hundreds of tool names. BM25's IDF component loses discriminating power.

Short documents amplify the problem. Tool descriptions are uniformly short (10-50 words), collapsing a dimension BM25 normally uses for discrimination.

Semantic intent gets lost. "notify the team about a deployment" might need Slack, PagerDuty, or email. BM25 cannot bridge the gap between "notify" and "send_message."

None of this invalidates BM25 for smaller deployments. The 87% top-5 accuracy confirms BM25 almost always gets the right tool somewhere in the results.

What Hybrid Search Actually Looks Like

Step 1: Parallel Retrieval

The query runs simultaneously through two paths:

BM25 path: Keyword search against the Bleve index. Sub-millisecond, zero dependencies.
Semantic path: Query embedded via lightweight model, compared against pre-computed tool embeddings.

Step 2: Reciprocal Rank Fusion

The two ranked lists merge using RRF:

RRF_score(tool) = 1/(k + rank_bm25) + 1/(k + rank_semantic)

RRF is score-agnostic -- it works on rank positions, not raw scores. This sidesteps the normalization problem entirely.

Why This Works So Well

BM25 excels at exact term matching. Embeddings excel at semantic bridging. RRF ensures high confidence when both signals agree. Stacklok's 94% vs BM25's 34% on 2,792 tools proves the combination is categorically better at scale.

Where BM25 Still Wins

Small-to-medium tool sets (under 100 tools). 87% top-5 accuracy, zero dependencies, sub-millisecond.

Air-gapped environments. No network calls required.

Determinism and debuggability. BM25 scoring is fully transparent and inspectable.

Cold start speed. Indexes built instantly from tool metadata.

MCPProxy's Roadmap: Hybrid Without Compromise

Phase 1: Smarter BM25 (Now)

Field-weighted scoring (tool names > descriptions)
Verb deweighting for common actions
Query expansion for abbreviations
Server-context boosting

Phase 2: Optional Embedding Layer

Local embedding models (~80MB, single-digit ms)
Pre-computed embeddings stored alongside Bleve index
RRF fusion
Graceful degradation to BM25-only

Phase 3: Hierarchical Discovery

Server-level grouping as first-level filter
Progressive disclosure (mirrors Claude Code's pattern)
Dynamic tool sets by annotation or usage

The Guiding Principle

Every phase maintains MCPProxy's core contract: it ships as a single binary with zero required external dependencies.

What This Means for You

Your Scale	Recommended Approach	Expected Top-1 Accuracy
10-50 tools	BM25 (MCPProxy default)	~80-85%
50-200 tools	BM25 with field weighting	~60-70%
200-500 tools	Hybrid BM25 + embedding	~85-90%
500+ tools	Hybrid + hierarchical discovery	~90-94%

The earlier BM25 post was not wrong -- it was incomplete. BM25 is the right starting point. But the data is clear that BM25 alone does not scale to the hundreds-of-tools future. MCPProxy is evolving toward hybrid search because the constraints are changing -- and we would rather share that data honestly than pretend a single algorithm solves everything forever.

MCPProxy is open source at github.com/smart-mcp-proxy/mcpproxy-go. Originally published at mcpproxy.app/blog.

DEV Community