Your agent can't find the tool it needs

#ai #machinelearning #bm25 #agents

I reproduced the BM25 mode of Anthropic's Tool Search Tool. At its documented limit of 10,000 tools, a correct tool missed the search's top 5 more than 40% of the time.

Tool overload has an official fix now. It works. You no longer begin every session, and every turn after, with your whole tool set loaded into the model's context. Anthropic's Tool Search Tool keeps the catalogue out of context and hands the model only the handful of tools a task needs, pulled by a search. Enable all the plugins and MCP servers you want; the search sorts it out. The token usage number drops.

There is a catch though. The model can no longer reach a tool directly. When it needs one it writes a search query, the search returns the top 3 to 5 matches, and the model can only call those. It can search again with different words, but it can only call a tool that a search has actually returned. The high-level flow looks like this:

Picking the right tool used to be the model's job, with the whole list in front of it. Now the list sits behind a search, and the model only sees what the search returns. I started noticing the model miss tool calls it used to get right. However, Anthropic's numbers tell a different story: with the search on, accuracy rose, 79.5% to 88.1% on Opus 4.5 in its MCP tests. That measures a different gate than the one I was concerned about. Theirs is end-to-end accuracy, whether the agent picked the right tool and finished the task, at a catalogue size the post does not give. Mine is the step before it: whether the right tool even reaches the model, as the catalogue grows toward the 10,000 tools the docs allow. Neither number refutes the other, and that earlier step was the one I could not find measured. So I went through the documentation and rebuilt the search: the same BM25 algorithm (one of the two modes the tool offers, next to a regex one), indexing the four fields the docs list (name, description, argument names, argument descriptions). The docs cap the tool at 10,000 tools, so that is where I set the catalogue, filled with real tools drawn from a public benchmark (ToolRet; I used its 37,292-tool web category, the largest of the benchmark's three). Then I ran 1,100 real queries through it and checked whether the right tool was located correctly.

It made the top 5 57% of the time. Top 1, 36%.

A query counts as a hit if any of the tools ToolRet marks correct for it, about 2.39 on average, reaches the top 5. That makes 57% the generous reading: demand the one tool you meant and recall only falls.

That 57% is a ceiling, and everything the agent does sits under it. At the catalogue size Anthropic supports, the right tool is missing from what the model sees more than 40% of the time, no matter how good the model is. That is your retrieval ceiling, and almost no agent dashboard shows it. That figure is per search. An agent can try again with new words and recover some of it, but every retry runs against the same ceiling.

Adding tools lowers the ceiling

Connect more tools and the ceiling drops. I ran the same search at growing catalogue sizes, filling each with random, unrelated tools. At 5 tools the right one was in the top 5 every time. At 100, 93%. At 1,000, 80%. At 5,000, 64%. At 10,000, the most Anthropic's tool supports, 57%. I did push beyond that cap as well, loading the full 37,000-tool corpus I had, and recall rate sat at 44%. That decline never reverses or shows any sign of bouncing back up; the black curve in the chart above falls from the first tool to the last. Every tool you connect, even one the agent never needs, is one more chance for the search to rank it over the tool you wanted. Under 100 tools the effect is small, and you would not bother with tool search at all. It starts to matter in the hundreds and turns serious past a few thousand, which is further into the curve than most think they are.

The Model Context Protocol made connecting a tool almost free, so catalogues grow. Zapier's MCP server advertises 9,000-plus apps behind one connector, among them Gmail, Slack, and Salesforce. Wire a few hundred of those apps in, each exposing several tools, and your catalogue is already in the low thousands, deep enough for the decline to show.

Overlap does most of the damage

The size of the catalogue erodes recall slowly. It gets steep when tools resemble each other in the exact fields the search indexes. To see overlap on its own, I fixed the catalogue at 1,000 tools, the right tool always in it, and changed only the tools around it. When those tools were random and unrelated, the right tool made the top 5 79% of the time. When they were near-duplicates of it, the kind a real catalogue is full of, that fell to 52%. Same count. 27 points gone, purely because the other tools looked like the one I needed.

Real catalogues are the second kind. In that 37,000-tool set, 39% of tools have a near-twin by surface text alone, and the real number is higher, because matching on text misses the tools that mean the same thing in different words. The corpus carries three separate "get all climate change news" tools, from three different API versions. Every service ships a search, a create, a list. search_drive sits next to search_email sits next to search_web, and the retriever has to tell them apart on a name and one line of description. Anthropic names the same failure in the launch post: wrong tool selection happens most when tools have similar names, its example being notification-send-user against notification-send-channel.

At a thousand tools, making them resemble each other cut the right tool's odds of being found from 79% to 52%. The count never changed.

This is also why searching again does not guarantee the right tool comes back. The look-alikes rank high because they resemble the tool you want, so any query that finds your tool tends to find them too. Reword the search and the same near-duplicates come back, just in a different order. The overlap penalty is a property of the catalogue, and rewording the query does not touch it.

There are studies that measured this from the other side. ToolScope, which merges redundant tools "with overlapping names and descriptions" and adds a context-aware retriever, reported gains of up to 38.6% in tool-selection accuracy across three models and three benchmarks. Another study found that editing only a tool's description, leaving the function untouched, changed how often a model picked it by more than 10 times. The choice rides on the wording, at the retrieval step I measured and at the selection step they did.

The fix is in the tool definitions

When the search hands the model the wrong tool, my first instinct is to blame the model and reach for a bigger one. The experiment is what changed my mind. A bigger model still sees only the 3 to 5 tools the search returned, and the search ran with no model in it at all. None of these numbers move because the model got smarter.

The next instinct is to blame the BM25 part and reach for embeddings, on the theory that a semantic search knows "find me the weather" and get_current_conditions are the same request. So I ran the whole thing again with real dense retrievers, a small one and the larger model in the same family (BGE-small and BGE-large, normalised cosine). They helped, and hit the same wall.

Retriever, at the 10,000-tool cap	hit@5	top-1
BM25 (lexical, what the tool ships)	57%	36%
BGE-small (dense)	68%	42%
BGE-large (dense)	72%	47%

Going from lexical to semantic buys about 10 points of hit@5; scaling the embedder from small to large buys 4 more. Neither clears the ceiling. Even BGE-large leaves the right tool out of the top 5 about 28% of the time at the cap, and on the full 37,000-tool corpus its best guess is right only 26% of the time. The overlap cliff survives every retriever: at 1,000 tools, BGE-large still falls from 91% on random fillers to 65% on look-alikes. A better retriever lifts the floor, with diminishing returns, and never moves the work off the catalogue.

The cleanup that helps is dull. Merge the duplicate tools. Put the service in the name, so gmail_search_messages instead of a bare search, and write descriptions in the words a real query would use. Anthropic's own troubleshooting guide for the Tool Search Tool says exactly this: when Claude cannot find a tool, "add common keywords to tool descriptions" and use "consistent namespacing in tool names: prefix by service or resource." The vendor that shipped the search also shipped the list of ways it misses. ToolScope measured that end-to-end gain with its full merge-and-retrieve system, and my own numbers size the cleanup piece from the retrieval side: the 27 points between look-alike and random tools at 1,000 is the overlap penalty, which is what deduping and renaming are fighting to win back. It only goes so far. That penalty comes from overlap. The slow decline from sheer count is a separate problem, and cleanup does not touch it, so past a few thousand tools the next move is to stop exposing tools a given task will never touch.

The whole method fits on one screen

There is no server and no model in the loop. A tool becomes searchable text from the four fields the docs index, BM25 ranks that text against the query, and I check whether the right tool landed in the top 5. ToolRet stores those fields in two different shapes depending on the source API, so the text builder reads both.

from rank_bm25 import BM25Okapi
import re, numpy as np

def text(tool): # the four fields the docs index, both schemas
    parts = [tool["name"], tool["description"]]
    props = (tool.get("doc_arguments") or {}).get("properties") or {} # schema A
    for argname, spec in props.items():
        parts += [argname, spec.get("description", "")]
    for key in ("required_parameters", "optional_parameters"): # schema B
        for p in (tool.get(key) or []):
            parts += [p.get("name", ""), p.get("description", "")]
    return re.findall(r"[a-z0-9]+", " ".join(parts).lower())

# catalogue: the query's gold tools plus fillers, up to size total, sampled from ToolRet
index = BM25Okapi([text(t) for t in catalogue])

hits = 0
for q in queries: # 1,100 real ToolRet queries
    scores = index.get_scores(re.findall(r"[a-z0-9]+", q.text.lower()))
    top5 = [catalogue[i] for i in np.argsort(scores)[::-1][:5]]
    hits += any(g in top5 for g in q.gold) # hit = any of the query's ~2.39 gold tools in the top 5

recall = hits / len(queries) # 0.57 at a 10,000-tool catalogue

That is the retriever in full, the two argument schemas included. The size curve runs this same loop at each catalogue size, averaged over a few random fills; the look-alike test swaps the random fillers for the target's nearest neighbors. Loading ToolRet and sampling a catalogue is the ordinary dataset plumbing around it. The corpus is public and the scoring is on this screen, so the 57% is yours to check.

Measure your own retrieval recall

My 57% is on a benchmark corpus, at the catalogue size Anthropic's tool supports. Yours will move with your tools, the descriptions attached to each of them, and your retriever. Stop focusing on how many tools you connected. For a sample of your real tasks, consider tracking whether the correct tool was in what the search returned, before the model chose. The loop below walks every search in a response and gathers all the tools they returned, so the number reflects what your agent actually retrieved across all its searches, however many it ran.

# Retrieval recall: did the search return the right tool,
# before the model got to choose? Run this across your eval set.
def retrieved_tools(response):
    # response shape (Anthropic docs): each tool_search_tool_result block carries
    # .content.tool_references[].tool_name; verify the keys against your SDK version
    names = []
    for block in response.content:
        if block.type == "tool_search_tool_result":
            # each search returns the 3 to 5 tools it ranked most relevant
            names += [ref.tool_name for ref in block.content.tool_references]
    return names

hits = 0
for task in eval_set:                 # each task knows the tool it should use
    resp = client.messages.create(
        model="claude-opus-4-8", tools=TOOLS, messages=task.messages)
    hits += task.expected_tool in retrieved_tools(resp)

recall = hits / len(eval_set)         # this is your retrieval ceiling

The response shape is in Anthropic's docs: every search returns a tool_search_tool_result block listing the tools it found.

	Context bloat	Wrong tool called
Symptom	55,000 tokens spent before the first message	the right tool is missing from the results, or lost among look-alikes
Fixed by	deferring tool loading	deduping and renaming the overlapping tools
Number to watch	tokens per request	retrieval recall, then selection accuracy on what comes back

When recall is low, at least for now, the work should be in the catalogue, and a model upgrade will not mend it. When recall is high and the model still picks the wrong one, then you have a selection problem, and a better model earns its cost. Reach for the bigger model first and you spend on tool selection, the step the model already handled, while retrieval, the step that was failing, stays broken.

A few limits. I measured a BM25 reproduction and two embeddings retrievers, not Anthropic's exact server, so the absolute numbers may differ from production. The tool ships in two modes, regex and BM25; I tested BM25, the one that takes natural-language queries and so is the likeliest to return a loosely matching tool. The regex mode matches name patterns instead, and I did not run it, including against the overlap test, so if you rely on regex, consider measuring that case yourself. The 57% is measured at the 10,000-tool limit the docs set; the 37,000-tool figures run past that cap and are here only to show where the curve heads. The exact numbers are specific to that corpus, which mixes real APIs with synthetic ones. The benchmark also scores only its labeled tools as correct, so some of what I counted as a miss returned a different tool that would have done the job, which means the real failure rate is lower than the raw number. These are also single-shot retrieval numbers, one search per query; a real agent can search again with different words, so its end-to-end miss rate is lower than the per-search figure here.

I tested two embedding models, BGE-small and BGE-large; the larger one added about 4 more points and hit the same wall, so a heavier retriever is unlikely to clear it. What I am sure of is simpler than the exact numbers: recall declines as the catalogue grows, and it declines even more steeply when the tools resemble each other. It occurred with every retriever I tried. Every number above comes from a deterministic script with no API in the loop, and none of it is private: the corpus is the public ToolRet set, the retrievers are a standard BM25 index and off-the-shelf embedding models, and the metric is whether a labeled tool lands in the top k. You can rebuild the curve from those without my code. Plot your own curve before you reach for a bigger model.