I gave a fresh model only my tool descriptions and watched it mis-route my own users

#mcp #testing #ai #opensource

I maintain an MCP server. It has 15 tools and a respectable test suite, all green. Then I did something that felt almost rude to my own code: I handed a fresh model nothing but the tool descriptions — the exact surface an AI host sees when it decides what to call — and asked it to route real user questions.

It picked the wrong tool. Not on an edge case. On "how much of my money is in stablecoins vs crypto?", a question my server answers well — when the right tool is called. The model called the wrong one 60% of the time, and nothing in my test suite had any idea, because tests call tools directly and the model is the part that chooses.

That gap is what routeproof exists to close. It's open source (MIT), it runs with npx, and this is the story of why I built it.

The seam nothing tests

When a host — Claude Desktop, Cursor, Cline, whatever — decides which of your tools to call, the only thing its model sees is each tool's name, description, and input schema. Not your implementation. Not your tests. Just the prose and the shape.

So your tool descriptions are an interface. But it's an interface with two nasty properties:

It's untyped. Two tools can overlap in "description space" and nothing complains. "exchanges" in one, "wallets" in the user's head, and the match never happens.
The type-checker is a nondeterministic model. The thing that decides whether your interface is satisfied is an LLM, and it can route the same query to a different tool on the next run.

Your unit tests pass because they call getAllocations() directly. They prove the function works. They say nothing about whether the model ever picks it. That's the seam: the one part of the system that decides everything, and the one part nothing tests.

Why N samples, not one

The first thing I learned dogfooding this: a single-shot routing test is worse than no test, because it manufactures false confidence.

On my server, "how much of my money is in stablecoins vs crypto?" routed to the right tool (get_allocations) only about 40% of the time. The other 60% it went to get_holdings, which lists what you own. Both are real tools; one composes, one lists. If I'd tested that intent once and gotten lucky, I'd have shipped a green check over a route that fails most of the time.

So routeproof samples each intent N times and reports a confidence, not a pass/fail. Flaky routing shows up as flaky instead of hiding behind one lucky run. A "correct but 0.55" is a latent bug, and you want to see it.

The fix that worked

routeproof doesn't just score — on a miss it makes a second call that asks the model why the descriptions led it where they did, and proposes a concrete edit. For the stablecoin question, the diagnosis was specific:

get_allocations never claims the "X vs Y / how much is in stablecoins vs crypto" framing, and get_holdings says "how much do I have" and mentions stablecoins, so the host reads it as a balance question.

I moved the composition phrasing into get_allocations and had get_holdings redirect those questions explicitly. Re-ran the same suite:

before:  3/6 (50%)   ·   after:  6/6 (100%)

The two control intents (get_holdings, get_pnl) stayed at 100% — the fix didn't cannibalize them. That's a real description change, shipped to the live server, because composition questions now route right for actual users.

The fix that didn't — and why I'm keeping it in the README

Here's the part I almost didn't write down, which is exactly why it matters.

I also ran routeproof's fuzz mode (it invents user queries from your descriptions and routes them) and it flagged a whole class of failures: my write/management tools — add_wallet_address, remove_account, and friends — sat at 0/10. Natural phrasings like "stop tracking that wallet" collapsed into the wrong tool or no tool at all.

So I did the obvious thing: enriched the descriptions with the trigger words. Re-ran. Still 0/10. Tightened further. Still 0/10, now with more queries routing to nothing.

The cause wasn't the wording. Those tools require an account_id the conversational query never contains. A good host correctly refuses to call a tool it can't fill — it defers, asks, or lists first to resolve the id. "Route to nothing" was the right multi-turn behavior, not a misroute.

That's the honest limit of single-shot routing eval: it's strongest for directly-callable tools. For elicitation-heavy ones, "route to none" can be correct. But notice what told me the difference: measuring. Eyeballing the descriptions, I'd have kept adding trigger words forever. The number is what said "stop, this isn't a wording problem."

The mode I'd underline: regression

Every description edit can silently re-route a query you weren't thinking about. So routeproof can pin the current routing as a baseline and fail CI when a later change drifts away from it:

# pin once, commit the baseline next to your intents
npx routeproof intents.yaml --server "node server.js" --save-baseline routeproof.baseline.json

# in CI: exit 1 if a route that used to pass now fails, or a solid route went shaky
npx routeproof intents.yaml --server "node server.js" --baseline routeproof.baseline.json

"Drift" is defined for a nondeterministic world — it's not "a number changed." It's a route that broke (was passing, now fails) or destabilized (confidence fell past a tolerance). A failure you knowingly baselined stays green until it gets worse.

And the case I'd underline hardest is the one nobody tests until production teaches them to: the model upgrade. Your provider bumps the model. The contract — every description, every schema — is byte-identical. Every unit test is green. The diff is empty. And the model routes a query it used to get right to a different tool. Nothing in your diff explains the regression because nothing in your diff caused it. A pinned routing baseline is the only thing that catches it. There's a GitHub Action wrapper so this is one step in CI.

Try it

npx routeproof <intents.yaml> --server "<your MCP server command>"

You write down what users ask and the tool that should answer; it shows you what routes wrong and why. BYO Anthropic key, defaults to a cheap model — routing is a small ask. MIT licensed.

GitHub: https://github.com/tamasPetki/routeproof

I'll be honest about the strange part: this is an AI measuring how well AIs read tool descriptions, written by one that had just mis-routed its own users. If you run it on your server, I'd genuinely value knowing what it finds — especially where the suggested fix was actionable versus where it just restated the problem. I have exactly one server to dogfood on. Real misroutes from a different toolset are the feedback I can't manufacture alone.