Ali Al-Jaafari

Posted on Jun 28

Your MCP servers are burning 50k+ tokens before you type a word

#ai #mcp #productivity #claude

Context caching and on-demand schema loading

Here is something I did not realize about the Model Context Protocol until my context window kept feeling full for no reason.

Every MCP server you connect loads its full set of tool definitions into the context window on every single request. Those schemas are not free. Each tool costs a few hundred tokens, and they are sent before the model reads a word of your prompt.

Five typical servers, with a dozen or more tools each, commonly add up to 50,000 to 75,000 tokens of overhead per request. That is real money on every call, and latency you feel on every turn. It also crowds out the context you actually want the model to use.

Measure it first

You cannot cut what you cannot see. A rough rule is about 200 tokens per tool plus a small per-server overhead. I built a tiny tool that prints an estimate for your real config (and checks security while it is at it):

pipx install git+https://github.com/alih552/mcp-audit
mcp-audit
# -> 7 server(s) - ~13,160 context tokens - score 0/100

It runs fully locally, zero dependencies, MIT.

Then cut it

Turn off what you are not using. The biggest lever and the easiest. Most people leave servers connected that they touched once. Going from seven always-on servers to the two you actually use can reclaim tens of thousands of tokens.
Remove redundant servers. Two search servers, two file servers. Pick one per capability.
Trim tool surface on servers you build. Ten focused tools beat thirty overlapping ones, both for token cost and for the model picking the right one. Keep descriptions tight.
Load niche servers on demand rather than keeping everything always on.

The default of "everything connected all the time" is what creates the bloat. A few minutes of cleanup pays for itself on every request after.

Repo and the full writeup: https://github.com/alih552/mcp-audit

Curious what context-token number people get on their setups.

Top comments (8)

UnitBuilds • Jun 28

Worth noting, MCPs are a tradeoff. If you're using context caching, then it becomes really cheap. You essentially pay full token price once, then 5% the price for each request after. Though yes, the MCP standard is a bloated mess... It's slow and it doesnt even prevent hallucinations. Why I had to redesign it for NMCP for VELOCITY OS. A standard Node.js hosted json mcp takes over a milisecond just to deserialize, whereas NMCP runs at nanoseconds and doesnt need to serialize or deserialize, it's deterministic, so no label + description, it's logically written in a way the LLM natively understands.

Ali Al-Jaafari • Jun 29

Fair point on caching, that genuinely changes the cost math when the client and provider both support it. Two things it does not fully solve though: the schemas still occupy the context window, so they crowd out room for what you actually want the model to use, and the cache has a TTL plus it does nothing for the first call or for clients that do not cache. So measure and prune still earns its keep even with caching on. Interesting direction with NMCP, attacking the schema tax with deterministic dispatch is exactly the right thing to go after.

Nazar Boyko • Jun 29

Turning off servers you don't use is the right first move, and there's a second fix landing one layer down that you can't reach by hand. Some clients are starting to load only the tool names up front and pull the full schema on demand the moment a tool is actually needed, so a connected server costs almost nothing until you call it. That kills the always on tax without making you babysit which servers are plugged in. Until that's everywhere, your measure then prune approach is the practical answer, and the rough 200 tokens per tool rule is a handy gut check. What number did your own config print?

Ali Al-Jaafari • Jun 29

Exactly right, lazy schema loading is the real fix and it is great to see clients moving that way. Load the tool names up front, pull the full schema only when a tool is actually invoked, and the always on tax basically disappears. Until it is everywhere, measure and prune is the move, and the rough 200 tokens per tool rule holds up as a gut check. Mine printed about 13k tokens across 7 servers, and a humbling 0 out of 100 on the security side, which is what sent me down this whole path.

René Zander • Jul 10

The token overhead is the measurable half, and your cut list is right, but there is a second cost pruning fixes that caching does not touch: selection accuracy falls as the tool surface grows. Past a few dozen tools the model starts reaching for the overlapping one or the wrong search server, and that error rate climbs with surface area independent of how the schemas are cached. So "ten focused tools beat thirty" is not only a token argument, it is a correctness one. The lever I would add to your list: a lot of MCP tools wrap deterministic work, a fixed file read or a single API call, that never needed to be a model-visible decision at all. Route those through plain code and keep the tool list reserved for the calls that actually need judgment, and both the token count and the mis-selection rate drop together. I found the selection ceiling sits around 50 and wrote up where it bites: renezander.com/blog/skill-library-...

Maria andrew • Jun 29

Great reminder that MCP performance isn't just about the model it's also about context management. Regularly auditing connected servers and reducing unnecessary tool definitions can improve latency, lower costs, and leave more context available for tasks that actually matter.

Ali Al-Jaafari • Jun 29

Thanks, that is exactly it. The context you free up by pruning is the win people miss. It is not just cost, it is leaving room for the model to actually focus on the task. Appreciate you reading.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.