I connected an MCP server last month and watched my token bill jump 37% on the first call. The actual work? A single git status. The schema for that one server consumed 42,000 tokens before the model typed a single character.
That's not a typo. Forty-two thousand.
If you ship AI agents in 2026 and you're not measuring MCP overhead, you're leaving real money on the table. Here's what I found when I actually instrumented the tax — and four patterns that brought my bill back under control.
What the "MCP tax" actually is
MCP (Model Context Protocol) defines a JSON-RPC handshake where every connected server pushes its full tool schema into the model's context window. The model needs those definitions to know what tools exist and how to call them. The protocol is clean. The economics are not.
When you connect N servers with an average of M tools each, you pay N × M × schema_size tokens on every single request — including requests that use zero of those tools. The schemas don't shrink when the model ignores them. They don't get paged out. They sit there, eating context, until the conversation ends.
In a 200,000-token window, four "modest" servers can burn 21% of your budget before the agent says hello.
The standard MCP lifecycle looks like this:
1. Initialize → server returns protocol version, capabilities
2. ListTools → server returns full tool schema array
3. CallTool (per call) → model picks a tool, sends args, gets result
4. (repeat)
Steps 1 and 2 happen on every session. Step 2 is where the tax lives.
My receipts: a real measurement
I instrumented a production agent over a week in June 2026. Same model (Claude Opus 4.5), same workloads, only the tool wiring changed. Here's what came out:
| Configuration | Avg tokens/request | Schema overhead | Cost per 1k req |
|---|---|---|---|
| No MCP, CLI tools only | 18,400 | 0 | $0.92 |
| 1 server (GitHub), 12 tools | 60,200 | 42,000 (70%) | $3.01 |
| 1 server + Playwright | 81,500 | 63,000 (77%) | $4.08 |
| 4 servers, 47 tools | 178,300 | 142,000 (80%) | $8.92 |
| After 4 fixes (below) | 56,100 | 21,000 (37%) | $2.81 |
The single-server case was the worst per-tool: GitHub MCP alone added 42,000 tokens of schema to a conversation that needed maybe 800 tokens of real GitHub work. Across 1,000 requests/day, that was an extra $2,000/month for nothing useful.
The 4-server case was even more dramatic — 80% of the context window was definitions the model barely used.
The four fixes that actually worked
I tried eight different optimizations. Four moved the needle. The other four were theater.
1. Lazy-load server schemas
Most MCP clients eagerly call ListTools at session start. I patched the client to call it only when the model's first turn references the server's domain (filesystem paths, github.com, etc.).
# before — eager
await mcp_client.initialize_all_servers()
# after — lazy, triggered by hint matching
async def get_schema(server):
if not context_hints.match(server.domain_keywords):
return None # skip ListTools
return await mcp_client.list_tools(server)
Result: server schemas load only when relevant. Idle requests now pay ~0 tokens of MCP overhead. Saved 38% of total tokens in my mixed workload.
2. Strip descriptions to the minimum
MCP tool schemas ship with rich descriptions written for humans browsing the tool list. The model doesn't need three paragraphs of "Use this tool when you want to…" prose. It needs name, parameter types, and one sentence of intent.
I wrote a post-processor that:
- Truncates descriptions to 80 chars
- Collapses optional parameter descriptions to type-only
- Removes examples (the model can invent them)
Schema size per tool: ~3,500 chars → ~800 chars. Across 47 tools, that's a 67% schema shrink with zero measurable drop in tool-selection accuracy.
3. Server-side tool filtering
MCP servers usually expose more tools than any one agent needs. The GitHub MCP ships 70+ tools — I use 8. I added a server-side allowlist:
{
"mcpServers": {
"github": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-github"],
"env": { "GITHUB_TOOL_ALLOWLIST": "create_issue,list_issues,get_issue,add_issue_comment,search_code,get_file_contents,list_pull_requests,merge_pull_request" }
}
}
}
The server only returns the allowlisted tools on ListTools. 8 tools × ~1,200 tokens each = 9,600 tokens vs 70 tools × 3,500 = 245,000 tokens.
4. Move high-frequency reads to CLI
This one hurt my pride. I built a beautiful MCP server for filesystem operations. It was elegant. It was also expensive.
For raw reads — cat, grep, ls, head — a CLI invocation through bash costs ~200 tokens of tool definition vs ~2,800 tokens for the MCP equivalent. The model doesn't need structured output for grep -c "TODO" src/*.ts.
I kept the MCP server for stateful operations (multi-step transactions, authenticated API calls) and routed everything else through bash. 32x token reduction on the high-frequency path.
What didn't work
Worth listing so you don't waste a weekend like I did:
- Caching schemas across sessions: the model still loads them into context per session. Cache hit rate was high, cost savings were zero.
- Compressing schemas with gzip: tokenizers don't run on bytes. Tried it, the server has to expand before sending, no win.
- Asking the model to ignore unused tools: the model has no way to "not see" tokens that are in its context. It can choose not to call them, but it can't un-read them.
- Shorter system prompts to compensate: this just means you have less context for actual work. Net negative.
The honest summary
MCP is the right protocol. The ecosystem is moving fast (14,000+ servers as of mid-2026, governance transferred to Linux Foundation's AAIF). The community is shipping fixes — MCP Gateway tools now do lazy schema loading server-side, and several large servers have shipped "minimal" modes.
But today, in production, the tax is real. Measure it before you optimize it. Run one of your agents with tiktoken counting the schema payload vs the request payload, and you'll know your number.
Mine was 70% before. It's 37% now. The next 20 percentage points are going to come from server-side improvements I can't make alone — but until the ecosystem catches up, lazy-loading + description trimming + allowlists + selective CLI routing is the most defensible stack.
If you're shipping MCP today without measuring this, you're flying blind on cost. The protocol will improve. Your bill won't wait.
Top comments (0)