What an MCP server actually costs you in tokens

#mcp #tooling #devrel #ai

Connect five MCP servers to your agent and something quietly happens before you've typed a single word: you've already spent tens of thousands of tokens. Not on your problem. On the menu of tools the model might use.

Most people notice what MCP unlocks. Fewer notice the bill.

I've been wiring MCP into real workflows for a while — I wrote about using the GitHub and Jira servers to triage PR comments — and the thing that keeps coming up isn't capability. It's context. So this is a note on where that cost actually comes from, and the two very different techniques that attack it.

The two costs of an MCP

It helps to split the bill in two.

Fixed cost — the tool definitions. Every connected server injects its tools' names, descriptions and schemas into the context window at startup, whether you use them or not.
Variable cost — the results. Every tool call and its output passes back through the model, including the 4,000-row response you needed one number from.

The fixed cost is the one that surprises people. Anthropic's own figures put a five-server setup at around 58 tools and ~55K tokens before the conversation even starts. GitHub's official server alone runs around 26K — about 13% of a 200K window, and only approaches a quarter once you enable its full ~90-tool surface.

Connect ten MCP servers and you can burn 75,000+ tokens before anyone types a thing.

Two costs, two completely different fixes. Worth keeping straight, because a lot of advice online quietly solves one and claims to solve both.

Fixed cost: stop loading tools you don't need

The fix here is lazy loading, and it finally went mainstream.

Instead of injecting every tool definition upfront, the client exposes a small search interface and loads a tool's full schema only when the agent decides to use it. Anthropic shipped this as the Tool Search Tool on the API in late 2025, then turned it on by default in Claude Code in January 2026 — fittingly, it had been one of the most requested issues on the repo: "lazy loading for MCP servers".

Mechanically it's simple. You mark the tools you don't need upfront:

{
  "type": "tool_search_tool_bm25_20251119",
  "name": "tool_search_tool_bm25"
},
{
  "name": "github.createPullRequest",
  "description": "Create a pull request",
  "input_schema": { "...": "..." },
  "defer_loading": true
}

Deferred tools stay discoverable but cost nothing until the model searches for them. Keep your two or three most-used tools at defer_loading: false so they're always present. Anthropic reports this preserves around 85% of the tokens definitions used to eat — while improving accuracy, because fewer irrelevant options means less decision paralysis (Opus 4 went from 49% to 74% on their MCP evals).

Here's the part most people miss:

This is a client problem, not a server problem. Your server can't lazy-load itself. All it can do for the fixed cost is ship few tools with short, keyword-rich descriptions — because the search runs BM25 over names, descriptions and parameter names. Vague descriptions don't get found.

And a real tradeoff worth knowing: retrieval isn't perfect. Independent benchmarks put keyword (BM25) retrieval — the same algorithm Tool Search uses — at around a 61% top-10 hit rate on a 2,000-tool catalog, with end-to-end accuracy in the high-50s to mid-60s. "Send an email" failing to surface Gmail_SendEmail is a genuine failure mode. Lazy loading saves tokens; it also adds a step that can miss.

If you run Claude Code, /context shows you exactly where your tokens are going. Look before you optimize.

Which clients actually ship lazy loading today — and the proxy trick for the ones that don't — turned into its own note: which clients support it and how.

Variable cost: keep the data out of the model

Lazy loading does nothing for the second cost. If a tool returns a 50,000-token spreadsheet, you pay for the spreadsheet — twice, if the model has to copy it into the next call.

This is where code execution with MCP comes in, and it's the more interesting shift.

Instead of the model calling tools directly, the client exposes each server as code modules on a filesystem (servers/salesforce/createContact.ts, and so on). The agent writes TypeScript that imports only what it needs and runs it in a sandbox. Two things change:

Progressive discovery. The agent lists the directory and reads only the module it's about to use. The full catalog never enters context — the same win as lazy loading, by a different road.
The data stays in the sandbox. This is the big one. A huge intermediate result lives in the execution environment; only what you return or log comes back to the model.

So importing a spreadsheet into Salesforce stops looking like "load 10,000 rows into context, then call createContact 10,000 times". It looks like this:

import { getRows } from './servers/sheets/getRows.ts';
import { createContact } from './servers/salesforce/createContact.ts';

const rows = await getRows('contacts.xlsx');          // stays in the sandbox
const active = rows.filter(r => r.status === 'active');

for (const r of active) await createContact(r);

console.log(`Imported ${active.length} of ${rows.length} contacts`);

The model sees one line of output, not ten thousand rows. Anthropic reported a workflow dropping from ~150,000 tokens to ~2,000 on exactly this kind of pattern — a 98.7% cut.

A couple of things fall out of this almost for free:

Privacy. Sensitive fields can be tokenized inside the sandbox, so the model sees placeholders while the client keeps the real mapping. Data can move between servers without the model ever seeing a raw email.
Reusable skills. Because there's a filesystem, a helper you write once ("turn this sheet into a report") can be saved and imported in a later session. The agent accrues skills instead of re-deriving them.

The catch is honest and familiar: running model-generated code needs a real sandbox, with resource limits and monitoring. You're trading token cost for infrastructure.

If you build or maintain MCP servers

Most of the writing on this is about the client side. But the techniques change how you should build a server too.

Keep the tool count low and the surface clean. Even with lazy loading, every tool is something search has to disambiguate.
Write descriptions for retrieval, not just for humans. BM25 matches keywords. "Fetch conversion data from analytics" beats "Analytics endpoint v2".
Name tools so they don't collide. notification-send-user vs notification-send-channel is exactly the pair that produces wrong-tool errors.
Don't return everything. Paginate, summarize, or expose a "give me the count" variant. A tool that always dumps its full payload is a tax on every workflow that touches it.
Design for code mode. Small, typed modules that do one thing compose well when an agent is writing code against them. Sprawling do-everything tools don't.

None of this is exotic. It's the same hygiene we already apply elsewhere — small units, clear names, return what's asked for. It connects to something I keep coming back to about keeping AI context lean and intentional: the value isn't in giving the model more, it's in giving it the right slice at the right moment.

The shift underneath all this

The mental model that's emerging is simple. You used to keep every tool in the model's head at all times. Now the tools live in a dictionary — searched, or imported as code — and the model only pulls what the task needs.

Treat your tool catalog like a library, not a backpack.

Both techniques are young, and retrieval accuracy isn't where it needs to be yet. But the direction is clearly right. Context was the main reason a lot of us kept MCP setups small.

That reason is quietly going away.