MCP is having its moment. Every agent framework is wiring tools through it, and for good reason: standardized tool registration, composable servers, a protocol that doesn't force you to reinvent the plugin system every project. I get the appeal.
What I didn't see coming was the token bill.
Here's the number that stopped me cold: for a simple SerpApi search, an MCP agent used 6,047 tokens per call. A CLI script doing the exact same job used 351 tokens. That's 17x overhead, for the same search result.
I went looking for other data points after that. The range is wider than I thought.
The number that started this: 17x tokens per call
The benchmark comes from a post measuring SerpApi MCP vs a CLI agent with field projection. Both do the same job: search the web, return results. The difference:
| Approach | Tokens per call |
|---|---|
| MCP agent | 6,047 |
| CLI script (with field projection) | 351 |
| Ratio | ~17x |
At 10 searches a day in a conversational interface, 6,047 tokens per call is fine. At 1,000 searches a day in an automated pipeline, you're burning 6 million tokens where 351,000 would do.
That's not a rounding error. That's a billing line item.
Why it happens: schema injection on every message
The overhead isn't in the API call itself. It's in what your AI host has to inject into every single message before the call even happens.
When you register an MCP server, every tool definition (name, description, input schema, parameter types) gets serialized and injected into every system prompt or assistant message in your conversation. The LLM needs to "see" all available tools on every turn so it knows what it can call.
A concrete example: a simple repo language check. The task is trivial. The token count is not:
| Approach | Tokens |
|---|---|
| CLI agent (direct API) | 1,365 |
| MCP agent (same task) | 44,026 |
The extra 42,661 tokens are 43 tool definitions injected into every message. Tools that had nothing to do with the repo language check. The agent happened to have 43 tools registered on that MCP server, and every one of them rode along for the trip.
The full benchmark breakdown (4x to 32x range)
"17x" is the headline, but the actual range is 4x to 32x depending on:
- How many tools are registered on the server
- How complex those tool schemas are
- How many turns the conversation takes
- Whether you're running batch jobs or interactive queries
The 4x end is a small server with simple schemas and a short session. The 32x end is a large server with dozens of complex tools running a multistep pipeline where schema injection compounds across every message.
What matters is that the floor is 4x. Even a minimal MCP setup carries overhead that a direct API call doesn't.
The $270/day surprise: unused tool definitions at scale
Here's where it gets genuinely alarming for production pipelines.
At Claude Sonnet pricing, 90,000 tokens of unused schema overhead costs roughly $0.27 per request. If your pipeline runs 1,000 times a day, that's $270/day on tool definitions your agent never touches.
Not on useful computation. On schemas riding along in the context window because you registered everything upfront.
Most MCP servers are built for flexibility. You register everything because you don't know what the agent will need. That makes sense in a conversational interface where the user might ask anything. In a batch pipeline where the task is deterministic, you're paying the full flexibility tax even when you only need two of the forty registered tools.
The math compounds fast. And it's invisible in your usage dashboard: it just shows up as "input tokens."
When the overhead is absolutely worth it
MCP has real strengths. The overhead is the cost of flexibility, and in these cases it's worth paying.
Conversational agents. The user is asking unpredictable questions. You need all tools available on every turn because you genuinely don't know what comes next. Schema injection is the price of that flexibility.
Governance and auth at the server level. MCP servers can enforce access control, rate limiting, and audit logging in one place for every client that talks to them. Doing that in bespoke scripts for each tool is a footgun waiting to fire. The overhead buys you correctness.
Composable tooling across teams. If five teams are building agents and they all need the same data sources, one MCP server beats five separate integrations. The overhead is shared; the maintenance savings are real.
Rapid prototyping. Getting from zero to a working agent tool takes minutes with MCP. The token overhead doesn't matter when you're still iterating before it runs in production.
When to cut MCP out of the loop
The cases where you should rethink it:
Deterministic batch pipelines. You know exactly what the agent needs to do. Schema injection has zero conversational payoff here. It's pure cost. A job with 50 items takes roughly 50 seconds via direct API vs about 25 minutes via MCP, with 10x to 20x token overhead per step. At that scale, the pipeline eats itself.
High volume single purpose agents. If your agent does one thing 10,000 times a day, a direct API call with a hardcoded schema is faster, cheaper, and easier to debug. MCP's flexibility is a liability in this case.
Latency sensitive paths. More tokens means slower time to first token from the LLM. For realtime applications where response latency matters, every unnecessary token hurts.
Fixed scope, cost optimized pipelines. If you know you'll never need more than three tools and the task is well defined, bespoke beats general purpose every time.
Practical reduction: field projection, selective loading, lazy registration
If you're committed to MCP but want to cut the overhead, three patterns help.
Field projection
Instead of passing the full API response, return only the fields the agent actually needs. The SerpApi CLI example gets most of its efficiency from this: it strips the raw response to title, snippet, and URL before handing anything to the LLM. The agent gets the same information with a fraction of the token cost.
// Instead of passing the full response object:
const raw = await serpapi.search(query);
// Project down to what the agent actually needs:
return {
results: raw.organic_results.slice(0, 5).map(r => ({
title: r.title,
snippet: r.snippet,
url: r.link,
}))
};
Selective tool loading
Not every agent needs every tool every session. If you can determine upfront which tools a task requires, register only those. MCP servers support dynamic tool lists, so you don't have to expose everything on every request.
Lazy registration
Register the core tools that handle 90% of your traffic. Add specialized tools only when the conversation signals it needs them: via explicit user intent, a specific task type, or a flag in your system prompt. This keeps the base schema small and only pays for specialized tooling when it's actually relevant.
FAQ
Does MCP use more tokens than direct API?
Yes, always. Schema injection adds a minimum of 4x overhead and can reach 32x for complex multitool servers. The question is whether the flexibility justifies the cost for your use case.
How many tokens does MCP add per request?
It depends on how many tools are registered and how complex their schemas are. A minimal server might add a few hundred tokens per message. A large server with dozens of tools can add tens of thousands per message, per turn.
When should I use MCP vs direct API calls?
MCP for conversational, open ended, governance heavy, or multiteam scenarios. Direct API for batch jobs, high volume single purpose agents, latency sensitive paths, and any pipeline where the task scope is fixed and well defined.
How do I reduce MCP server token overhead?
Field projection on responses, selective tool registration, and lazy loading of specialized tools. For extreme cases, a hybrid approach works: MCP for the orchestration layer, direct API for the hot path inside each tool.
If you want to see how this fits into a production agent architecture, I cover the full decision framework in my post on production architecture.
If you want help building your own MCP setup without the token tax eating your budget, that's the kind of work I take on.
Drop a comment if you've run your own numbers. Curious what the 4x to 32x range looks like in your production setup.

Top comments (0)