I Measured My MCP Token Overhead. The Numbers Are Worse Than I Expected.

#ai #llm #agents #llmtools

Someone on Reddit posted that their MCP setup consumed 67,000 tokens before they typed a single question. I didn't believe them — until I measured my own. Then I spent a week trying to figure out why, and what to do about it.

Here's what I found.

The Setup

I was building a simple code review agent. Read a PR description, check the diff, leave a comment. I wired it up with four MCP servers: GitHub, a vector store, a Slack notifier, and a database inspector. Clean, production-ish, nothing exotic.

Baseline (no MCP): 480 tokens per request.
With all four servers active: 11,200 tokens.

That's a 23x overhead. For a code review agent. Running hundreds of times a day.

I went digging.

Where the Tokens Go

MCP's token cost isn't one thing — it's four things stacking on top of each other.

1. Session initialization
Every MCP session starts with a handshake. Tool schemas get sent to the model at the start of every conversation context. The model needs to know what tools exist before it can call any of them.

The MCP spec repo has an issue where someone measured roughly 1,000 tokens of overhead per tool in a session. A 53-tool MCP server like agentmemory isn't adding 53 tools to your context — it's adding 53,000 tokens.

2. Tool schema inflation
This is the one that caught me off guard. A GitHub MCP server doesn't just expose get_pr() as a function call. It sends the full OpenAPI schema: descriptions for every parameter, type annotations, enum values, nested object structures.

Anthropic's documentation specifically calls this out: models perform better when tool schemas are tight and precise. But MCP servers are written for generality, not token efficiency. The result is schemas that are 5-10x larger than what you'd write by hand for a specific task.

3. Tool result formatting
When an MCP tool returns, the result goes back into your context as a structured message. If the tool returns a full API response (not just the relevant fields), you're stuffing a lot of noise into the context window. A database inspection tool that returns 47 columns when you only needed 3 — that's wasted tokens on every call.

4. Context window compounding
This is the one that wrecked my budget. When you're in a multi-turn conversation, the MCP schema and every tool result stays in context for the entire session. Each turn adds more. A 30-message conversation with 4 MCP servers active can easily accumulate 300k+ tokens of overhead — most of it from the tools the model never actually called.

The Benchmark

OnlyCLI published benchmarks in 2026 comparing MCP to direct CLI for the same operations. The results:

Simple repo metadata check: CLI 1,365 tokens vs MCP 47,000 tokens (~34x)
File search operation: CLI 890 tokens vs MCP 12,400 tokens (~14x)
Multi-tool workflow: CLI 3,200 tokens vs MCP 89,000 tokens (~28x)

The ratio varies by operation complexity, but it's consistently 4x to 35x. The simpler the operation, the worse the ratio — because the fixed MCP overhead dominates.

What I Actually Changed

First: measured before anything else.
I added token counting to my agent loop. Every request logs: input_tokens, output_tokens, mcp_tools_called, mcp_results_size_bytes. This sounds tedious but it's one function that runs in one place.

import anthropic
from anthropic import Anthropic

client = Anthropic()

def count_mcp_tokens(messages, tools):
    """Rough token estimate for MCP-augmented calls."""
    # Count raw message content
    text_tokens = client.count_tokens(
        "".join(m.get("content", "") for m in messages 
                if isinstance(m.get("content"), str))
    # Add tool schemas (rough: ~4 chars per token)
    tool_tokens = sum(len(str(t)) // 4 for t in tools)
    return {"text_tokens": text_tokens, "tool_schema_tokens": tool_tokens}

Second: registered tools lazily, not upfront.
Instead of loading all 4 MCP servers at session start, I load them when the model first expresses intent to use them. This shifts the initialization cost to only the servers you actually need — and often, you only need one.

Third: trimmed schemas aggressively.
Most MCP servers let you configure which tools to expose. I went from 23 tools to 6 per server. The model still picks the right tool — because it was picking the wrong one before, when it had too many options.

Fourth: paginated tool results.
Instead of asking the database inspector for "all recent transactions," I ask for "top 10 by amount, descending." The model learns this pattern. It's better at asking for exactly what it needs when what it needs is explicit.

The Numbers After

After these changes:

Baseline (no MCP): 480 tokens (unchanged)
With 2 of 4 servers, trimmed schemas: 3,100 tokens (was 11,200)
With lazy loading + pagination: 1,850 tokens
Ratio: ~4x instead of 23x

It's still overhead. But 4x is manageable. 23x was burning through my context budget before lunch.

What I'd Tell Someone Starting Fresh

MCP is worth it. The ecosystem is real — 13,000+ servers, AWS going GA in 2026, 97M monthly downloads. The tooling wins. But go in with your eyes open: