Your MCP Servers Are Burning Tokens Before You Type a Word

#mcp #ai #claudecode #productivity

I counted them last week: 47 MCP tools wired into one of my agent sessions. Not called — just loaded. Every tool's full JSON schema, sitting in the system prompt, before I'd typed a single character.

I ran the math. Each tool schema (name, description, parameter shapes, examples) averages 150-400 tokens depending on how chatty the description is. 47 tools landed around 11,000 tokens of pure overhead. That's context I paid for and the model read on every single turn, whether or not I ever touched half those tools.

This is the quiet cost nobody budgets for. Everyone worries about a tool call that dumps a 50KB log file into context. Fewer people notice that the menu of tools was already expensive before anyone ordered anything.

Why this happens

MCP servers advertise their full tool list upfront by design — the protocol wants the model to know what's available. But "available" and "loaded in full detail" don't have to be the same thing. A send_slack_message tool and a query_datadog_metrics tool both get their complete parameter schemas injected even in a session where you only ever use git_status.

Stack a few servers together — GitHub, Slack, a database, a design tool, your own internal ones — and you're not looking at 10 tools anymore, you're looking at 60-100. I've seen sessions where tool definitions alone were pushing 15-20% of the entire context budget, and that's before the actual conversation starts.

The fix: defer the schema, not the tool

The pattern that actually worked for me is one I lifted from how my own harness handles this: list tools by name and one-line description only, and fetch the full schema on demand.

Concretely, instead of this landing in context for every tool at session start:

{
  "name": "mcp__github__create_pull_request",
  "description": "Create a pull request on GitHub with title, body, base/head branches...",
  "parameters": {
    "type": "object",
    "properties": {
      "owner": { "type": "string", "description": "..." },
      "repo": { "type": "string", "description": "..." },
      "title": { "type": "string", "description": "..." },
      "body": { "type": "string", "description": "..." },
      "base": { "type": "string", "description": "..." },
      "head": { "type": "string", "description": "..." },
      "draft": { "type": "boolean", "description": "..." },
      "maintainer_can_modify": { "type": "boolean", "description": "..." }
    }
  }
}

you get a single line:

mcp__github__create_pull_request

and a lightweight search tool sits alongside the deferred list:

def tool_search(query: str, max_results: int = 5) -> list[dict]:
    """Match query against deferred tool names/descriptions.
    Returns full JSON schemas only for matches — not the whole registry."""
    candidates = index.search(query, limit=max_results)
    return [registry[name].full_schema() for name in candidates]

The model calls tool_search("select:mcp__github__create_pull_request") exactly once, gets the full schema back, and only that tool's definition enters context — for the rest of the session, not just that turn. Everything else stays as a name on a list.

In practice this took a session with ~80 deferred tool names (roughly 3-4 tokens each, call it 300 tokens for the whole index) down from an estimated 18,000 tokens of eagerly-loaded schemas to under 1,000 for the tools actually used. That's not a marginal win — that's the difference between "half my context window is tool definitions" and "tool definitions are a rounding error."

What this doesn't fix

Deferred loading doesn't help if you're going to use every tool in the session anyway — you'll pay the same total cost, just spread across turns instead of upfront. It's a win specifically because most sessions touch a small fraction of the tools that are technically "available." If your workflow genuinely needs 40 of your 47 tools in one conversation, eager loading and deferred loading converge.

It also doesn't fix chatty tool outputs — that's a separate problem (sandboxed execution, output truncation, summarize-before-return) and a separate fix. Schema bloat and output bloat are two different token sinks; solving one doesn't touch the other.

The actual takeaway

If you're building or configuring an MCP setup, ask the boring question before the exciting one: not "what tools does the model need to call," but "what does the model need to know exist by default, versus what can it look up." Most tool catalogs answer that question with "everything, always" because it's the path of least resistance for the server author. It's also the path of least resistance for burning your context budget on a menu nobody's reading.

The fix isn't clever. It's a name, a description, and a search function standing between the model and the full schema. But it's the difference between paying for 80 tools and paying for the 3 you actually called.

DEV Community

Your MCP Servers Are Burning Tokens Before You Type a Word

Why this happens

The fix: defer the schema, not the tool

What this doesn't fix

The actual takeaway

Top comments (0)