DEV Community

Cover image for Lazy loading MCP tools: which clients support it and how
Ismael Ramos 🚀
Ismael Ramos 🚀

Posted on • Originally published at ismaelramos.dev

Lazy loading MCP tools: which clients support it and how

I've been building an MCP server to expose an internal documentation repo to an AI agent. Nothing exotic — a handful of tools that let the model search docs, read a section, look up a glossary term. The kind of thing you write in an afternoon.

What surprised me wasn't the protocol. It was realizing that the biggest token cost — the tool definitions loaded into context every turn — isn't something my server controls at all. Whether you can avoid it comes down to one feature, lazy loading, and only some clients support it yet.

So this is a note on who does, what to do when yours doesn't, and why the line between server and client decides the whole thing.

What an MCP even is

Skip this if you've built one. If you haven't: MCP (Model Context Protocol) is a standard way to hand an AI agent a set of tools it can call — search this repo, query that API, read this file. Instead of every assistant inventing its own plugin format, they speak one protocol.

There are two pieces, and confusing them is the source of most "why is this slow" questions:

  • The server is your code. It exposes tools and answers when one is called. Your docs MCP is a server.
  • The client (also called the host) is whatever connects to that server, loads its tool definitions into the model's context, and decides how to present them. Claude Code, Cursor, Codex, Devin — those are hosts.

The line between them matters more than it looks. A lot of "can the MCP do X?" turns out to be "does the host do X?" — and your server has no say.

The SDK is the plumbing you don't write

Building a server by hand would mean implementing the whole protocol: how a client says hello, how it asks "what tools do you have?", how it invokes one, what format the messages travel in. That's a lot of fiddly, error-prone wiring.

The official @modelcontextprotocol/sdk does it for you. Think of it as a phrasebook for a very picky language: you don't learn the grammar, you just say "I have a tool that does X" and the SDK handles the conversation.

A minimal server is almost boring:

import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js'
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js'
import { z } from 'zod'

const server = new McpServer({ name: 'docs', version: '1.0.0' })

server.registerTool(
  'search_docs',
  {
    description: 'Search the documentation by keyword',
    inputSchema: { query: z.string(), limit: z.number().default(30) }
  },
  async ({ query, limit }) => {
    const hits = await search(query, limit)
    return { content: [{ type: 'text', text: hits }] }
  }
)

await server.connect(new StdioServerTransport())
Enter fullscreen mode Exit fullscreen mode

Four moving parts: create the server, register each tool with a schema, pick a transport (here stdio, the standard input/output pipe the host talks through), connect. Everything else — turning your Zod schemas into the JSON Schema the client sees, validating inputs and outputs — is the SDK's job. The lines that are yours are the handler bodies. The rest is scaffolding.

One detail worth knowing: the SDK version decides which features exist. Mine declares ^1.12.0 but resolves to 1.29.0 in practice, so newer capabilities like structured output (outputSchema) just work. If yours is old, a feature can simply not exist.

Two costs, and only one is yours to fix

The mental model that reorganized how I think about MCP design is that a server charges you context on two axes:

Fixed cost: the tool definitions, loaded into context on every single turn.
Variable cost: the size of each response a tool returns.

The variable cost is the easy one — it lives entirely inside your handlers. Cap the output, paginate, let the model read one section instead of a whole file, return an index before dumping a list. It's the discipline of treating context like a slow network response: don't send the whole table when the client asked for one row. I went through how to cut both costs in detail separately, so I won't repeat the tactics here. The same instinct shows up in keeping AI sessions lean instead of letting them bloat.

The fixed cost is the interesting one, because it's the part your server can't fully control.

Two context costs of an MCP server: fixed tool definitions and variable responses

The fixed cost is the host's decision

The fixed cost — the tool definitions sitting in context every turn — is paid whether or not the model ever calls a tool. Six small tools with short descriptions might be 600–800 tokens. Ten servers with dozens of tools between them can reach 40k+ tokens, gone before the model reads your first message.

And here's the part that catches people off guard: how those definitions get loaded is largely the host's decision, not yours.

The dream solution is lazy loading: don't load all the tool definitions up front, load a tiny search tool, and let the model discover the rest on demand. Catalog of 40 tools, but the model only pays for the two it actually uses.

Your server can't impose lazy loading; the host has to support it. So the honest question is who does, today (mid-2026) ?

Generated image: Comparativa de soporte de carga perezosa

Claude Code is the one that genuinely does it: tools get marked to defer loading, and a search tool surfaces them on demand. Everyone else either caps you or loads everything, the way they always have.

This whole area is new and moving month to month. Treat any specific row above as "check it in your version," not gospel.

If your host doesn't lazy-load: the proxy trick

Say you're on Devin or Cursor and the fixed cost is becoming real. Your server can't change the host's behavior — but you can slip a lazy proxy between them.

The idea: point the host at one proxy server instead of your real servers. The proxy exposes two meta-tools — search_tools and execute_tool — and loads the real definitions only when asked. Your actual servers don't change at all; they move into the proxy's config.

Before, the host loads every definition from every server:

{
  "mcpServers": {
    "docs":   { "command": "node", "args": ["/…/server.js"] },
    "github": { "command": "npx", "args": ["-y", "@modelcontextprotocol/server-github"] }
  }
}
Enter fullscreen mode Exit fullscreen mode

After, the host sees only the proxy, and context drops from "every definition, always" to "two tools plus whatever the agent asks for":

{
  "mcpServers": {
    "lazy": { "command": "npx", "args": ["-y", "lazy-mcp", "serve"] }
  }
}
Enter fullscreen mode Exit fullscreen mode

Tools like lazy-mcp (and similar projects) do exactly this. But be clear-eyed about the tradeoff: once every call goes through execute_tool, the host's per-tool permission matchers stop working — it only ever sees mcp__lazy__execute_tool, not mcp__docs__get_doc. You trade fine-grained permission control for a smaller context. Worth it when you're juggling several servers at once. Not worth it for one small server.

And the proxy is very new tooling. It solves a real problem, but it's the kind of thing you reach for deliberately, not by default.

What I'd actually do as it grows

The lazy proxy shines when many servers are connected at once. A single server that grows has a cleaner fix that doesn't need a proxy at all:

  • At ~6–15 tools: do nothing special. The fixed cost is small. Don't add machinery to solve a problem you don't have.
  • At many tools: consolidate into router tools inside your own server. Instead of 25 separate tools, one docs tool with an op parameter: "list" | "get" | "search" | "glossary". Fewer schemas, lower fixed cost, and you control it 100%. The tradeoff is denser descriptions and slightly worse discoverability for the model.
  • Bring in the proxy only when the pain is the aggregate of several servers, not one server in isolation.

The clean long-term path is a standardized tools/search capability the server declares and the host honors. When that lands, lazy loading finally becomes something your server can opt into instead of waiting on the client. Until hosts support it, it's not actionable — so I'm not building for it yet.

This is the part that keeps surprising me about MCP-driven workflows: the protocol is easy, but the most important optimization for your server's footprint isn't in your server at all.

So before you spend an afternoon shaving tokens off your tool definitions, check what your client does with them. On Claude Code, lazy loading already handles it. Everywhere else, the lever is the proxy — or, when your own server is the one bloating, folding tools into a router. Match the fix to where the cost actually lives.

Top comments (0)