Loading Tool Schemas on Demand Is How Agents Scale

#aiagents #platformengineering #mcp #contextengineering

An agent connected to five MCP servers can burn 55,000 tokens on tool schemas before it reads the first word of your request. GitHub's server ships 35 tools at roughly 26K tokens. Slack adds 11 more at about 21K. Sentry, Grafana, and Splunk tack on a few thousand each. Add Jira — about 17K on its own — and you are pushing 100K tokens of pure definitions. Anthropic measured 134K tokens of tool definitions in its own internal setup before it optimized anything. None of that is work. It is the bill for capabilities the agent might use, paid in full on every request.

That is importing your whole dependency tree to call one function. No engineer would defend that pattern in code. We have lazy loading, tree shaking, and dynamic imports precisely because paying for everything up front does not scale. Agents need the same move, and as of late 2025 the platform layer supports it: schemas load on demand, when the work actually calls for them.

The Upfront Bill Has Two Line Items

The token cost is the visible half. The hidden half is accuracy. The Claude API documentation is blunt about it: tool selection accuracy degrades significantly once an agent has more than 30 to 50 tools in front of it. Two tools named notification-send-user and notification-send-channel sitting in a pile of 58 definitions is a recipe for the model grabbing the wrong one. The context is not just expensive. It is noisy, and the noise makes the model worse at its job.

So the upfront-loading pattern costs you twice. You pay tokens for schemas you will not use this turn, and you pay accuracy because the schemas you will use are buried in the ones you will not.

Deferred Loading Flips the Default

Anthropic shipped the Tool Search Tool and defer_loading on the Claude Developer Platform in November 2025. The mechanics are simple. Tools marked defer_loading: true stay out of the context window entirely. The model gets one small search tool — about 500 tokens — and when the task needs a capability, it searches, the matching schemas load, and work proceeds. A typical request discovers three to five tools, around 3K tokens. The full library stays reachable. Almost none of it is resident.

The before-and-after numbers make the case. A 58-tool setup that consumed roughly 77K tokens of context before any work now consumes about 8.7K — the agent keeps roughly 95% of its context window for the actual task. Deferred tools also stay out of the initial prompt, so prompt caching survives instead of getting invalidated every time the tool roster changes.

The part worth dwelling on is what happened to accuracy. On MCP evaluations with large tool libraries, Opus 4 went from 49% to 74% with tool search enabled. Opus 4.5 went from 79.5% to 88.1%. Trimming the working set did not cost capability. It removed distraction. A model choosing between four relevant tools outperforms the same model choosing between 58 mostly irrelevant ones.

The Extreme Case Is Reading Tool Definitions Like Files

Anthropic's code execution with MCP work pushes the principle to its limit. Instead of presenting tools as schemas at all, the agent gets a filesystem: each server is a directory, each tool a file, and the agent reads the definition only when it needs that tool. A workflow that cost 150,000 tokens dropped to 2,000 — a 98.7% reduction.

The same post names the other half of the context tax: intermediate results. A two-hour meeting transcript fetched by one tool and passed to another is roughly 50K tokens moving through the model twice, even when the model never needed to see it. On-demand discovery for schemas and code-level handling for intermediate data attack the same problem from both ends. The model's context holds what the model needs to reason about, and nothing else rides along.

Skills Apply the Same Principle to Procedure

Tool schemas are not the only capability payload. Procedures are too — the how-to knowledge for filling out a brand template, processing a spreadsheet, following a deploy checklist. Agent Skills, shipped October 2025, package that knowledge as folders of instructions, scripts, and resources that the model loads when needed, in the same format across Claude apps, Claude Code, and the API.

The loading model mirrors deferred tools exactly. Anthropic's engineering write-up names the design principle: progressive disclosure. At startup, only each skill's name and description enter the system prompt — tens of tokens per skill. The full instruction body loads when the task matches. Bundled files load only if the instructions reference them. A library of fifty skills costs you a table of contents, not fifty manuals.

Schemas on demand cover what the agent can call. Skills on demand cover what the agent knows how to do. Together they mean an agent's capability surface can grow without its per-request context bill growing with it. That decoupling is the scaling property. Capability count and context cost used to be the same axis. Now they are two.

When Upfront Loading Is Still Fine

Honest caveat: deferred loading adds a discovery round-trip. The agent has to search before it can call, and that costs latency. Anthropic's own guidance is that tool search pays off when definitions exceed roughly 10K tokens or when you see the model picking wrong tools. Below that, load everything up front and skip the indirection. A five-tool agent does not need a search layer any more than a 200-line script needs a plugin architecture.

The pattern matters when the library is large and the per-task working set is small. That describes most serious agent deployments — and the gap between library size and working-set size only widens as teams connect more systems.

This Is a Platform Contract, Not a Trick

The platform engineering field spent 2025 asking what changed and closed the year asking what agents need from infrastructure. The year-in-review discussion put agentic developer platforms at the top of the 2026 agenda, with the open question being which interfaces and what infrastructure agents actually require. One panelist's warning fits here: an agentic platform still needs a product mindset — AI is not going to save the day on its own.

On-demand capability loading is one concrete answer to that question. The contract between an agent and its platform should look like the contract between a program and its module system: declare what exists cheaply, resolve what is needed lazily, and never charge the running process for the parts of the library it did not touch. Teams wiring up agents today should treat the 30-to-50-tool threshold as a design constraint, mark everything past the core set as deferred, and ship procedures as skills with one-line descriptions. The agents that scale will be the ones that pay for context the way good code pays for dependencies: only when called.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.