Add a 30-tool MCP server to your Claude agent and watch the first request balloon by 6,000 input tokens — before the user has typed a word. Now run a 12-turn agent loop. Those 6,000 tokens get reprocessed on every single turn, because tool definitions live in the system prompt and the system prompt is resent with every API call. That's the tool-use tax, and most agent codebases pay it at full price.
This post is about the token economics of tool use in LLM agents: where the cost actually lives, why it grows faster than you expect across a multi-turn loop, and the three levers — schema pruning, prompt caching, and parallel tool calls — that bring it back under control.
TL;DR
- Tool definitions are input tokens, charged every turn. Each tool's name, description, and JSON Schema is serialized into the request and re-tokenized on every call in the agent loop — not once per session.
- Cost grows roughly quadratically over a loop. Tool defs are fixed per turn, but conversation history (including verbose tool results) accumulates, so total tokens processed across N turns scale with N².
-
Cache the static prefix. Put a
cache_controlbreakpoint after your tool definitions so they're billed at the cheaper cache-read rate on turns 2…N instead of full input price. - Parallel tool calls cut round-trips, not tokens. They collapse independent calls into one assistant turn, removing whole reprocessing passes — the single biggest latency win.
- Prune tools per phase. A retrieval agent doesn't need write tools loaded. Fewer schemas in context means fewer tokens and better tool-selection accuracy.
Where does the token cost of tool use actually live?
It lives in the system prompt. When you pass a tools array to the Anthropic Messages API (or any function-calling API), the runtime serializes each tool's name, description, and input_schema into a structured block that is prepended to the model's context as input tokens. The model never sees your Python dict — it sees a tokenized rendering of the schema.
A single tool with a thorough description and a nested JSON Schema can run 150–300 tokens. A real MCP server exposing 20–40 tools — each with enum lists, parameter descriptions, and required-field arrays — easily clears 5,000–8,000 tokens of pure schema. That cost is paid before the conversation starts and is identical whether the model calls zero tools or all of them.
Here's a compact tool definition. Notice how much of the token weight is prose and schema, not logic:
tools = [
{
"name": "search_orders",
"description": (
"Search the orders database. Use this when the user "
"asks about order status, history, or refunds. Returns "
"up to 50 matching orders sorted by date descending."
),
"input_schema": {
"type": "object",
"properties": {
"customer_id": {"type": "string", "description": "UUID of the customer"},
"status": {
"type": "string",
"enum": ["pending", "shipped", "delivered", "refunded", "cancelled"],
"description": "Filter by order status",
},
"since": {"type": "string", "format": "date", "description": "ISO 8601 date lower bound"},
},
"required": ["customer_id"],
},
},
# ... 19 more tools
]
Every description string and every enum value is tokens the model reads on each turn. Multiply by 20 tools and a 12-turn loop and you're reprocessing the same schemas 12 times.
Why does tool-use cost grow quadratically across an agent loop?
Because the API is stateless: each turn resends the entire prior conversation. Tool definitions are a fixed cost per turn, but the conversation history is not — it grows with every assistant message and every tool_result you append.
Walk the loop. Turn 1 sends [system + tools + user]. The model returns a tool_use block. You execute it and append the result. Turn 2 sends [system + tools + user + assistant tool_use + tool_result]. The model calls another tool. Turn 3 sends all of that plus two more blocks. By turn N, you're sending a context that has grown linearly — and you've sent it N times. Sum the input tokens across the whole loop and the total scales with N², the classic cost of a stateless conversation that accumulates state.
Tool results are the accelerant. A search tool that dumps 40 rows of JSON, or a file-read tool that returns 800 lines, injects thousands of tokens into the history — and those tokens ride along on every subsequent turn. A loop where each tool returns 2,000 tokens of results doesn't just cost 2,000 tokens; it costs that 2,000 again on every turn that follows it.
This is why "just add more tools and let the model figure it out" is a budget trap. The schema cost is the floor; the accumulating-results cost is the slope.
How do you cache tool definitions so you stop paying for them every turn?
Put a cache_control breakpoint at the end of your tool definitions. Tool schemas are the most static part of your request — they don't change between turns — which makes them ideal cache material. Marking the boundary tells the API to store the tokenized prefix and bill turns 2…N at the much cheaper cache-read rate instead of full input price.
The breakpoint goes on the last tool in the array, since the cache covers everything up to and including the marked block:
tools = [
{"name": "search_orders", "description": "...", "input_schema": {...}},
# ... middle tools ...
{
"name": "issue_refund",
"description": "...",
"input_schema": {...},
"cache_control": {"type": "ephemeral"}, # caches all tool defs above
},
]
resp = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
tools=tools,
messages=conversation,
)
The order matters. The cache is a prefix match: everything before the breakpoint must be byte-stable across requests. So your tool array must be in a deterministic order — don't rebuild it from a dict that reorders, and don't interpolate a timestamp or request ID into any description. One reordered tool invalidates the whole cached prefix and you're back to full price.
The economics: cache writes cost a bit more than a normal input token, cache reads cost a fraction of one. For a loop of 3+ turns with a fat tool array, you come out ahead almost immediately. The first turn writes the cache; every turn after reads it.
You can extend the same breakpoint logic to a stable system prompt and even a large, unchanging document prefix — but tool definitions are the highest-leverage place to start because they're both large and perfectly static.
Do parallel tool calls reduce token cost?
Not directly — they reduce round-trips, which is often the bigger win. A model that can emit multiple tool_use blocks in a single assistant turn lets you fan out independent calls instead of serializing them across separate turns.
Consider an agent that needs the weather in three cities. Serialized, that's three full turns: three requests, each resending the growing context, three reprocessing passes. With parallel tool calls, the model emits all three tool_use blocks at once; you execute them concurrently and return all three tool_result blocks in a single user message. One round-trip instead of three. You've eliminated two entire passes over the accumulated context — that's where the latency and token savings come from.
The Anthropic API does this by default when the model judges the calls independent. You control it explicitly:
resp = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
tools=tools,
tool_choice={"type": "auto", "disable_parallel_tool_use": False},
messages=conversation,
)
Set disable_parallel_tool_use: True when calls have dependencies — when tool B needs tool A's output — and you want the model to call one, see the result, then decide. Leave it off for read-heavy fan-out like multi-source retrieval.
One caveat: parallel calls only help when the work is genuinely independent. If you force parallelism on dependent steps, the model guesses at inputs it doesn't have yet, and you get wrong calls you have to discard — which costs more than serializing would have.
When should you prune tools instead of loading them all?
Prune whenever the agent operates in distinct phases, which is almost always. A research phase needs search and read tools; an execution phase needs write and commit tools. Loading both sets at all times pays double schema cost and — more importantly — degrades tool-selection accuracy.
This is the part teams underweight. Beyond raw tokens, every extra tool is a distractor in the selection space. Give a model 40 tools when 6 are relevant and you measurably raise the rate of wrong-tool and malformed-argument calls, because the correct choice is buried among near-duplicates. Trimming the array to the 6 tools the current phase needs improves both cost and correctness at once.
Practical patterns:
-
Phase-scoped tool sets. Swap the
toolsarray as the agent transitions states. A planner exposesread_*andsearch_*; an executor exposeswrite_*andrun_*. -
tool_choiceto force or forbid. Use{"type": "any"}to require the model to call some tool (good for a routing step),{"type": "tool", "name": "..."}to force a specific one, and{"type": "none"}to forbid tools when you want a plain text turn. - Compress tool results before appending. The accumulating-results problem is yours to manage. Summarize or truncate a 2,000-token search dump down to the rows that matter before it joins the history, so it doesn't ride along on every future turn.
Putting the levers together
A production agent loop should: load only the phase-relevant tools, cache the tool-definition prefix with a cache_control breakpoint, allow parallel tool calls for independent fan-out, and compress tool results before they enter the history. Each lever attacks a different part of the cost curve — schema floor, per-turn reprocessing, round-trip count, and the accumulating slope. Skip any one and the others can't fully compensate.
The mental model to keep: tool definitions are not free metadata the model consults on demand. They are input tokens, charged on every turn, sitting in front of a conversation history that grows with each step. Treat them like the recurring cost they are.
So what is the hidden token tax of tool use in LLM agents?
Tool definitions in LLM agents are serialized into the system prompt as input tokens and reprocessed on every turn of the agent loop — not once per session — while conversation history and verbose tool results accumulate on top, so total token cost across an N-turn loop scales roughly with N². You cut it with four levers: prune tools to the phase-relevant set so fewer schemas sit in context, place a cache_control breakpoint after your tool array so the static prefix is billed at the cache-read rate on later turns, enable parallel tool calls to collapse independent calls into a single round-trip, and compress tool results before appending them so they don't ride along on every subsequent turn. The schema cost is the floor; the accumulating results are the slope — manage both.
Top comments (0)