Sumedh Bala

Posted on Jun 26 • Edited on Jun 27

Claude Code Costs, Act I — How the billing actually works

#ai #claude #llm #programming

📊 New here? Scroll through the slides above for a quick visual overview of this act — then dive into the details below.

💬 Spot something unclear or wrong? This guide aims to be precise. If any part could be explained better, or you think a claim is off, please leave a comment — I read every one and fix what's broken.

👋 If you want more on AI-native engineering, follow me on LinkedIn.

How billing actually works, the ecosystem of options to spend less, and the mistakes that quietly cost you money — grounded in measurement, not folklore.

This guide teaches the cost model of Claude Code from first principles. It is written for engineers who want to predict their bill, lower it deliberately, and recognize the anti-patterns before they hit production. Every non-obvious claim carries a provenance tag — [measured] (observed on the wire), [docs] (from Anthropic's published documentation), or [doc-confirmed] (measured and matching the docs) — and the table or experiment that supports it.

The best way to use this guide: keep a Claude session open on the side as you read. When something isn't clear, paste it in and ask Claude to explain it or to dig deeper. Better still, don't take the numbers on faith — ask Claude to reproduce the experiments on your own setup, walk you through what the results mean, and show you the raw request/response logs behind each table. The provenance tags exist so you can verify everything here yourself; treat this guide as a starting point for that conversation, not the last word.

What you'll be able to do after reading this

Explain, in one sentence, why a Claude Code conversation gets more expensive with every turn — and why that is not a bug.
Look at a proposed change (a new MCP server, a model switch, a "compression" proxy) and predict whether it helps or hurts your bill, before you ship it.
Read a usage block and tell the difference between a warm session and one that is silently rebuilding its cache every turn.
Choose the right cost-reduction tool for the cost line you actually want to attack — and avoid the popular tools that make things worse.
Recognize the dozen most expensive mistakes and apply the fix for each.

How to read this

The guide is built around four mental models. Learn them and the rest is derivable:

The amnesiac contractor — the server remembers nothing; the request is the only state.
The prefix-match cache — caching is a strict byte-prefix match; one early byte change invalidates everything after it.
The model-scoped key — a cache entry belongs to exactly one model; another model can't read it.
The encrypted envelope — the model's reasoning rides back to you sealed; you carry it but can't open it.

The guide is in four acts mapping to what you're trying to do: Act I — how billing works; Act II — where the big hidden costs are (model switching and thinking blocks); Act III — the ecosystem of options for spending less; Act IV — the consolidated mistakes catalogue and a one-page cheat sheet.

A note on how this was measured

None of this relies on internal access. Claude Code honors the ANTHROPIC_BASE_URL environment variable, so it can be pointed at a plain-HTTP local reverse proxy that logs each /v1/messages request body and forwards it to https://api.anthropic.com. There is no TLS interception: the OAuth Authorization: Bearer token and anthropic-beta headers pass through untouched; the proxy rewrites only the JSON model field when an experiment calls for it. The proxy parses the streamed (SSE) response for the usage block and counts cache_control markers. A replay harness re-sends real captured requests verbatim — genuine headers — under single-variable variations, so each finding isolates one cause.

The honest constraint throughout: the API exposes no field that announces "thinking was dropped" or "your cache was busted." Every conclusion here is inferred from HTTP status codes, token-count deltas between near-identical requests, and inspection of the response. Where a claim is inferred rather than directly reported, it says so.

Measurements were taken against Claude Code 2.1.150 (OAuth auth, mid-2026) on Sonnet 4.6, Opus 4.7/4.8, and Haiku 4.5. Fable 5 and the Mythos family were gated (HTTP 404) on the test host, so claims about them are documentation-only. Prices and exact behaviors are version-specific — re-verify them for your own models and client version.

Billing-rate caveat, stated once and used throughout. Claude Code's self-reported total_cost_usd (what /cost and the status line show) can under-report cache-write cost: it prices the 1-hour-TTL writes it sends at the 5-minute 1.25× rate. Anthropic's published rate for a 1-hour-TTL write is 2× base input — the rate this guide uses everywhere. So expect your displayed session cost to read lower than the figures here. Treat the Anthropic Console as authoritative for actual billing. This gap is itself one of the mistakes in Act IV.

The entire cost model falls out of one architectural fact. Get this fact and the three cost buckets, and you can already predict most of your bill.

The one fact: Claude Code is a stateless API client

Claude Code is a client of Anthropic's /v1/messages HTTP API. It holds no privileged channel to the model and no special server-side session. Every turn is one complete HTTP request that carries the entire context the model will see: the tool definitions, the system prompt, and every message in the conversation so far. The server generates a response and then forgets everything — no session, no memory, no "conversation" object. The next turn re-sends the whole thing again, one turn longer.

That single fact — the request is the only state — drives every cost lever in this guide:

Because the client re-sends the full conversation each turn, cost grows with the conversation: you pay to re-process the entire history on every turn.
Because the server is stateless, the only way to avoid re-processing that history is the prompt cache (the rest of Act I).
Because the request is co-designed for one target model, switching models mid-conversation throws the cache away and re-bills accumulated reasoning (Act II).

Mental model 1: the amnesiac contractor

Picture a brilliant contractor with no long-term memory. Every time you consult them, you must hand over a complete dossier: the tools they're allowed to use, their standing instructions, and the entire history of the project so far. They read it, give you excellent advice, and then forget the entire engagement the instant you leave. Next time, you bring the same dossier plus today's new page.

That is exactly the relationship between Claude Code and the API. Everything expensive about Claude Code follows from "the dossier gets re-read, in full, every single time."

The proof: the server recalls nothing [measured]

Two requests, same session design, isolating whether the server remembers a fact across turns:

request sent	model's answer
"my secret is 42" + acknowledgement + "what's my secret?" (the fact is in the payload)	"42"
only "what's my secret?" (the prior turn omitted from the payload)	"I don't have a secret number stored in memory…"

The model knows only what the current request carries. When the prior turn is omitted from the request body, the secret is gone — there is no session to recall it from. The "memory" you experience in a chat is an illusion maintained entirely by the client re-sending history.

Thinking is physically resent, too [measured]

It isn't just visible messages. A captured Opus continuation shows message [17] as ['thinking(sig=yes, len=0)', 'text', 'tool_use(Grep)'], followed by [18] as a tool_result. The thinking block is in the request body — the client is sending the model's own prior reasoning back to it. (What those blocks contain, how they're billed, and what happens to them on a model switch is the whole of Act II's second half. For now: they are part of the dossier, and parts of the dossier cost money.)

A subtlety worth internalizing early: resending thinking is not strictly mandatory in this configuration. Replaying the captured continuation with thinking blocks (a) kept, (b) stripped from the last assistant turn, and (c) stripped everywhere all returned HTTP 200 [measured]. The server doesn't error and doesn't substitute a remembered copy — because it has none. Resending thinking is how the client gives the model reasoning continuity; it is not how the server maintains state. The server maintains no state.

The render order: stable content first, volatile content last

Every request is assembled in this order:

tools  →  system  →  messages

Tool definitions sit at byte 0, the system prompt next, and the conversation last. This isn't cosmetic. It is the single most important layout decision for cost, and it follows one rule:

Stable content first, volatile content last.

You'll see why this ordering is load-bearing the moment we get to caching: the cache is a prefix match, so the things that change least must come first, or they'll keep getting invalidated by the things that change most.

One more piece of anatomy, because it decides cache behavior later. The messages tier is an ordered list of messages — conversation turns, each with a role and some content — and each message's content is itself an ordered list of typed content blocks: a text block, a tool_use block (the model calling a tool), a tool_result block (your answer to it), a thinking block, an image. One message can hold many blocks — an assistant turn that fires ten parallel tools is a single message but twenty-plus blocks. Hold onto the message-vs-block distinction: the cache's backward re-link counts blocks, not messages, which is exactly what a big tool burst trips (see Agentic tool bursts overflow the 20-block lookback).

The three things you pay for

Every request's usage block breaks input into three buckets, plus output. Learn what each one costs relative to base input, because that ratio is the whole game:

Bucket (`usage.*`)	Meaning	Price vs. base input
`cache_read_input_tokens`	served from an existing cache entry	0.1×
`cache_creation_input_tokens`	written to the cache this request	2× (1-hour TTL)
`input_tokens`	processed uncached, not cached	1×
`output_tokens`	generated tokens	the model's output rate

Two facts to burn in:

A cache hit is ~10× cheaper than processing the same tokens cold (0.1× vs 1×). This is the prize.
Output is the most expensive class — it's billed at 5× the input price on every model, and it is never cached. Tokens the model writes dominate generation-heavy turns.

The rate card (Anthropic published list prices, per 1M tokens, as of 2026-06-15)

Re-verify at claude.com/pricing. Cache read = 0.1× input; 1-hour cache write = 2× input; output = 5× input.

Model	Input (1×)	Cache read (0.1×)	Cache write, 1h (2×)	Output (5×)
Fable 5	$10.00	$1.00	$20.00	$50.00
Opus 4.8	$5.00	$0.50	$10.00	$25.00
Sonnet 4.6	$3.00	$0.30	$6.00	$15.00
Haiku 4.5	$1.00	$0.10	$2.00	$5.00

The ratios are clean and worth memorizing: Sonnet = 0.6× Opus, Haiku = 0.2× Opus, Sonnet = 3× Haiku. They'll matter when we compare routing options.

Break-even: when does writing to the cache pay off?

Caching is a trade: you pay a one-time expensive write (2× the input price) so that later requests get cheap reads (0.1×). Whether that trade wins depends on how many times you'll re-read the cached prefix.

Process the same prefix over N requests, two ways:

No cache: every request pays full input price → total N × 1×.
With cache: the first request writes (2×); each later request reads (0.1×) → total 2× + (N−1) × 0.1×.

Caching comes out ahead once 2 + 0.1(N−1) < N — i.e. from the 3rd request onward. Worked out:

Over 3 requests: cache costs 2 + 0.1 + 0.1 = 2.2× vs 3× uncached. ✅ cache wins.
Over 10 requests: 2.9× vs 10×. ✅ cache wins comfortably.

Why "3rd request" and not "2nd"? The TTL sets the write price, and this is exactly where Claude Code differs from the raw API. By default Anthropic uses a 5-minute cache TTL, where a write costs only 1.25× — so caching breaks even one request sooner, on the 2nd. But the Claude Code client overrides that default and requests the 1-hour TTL on every breakpoint (measured later in this act), where a write costs 2× — pushing break-even to the 3rd request. This guide uses 2× / 1-hour throughout because that's what Claude Code actually sends.

⚠️ Mistake — paying for a cache you never read back. A wasted write (content cached but never read again) costs 2× — double what you'd have paid had you never cached it. Caching is not free insurance; it's a bet that you'll re-read the prefix at least three times. A short, one-shot interaction that ends after one or two turns can be cheaper without caching.

✅ Fix — Let Claude Code's defaults stand for interactive sessions (they will be re-read many times). Only worry about this in custom harnesses that cache aggressively but terminate early.

One more corollary you'll lean on constantly: every read also refreshes the TTL. A continuously-reused prefix never expires.

What to do (Act I so far): Internalize that cost = (re-processing history) + (output you generate). The cache is the only tool against the first term, and it's a strict prefix match — which is the next mental model.

Mental model 2: the prefix-match cache

This is the model that, once you have it, makes cache behavior obvious instead of mysterious — and it starts one level down, in how the model reads your prompt at all.

What attention actually computes

The model itself — a transformer (the neural-network architecture every modern LLM, Claude included, is built on) — reads your prompt token by token, left to right. A token is roughly a word or a word-piece. For every token the model computes three vectors — a query, a key, and a value — and each token attends to all the tokens before it: its result is a blend of those earlier tokens' values, weighted by how well its query matches their keys.

In plain terms: to handle each word, the model glances back over everything before it and weighs which earlier words matter — the way you look back to figure out what "it" refers to in "I poured water into the glass until it was full." It does that for every token, against every earlier token at once.

Two facts fall straight out of this, and together they explain the entire cache:

The keys and values for the prefix are the expensive artifact. "Processing the input" is mostly computing, for every token, its key/value vectors and its attention over all earlier tokens. The longer the prefix, the more of that work — and because the client re-sends the whole conversation each turn (Mental model 1), a naive server would redo all of it on every turn.
Attention is causal: token N depends only on tokens 0…N. A token's key/value never depend on anything that comes after it, so the computed state for a stable prefix is identical no matter what you later append.

What the cache stores, and what it saves

The prompt cache stores exactly those computed key/value vectors — the "KV cache" — for the prefix tokens, keyed to the exact tokens that produced them. On a hit, the server loads that saved attention state instead of recomputing it and starts real work only at the first uncached token. Every pricing rule in this guide follows from that one move:

A cache read is ~10× cheaper than a cold pass (0.1× vs 1×): you pay to load precomputed attention state, not to recompute it.
Only a contiguous prefix from token 0 can be cached — token N's state depends on every token before it, so cacheable spans grow from the start, never from the middle.
Output is never cached at generation: each output token's KV is computed once while decoding and discarded — billed purely as output_tokens, absent from that turn's cache_creation. The output text, though, doesn't vanish: next turn it's re-sent in the history and cache-written as input exactly once (its token count shows up in that turn's cache_creation), then read warm thereafter. So a generated span is paid once as output, once more as a single cache write next turn, then cheap reads — there's no cached output-KV to reuse, only the re-encoded text. (Measured: a 440-token turn-1 output had cache_creation 0 that turn, then appeared as ~450 of turn-2's cache_creation, then rode along in turn-3's warm read. That's why output is its own, uncacheable-at-generation cost line back in The three things you pay for.) [measured]
And the headline invariant is just causality restated:

Prompt caching is a strict prefix match. A cache entry is keyed on the exact tokens from position 0 up to a cut point. Change one token at position N and every cached state at position ≥ N is invalid — because each of those later key/value vectors was computed attending to the token you changed.

The cache is therefore content-addressed, not position-addressed: the API re-hashes the leading tokens each request and looks the hash up. Identical leading bytes → hit. This is why the render order (tools → system → messages) is load-bearing — put the bytes that never change at the front, and the model reloads the prefix's attention state instead of recomputing it.

The cut point itself is a cache breakpoint — a cache_control marker on a block, meaning "cache everything from the start up to here." In Claude Code the client places these for you; the rest of this section is what they do.

The sliding window and delta-only writes

Here's the elegant part. The last breakpoint slides forward to the newest turn on each request (Claude Code does this automatically; on the raw API you do it yourself). Then three things happen together:

No invalidation. A new turn reads the unchanged earlier entries and writes a longer one. Cache entries are immutable; stale ones simply age out, they aren't purged.
You pay the 2× write premium only on the delta — the new tokens since the last boundary. The big system prompt is written exactly once.
The TTL refreshes on every read, so a continuously-used prefix never expires.

In one line:

write cost per turn ≈ (new tokens since last boundary) × 2× rate

The exception is the whole story of cache-busting: a byte change inside the prefix misses at that point, and the next write must span from the change forward — now 2× as costly to rebuild as the read it replaced.

The slide also has a reach limit: a breakpoint only re-links to a cache entry within ~20 content blocks of it. So a single turn that appends more than ~20 blocks — a large agentic tool burst — breaks the chain and forces a cold rewrite even when nothing else changed. That failure mode, and the proxy-side fix, are in Agentic tool bursts overflow the 20-block lookback (under What busts the cache).

The invalidation hierarchy

Memorize which changes survive (✅) and which invalidate (❌) each tier:

Change	Tools	System	Messages
Tool definitions (add/remove/reorder)	❌	❌	❌
Model switch	❌	❌	❌
`speed` / web-search / citations toggle	✅	❌	❌
System prompt content	✅	❌	❌
`tool_choice` / images / thinking toggle	✅	✅	❌
Message content	✅	✅	❌

Source: Anthropic — Prompt caching (cache-tiers / what-invalidates section).

The two rows that force a full rebuild — touching every tier — are the expensive ones: tool-definition changes and model switches. Because both sit at or before byte 0 of what's cached, they re-key everything, and at the 2× write rate that rebuild is twice as painful as a read.

Gotchas — in Claude Code

Claude Code places the cache breakpoints for you — three of them, 1-hour TTL (proven in Claude Code's actual caching) — so you never write cache_control yourself. Your cache lever isn't placing breakpoints; it's not disturbing the prefix the client already cached. The ways a real session loses it:

Tool-set churn at byte 0. Adding or removing an MCP server — or a server that mutates its tools at runtime (dynamic toolsets), or async connectors that register after startup — changes the tool definitions at byte 0 and re-keys the entire prefix. Stabilize the tool surface first; full treatment in What busts the cache.
Switching models. The cache is model-scoped, so any model change is a full cold rebuild — the whole of Act II.
Long agentic tool bursts. A single turn that emits more than ~20 tool-call/result blocks overflows the API's 20-block lookback window, so the next turn can't re-link to the prior entry and cold-writes the gap. You can't insert your own breakpoints to repair this from inside Claude Code, so the mitigation is to keep individual tool bursts bounded — full treatment, with the exact re-link rule and a proxy-side fix, in What busts the cache.
Editing CLAUDE.md (or a memory file it imports) — free while a session runs; you pay a one-time re-key only on the next start or resume, and only if the file actually changed. "This" means the project-context layer Claude Code assembles for you: your CLAUDE.md files (enterprise, user, project), the rules and memory files they @import, and the auto-memory store. (It is not your system prompt — --append-system-prompt, output styles, and the like sit in the system tier at the front of the prefix, a separate and costlier case under Tool-set churn / What busts the cache — and not the message you type, which is the live turn.) Claude Code reads this layer from disk once, at session start, into a single <system-reminder> in messages[0] (the first user turn), placed after the system breakpoints — so editing it can never disturb the expensive tools+system prefix (it is never the worst class). The process then reuses that start-of-session snapshot for its whole life, so the only question is when the snapshot is rebuilt from disk:
- While running → free. The live process never rewrites its messages[0] snapshot, so the warm prefix is safe no matter who edits the file. Headless -p ignores the edit until restart. Interactive surfaces it cheaply at the tail: an out-of-band edit (you, a linter, another process) appends a one-shot "CLAUDE.md was modified…" reminder (~155 tokens — the same append-don't-mutate trick as the date); an edit Claude makes itself rides in as its Edit tool result, with no extra reminder. Either way messages[0] stays byte-identical.
- On --continue or a fresh start → the bill lands, but only if the file changed. A new process rebuilds messages[0] by re-reading CLAUDE.md (and its @imports) from disk. If the content changed (whoever changed it), cache_read collapses to the system tier and the whole message tier cold-writes — measured cache_read 30,216→27,975, cache_creation 24→2,265 vs an unedited control; an @imported file re-keys identically (45→2,301). If it didn't change, the resume is fully warm — messages[0] comes back byte-identical and only the sliding tail re-keys (~17–21 tokens). So the re-key is the edit's cost, not resume's. (A fresh session pays the same cold write but has no warm tier to lose.)
- Don't confuse this with system-prompt drift. Any cross-process resume can separately eat a bigger, occasional miss: the regenerated system prompt differs by ~6–17 chars and cold-writes the entire prefix. That's process nondeterminism, independent of any CLAUDE.md edit — tell them apart by system[2]'s hash (a CLAUDE.md re-key leaves system[2] identical and flips only messages[0]). [measured]
Adding a skill vs. a plugin/MCP server — opposite ends of the cost scale. The split comes from where each lands in the prefix: a skill's name+description goes into messages[0] (same tier as CLAUDE.md, after the system breakpoints); a plugin/MCP server's tools go to byte 0, ahead of everything cached. That placement is the whole story:
- Markdown skill → cheap, and --continue-invisible. Added mid-session, the interactive client shows it immediately as a one-shot tail append (the full skills list is prepended to the current turn, ~1,016 tokens once); headless -p ignores it. It's baked into the canonical messages[0] only at the next fresh start. Notably, --continue does not re-scan skills — a resumed session never sees a newly-added skill and never re-keys for it. (This is the resume asymmetry from the bullet above: --continue re-reads CLAUDE.md but replays the stale skills list.)
- Plugin/MCP server → the expensive one. Its tools sit at byte 0, so changing the connected tool set re-keys the entire prefix — system and message tiers (measured: one added tool collapsed cache_read 29,424→0 and cold-wrote all 29,505 tokens). A connected server's own tool change (e.g. a dynamic tools/list_changed) applies live; adding a brand-new server via config is restart-gated. [measured]

The full catalogue of what busts the prefix — silent invalidators, dynamic MCP tools, injected date/git/system-reminders — is the next section.

Note — extra gotchas if you use the Claude API directly. Everything above assumes the Claude Code client, which manages caching for you. If you assemble /v1/messages requests yourself, you also own the things Claude Code quietly handles:

You must add cache_control yourself. With no breakpoint, nothing is cached. There's a maximum of 4 per request; place them at stability boundaries — [tools + system] (frozen) / [project context / CLAUDE.md] (per task) / [conversation] (the sliding tail) — and the API reads the longest matching prefix, reprocessing only what follows.

Minimum cacheable prefix ~1,024–4,096 tokens (model-dependent). Below it the marker silently no-ops: cache_creation_input_tokens comes back 0 and nothing is cached, even though it looks enabled.

The default TTL is 5 minutes (write ≈ 1.25× input); you must explicitly request the 1-hour TTL (write ≈ 2×) that Claude Code always sends. Choose by how long you'll keep reusing the prefix.

Ordering is on you: emit tools → system → messages, stable content first, volatile last.

Serialization drift busts the cache silently: re-serializing JSON with different key order, separators, or escaping — or interpolating a timestamp, UUID, or request-ID early in the prompt — changes the bytes even when the meaning doesn't, and the KV states no longer match. Keep the cached prefix byte-identical and push volatile tokens after the last breakpoint.

You slide the tail breakpoint yourself to the newest turn each request to get delta-only writes.

Verify any of it with the usage block: a non-zero cache_creation_input_tokens on the first call, then a non-zero cache_read_input_tokens on the next. If both stay near zero across requests that share a prefix, caching isn't engaging.

What busts the cache

This is where understanding turns directly into money. Everything below changes the prefix bytes, so the next request re-keys and rewrites at 2× instead of reading at 0.1×. These are the ways it happens in a real Claude Code session; a closing FYI covers hazards that only apply if you drive the raw API or bolt your own content onto Claude Code.

Dynamic MCP tools

MCP servers can change their advertised tools at runtime by sending notifications/tools/list_changed. When they do, the new tool definition lands at byte 0 (tools are first in the render order) → the entire prefix re-keys → a full cold rebuild, now at 2× the write rate.

It helps to keep three distinct things straight, because only the first one lives at byte 0:

Thing	Lives in
Tool definitions	`tools` param (byte 0) — changing these is catastrophic
Tool calls (`tool_use`)	assistant turn content (messages) — late, cheap
Tool results (`tool_result`)	user turn content (messages) — late, cheap

The async-connector blow-up, caught live [measured]

In one real session the tool count grew 30 → 55 → 85 across consecutive turns with no model switch, busting the cache on every turn. The cause: four claude.ai MCP connectors (Zoom, Atlassian Rovo, Microsoft 365 [unauthenticated], Slack) — remote servers connecting asynchronously at startup. Each request snapshots whatever tools have registered so far, so the byte-0 tool list kept growing as connectors came online mid-session. Pinning the config with --strict-mcp-config froze it at 30.

⚠️ Mistake — optimizing anything before the tool surface is stable. If tools are still registering across your first few turns, every "optimization" you measure is noise on top of a cold rebuild.

✅ Fix — Stabilize the tool/MCP surface first. Use --strict-mcp-config (or a frozen, pinned server set) so byte 0 is constant from turn 1. At a 2× write rate, every cold turn is doubly expensive — this is the highest-leverage fix in the guide.

Injected context Claude Code handles for you: date, git, system-reminders

These are the things people expect to bust the cache — but Claude Code places them so they mostly don't:

Date — lives in a system-reminder after the system breakpoints, and a mid-session midnight rollover is effectively free: the harness doesn't rewrite the date in messages[0], it appends a "The date has changed" reminder to the newest turn (measured: cache_read held ~30K, only ~58 tokens written). It re-keys nothing but the sliding tail. [measured]
Git status — safe: Claude Code snapshots it once at session start. (Re-running git status every turn and injecting it early would bust constantly — a cautionary design lesson.)
<system-reminder> blocks — cheap when appended to the newest turn; only expensive if per-turn-varying content is placed early, re-stamped onto an already-cached turn, or stripped on replay (each changes a cached turn's bytes).

Case study: the GitHub MCP server's dynamic toolsets [measured against source]

This case is worth studying because the most popular official platform MCP server made exactly this mistake — and then deleted the feature.

Verified against github/github-mcp-server (Go, MIT) at tag v1.0.5 (1d17d33): the README's "Dynamic Tool Discovery," pkg/github/dynamic_tools.go, and the bundled MCP Go SDK (modelcontextprotocol/go-sdk v1.6.1).

Mode	Tools	Mutates byte 0 at runtime?	Cache-safe?
default (no flags)	`default` toolset, fixed at startup (~50 tools)	No	✅
`--toolsets repos,issues,…`	explicit groups, fixed at startup	No	✅
`--dynamic-toolsets`	3 meta-tools; starts ~empty, grows on demand	Yes (`enable_toolset` → `AddTool` → `tools/list_changed`)	❌

The default and explicit --toolsets paths resolve the tool set once at startup (ResolvedEnabledToolsets) and never touch it again — a frozen prefix, cache-safe. The --dynamic-toolsets mode (off by default; env GITHUB_DYNAMIC_TOOLSETS) instead starts nearly empty and exposes three meta-tools — list_available_toolsets, get_toolset_tools, enable_toolset — so the model turns groups on as it needs them.

That convenience is a cache-buster. When the model calls enable_toolset, the handler registers the group's tools against the live, already-connected server (EnableToolset → RegisterFunc → s.AddTool), and AddTool fires notifications/tools/list_changed. New tool definitions land at byte 0 → the whole prompt prefix re-keys → a full cold rebuild (2× write) on the next turn, paid again on every enable_toolset call.

An epilogue, read honestly: GitHub removed the dynamic-toolsets feature entirely in v1.1.0 (PR #2512, commit 0f0506d, 2026-05-20) — absent from every release since, including v1.3.0. But the maintainers do not cite caching or token cost. Their stated reasons are tech-debt cleanup — the dynamic path "carried real complexity: a separate config flag plumbed through stdio + http configs, a parallel registration path, four inventory methods … three meta-tools" — and that it was "a path no longer in active use," superseded by client-side progressive discovery (the removal author explicitly names "Anthropic's Tool Search Tool, OpenAI's equivalent, 'Code Mode' patterns"). So this is consistent with the frozen-prefix principle, not proof that cache cost drove the deletion. The cache-busting cost itself is documented first-party — by Anthropic, not GitHub: Prompt caching states "Modifying tool definitions … invalidates the entire cache." And tellingly, the replacement paradigm — tool search — is cache-preserving by design: it appends tool schemas inline rather than swapping the prefix, so "the prefix is untouched, so prompt caching is preserved" (tool search docs). [removal + stated reasons verified via PR #2512; the cache motive is our analysis, not GitHub's]

Live companion — Docker MCP Gateway, the same pattern shipped default-on. [measured against source] Where GitHub removed its dynamic path, Docker MCP Gateway (docker/mcp-gateway, Go, MIT, ~1.5k★) ships it on by default. Verified at tag v0.43.0 (4833d8c): the dynamic-tools feature is default-enabled (cmd/docker-mcp/commands/feature.go → defaultEnabledFeatures{"dynamic-tools": true}; the launch blog confirms the meta-tools are "available to your agent by default"), exposing mcp-find / mcp-add / mcp-remove / mcp-config-set / code-mode that add and remove tools on the live, already-connected server mid-session. The mutation hits byte 0 through the identical sink as the GitHub case: pkg/gateway/reload.go (reloadConfiguration) calls mcpServer.RemoveTools(…) + AddTool(…), which in the MCP Go SDK (modelcontextprotocol/go-sdk v1.4.1) bottom out in changeAndNotify(notificationToolListChanged, …) → notifications/tools/list_changed — and Docker's own TestIntegrationToolListChangeNotifications asserts it fires on a live add. So every mcp-add / mcp-remove cold-rebuilds the whole prefix at the 2× write rate, on each call.

Motive, stated honestly: Docker argues the feature on raw token volume, never caching — its slide deck cites "335 tools => 209K tokens of tool description / request … \$1.25/1M tokens" (examples/tool_registrations/embeddings.md) and the launch blog warns the context window can "accumulate hundreds of thousands of tokens of nothing but tool definition." The word "cache" appears nowhere in its blog or docs; the token-volume cost is Docker's claim, the prompt-cache consequence (byte-0 definitions cold-rewrite the cached prefix) is ours. [runtime mutation + list_changed verified at v0.43.0/4833d8c; cache framing is ours]

⚠️ Mistake — enabling dynamic/runtime tool discovery for the convenience. Every "enable this toolset on demand" call cold-rebuilds your entire prompt prefix.

✅ Fix — Use a frozen startup tool set (default or explicit --toolsets). If you genuinely need many tools, prefer a single fixed dispatcher tool (select the operation via arguments) or tool-search that appends schemas to the tail rather than mutating byte 0.

Agentic tool bursts overflow the 20-block lookback [measured]

This buster is different from the ones above: it fires from a decision the model makes — a big parallel tool fan-out — not from anything you configured.

The mechanism, in plain terms. Each turn, the API tries to reuse the cache by matching the prefix at a breakpoint. If that exact match misses, it walks backward looking for a still-cached chunk to extend — but only about 20 content blocks. Find a cached chunk inside that window → the whole prefix is served warm. Find nothing → the API gives up and re-charges the entire prefix at the 2× cold-write rate. So when the previous turn appended more than ~20 blocks, last turn's cached chunk is now sitting too far back to reach, and the next turn pays full freight.

Why tool bursts trip it so easily. Recall a message holds many blocks, and this window counts blocks, not messages. Each parallel tool call is two blocks — the tool_use and its tool_result — so a turn with just ~10 parallel tools is already ~20 blocks, right at the edge. The threshold is razor-sharp: 19 added blocks still re-links, 20 misses (and 20 blocks crammed into a single message still misses — it genuinely counts blocks). Measured on claude-opus-4-8:

one turn's fan-out	blocks added	next turn
28 parallel tools	57	MISS — entire prefix re-charged cold (28,149 written, 0 read)
5 parallel tools	11	HIT — prefix served warm (25,672 read, 452 written)

Same setup; only the burst size differs.

Two facts that make a fix possible. (1) Re-linking needs just one breakpoint within ~20 blocks of the last cached chunk — how far the end of the conversation is doesn't matter, and you don't need an unbroken chain of markers. (2) The cache entry isn't deleted when a turn overflows; it lives for the full hour. You only need a marker close enough to point back at it on the very next request — so the repair is needed only on the oversized turn itself, not forever after.

You can't do this inside Claude Code. The client places the breakpoints for you — and spends only 3 of the 4 the API allows (see The 3-breakpoint finding below; the 4th slot sits unused) — with no way to add or move one. A feature request to expose breakpoint placement in settings.json was filed and closed as not planned (anthropics/claude-code#58103). So from inside the client the only lever is to keep bursts bounded — fewer parallel tools per turn.

Behind a proxy you could potentially fix it. A proxy sitting between Claude Code and the API sees the entire conversation on every request — the API is stateless, so the full history is resent each turn — which means the proxy can rewrite the cache_control markers before forwarding, with no memory of prior calls. And because Claude Code uses only 3 markers, a proxy has the whole budget of 4 to work with. Place them as a sliding grid — at the tail and every ~18 blocks back (tail, −18, −36, −54) — and whatever the burst size, one marker still lands within ~20 blocks of the last cached chunk, re-links the full prefix, and re-charges only the genuinely new blocks:

turn	naive (Claude Code's 1 tail marker)	proxy grid (4 markers, stride 18)
+57-block burst	read 0 · write 16,879 ✗	read 14,794 · write 2,085 ✓
+31-block burst	read 0 · write 18,011 ✗	read 16,888 · write 1,123 ✓

Two limits on the proxy fix: four markers at stride 18 rescue a single turn of up to ~74 new blocks — past that, even the grid can't reach the prior chunk. And the proxy must forward the prefix byte-for-byte; re-sorting the tools or re-serializing the JSON busts the cache on its own, markers or not.

⚠️ Mistake — letting an agent fan out 10+ parallel tool calls in one turn on a cached prefix. ~20 blocks overflows the lookback, and the next turn cold-rewrites the whole prefix at 2×.

✅ Fix — In Claude Code you can't place your own markers, so bound the burst (fewer parallel calls per turn) — that fixes the common case for free. The proxy grid (4 markers, stride ~18, rescues up to ~74 new blocks) is a last resort, not a default: a burst big enough to overflow the window is rare, and a request-rewriting proxy now owns your cache correctness — it must forward the prefix byte-for-byte and is still capped at the ~74-block ceiling. Don't stand one up — or buy a "cache-optimization" layer that does — without understanding every constraint above; for a problem this infrequent, the added complexity and failure surface rarely pay for themselves.

FYI — not Claude Code: silent invalidators in raw-API or custom setups

The classic prompt-cache killers below don't come from Claude Code — it keeps volatile content out of the cached prefix (the date sits after the system breakpoints, the per-request billing-header token is excluded from the cache key, git is snapshotted once). They bite only when you build the prefix yourself: calling /v1/messages directly, or bolting content onto Claude Code via --append-system-prompt / --system-prompt, a SessionStart hook or proxy, or a custom MCP server whose tool descriptions embed per-request data (those land at byte 0). They're silent because the meaning looks unchanged — only the bytes moved:

datetime.now(), UUIDs, or request-IDs interpolated early in the prompt.
Unsorted json.dumps (key order drifts between requests).
Per-user data or conditional sections in the system prompt; per-user tool sets.

✅ Fix — Keep per-request tokens after the last breakpoint (the message tier), exactly where Claude Code puts them. Verify with cache_read_input_tokens ≈ 0 across requests that should share a prefix; if reads are zero, diff the rendered bytes of two consecutive requests.

Claude Code's actual caching, empirically

Now we ground the abstract rules in what Claude Code 2.1.150 actually sends on the wire. (This layout is an implementation detail and is version-specific — re-verify per release.)

The 3-breakpoint finding [measured]

Captured via raw request logging (OTEL_LOG_RAW_API_BODIES=1): 3 cache_control breakpoints, all with ttl:"1h" — the 4th available budget slot is unused.

TOOLS: 30 definitions             ← byte 0, no marker of their own
SYSTEM:
  system[0] len=85   billing/version header (per-request cch token)
  system[1] len=62   "You are a Claude agent…"        ◀ BREAKPOINT 1 (1h)  — covers tools + identity
  system[2] len=26887 "You are an interactive agent…" ◀ BREAKPOINT 2 (1h)  — full system prompt
MESSAGES:
  messages[0].content[0]  <system-reminder> skills list
  messages[0].content[1]  <system-reminder> context + DATE
  messages[0].content[2]  user prompt (turn 1)
  …turns…
  last user message                                    ◀ BREAKPOINT 3 (1h) — slides each turn

Note that tools fold into Breakpoint 1 — there is no dedicated tool breakpoint. A multi-turn (-c) capture shows the tail breakpoint (BP3) sliding forward each turn while the two system breakpoints stay put. This is the textbook sliding-window pattern from earlier: frozen system tier written once, sliding tail writing only the delta.

The billing-header gotcha [measured]

system[0] contains a cch= token that changes on every request — yet warm reads still hit. How? The whole of system[0] is special-cased out of the cache key, so it never busts Breakpoint 1. The block is the billing/version header — x-anthropic-billing-header: cc_version=…; cc_entrypoint=…; cch=…; — and the exclusion covers the entire block, not just the volatile cch= token: changing the version string or the entrypoint is ignored for caching too. (Measured: mutating a non-cch byte of system[0] across a warm turn still reads the full prefix; the same one-byte change in system[1] instead cold-rewrites — so the cache key is live, system[0] is simply outside it.) It's a deliberate exception — a per-request-varying span sitting inside the cached prefix that, by the prefix-match rule, should invalidate everything after it, but doesn't. Don't chase it as a phantom cache-buster.

Date placement [measured]

The date lives in a <system-reminder> in the first user message — after both system breakpoints. A mid-session midnight rollover is cheaper than you'd expect: it does not rewrite messages[0] at all. Verified by holding one session alive across Pacific midnight and capturing the wire — turn 1 (23:48, 06-16) and turn 2 (00:01, 06-17, same session):

turn	`messages[0]` date	rollover notice	usage
1 (before)	`2026-06-16`	—	read 21,812 · write 8,305
2 (after)	still `2026-06-16`	`"The date has changed. Today's date is now 2026-06-17"` appended to the new user turn (msg tail)	read 30,117 · write 58

So the harness leaves the original date in messages[0] untouched and appends a "date has changed" reminder to the sliding tail — re-keying only the cheapest tier (≈58 tokens written, the whole prefix still read warm). It doesn't touch the system tier or messages[0]. Git status is snapshotted once at session start (frozen). Transcripts don't persist the cache_control markers — capture the wire request, not the transcript. [measured]

⚠️ Mistake — trusting the displayed session cost as your true bill. Claude Code's total_cost_usd prices its 1-hour writes at the 5-minute 1.25× rate, so it reads lower than your actual invoice.

✅ Fix — Use the displayed cost for relative comparisons within a session, but treat the Anthropic Console as authoritative for absolute billing. (Act IV quantifies the gap on a real 16-turn session: $1.77 displayed vs $1.97 actual.)

What to do (end of Act I): Keep the tool/MCP surface frozen from turn 1; keep volatile tokens (dates, IDs) after the last breakpoint; verify caching with the usage block, not faith; and don't trust the displayed cost as your invoice. With that, the steady-state cost of a single-model session is mostly cheap reads plus the output you generate. The expensive surprises come from the two things in Act II.

DEV Community