This act consolidates everything into two reference artifacts: a catalogue of the mistakes (each with its symptom, cost, and fix) for revision, and a one-page cheat sheet.
The mistakes catalogue
Ordered roughly by how much money they cost and how often they happen.
1. Routing per turn instead of per conversation. Symptom: a "cost-saving" router that bounces models mid-conversation. Cost: a cold prefix write on the first switch + a catch-up write on every re-entry; for Sonnet, a net loss (+2.1% measured). Fix: route sticky — per conversation or per sub-agent. Where you switch matters more than whether.
2. Inserting a proxy that re-serializes JSON. Symptom: a logging/compression/routing proxy that parses and re-emits the request. Cost: even a logically-identical re-encode changes the bytes (separators, escaping, key order, number formatting) and drops cache hit rate to ~0 for all traffic. Fix: forward original bytes verbatim; mutate only by surgical byte-fragment replacement on the messages tail.
3. Leaving the tool/MCP surface unstable. Symptom: tool count grows across early turns (async MCP connectors), or dynamic/runtime tool discovery is enabled. Cost: byte-0 churn → full cold rebuild (2× write) every affected turn. Fix: --strict-mcp-config and a frozen startup tool set; a fixed dispatcher tool instead of runtime tool mutation.
4. Volatile tokens early in the prompt. Symptom: a timestamp, UUID, or request-ID near the top of the system prompt; unsorted json.dumps. Cost: every request has a unique prefix → ~0% hit rate, silently. Fix: move all volatile content after the last breakpoint; sort keys deterministically.
5. Stripping or editing thinking blocks in a proxy. Symptom: a proxy that "cleans up" or removes thinking blocks. Cost: editing → 400 (integrity); removing → 200 but a position-dependent re-key (removing an early block lost 8,104 cache tokens vs 440 for a late one). Fix: never mutate thinking blocks; carry the request body as Claude Code assembled it.
6. Trusting the displayed session cost as your invoice. Symptom: using /cost or the status line as the source of truth. Cost: it prices 1-hour writes at the 5-minute 1.25× rate and reads ~10% low ($1.77 vs $1.97 on a real session). Fix: use the Anthropic Console for absolute billing; use the displayed figure only for in-session relative comparison.
7. Caching a prefix you'll read fewer than 3 times. Symptom: aggressive caching in a short-lived custom harness. Cost: a wasted write is 2× — double what no caching would have cost. Fix: cache only when you'll re-read the prefix ≥3 times (the break-even); let interactive sessions cache by default.
8. Running an aggressive input compressor over a cached prefix. Symptom: LLMLingua/LLM-based compression applied to the conversational prefix. Cost: rewriting cached bytes flips reads (0.1×) to writes (2×); accuracy degrades past ~20×. Fix: use input compressors only where there's no cache to lose (one-shot, RAG assembly) or only on the fresh tail; prefer a cache-aware tool.
9. Agentic bursts overflowing the 20-block lookback. Symptom: a single turn emits ~10+ parallel tool calls (≈21+ tool-call/result blocks). Cost: the next turn can't re-link the cache and cold-writes the gap. Fix: in Claude Code, bound the burst; behind a proxy, a stateless breakpoint grid (stride ~18, 4 markers) rescues up to ~74 blocks/turn.
10. A cache breakpoint below the minimum prefix. Symptom: marking a sub-1,024-token prefix for caching. Cost: the marker silently does nothing (cache_creation: 0); you think caching is on. Fix: confirm non-zero cache_creation then non-zero cache_read on the following request.
11. Assuming HTTP 200 means the cache survived. Symptom: judging a proxy change by "it still works." Cost: 200 means "valid request," not "cache preserved" — you can silently convert cheap reads into expensive writes. Fix: validate every prefix-touching change by watching cache_read on the next request, not just the status code.
12. Switching Haiku → Sonnet mid-conversation. Symptom: escalating an in-flight Haiku conversation to Sonnet. Cost: stacks three things — full cache miss (model-scoped), Sonnet now keeps-and-bills thinking Haiku had been ignoring, and any prefix-leanness benefit evaporates. Fix: decide the tier at conversation start; if you must escalate, accept it as a fresh cold start, not a cheap continuation.
One-page cheat sheet
The four mental models
- Amnesiac contractor — the server stores nothing; the request is the only state.
- Prefix-match cache — caching is a strict byte-prefix; a change at byte N invalidates everything ≥ N.
- Model-scoped key — a cache entry belongs to one model; another model can't read it.
-
Encrypted envelope — thinking rides back sealed in the
signature; you carry it, can't open it, can't edit it.
The numbers
- Cache read = 0.1× input; cache write = 2× (1-hour TTL, what Claude Code sends); output = 5× input, never cached.
- Caching breaks even from the 3rd read (at 2×). A wasted write costs 2× (double no-cache).
- A cache hit is ~10× cheaper than cold processing.
- Tokenizer factor: Opus 4.8 = 1.0; Haiku 4.5 = 0.775 (measured); Sonnet 4.6 ≈ 0.775 (assumed).
- Claude Code: 3 breakpoints, all 1-hour TTL (tools+identity, full system, sliding tail). Displayed
total_cost_usdunder-reports (prices writes at 1.25×).
The render order — tools → system → messages. Stable first, volatile last.
The two safest cost levers (never touch the prefix):
- Output compression (Caveman) — attacks the priciest line, which is never cached.
- Sticky routing — per conversation/sub-agent: all-Sonnet ≈ −53%, all-Haiku ≈ −85% vs Opus.
The reflex to build: before any change that touches tools, system, or message history, ask "does this change a byte inside the cached prefix?" If yes, you're paying a 2× rewrite — only do it on purpose.
Synthesis — the whole guide in twelve lines
-
Claude Code is a stateless API client. Every turn is one complete
/v1/messagesrequest carrying tools + system + the whole conversation; the server stores nothing and the client re-sends everything (including thinking) each turn. The request is the only state. - Caching is a prefix match. A byte change at N kills caches at ≥ N; order tools → system → messages; stable first, volatile last.
- Breakpoints are content-addressed cut points (max 4); read the longest matching prefix, reprocess only the delta; backward re-link is bounded by 20 blocks.
- A new turn adds and reads — it doesn't invalidate. Reads refresh the TTL; the 2× write premium is paid only on the delta; the big prefix is written once.
- Claude Code uses 3 breakpoints at 1-hour TTL — tools+identity, full system, sliding tail. The date is a system-reminder after the system breakpoints; the billing header's token is excluded from the cache key; its displayed cost under-reports by pricing 1-hour writes at 1.25×.
- Switching models is not "always cold": first call cold, later same-model calls warm-read + catch-up-write. Caches are model-scoped, the system prompt differs by two lines, and across the Opus/Sonnet/Haiku family thinking blocks replay across models — rendered and billed, not dropped, even when encrypted; only Fable/Mythos drop them.
-
Thinking is an encrypted envelope. The
signaturecarries the full reasoning, decrypted server-side; you can't read it, can't edit it (400), but must carry it for continuity. Billed as output at generation, input on replay;displayis visibility-only. -
Thinking-strip is a model-class trait, neutralized by Claude Code. Default Haiku/older strip thinking (free, out of the cache key); Opus 4.5+/Sonnet 4.6+ keep it. Claude Code forces
keep:"all"everywhere, so the strip never bites in real usage — and a proxy that removes blocks pays a position-dependent re-key. -
Dynamic MCP tools (runtime
list_changedor async connector load) churn byte 0 → full rebuild. Keep the tool set stable; use a fixed dispatcher. Stabilize the tool surface first. - Output dominates generation-heavy turns — the only regime where a cold cross-model bounce can still win (output > ~2,500 tokens).
- Route per conversation, not per turn. At 2× writes, per-request 20%-Sonnet loses 2.1% while sticky all-Sonnet saves 53% and all-Haiku 85%. Where you switch matters more than whether.
- Cost-reduction tooling clusters by cost line. Routers attack the model tier (pay off only sticky); compressors attack input/cache (must leave the cached prefix byte-stable); output compressors attack the priciest, cache-neutral line. The safest wins — output compression and sticky routing — are the two that never touch the cached prefix.
References & caveats
Costs are computed from measured token counts at Anthropic list prices (per 1M, as of 2026-06-15): Opus 4.8 $5/$25, Sonnet 4.6 $3/$15, Haiku 4.5 $1/$5 in/out; cache read 0.1×, cache write 2× (1-hour TTL). The tokenizer factor 0.775 is measured for Haiku and assumed for Sonnet. Sonnet routing figures are modeled, not run. Claude Code's displayed total_cost_usd uses a 1.25× write multiplier and reads lower than these documented-rate figures. Prices and behaviors are version-specific (Claude Code 2.1.150, mid-2026) — re-verify for your own setup.
Anthropic documentation
- Extended thinking (signatures, encryption, display, retention, multi-turn/tool-use):
platform.claude.com/docs/en/build-with-claude/extended-thinking - Adaptive thinking & effort:
platform.claude.com/docs/en/build-with-claude/adaptive-thinking - Prompt caching:
platform.claude.com/docs/en/build-with-claude/prompt-caching - Pricing:
claude.com/pricing
Cost-reduction tooling
- Routers: RouteLLM
lm-sys/RouteLLM· vLLM Semantic Routervllm-project/semantic-router - Input/cache compressors: Headroom
chopratejas/headroom· LLMLinguamicrosoft/LLMLingua· RTKrtk-ai/rtk· lean-ctxyvgude/lean-ctx - Output compressor: Caveman
JuliusBrussee/caveman - MCP servers referenced:
github/github-mcp-server(v1.0.5; dynamic-toolsets removed in v1.1.0, PR #2512 — for tech-debt, not cache) ·docker/mcp-gateway(v0.43.0 /4833d8c;dynamic-toolsdefault-on)
Top comments (0)