Patrick Hughes

Posted on Apr 24 • Edited on Jun 30 • Originally published at bmdpat.com

Claude Code caching is eating your budget. Here's what's happening.

#claudecode #aicosts #promptcaching #agentguard

Claude Code caching is eating your budget. Here's what's happening.

Claude Code's prompt caching has two TTLs (5 minutes and 1 hour), two price tiers (25% and 100% premium on cache writes), and most people have no idea which one they're paying for. The Register reported this week that users are hitting unexpected bills and rate limits. If you run Claude Code in production, you need to read this.

The Register ran a piece this week on Claude Code cache confusion. Short version: users are paying more than they expected, performance is inconsistent, and nobody has a clear mental model for when cache writes get charged at 25% versus 100% over base input tokens.

I've been watching my own Claude Code bill closely. Here's what's happening and what to do about it.

The two TTLs

Claude's prompt caching has two expiration windows:

5-minute (ephemeral): the default. Cache writes cost 1.25x base input tokens. Reads cost 0.1x.
1-hour: opt-in for longer sessions. Cache writes cost 2x base input tokens. Reads are still 0.1x.

The minimum cacheable block is around 1024 tokens. You get up to 4 cache breakpoints per request.

Tight back-and-forth inside a 5-minute window? The 5-min cache is fine. You write once, read many, amortize the cost.

Coding for an hour, stepping away for coffee, coming back? The 5-min cache expired while you were gone. Next request pays the full write again.

Nobody tells you which mode your session is in at any given moment.

What a day of coding actually costs

Rough math. Your Claude Code session has:

40-50k tokens of tools, CLAUDE.md, and scaffolding that gets cached
Some currently-open files (another 5-20k)
1-3k tokens per turn of conversation

With 5-min caching over a typical 2-hour session where you idle 4 times (coffee, Slack, lunch, thinking):

First turn: 1.25x on ~50k cached tokens. That's 12.5k tokens of write premium on top of the base cost.
Reads inside the 5-min window: about $0.02 per turn at Sonnet rates.
Each idle longer than 5 minutes triggers another rewrite. Another 12.5k in premium.

Four rewrites in a day = around 50k tokens of pure cache-write premium. At current Sonnet input rates (~$3 per million), that's about $0.15/day per active session in overhead you weren't tracking.

Sounds small. Scale it. Five sessions a day, 20 workdays a month. That's $15/month in cache-write overhead alone. More if your CLAUDE.md is large or you have a big tool set.

Per-token pricing that shifted this month makes this worse. You no longer have a flat subscription to absorb the variance.

Why this is worse than it sounds

Three things compound:

Cache writes on every fresh session. Every claude cold-start rewrites the full system prompt. 1.25x on ~40k tokens. At least $0.15 just to start each session.

The 1024-token floor. Blocks under 1024 tokens don't cache at all. You pay full input price every time. Small files and short tool definitions silently fall off the cache path.

The 1-hour option is not automatic. You need to pass cache_control: { type: "ephemeral", ttl: "1h" } in the API call. Claude Code the CLI handles this in most paths, but if you're driving the SDK directly, you're on 5-min caching until you opt in.

The Register flagged the bigger issue: the billing surface is opaque. You see the total on your invoice. You don't see cache-write vs cache-read vs base-input split out the way you'd split out the code that caused them.

Per-token pricing makes it a real problem

Anthropic shifted to per-token billing on several tiers this month. Max-plan-style flat absorption is gone on those paths. Every cache miss hits your card.

Per-token plus opaque cache costs = your bill is variable AND unpredictable.

Variable is fine if it's predictable. Unpredictable is fine if it's bounded. Both at once is the exact pattern that sinks AI-dev-tool budgets.

How to control it

Three things, in order:

1. Set a daily token budget, not a monthly one. You want the alert Monday afternoon, not Friday morning when the damage is done.

2. Close idle sessions. A 2-hour tmux-backgrounded Claude Code window is burning cache rewrites on every follow-up. Open a fresh one instead of reviving a stale one.

3. Put a hard cap on every agent process you spawn. If you script Claude Code via claude --headless or the SDK, don't trust the process to self-regulate. Cap the runtime at the guardrail layer, below the model.

Number three is the one most people skip and most often regret.

What I use

I built AgentGuard for exactly this. Runtime cost and safety guardrails for any agent you spawn, including Claude Code via SDK. Set a dollar budget, a max number of turns, a timeout, and a tool-call rate limit. The agent stops when it crosses a line. You find out at the cap, not on the next invoice.

pip install agentguard47. MIT license, OSS. Pro dashboard is $39/mo if you want alerting and historical spend charts.

The Register article is a warning shot. Prompt caching is a sharp tool. Use it wrong and it compounds. Use it right with a hard runtime cap and it's fine.

Don't wait for your next invoice to find out which one you're doing.

Set a hard budget on your AI agents with AgentGuard →

Originally published on bmdpat.com. I run a one-person AI agent company and write about what actually works.

Want these in your inbox? Subscribe to the newsletter - no spam, unsubscribe anytime.

Top comments (1)

A3E Ecosystem • Apr 24

Solid breakdown of the two-TTL system — the "four rewrites in a working day = ~50k premium tokens" math is exactly the kind of framing that changes behavior.

One angle worth adding to your three controls: CLAUDE.md size discipline.

A bloated CLAUDE.md (500+ lines of general advice) is 500+ lines sitting in the cached system prompt on every cold start. The 1024-token floor means over-padded instruction files pay full input price on sections that don't cross a cache breakpoint — and they inflate the 1.25x write target on the sections that do.

The fix: trim CLAUDE.md to only what Claude genuinely can't derive from reading the code — hidden constraints, workaround reasons, specific tool preferences. Anything Claude already knows from training (general code quality, defensive error handling patterns) just adds tokens without adding signal. A CLAUDE.md under 150 lines of genuine context vs. 600 lines of advice isn't just cleaner to maintain — it's meaningfully cheaper per session restart.

Combined with your dollar-budget cap and the "close idle sessions" advice, you get both sides: leaner context (supply) + hard runtime limit (demand). AgentGuard handles the cap layer well from what you describe.