Three places your Claude Code bill leaks (and how to plug them)

#ai #llm #productivity #devtools

A few months into running Claude Code across a small team, the API bill stopped being a rounding error. Nobody had done anything reckless. The spend crept up the boring way: a hundred small calls a day, each a little more expensive than it needed to be.

When I sat down with the usage export, the waste wasn't where I expected. It wasn't one runaway agent. It was three patterns repeating across thousands of calls. Each one is dull on its own and adds up to real money at the bottom of the invoice.

Here they are, with how to find each one in your own usage and the fix that moves the number.

1. You're paying full price to re-send the same prompt

This is the big one, and most teams don't know it's happening.

Every call you make ships a system prompt plus whatever context you've stacked in front of the real question: coding standards, file trees, a long instruction block. If that prefix is identical across calls and you're not caching it, you pay full input price to transmit the same tokens over and over.

Anthropic's prompt caching exists for exactly this. You mark a stable prefix with a cache breakpoint, and on the next call within the window that prefix is read from cache instead of reprocessed. A cache read bills at roughly a tenth of the normal input price. The catch: the write costs a bit more than a normal call (about 1.25x), so caching only pays off when the same prefix gets reused. For a system prompt that fronts every request, that's a near-certain win.

How to spot it: pull a sample of your requests and look at the first few thousand tokens of each. If a large, identical block repeats and your cache_control markers aren't set, that block is being re-billed at full rate every single time.

The fix is one config change, not a rewrite. Add a cache breakpoint after your stable prefix. On a 9k-token system prompt hit a few hundred times a day, the difference between cached and uncached is the kind of line item that pays for itself the same afternoon.

2. Opus work that Sonnet would do the same

Model choice tends to get set once, early, when someone picked the strongest option to be safe, and then never revisited.

The strongest model is the right call for genuinely hard reasoning. It is overkill for renaming variables, summarizing a diff, classifying a log line, or formatting output. A cheaper model returns the same answer for those, and the price gap between tiers is large.

How to spot it: group your calls by the kind of task, not by volume. The wins hide in the high-frequency, low-difficulty bucket, the thousand-times-a-day calls doing mechanical work on your most expensive model. Those are the ones to test on a cheaper tier.

The honest way to do this is side by side. Take a real sample, run it on both models, and compare the outputs yourself before you switch anything. If the cheaper model scores the same on the tasks you care about, route that task class down. If it doesn't, you've learned where the expensive model earns its price. Either way you're deciding with evidence instead of a hunch you set six months ago.

3. Retries you're paying for and can't see

Timeouts and 5xx errors get retried. Often that retry logic lives in a wrapper somebody added once and forgot, and it quietly doubles the cost of the same work whenever the API has a rough minute.

The reason this one is dangerous is that it's invisible by default. A retried call succeeds, the user gets their answer, and nothing in your app surfaces that you paid twice. It only shows up as a slow upward drift in spend that nobody can explain.

How to spot it: log every attempt, not only the final success, with a request id you can group on. Then count attempts per logical job. A healthy system sits near one. If you see jobs averaging well above that during error windows, you've found a leak.

The fix is exponential backoff with a sane cap, plus an idempotency key so a retry of a job that already half-finished doesn't redo paid work. None of it is exotic. The hard part was seeing it at all.

The pattern behind the patterns

None of these is a clever trick. They're all the same failure: a default that was reasonable when it was set, left in place while the volume grew underneath it. The expensive part isn't the fix. It's noticing.

That noticing is what I got tired of doing by hand. Eyeballing a usage export once a month catches the obvious stuff and misses the slow drift. So I'm building a tool that reads your team's own Claude Code usage and hands back a short list of dollar-quantified cuts: the prefixes you're paying to re-send, the call patterns safe to downgrade, the jobs burning tokens on retries. Evidence attached, nothing applied automatically. It tells you where the money is, you decide.

It's early and I'm validating before I build the whole thing, so right now it's a waitlist. If watching your AI bill go up without a clear reason sounds familiar, you can put your name down here: Tokenwise Savings. I'd genuinely like to know whether this is a real problem for your team or only mine.

Either way, the three checks above cost you nothing but an hour with your usage export. Start with caching. That's where the money usually is.