My daily token burn was eating me alive until I learned what a cache hit rate actually is

#ai #python #automation #discuss

April 24, 2026. I sat down this morning with a spreadsheet and a coffee that was supposed to be the cheap kind but somehow cost six bucks, and I started charting my DeepSeek bill day by day for the last three weeks.

It was bad. Like, mad-at-myself bad.

Every call to the model was hitting the API fresh. Every prompt template. Every system message. Every bot in my swarm running the same boilerplate over and over, paying full freight each time. I'd been so focused on shipping the next bot, the next route, the next little thing somebody might actually pay for, that I never looked at where my own money was bleeding out.

When I started this whole thing I had a laptop, an email account, and social media. I didn't know what an API key was. The first time I saw the word "token" I thought it meant the little coin thing you put in arcade machines. Six weeks ago if you'd asked me about cache hit rate I would have guessed it was something Starbucks measured during the morning rush.

So today I got angry. Then I got to work.

DeepSeek has prompt caching. It's been right there in the docs the whole time. You structure your prompts so the static parts come first (system prompt, tool definitions, the long boring context), and the variable parts come last (the actual user input that changes per call). The provider hashes the prefix. If it matches a recent call, you pay a fraction of the price for those tokens. Sometimes a tenth.

I'd been writing my prompts in the opposite order. Variable junk up front, static block at the bottom. Cache hit rate of basically zero across every bot in my collection. I was paying full price on every single call for content that literally never changed.

I rewrote the prompt builder. One function, used by all 27 bots. Static system block first, tool schemas next, then the rolling context window with a hard cutoff, then the user turn at the very end. Took me about three hours and a lot of swearing at myself.

Then I wrote pytests. Twelve of them. Each one fires a representative call from a different bot and asserts the cache hit metric in the response metadata is above a threshold. Eleven of twelve passed clean at 86 percent or higher. The twelfth is one of the intel bots where the context genuinely does change every call, so I'll restructure that one differently this weekend.

I ran the new version against today's traffic for four hours and watched the burn rate on the dashboard. It dropped. Not a little. A lot. The kind of drop where you refresh the page three times because you don't believe it.

I'm still learning. A senior dev reading this will spot ten amateur moves in how I structured the cache keys alone. I know I should be batching some of these calls. I know I should be using a smaller model for the classification step before the expensive reasoning step. I know there's probably a whole tier of optimization I haven't even heard the name of yet.

But today I went from angry to relieved in about six hours of focused work, and that's a trade I'll take every single time.

The honest version of this story is that I shipped 27 bots before I optimized the thing that was costing me the most money. That's the order I did it in. Ship first, then look at the bill, then panic, then fix it. I'm not going to pretend I planned it the smart way.

Somebody sent payment for a Safety Pack while I was writing this. They add up little by little like pennies. Pennies I get to keep more of now.

For anyone else running a bunch of LLM calls in parallel, what's the optimization you wish you'd found in week one instead of week six?