The afternoon I learned what my AI subscription was actually doing, and the 200 lines that took my next bill down 41 percent.
I had been using Cur...
For further actions, you may consider blocking this person and/or reporting abuse
The spreadsheet detail makes this measuring it instead of assuming is the whole post.
One thing the 4k–14k variance hints at: a lot of that context is the routing layer compensating for the fact that the model has no durable state to retrieve, so it re-ships ambient buffer state every call just in case.
When the load-bearing context lives in a record the agent can pull deliberately, the “play it safe, send everything” default has less to do.
Doesn’t fix Cursor’s economics for you, but it’s a strong argument for not letting the chat window be the only memory.
Saving the token-audit approach.
Holy wow
I laughed because I’ve felt that same silent budget hemorrhage. Thank you for ripping open the token details—it’s a warning everyone using AI coding tools needs. Most devs don’t realize the context re‑sent includes huge chunks of unrelated code. Can we teach such tools to send only the AST of what actually changed? If not, maybe we need a local proxy that slims the payload before it ever hits the network.
Imagine a “token budget” mode hard‑coded into the editor. If the projected context exceeds your cap, it simply refuses the request. That would force more modular coding habits from the start.
Feels like we need a tiny AI sitting on laptop or pc deciding what context to send and not and managing here itself to improve overall process then bloating context.. .
Cause if you keep cleaning the context then the AI doesn't know whole context and starts hallucinating as well...
Engineering Agent Memory check the post. dev.to/kenwalger/engineering-agent...
see I'm comment in there, we are talking exactly the problem you are facing in this blog.
8,400 tokens to rename a function is the perfect tiny example of the whole problem. The model doesn't "rename" - it re-reads a pile of context to be safe, then regenerates, and you pay frontier-model rates for what is mechanically a find-and-replace. The mismatch between task complexity and tokens spent is enormous on exactly this kind of trivial edit.
This is the strongest case for routing: a rename, an import fix, a formatting pass - none of it needs a reasoning model. A cheap model (or honestly an LSP/refactor tool) handles it for a fraction of the tokens. You'd reserve the expensive model for the 5% of asks that genuinely require thinking. Checking the actual token count like you did is how people wake up to this - most never look. Great little investigation.
Love that you actually measured this. The 4x-7x overhead over the bare API call lines up with what we see when we sniff Cursor/Cline-style clients — the system prompt + workspace indexing + tool definitions roughly dominate the prompt for any short request, and the marginal cost of your actual edit is in the noise.
One nuance worth surfacing for anyone building their own router: the cost-routing math is the easy part. The hard part is the intent classifier itself. A 200-line classifier that routes ~80% correctly is fantastic when it's right, and corrosive when it's wrong, because the failure mode is silent — a slightly worse answer that the user accepts because they don't know Opus would have caught the bug. We started instrumenting "router regret" by re-running a sampled 5% of routed requests on the next tier up and diffing the outputs. It costs a bit but it's the only way to know whether your savings are real or whether you're just downgrading quality you can't see.
Also: the 41% cost win is a great result, but the unit you actually want to optimize is cost per accepted edit, not cost per call. Cheap calls that get rejected and re-prompted are a worse deal than expensive calls that land first try.
This is exactly why I still don’t buy the “full AI engineer replacement” narrative. At some point a human notices something’s off.
And honestly, the lack of observability in a lot of commercial AI tooling is weird. You start a process, wait forever, your laptop fans enter takeoff mode, and 20 minutes later you get either nonsense or a beautifully formatted disaster.
Feels like these tools desperately need basic monitoring primitives:
token spikes, loop detection, some tipical alerts, intermediate reasoning snapshots, kill switches mid-run.
At the same time, I genuinely believe AI agents are the next real platform shift. The value is obvious already. But right now they’re still more like copilots with a weirdly broad skill matrix — not Terminators with independent judgment and a stable personality.
Right now a lot of AI workflows feel less like pair programming and more like sending a junior dev into the basement and hoping they come back with the right file 🥴
Ran four agents in parallel last week and one of them ate roughly 60% of the budget on a single rename refactor. Same surprise. The instrumentation gap is the actual problem - you don't know which agent burnt the tokens until 3 hours after the fact, by which point the diff is already merged. Did you find a way to flag token-disproportionate edits in real time, or only post-hoc?
The “cost per accepted edit” point in the comments feels like the missing metric here. Raw token spend is useful, but the real alarm should be “this tiny rename used 8k tokens and produced one accepted diff.” That would make waste obvious without forcing devs to inspect every request manually.
This matches what I keep seeing too: the expensive part is not the visible prompt, it is the safe default context bundle. The practical fix is less “use a cheaper model” and more “make context selection observable,” even if it is just logging files included, estimated tokens, and the reason they were pulled in.
If you are really concern about saving tokens and optimising context, you can check out Ogcode - In my experience it is saving me lots of tokens per session. It is on github
AI killing money more than a developer. 🤫
It depends a lot on your developers 😂