GDS K S

Posted on May 13

I asked Cursor to rename a function. It sent 8,400 tokens. I checked.

#ai #cursor #claude #productivity

4x to 7x overhead versus direct API calls

The afternoon I learned what my AI subscription was actually doing, and the 200 lines that took my next bill down 41 percent.

I had been using Cursor for six months when I noticed the discrepancy. I was renaming a function. A short one. Three lines of body. One call site. The kind of refactor that takes, at most, six seconds of human attention.

I had two windows open. The Cursor chat panel where I had typed "rename getUser to fetchUser". And the Anthropic console in another tab, because I had been debugging a different project earlier and forgot to close it.

The Anthropic console refreshed while the Cursor request was in flight. I watched the token counter tick up live. The number it landed on for that single rename request was 8,400 input tokens. The actual prompt I had typed was eleven words.

I sat there for a moment. Then I opened a fresh terminal and made the same call directly through the Anthropic API with my own minimal prompt. Same model. Same intent. Same outcome.

The direct call used 1,900 input tokens. Cursor had sent 6,500 extra tokens of context to perform the same rename.

That observation was the start of the spreadsheet that ate the next four hours of my evening.

What was in those 6,500 tokens

I do not have inside knowledge of Cursor's internals. I have my own experiments and the public documentation, neither of which gives a complete picture. Here is what I have: when I ran the same prompt 50 times across different parts of my codebase, the input token count varied between roughly 4,000 and 14,000 with a median around 8,000. The variation correlated loosely with how many open buffers I had and how recently I had viewed files in the same directory.

The reasonable inference is that Cursor was sending me a system prompt, plus indexing context derived from my recent activity, plus tool definitions for the agent framework, plus the actual prompt I had typed. The first three are the routing layer doing its job. The fourth was the only one I could see in the chat panel.

Some of that context helps. When I ask for a refactor that touches several files, Cursor knowing about those files is the entire point. When I ask to rename a function whose call sites fit in three lines of context, sending 6,500 tokens of unrelated buffer state is the routing layer playing it safe on my behalf in a way that benefits Cursor more than it benefits me.

Cursor charges a fixed seat fee. Anthropic charges Cursor by the token. The math runs the wrong direction for me as a heavy user, because the marginal token cost ends up in my own direct API calls (which Cursor's seat does not cover) plus the fixed seat itself. Conservative context is cheap to ship and expensive to consume. The incentive lands on the wrong side of the table.

The bill that made me pay attention

My March bill arrived on April 2. The Anthropic line item had grown 50 percent month over month for three months running. Cursor at $20 was flat. Copilot at $10 was flat. The variable line was the API I was hitting from my own CLI for things Cursor was not the right tool for.

The growth was not from doing more work. I checked. My weekly logged hours were stable. The growth was from an increasing fraction of those hours involving AI calls that I was making more casually because the AI was getting more useful.

The trend was straight. If I extrapolated, by August I was going to be paying more in direct Anthropic API spend than in Cursor seats, and my Cursor seat would still be running the same conservative context overhead on every chat panel turn. The bill was going to keep growing in two places at once.

I cancelled Cursor that weekend.

The 200 lines

The thing I built to replace the routing piece of Cursor was small enough to embarrass me for not having built it months earlier. A regex based intent classifier with five rules. Trivial prompts route to Haiku. Code prompts route to Sonnet. Planning prompts route to Opus. Embedding-style classification prompts route to a cheap OpenAI model. Default to Sonnet if nothing matches.

That is the entire routing logic. Two hundred lines of TypeScript including imports, error handling, a pricing table, and a cost calculator that logs every call. The full file fits on one screen if you have a tall display.

I tested it on a hundred prompts I had logged from the previous week. The breakdown shifted hard. Sonnet went from 70 percent of calls to 25 percent. Haiku went from zero to 60 percent. Opus stayed at 5 percent. The estimated cost reduction was 47 percent on the test set.

I did not believe the number. I assumed I had a bug. I instrumented the router to log the actual model picked per prompt and the cost in real time, and I ran my normal workflow for two weeks. The actual reduction came in at 41 percent on the May 2 bill, with 30 percent more total calls because the cheaper per call cost made me reach for AI more often.

What I did not understand before

The routing layer is the most valuable part of the AI tool stack right now, and the wrappers want to own that part most of all.

Every coding tool I have looked at in the last 90 days has shipped a model dropdown. Cursor added one. GitHub Copilot added one. Windsurf added one. The story they tell is customer choice. The story underneath, I think, is that they have all noticed the same thing I noticed in April. The user can route their own calls. The user is starting to. If the wrappers do not own the routing layer, they own a chat panel and an autocomplete and not much else.

The chat panel is real value. The autocomplete is real value. They are not $20 a month of value for a heavy user who would prefer to route directly. They are maybe $5 to $10 a month of value, sized to the actual work they save.

I think we are two quarters away from a wave of users doing this exercise. The wrappers know that. The pricing pages are starting to reflect it.

What I would tell my March self

Three things, in the order they matter.

The first: open the Anthropic dashboard. Look at the input token count on three of your normal Cursor turns. If the number runs more than 3x your direct call baseline, the routing layer is not earning its cost on your usage pattern. That does not mean cancel. That means notice.

The second: log every AI call you make for one week. Cost per call, model picked, prompt length, output length. The log takes twenty lines of code per provider. The data will surprise you. No honest way exists to optimize a bill you cannot see.

The third: write the router. Two hundred lines. The first version does not have to be smart. Five regex rules for intent capture 70 percent of the savings. Iterate on the rules later.

The reason I would tell my March self these things: I would have done the same exercise three months earlier and saved roughly $300 in subscription overlap. The cost of doing the exercise: one Saturday afternoon. The cost of not doing it: whatever your bill grows to in the next quarter, which for most people building with AI right now lands at a number bigger than they want to admit.

What this does not solve

The router does not give me a multi file editing agent. It does not index my codebase. It does not know about my open buffers. It does not autocomplete inline. None of that is its job.

I kept Copilot for the inline ghost text in VS Code, because that is a different product solving a different problem and the $10 is not the line that hurts. For the multi file agent work I would have used Cursor for, I now use Claude Code from the terminal, which I pay for separately through my Max plan. The total stack is cheaper than Cursor plus my old direct API spend.

If your usage pattern is different, your math will be different. If you live in the chat panel and rarely go outside it, Cursor is probably still a fair trade. If your AI work spans chat, agent loops, embedding pipelines, and one off CLI calls, the routing layer is the one piece worth owning yourself.

The closing

The bill came in on April 2. The new bill came in on May 2. The difference between them was 41 percent and 200 lines of code and one weekend afternoon I was going to spend half asleep in front of a movie.

The lesson should have been obvious. The wrappers have an incentive to send more tokens than necessary. The user has an incentive to send fewer. The routing layer is where the two incentives meet. Whoever owns the routing layer wins.

The routing layer can be 200 lines of yours.

What is the line on your AI bill that grew the fastest last month? Drop it in a reply. I read everything.

Written by **GDS K S* (thegdsks.com), building Glincker.*
If this was useful, follow me on X / @thegdsks. I write about the parts of the AI stack vendors keep off the pricing page.

Top comments (15)

Mary Olowu • May 17

The spreadsheet detail makes this measuring it instead of assuming is the whole post.

One thing the 4k–14k variance hints at: a lot of that context is the routing layer compensating for the fact that the model has no durable state to retrieve, so it re-ships ambient buffer state every call just in case.

When the load-bearing context lives in a record the agent can pull deliberately, the “play it safe, send everything” default has less to do.

Doesn’t fix Cursor’s economics for you, but it’s a strong argument for not letting the chat window be the only memory.

Saving the token-audit approach.

Ben Halpern • May 15

Holy wow

HARD IN SOFT OUT • May 13

I laughed because I’ve felt that same silent budget hemorrhage. Thank you for ripping open the token details—it’s a warning everyone using AI coding tools needs. Most devs don’t realize the context re‑sent includes huge chunks of unrelated code. Can we teach such tools to send only the AST of what actually changed? If not, maybe we need a local proxy that slims the payload before it ever hits the network.

Imagine a “token budget” mode hard‑coded into the editor. If the projected context exceeds your cap, it simply refuses the request. That would force more modular coding habits from the start.

GDS K S • May 15

Feels like we need a tiny AI sitting on laptop or pc deciding what context to send and not and managing here itself to improve overall process then bloating context.. .
Cause if you keep cleaning the context then the AI doesn't know whole context and starts hallucinating as well...

HARD IN SOFT OUT • May 16

Engineering Agent Memory check the post. dev.to/kenwalger/engineering-agent...

see I'm comment in there, we are talking exactly the problem you are facing in this blog.

Harjot Singh • May 30

8,400 tokens to rename a function is the perfect tiny example of the whole problem. The model doesn't "rename" - it re-reads a pile of context to be safe, then regenerates, and you pay frontier-model rates for what is mechanically a find-and-replace. The mismatch between task complexity and tokens spent is enormous on exactly this kind of trivial edit.

This is the strongest case for routing: a rename, an import fix, a formatting pass - none of it needs a reasoning model. A cheap model (or honestly an LSP/refactor tool) handles it for a fraction of the tokens. You'd reserve the expensive model for the 5% of asks that genuinely require thinking. Checking the actual token count like you did is how people wake up to this - most never look. Great little investigation.

Max Quimby • May 13

Love that you actually measured this. The 4x-7x overhead over the bare API call lines up with what we see when we sniff Cursor/Cline-style clients — the system prompt + workspace indexing + tool definitions roughly dominate the prompt for any short request, and the marginal cost of your actual edit is in the noise.

One nuance worth surfacing for anyone building their own router: the cost-routing math is the easy part. The hard part is the intent classifier itself. A 200-line classifier that routes ~80% correctly is fantastic when it's right, and corrosive when it's wrong, because the failure mode is silent — a slightly worse answer that the user accepts because they don't know Opus would have caught the bug. We started instrumenting "router regret" by re-running a sampled 5% of routed requests on the next tier up and diffing the outputs. It costs a bit but it's the only way to know whether your savings are real or whether you're just downgrading quality you can't see.

Also: the 41% cost win is a great result, but the unit you actually want to optimize is cost per accepted edit, not cost per call. Cheap calls that get rejected and re-prompted are a worse deal than expensive calls that land first try.

HR Pulsar • May 15

This is exactly why I still don’t buy the “full AI engineer replacement” narrative. At some point a human notices something’s off.

And honestly, the lack of observability in a lot of commercial AI tooling is weird. You start a process, wait forever, your laptop fans enter takeoff mode, and 20 minutes later you get either nonsense or a beautifully formatted disaster.

Feels like these tools desperately need basic monitoring primitives:
token spikes, loop detection, some tipical alerts, intermediate reasoning snapshots, kill switches mid-run.

At the same time, I genuinely believe AI agents are the next real platform shift. The value is obvious already. But right now they’re still more like copilots with a weirdly broad skill matrix — not Terminators with independent judgment and a stable personality.

Right now a lot of AI workflows feel less like pair programming and more like sending a junior dev into the basement and hoping they come back with the right file 🥴

Andrii Krugliak • May 16

Ran four agents in parallel last week and one of them ate roughly 60% of the budget on a single rename refactor. Same surprise. The instrumentation gap is the actual problem - you don't know which agent burnt the tokens until 3 hours after the fact, by which point the diff is already merged. Did you find a way to flag token-disproportionate edits in real time, or only post-hoc?

John • May 16

The “cost per accepted edit” point in the comments feels like the missing metric here. Raw token spend is useful, but the real alarm should be “this tiny rename used 8k tokens and produced one accepted diff.” That would make waste obvious without forcing devs to inspect every request manually.

John • May 16

This matches what I keep seeing too: the expensive part is not the visible prompt, it is the safe default context bundle. The practical fix is less “use a cheaper model” and more “make context selection observable,” even if it is just logging files included, estimated tokens, and the reason they were pulled in.

Prasenjeet Kumar • May 16

If you are really concern about saving tokens and optimising context, you can check out Ogcode - In my experience it is saving me lots of tokens per session. It is on github

View full discussion (15 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.