Amara Graham

Posted on Mar 23

Are you paying attention to your token use?

#ai #discuss

Can I get some folks in the comments talking about how closely they monitor their token usage?

Or if you don't, do you work at a company that provides you unlimited tokens? To specific tools?

I'm curious to see where people fall on this spectrum.

Photo by Dan Dennis on Unsplash

Top comments (38)

FrancisTRᴅᴇᴠ (っ◔◡◔)っ • Mar 23

I used the Free Tier whenever. For example, I used Google Gemini without the need of the account, so I don't worry about cost. Another case is downloading it Locally like Ollama. If I used Locally, then I don't need to worry.

In other words, no credit card no problem lmao xd

leob • Mar 23

With local LLMs you move the burden to your hardware (which needs to be more powerful) - pay you will in the end, if you do any kind of serious work ...

Amara Graham • Mar 23

Moving the burden to your hardware is such a great point and often a topic that comes up with downloadable software too. Running anything local to your machine only works if your hardware can handle it.

leob • Mar 24

Exactly - spend $10 per month for cloud LLM or $500 or $1000 once for a more powerful machine? Tradeoffs ...

FrancisTRᴅᴇᴠ (っ◔◡◔)っ • Mar 23

That makes sense. The worst case scenerio for me is using the Free Tier since it does enough for me. If you are doing serious work like you mentioned, like using Cursor to Navigate a big codebase, that's fair. Since I am working on small projects and nothing ambitious, I only used free tier. Thanks Leob!

leob • Mar 23 • Edited

Free tier of which product, if I may ask? I tried the free tier of Cursor and it ran out VERY quickly ...

FrancisTRᴅᴇᴠ (っ◔◡◔)っ • Mar 23

I mention Free Tier of more of "Not needing to Sign up" to use the service. For example, I can use Google Gemini without the need to sign up. The only thing I use was Gemini, ChatGPT, Copilot, and Ollama. I tend to avoid free version on a limit like GitHub Copilot since you can use it on a limit. Hope this makes sense!

leob • Mar 23

Yeah that makes sense, useful, thanks!

EmberNoGlow • Mar 23

Local LLM is powerful, but my ollama crashes after messages more complex than "Hello" because it's limited by my hardware. It's a good solution, but not everyone can use it to its full potential.

FrancisTRᴅᴇᴠ (っ◔◡◔)っ • Mar 23

That's fair! I use only models that takes up less space and avoid using bigger models. I would hope in the future it will be more accessible. I only use it on vscode as a tool when I code, knowing it will be slow for the purpose for me to think before I ask then if that makes sense. Thanks!

Benjamin Nguyen • Mar 23 • Edited

that is the best thing Francis concerning free tier.

Ben Halpern • Mar 23

I’m on Gemini Ultra for my day-to-day and it’s been a breath of fresh air to tap into as much token use as I need.

Amara Graham • Mar 23

Interesting! Do you feel like you are getting your money's worth? Or is the subscription worth it for not having to think about it?

Ben Halpern • Mar 23

We do a company budget per engineer, and I have to say: Absolutely.

I can't say I'm a fan of the concept of this development "tax" in general these days, but moving from concerned-about-tokens to feeling effectively unlimited (Not technically unlimited but I'm operating completely unconstrained).

I think most companies should do this.

Most of my effective token spend is company stuff. I just use the one account for personal stuff to but that's kind of a rounding error. Maybe different if you do high volume personal agent stuff.

jidonglab • Mar 23

token costs hit different when you're running multi-agent setups. single model calls are manageable but once you have 3-4 agents passing context back and forth the bill compounds fast. biggest lever we found wasn't model choice — it was compressing context before it enters the pipeline. a lot of what gets stuffed into prompts is redundant or low-signal, and stripping that out before inference saved us way more than switching to cheaper models. been open-sourcing some of our compression tooling at github.com/jidonglab/contextzip if anyone's dealing with similar token budget headach**

leob • Mar 23

Your github link doesn't work ...

Amara Graham • Mar 23 • Edited

Did you mean this link? github.com/jee599/contextzip

jidonglab • Mar 23 • Edited

ah sorry about the broken link, typo on my end. correct one is github.com/jee599/contextzip — thanks for catching that

leob • Mar 24

No that still doesn't work - github.com/jee599/contextzip does ...

Comment deleted

leob • Mar 24

Eh no, this one: github.com/jee599/contextzip

jidonglab • Mar 24

you right github.com/jee599/contextzip is correct link

leob • Mar 24

Finally ;-)

Brian Kirkpatrick • Mar 25

I use OpenRouter so I don't care much about token use or rates from the tools (and am even fairly agnostic about the models themselves, which are mostly selected via orchestration anyway). But OpenRouter does give me specific tools to monitor and budget token consumption, and I find myself (for side projects) using those budgets to constrain the scope and volume of what I'm working on per day. I could probably make the dollars stretch (and should), but in the meantime I'm focusing on routing more codegen tasks to locally-hosted models (primarily via OpenCode/OMO configurations), which reduces my remote inference volume by 30-40%.

Amara Graham • Mar 25

I find myself (for side projects) using those budgets to constrain the scope and volume of what I'm working on per day.

Oh I like that! Managing scope and volume this way. Thanks for sharing!

Harsh • Mar 23

This question hits differently after you've watched an agentic workflow silently burn through tokens in a retry loop.

I used to not think about token usage at all until I started building with agents. A single misconfigured workflow can trigger cascading retries where each step costs multiple LLM calls. What looked like a $0.10 task becomes a $5 surprise by the time you check your dashboard.

Now I treat token budgeting the same way I treat error handling you don't think about it until something breaks, and then you think about nothing else.

The most underrated optimization isn't model choice it's context hygiene. Keeping prompts lean and not stuffing unnecessary history into every call saves more than switching to a cheaper model ever will.

JT Perry • Mar 23

I have a pro subscription to Claude but find myself hitting the barrier and in startup mode can't afford to upgrade.

I decided to go off and solve the problem the geek way. Re-PC here in Seattle area has cast off enterprise servers. Found an old Dell R630 with solid amount of memory for reasonable. Got two older generation Tesla P4 and threw it in there. Hung it in the barn. For roughly $1500 up and running. ROI is about 7 months with me hitting it. Shorter as I add second dev. Working out the bugs.

There is definitely a trade-off. Opus 4.6 is much better than any model out there IMO. However we are getting good results with Qwen3. We are doing embedded device dev and react mobile dev. Our results are good. Saving Claude tokens for stickler problems or as adversarial QA.

LightLLM is amazing as a proxy to make switching and tracking easier. The whole setup was 2 days and now just runs in the background.

We also are getting good advantages for the farm businesses using it and backups of github and our other services.

If we had regular revenue (start selling April), Claude may become more enticing. It is also nice having a big blade for other things as well to remove dependence on services. All tradeoffs though like anything else.

ElementalSilk • Mar 23

I do not monitor token usage - Yes my company provides unlimited token usage, we are in process of aligning our tools to use those tokens..

Amara Graham • Mar 23

I expect this is the case for many folks at work. Even some being told "use AI as much as possible" and using that as an indicator instead of token usage.

EmberNoGlow • Mar 23

I have several accounts, so tokens... Don't tell anyone!

klement Gunndu • Mar 26

We track token spend per agent task — the surprise was that retries on failed tool calls burn 3-4x more tokens than the actual generation. Monitoring per-request broke our assumptions about where the cost sits.

Pavel Ishchin • Mar 27

Per-request broke our assumptions is what stuck had this happen once you think the money is in generation then some dumb retry drags everything and somehow the answer was the cheap part which is just annoying when the dashboard shows one number and not what actually blew up

jidonglab • Mar 23

token costs sneak up on you fast once you start chaining agents. single LLM call? manageable. but when you have 3-4 agents passing context back and forth, the token bill multiplies in ways you don't expect until you check the dashboard.

biggest thing that helped us was compressing context between agent handoffs — you don't need the full conversation history for every downstream agent, just the relevant pieces. we ended up building a tool specifically for this: github.com/jidonglab/contextzip

but honestly even just being intentional about what goes into each prompt makes a huge difference. most people stuff everything into the context window "just in case" and that's where the waste happens.

View full discussion (38 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.