How to Actually Monitor Your LLM Costs (Without a Spreadsheet)

#ai #productivity #showdev

I used to think I had a handle on my AI spending. I had a rough mental model: Claude is cheap, GPT-4 is expensive, Gemini is somewhere in the middle. Good enough, right?

Then I started actually logging what I was burning through. The gap between my mental model and reality was embarrassing.

The problem with just watching your bill

Every major AI provider gives you a monthly bill. That's fine for accounting. It's useless for actually understanding your costs.

By the time the invoice shows up, the context is gone. You don't remember which project, which feature, which dumb experiment ate half your budget. You just see a number and try to feel bad about it.

What you actually need is visibility at the call level. How many tokens did that chat completion use? How expensive was that context window? Is the cost per feature trending up as my codebase grows?

None of the dashboards the providers give you answer these questions in real time.

What I tried first

Spreadsheets. Obviously. I had a tab for each provider, manually entered rough token counts after each session, tried to estimate costs.

This lasted about a week before I stopped maintaining it. The friction was too high. I'd forget to log things. I'd ballpark numbers. The data became meaningless noise.

I also tried building a lightweight proxy that logged every API call. That actually worked technically, but then I had to maintain a piece of infrastructure just to track my own costs. As a solo dev building two apps simultaneously, I don't have the bandwidth for that.

The habit that actually worked

I started paying attention to token counts in real time, at the point of use, not after the fact.

This sounds obvious but there's a specific reason it works: when you see the number immediately, you can actually connect cause and effect. Oh, that system prompt is 2,000 tokens every single call. Oh, I'm re-sending the entire conversation history when I only needed the last three messages.

For my Mac menu bar workflows, I ended up using TokenBar — it shows live token counts and estimated cost right in the menu bar as I work. The thing about having it persistent and always visible is that it changes how you think. You start making micro-decisions constantly: is this context worth the extra tokens? Is this feature request worth spinning up a full Claude session or can I handle it with a lighter model?

The three questions I ask now

After a few months of actually paying attention, I've settled into asking three things about every AI interaction:

1. What's the token density?
Not just how many tokens, but how much useful work per token. A 5,000-token call that produces a complete working feature is cheap. A 1,000-token call that produces a vague response I have to iterate on three more times is expensive.

2. Is this the right model for this job?
I was defaulting to Claude Sonnet for everything for a long time. Then I realized: for quick validation tasks, formatting, or simple transformations, Haiku costs a fraction of the price and is fast enough that it doesn't matter. I probably cut my costs by 40% just by routing tasks properly.

3. Am I paying for laziness?
This one stings. A lot of my token burn was from not thinking carefully about prompts before sending them. I'd throw a messy half-formed request at the API, get a mediocre response, and iterate three more times. A little upfront clarity would have been one call instead of four.

What actual monitoring looks like

I check my token usage the same way I check my git diffs — regularly, not obsessively, but with intention.

The key insight is that monitoring shouldn't require effort. If you have to go somewhere to check it, you won't. The visibility needs to be ambient — always there, not intrusive.

Menu bar for live tracking. Provider dashboards for weekly reviews. That's the whole stack.

The spreadsheet phase was necessary because it forced me to pay attention, even if the data was garbage. What replaced it is better because it's automatic — the numbers are just there, and over time you develop intuitions about what's normal and what's a red flag.

The meta-lesson

AI costs are weird because they feel like they should be predictable (it's just API calls!) but they're actually highly variable based on how you're working, what you're building, and how carefully you're thinking.

You can't optimize what you're not measuring. And you can't sustain measurement that requires manual effort.

Find the lowest-friction way to make costs visible in your actual workflow, not in a separate dashboard you have to remember to check. That's the whole game.