DEV Community

Cover image for Every LLM response tells you what the call cost. Almost nobody keeps it.
Vikrant Shukla
Vikrant Shukla

Posted on

Every LLM response tells you what the call cost. Almost nobody keeps it.

Every call you make to Claude, GPT, or Gemini comes back with a usage block. Input tokens, output tokens, cache reads and writes. The exact shape of what you just paid for. Then the response gets parsed for its content, the usage gets dropped on the floor, and the only record left is a number on a dashboard at the end of the month.

So when the bill moves, you can't say why. Was it the RAG service or the agent loop? The prompt you changed on Tuesday, or just more traffic? The data to answer that existed once, per request and nobody kept it.

I think that's the actual problem. Not "LLMs are expensive." Cost attribution is a data-retention problem, and it's been left unsolved because the obvious places to solve it are all wrong.

You can wrap the SDK. But that only covers the code paths you own not Claude Code, not a CLI, not some dependency three layers down that calls the API itself. You can scrape the provider dashboard. But it's already aggregated; the per-call detail is gone. You can ask developers to log usage by hand. They won't, and they shouldn't have to.

The only layer that sees every call, without anyone changing their code, is the network. Almost every LLM client honours HTTP(S)_PROXY. Put something there that keeps the usage block instead of discarding it, attributes each call to a project, prices it, and stores it and the attribution problem just goes away. No SDK wrapper, no workflow change.

That's the whole idea behind the thing I've been building. A local proxy that observes outbound LLM traffic, keeps what the providers throw away.

A few things I learned keeping the data that everyone else drops:

The usage block is harder to capture than it looks. With streaming on, the numbers only arrive in the final event, so you have to reassemble the stream to count. Cache tokens are priced differently from normal input tokens; treat them the same and your total drifts. Each provider reports usage in a different shape, so the provider-specific logic has to be isolated or the core rots.

And reconciliation is the whole game. If the number I keep doesn't match the provider's invoice, keeping it was pointless. That's the bar — not "roughly," but reconciles. It's the part I'm still hardening, and the part I'd most like other people's eyes on.

It runs locally, installs in one line:

`uvx halton-meter`
Enter fullscreen mode Exit fullscreen mode

It's free and stays free for local use. Source and docs are on my profile. If you ship LLM apps, I'd like to know how you'd model attribution for agentic or multi-tenant workloads, and where you've watched logged cost drift from the real bill.

(There's a hosted version in beta reconciliation and history across machines free for a few people while I find what breaks: https://haltonmeter.com/beta)

Top comments (0)