Vishaal LS

Posted on Mar 22

I Analyzed 38 Claude Code Sessions. Only 0.6% of Tokens Were Actual Code Output.

#programming #ai #productivity #opensource

I kept hitting Claude Code's usage limits. No idea why.

So I parsed the local session files and counted tokens. 38 sessions. 42.9 million tokens.

Only 0.6% were Claude actually writing code.

The other 99.4%? Re-reading my conversation history before every single response.

Not as scary as it sounds

Input tokens (Claude reading) cost $3 per million on Sonnet.
Output tokens (Claude writing) cost $15 per million.

So that tiny 0.6% of writing carries 5x the per-token cost. The re-reading is cheap on its own.

The problem is compounding.

Every message you send, Claude re-reads your entire history. Message 1 reads nothing. Message 50 re-reads messages 1 through 49. By message 100, it's re-reading everything.

My worst session hit $6.30 equivalent API cost. The median was $0.41. The difference? I let it run 5+ hours without /clear.

Lazy prompts are secretly expensive

A prompt like "do it" costs nearly the same as a detailed paragraph. Your message is tiny compared to the history being re-read alongside it.

But detailed prompts get results in fewer rounds. Fewer rounds = less compounding. "Add input validation to the login function in auth.ts" beats "fix the auth stuff" because it finishes in one shot instead of three.

What actually helped

Use /clear between unrelated tasks. Your test-writing agent doesn't need your debugging context.

Keep sessions under 60 minutes. After that, context compaction kicks in and you lose earlier decisions.

Be specific. Fewer rounds = less compounding = lower cost.

I built a tool for this

Wanted to keep tracking over time, so I packaged it up.

uvx tokburn serve

One command. Local dashboard. Nothing installed permanently. Nothing leaves your machine.

Or permanent: pip install tokburn && tokburn serve

Shows equivalent API cost per session, daily trends, waste detection, and the "Claude Wrote" percentage.

Someone on LinkedIn ran it on 1,765 sessions: $5,209 equivalent API cost. Max plan paying for itself many times over.

GitHub: github.com/lsvishaal/tokburn

If you try it, drop your numbers in the comments. Genuinely curious about your stats.

First open-source project. First DEV.to post. Python + FastAPI. MIT licensed.

Top comments (5)

Basil Kubarev • Mar 23

Thanks for the transparency — 77% repetitive context explains a lot.

We ran a similar measurement on 116 conversations (4.7M tokens) as part of the LogMind project. After prompt optimization, cached tokens accounted for 70–80%, and the total cost through the DeepSeek API came out to $0.51 (less than half a cent per conversation). Provider choice turned out to matter more than optimization: DeepSeek is 20–30× cheaper than OpenAI/Claude.

The key takeaway for us: yes, AI coding is “expensive” in terms of tokens, but with the right approach, that “expensive” deserves scare quotes. What matters most is transparency — so you can consciously improve the process. Thanks for contributing to that transparency.

Vishaal LS • Mar 24

The reason people stick with Claude/Sonnet despite the premium is that 'magic' factor:

holding entire codebases in context
multi-file refactors
architectural reasoning

That's what people rely on to reduce their burden in commercial settings, and it's hard to replicate on cheaper models right now without explicit instructions and power setup.
Personally, I'd love to see open-source alternatives like OpenCode close that gap. When they do, tools like TokBurn become even more useful for comparing cost-vs-quality across providers.

What's the LogMind project? Curious about your measurement setup.

Basil Kubarev • Mar 24

Thank you for the detailed reply — and you're absolutely right. Claude's ability to hold an entire codebase in context and reason about architecture without hand-holding is its real value proposition. In commercial settings, the trade-off isn't just token cost, but cognitive load and iteration speed. That "magic factor" is hard to quantify, but it's very real.

LogMind is a project where we tried to understand how to replicate some of that magic with cheaper models — not by improving the model, but by improving the interaction protocol. We analyzed 116 real conversations (1.7M tokens) to see what separates shallow Q&A from genuine co-engineering.

Here's how it worked:

Exported all conversations from DeepSeek (the free web interface)

Built a parser to extract linear dialog from the JSON tree

Sent each conversation through the DeepSeek API with a structured prompt asking for: intention type (asking/doing/expressing), interaction protocols, psychological stance, key themes, and uniqueness score

Aggregated everything into a cognitive profile (profile.json) — dominant modes (analyst/engineer/philosopher), trust levels, topic networks, and cost metrics

What we found is that with the right protocol (iterative refinement, role assignment, explicit evaluation requests), cheaper models can get surprisingly close to the "magic" — not by matching context length, but by making the interaction itself smarter. Total cost for processing 116 conversations was $0.51, which gave us 44 high-value unique dialogs, an integrated cognitive profile, and a clear picture of how interaction style evolves over time.

I'm curious about your measurement approach: in the article, you focused on token economics (0.6% code, 77% repetitive context), but in your reply you mentioned the "magic factor" — holding codebases in context, architectural reasoning. Did you try to measure those qualitative aspects in any way? For example, did you track iterations per solution, success rates on complex refactors, or developer subjective feedback? I'd love to hear more about the qualitative side of your measurements.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.