Slim

Posted on Mar 14

Where Do Your Claude Code Tokens Actually Go? We Traced Every Single One

#claude #ai #statusline #productivity

You're paying for 200,000 tokens of context. But how many of those tokens are actually doing useful work?

We built ClaudeTUI — a set of monitoring tools for Claude Code — and dug into the raw JSONL transcript data to trace every token. What we found surprised us: there are four distinct categories of token usage, and only one of them is your actual work.

What Happens When You Press Enter

Here's something most Claude Code users don't realize: every time you press Enter, the entire conversation is sent from scratch.

The Claude API is stateless. It doesn't remember your previous messages. So every single keystroke triggers an API call that includes:

System prompt (~14k tokens) — Claude Code's instructions, tool definitions, your CLAUDE.md
Full conversation history — every message, every tool call, every tool result since the last compaction
Your new message

On turn 1, that's maybe 15k tokens. By turn 15, it's 100k. By turn 30, it's 167k — and then compaction fires.

This is why Claude gets slower and more expensive as your session goes on. Each Enter keystroke processes more tokens than the last. And it's why compaction exists: without it, you'd hit the 200k wall and the session would simply stop.

The good news: Anthropic's prompt caching makes this less painful than it sounds. But it's worth understanding how.

The Cache Lives on Anthropic's Servers

Your machine sends the full conversation on every request — the same bytes go over the network every time. The optimization happens server-side: Anthropic checks "have I seen this exact prefix of tokens recently?" If yes, it skips re-processing them and charges the cheaper cache read rate ($1.50/M instead of $15/M for Opus — a 10x discount).

In a 157-turn session, we measured 98% of all tokens as cache reads. That makes sense: by turn 100, you're re-sending 99 turns of history that are already cached. Only the newest content goes through the expensive cache_creation path.

The cache has a TTL — likely ~5 minutes for conversation content. If you pause too long between turns, the cache expires and the next call pays full input price for everything. This is also why compaction is expensive: it blows away the entire cached conversation and replaces it with a brand new summary that goes through cache_creation from scratch.

One more thing: the tokens still count toward your 200k context window, even when cached. Caching saves money, not space.

Now let's look at what those tokens actually are.

The Four Types of Tokens

Every API call Claude Code makes has a token usage breakdown in its transcript. By parsing thousands of these calls across real sessions, we identified four categories:

1. System Prompt (~14k tokens) — The Constant Tax

Every single API call includes a system prompt: Claude Code's internal instructions, tool definitions, safety guidelines, and your CLAUDE.md file. In our sessions, this was consistently ~14,328 tokens.

This isn't something you can avoid. It's infrastructure. But it means that out of your 200k window, only ~186k is ever available for actual conversation.

We discovered this by looking at cache_read_input_tokens after compaction events. The value resets to exactly 14,328 every time — that's the system prompt floor. During normal operation, cache_read grows from 14k to 167k as your conversation accumulates in the cache.

2. Compaction Summary (~11-19k tokens) — The Rebuild Cost

When compaction fires, Claude Code compresses your entire conversation into a summary. The next API call has to read that summary to reconstruct context. This is the real overhead of compaction.

From a real 3-compaction session:

Compaction	Summary Size	What It Costs
#1	18.8k tokens	$0.47 (Opus)
#2	10.6k tokens	$0.22 (Opus)
#3	17.8k tokens	$0.37 (Opus)

These summaries are lossy. Your 167k of rich context — exact error messages, file contents, code snippets — gets compressed into 11-19k tokens. Details are lost.

3. Useful Work — What You Actually Paid For

This is everything else: your prompts, Claude's responses, tool calls, file reads, code edits, test output. The actual productive content of your session.

4. Headroom (~33k tokens) — The Unused Buffer

Claude Code doesn't wait until 200k to compact. It triggers at roughly 83% capacity (~167k tokens), reserving ~33k tokens as a buffer for the compaction process itself.

That means ~16.5% of your context window is never available for useful work. You're paying for 200k but only getting ~167k.

A Real Session, Dissected

Here's an actual 4-segment session from our monitoring data:

  Seg 1  ▒▒▓▓████████████████████████████████████████████████░░░░░  200.0k
         14.3k system │ 152.7k useful │ 33.0k headroom │ → compacted

  Seg 2  ▒▒▓▓▓████████████████████████████████████░░░░░░░░░░░░░░░  200.0k
         14.3k system │ 18.8k summary │ 114.4k useful │ 52.5k headroom │ → compacted

  Seg 3  ▒▒▓▓▓████████████████████████████████████████████████░░░  200.0k
         14.3k system │ 17.8k summary │ 141.2k useful │ 33.9k headroom │ → compacted

→ Seg 4  ▒▒▓▓██████                                                44.8k
         14.3k system │ 10.6k summary │ 12.7k useful

  Efficiency: 76%  │  Wasted: 166.5k/644.8k

76% efficiency means 76% of the total tokens went to useful work. The other 24% went to compaction summaries and headroom.

Notice how Seg 1 has no summary — it's the first segment, nothing to rebuild from. But starting from Seg 2, every segment pays the summary tax.

The Hidden API Call

One thing we couldn't find in the transcript: the compaction summary generation itself. Claude Code makes a hidden API call that reads your ~167k context and produces the summary, but this call is not logged in the JSONL transcript.

Based on the preTokens metadata we found in compaction events, this hidden call reads the full pre-compaction context (~167k tokens). At Opus pricing ($1.50/M for cached reads), that's roughly $0.25 per compaction just for the summary generation — on top of the rebuild cost.

What This Means for Your Wallet

Let's do the math for a long Opus session with 3 compactions:

Token budget: 644.8k total

Category	Tokens	Cost (Opus)	% of Total
Useful work	490k	~$8.50	76%
Compaction summaries	47k	~$0.85	7%
Headroom (unused)	108k	$0 (not billed)	17%
System prompt (constant)	~43k	~$0.06 (cached)	—
Hidden summary generation	~500k reads	~$0.75	—

The headroom tokens aren't billed directly — they represent capacity you couldn't use. But the summaries and hidden calls are real costs.

With Sonnet 4.6 the same session would be dramatically cheaper. Sonnet supports up to 1M context, so with 644k tokens you'd hit zero compactions:

All tokens are useful work
Efficiency: 100%
Cost: ~$5.50 (vs ~$10+ on Opus)

The System Prompt Discovery

Perhaps the most interesting finding: the system prompt is a constant ~14k tax on every segment.

Before our investigation, we were counting the full post-compaction context as "rebuild waste." A segment showing 33.1k rebuild looked like 33.1k of compaction overhead. But 14.3k of that is system prompt — you'd pay it regardless.

The actual compaction overhead (the summary) is only 33.1k - 14.3k = 18.8k. That's a 43% difference in how you measure waste.

How we detected it:

After compaction #1: cache_read = 14,328  ← system prompt
After compaction #2: cache_read = 14,328  ← same
After compaction #3: cache_read = 14,328  ← same

During normal operation: cache_read grows from 14k → 167k

The cache_read value tells you exactly what's already cached. After compaction, only the system prompt survives in cache — everything else (the compaction summary) goes through cache_creation.

The Compaction Cache Structure

Here's how token caching works across a compaction boundary:

Before compaction (normal operation):

cache_read: 166,575    ← almost everything is cached
cache_creation: 312    ← tiny new content
input_tokens: 3        ← negligible
output_tokens: 126

First call after compaction:

cache_read: 14,328     ← only system prompt survives
cache_creation: 18,793 ← compaction summary, written fresh
input_tokens: 3
output_tokens: 1,249

The cache gets blown away by compaction. Everything that was cached (your conversation, tool results, file contents) is gone. Only the system prompt persists because it's on a separate, longer-lived cache (likely a 1-hour TTL vs the 5-minute conversation cache).

7 Things You Can Do Right Now

1. Use /compact manually at logical breakpoints

Don't wait for auto-compaction at 167k. After finishing a feature or fixing a bug, compact yourself. You can guide what gets preserved:

/compact Preserve all file paths, error messages, and the list of modified files

2. Use /clear between distinct tasks

Switching from implementation to debugging? Starting a new feature? A fresh 186k of clean context beats 80k of stale context with irrelevant history.

3. Delegate verbose work to subagents

Each subagent gets its own isolated 200k context window. Running tests, searching large codebases, or fetching documentation in subagents keeps verbose output from bloating your main session.

4. Read files with line ranges

Instead of reading entire files, specify what you need: "Read lines 40-90 of handler.ts." Especially critical in debugging loops where you might read the same file repeatedly.

5. Disable unused MCP servers

Each MCP server loads its full tool schema into context on every request. A 20-tool server can consume 5-10k tokens just by existing. That's on top of the 14k system prompt.

6. Keep CLAUDE.md under 200 lines

CLAUDE.md is part of that ~14k system prompt. It loads on every single API call and survives all compaction cycles. If it's bloated, you're paying on every call.

7. Monitor your efficiency

Install ClaudeTUI and watch the numbers in real-time. Seeing "Efficiency: 76%" drop to "Efficiency: 68%" after a compaction changes how you think about context management.

How to See This Yourself

Install ClaudeTUI:

# Via Homebrew
brew install slima4/claude-tui/claude-tui && claudetui setup

# Or directly
curl -sSL https://raw.githubusercontent.com/slima4/claude-tui/main/install.sh | bash

Open a second terminal and run:

claudetui monitor    # live dashboard
claudetui chart      # efficiency chart (standalone)

The efficiency chart shows the 4-component breakdown for every segment in your session — updated live as you work. Press w in the monitor to open it, or v to toggle between horizontal and vertical views.

The Bottom Line

Every Claude Code session has four types of token usage:

System prompt (~14k) — constant tax, can't avoid it, but it's cheap (cached)
Compaction summaries (~11-19k each) — the real cost of compaction, lossy compression of your work
Useful work — what you actually paid for
Headroom (~33k) — buffer that's never available for work

In a typical 3-compaction Opus session, about 76% of tokens are useful work. The rest is overhead. Making this visible — and understanding what each component actually is — is the first step to spending tokens more intentionally.

ClaudeTUI is open source and MIT licensed. Stdlib-only Python, zero external dependencies.

GitHub: github.com/slima4/claude-tui

Top comments (1)

Harjot Singh • May 31

"We traced every single one" is the discipline this conversation has been missing - everyone speculates, you measured. The recurring finding across every trace I've seen (and I'd bet yours agrees): input tokens dwarf output, and within input, re-sent context (file re-reads + accumulated history + tool results) is the dominant line. People obsess over generation length when the bill is mostly the model re-reading things it already saw.

Once the data makes that obvious, the fix follows: scope context tightly per task, and don't pay frontier rates for steps that are mechanical. That second part is where I focused with Moonshift - a multi-agent pipeline (prompt to a shipped SaaS on your own GitHub + Vercel) that scopes context per agent and routes cheap-80/expensive-20, which is why a full build is ~$3 flat. First run's free, no card. Excellent trace work - what surprised you most in the breakdown? The thing that always gets people is how little of the spend is the actual code generation.