You're paying for 200,000 tokens of context. But how many of those tokens are actually doing useful work?
We built ClaudeTUI — a set of monitoring tools for Claude Code — and dug into the raw JSONL transcript data to trace every token. What we found surprised us: there are four distinct categories of token usage, and only one of them is your actual work.
What Happens When You Press Enter
Here's something most Claude Code users don't realize: every time you press Enter, the entire conversation is sent from scratch.
The Claude API is stateless. It doesn't remember your previous messages. So every single keystroke triggers an API call that includes:
- System prompt (~14k tokens) — Claude Code's instructions, tool definitions, your CLAUDE.md
- Full conversation history — every message, every tool call, every tool result since the last compaction
- Your new message
On turn 1, that's maybe 15k tokens. By turn 15, it's 100k. By turn 30, it's 167k — and then compaction fires.
This is why Claude gets slower and more expensive as your session goes on. Each Enter keystroke processes more tokens than the last. And it's why compaction exists: without it, you'd hit the 200k wall and the session would simply stop.
The good news: Anthropic's prompt caching makes this less painful than it sounds. But it's worth understanding how.
The Cache Lives on Anthropic's Servers
Your machine sends the full conversation on every request — the same bytes go over the network every time. The optimization happens server-side: Anthropic checks "have I seen this exact prefix of tokens recently?" If yes, it skips re-processing them and charges the cheaper cache read rate ($1.50/M instead of $15/M for Opus — a 10x discount).
In a 157-turn session, we measured 98% of all tokens as cache reads. That makes sense: by turn 100, you're re-sending 99 turns of history that are already cached. Only the newest content goes through the expensive cache_creation path.
The cache has a TTL — likely ~5 minutes for conversation content. If you pause too long between turns, the cache expires and the next call pays full input price for everything. This is also why compaction is expensive: it blows away the entire cached conversation and replaces it with a brand new summary that goes through cache_creation from scratch.
One more thing: the tokens still count toward your 200k context window, even when cached. Caching saves money, not space.
Now let's look at what those tokens actually are.
The Four Types of Tokens
Every API call Claude Code makes has a token usage breakdown in its transcript. By parsing thousands of these calls across real sessions, we identified four categories:
1. System Prompt (~14k tokens) — The Constant Tax
Every single API call includes a system prompt: Claude Code's internal instructions, tool definitions, safety guidelines, and your CLAUDE.md file. In our sessions, this was consistently ~14,328 tokens.
This isn't something you can avoid. It's infrastructure. But it means that out of your 200k window, only ~186k is ever available for actual conversation.
We discovered this by looking at cache_read_input_tokens after compaction events. The value resets to exactly 14,328 every time — that's the system prompt floor. During normal operation, cache_read grows from 14k to 167k as your conversation accumulates in the cache.
2. Compaction Summary (~11-19k tokens) — The Rebuild Cost
When compaction fires, Claude Code compresses your entire conversation into a summary. The next API call has to read that summary to reconstruct context. This is the real overhead of compaction.
From a real 3-compaction session:
| Compaction | Summary Size | What It Costs |
|---|---|---|
| #1 | 18.8k tokens | $0.47 (Opus) |
| #2 | 10.6k tokens | $0.22 (Opus) |
| #3 | 17.8k tokens | $0.37 (Opus) |
These summaries are lossy. Your 167k of rich context — exact error messages, file contents, code snippets — gets compressed into 11-19k tokens. Details are lost.
3. Useful Work — What You Actually Paid For
This is everything else: your prompts, Claude's responses, tool calls, file reads, code edits, test output. The actual productive content of your session.
4. Headroom (~33k tokens) — The Unused Buffer
Claude Code doesn't wait until 200k to compact. It triggers at roughly 83% capacity (~167k tokens), reserving ~33k tokens as a buffer for the compaction process itself.
That means ~16.5% of your context window is never available for useful work. You're paying for 200k but only getting ~167k.
A Real Session, Dissected
Here's an actual 4-segment session from our monitoring data:
Seg 1 ▒▒▓▓████████████████████████████████████████████████░░░░░ 200.0k
14.3k system │ 152.7k useful │ 33.0k headroom │ → compacted
Seg 2 ▒▒▓▓▓████████████████████████████████████░░░░░░░░░░░░░░░ 200.0k
14.3k system │ 18.8k summary │ 114.4k useful │ 52.5k headroom │ → compacted
Seg 3 ▒▒▓▓▓████████████████████████████████████████████████░░░ 200.0k
14.3k system │ 17.8k summary │ 141.2k useful │ 33.9k headroom │ → compacted
→ Seg 4 ▒▒▓▓██████ 44.8k
14.3k system │ 10.6k summary │ 12.7k useful
Efficiency: 76% │ Wasted: 166.5k/644.8k
76% efficiency means 76% of the total tokens went to useful work. The other 24% went to compaction summaries and headroom.
Notice how Seg 1 has no summary — it's the first segment, nothing to rebuild from. But starting from Seg 2, every segment pays the summary tax.
The Hidden API Call
One thing we couldn't find in the transcript: the compaction summary generation itself. Claude Code makes a hidden API call that reads your ~167k context and produces the summary, but this call is not logged in the JSONL transcript.
Based on the preTokens metadata we found in compaction events, this hidden call reads the full pre-compaction context (~167k tokens). At Opus pricing ($1.50/M for cached reads), that's roughly $0.25 per compaction just for the summary generation — on top of the rebuild cost.
What This Means for Your Wallet
Let's do the math for a long Opus session with 3 compactions:
Token budget: 644.8k total
| Category | Tokens | Cost (Opus) | % of Total |
|---|---|---|---|
| Useful work | 490k | ~$8.50 | 76% |
| Compaction summaries | 47k | ~$0.85 | 7% |
| Headroom (unused) | 108k | $0 (not billed) | 17% |
| System prompt (constant) | ~43k | ~$0.06 (cached) | — |
| Hidden summary generation | ~500k reads | ~$0.75 | — |
The headroom tokens aren't billed directly — they represent capacity you couldn't use. But the summaries and hidden calls are real costs.
With Sonnet 4.6 the same session would be dramatically cheaper. Sonnet supports up to 1M context, so with 644k tokens you'd hit zero compactions:
- All tokens are useful work
- Efficiency: 100%
- Cost: ~$5.50 (vs ~$10+ on Opus)
The System Prompt Discovery
Perhaps the most interesting finding: the system prompt is a constant ~14k tax on every segment.
Before our investigation, we were counting the full post-compaction context as "rebuild waste." A segment showing 33.1k rebuild looked like 33.1k of compaction overhead. But 14.3k of that is system prompt — you'd pay it regardless.
The actual compaction overhead (the summary) is only 33.1k - 14.3k = 18.8k. That's a 43% difference in how you measure waste.
How we detected it:
After compaction #1: cache_read = 14,328 ← system prompt
After compaction #2: cache_read = 14,328 ← same
After compaction #3: cache_read = 14,328 ← same
During normal operation: cache_read grows from 14k → 167k
The cache_read value tells you exactly what's already cached. After compaction, only the system prompt survives in cache — everything else (the compaction summary) goes through cache_creation.
The Compaction Cache Structure
Here's how token caching works across a compaction boundary:
Before compaction (normal operation):
cache_read: 166,575 ← almost everything is cached
cache_creation: 312 ← tiny new content
input_tokens: 3 ← negligible
output_tokens: 126
First call after compaction:
cache_read: 14,328 ← only system prompt survives
cache_creation: 18,793 ← compaction summary, written fresh
input_tokens: 3
output_tokens: 1,249
The cache gets blown away by compaction. Everything that was cached (your conversation, tool results, file contents) is gone. Only the system prompt persists because it's on a separate, longer-lived cache (likely a 1-hour TTL vs the 5-minute conversation cache).
7 Things You Can Do Right Now
1. Use /compact manually at logical breakpoints
Don't wait for auto-compaction at 167k. After finishing a feature or fixing a bug, compact yourself. You can guide what gets preserved:
/compact Preserve all file paths, error messages, and the list of modified files
2. Use /clear between distinct tasks
Switching from implementation to debugging? Starting a new feature? A fresh 186k of clean context beats 80k of stale context with irrelevant history.
3. Delegate verbose work to subagents
Each subagent gets its own isolated 200k context window. Running tests, searching large codebases, or fetching documentation in subagents keeps verbose output from bloating your main session.
4. Read files with line ranges
Instead of reading entire files, specify what you need: "Read lines 40-90 of handler.ts." Especially critical in debugging loops where you might read the same file repeatedly.
5. Disable unused MCP servers
Each MCP server loads its full tool schema into context on every request. A 20-tool server can consume 5-10k tokens just by existing. That's on top of the 14k system prompt.
6. Keep CLAUDE.md under 200 lines
CLAUDE.md is part of that ~14k system prompt. It loads on every single API call and survives all compaction cycles. If it's bloated, you're paying on every call.
7. Monitor your efficiency
Install ClaudeTUI and watch the numbers in real-time. Seeing "Efficiency: 76%" drop to "Efficiency: 68%" after a compaction changes how you think about context management.
How to See This Yourself
Install ClaudeTUI:
# Via Homebrew
brew install slima4/claude-tui/claude-tui && claudetui setup
# Or directly
curl -sSL https://raw.githubusercontent.com/slima4/claude-tui/main/install.sh | bash
Open a second terminal and run:
claudetui monitor # live dashboard
claudetui chart # efficiency chart (standalone)
The efficiency chart shows the 4-component breakdown for every segment in your session — updated live as you work. Press w in the monitor to open it, or v to toggle between horizontal and vertical views.
The Bottom Line
Every Claude Code session has four types of token usage:
- System prompt (~14k) — constant tax, can't avoid it, but it's cheap (cached)
- Compaction summaries (~11-19k each) — the real cost of compaction, lossy compression of your work
- Useful work — what you actually paid for
- Headroom (~33k) — buffer that's never available for work
In a typical 3-compaction Opus session, about 76% of tokens are useful work. The rest is overhead. Making this visible — and understanding what each component actually is — is the first step to spending tokens more intentionally.
ClaudeTUI is open source and MIT licensed. Stdlib-only Python, zero external dependencies.
GitHub: github.com/slima4/claude-tui
Top comments (0)