The problem nobody warned me about
Same prompts. Same workflows. Same model. Higher bill — every single week.
I checked my logs expecting to find a spike in usage. Instead I found something stranger: the same prompt returning different token counts on different runs. Some runs were 30–47% higher than identical runs a few weeks earlier.
No input change. No structural change. Just more tokens — and a cost bill I could not explain.
After going through Reddit threads, developer forums, and an OpenRouter analysis of over 1M+ real API requests, here is what is actually happening and the six changes that brought my token usage back under control.
What changed under the hood
Three things shifted simultaneously, and most cost-increase explanations only mention one of them.
1. The tokenizer itself changed
Claude's tokenizer now splits text differently. Same input produces more token segments. The expected increase from this alone is roughly 1.0x–1.35x. But real-world data from high-volume API users shows 1.4x–1.47x increases — and sometimes higher on specific content types.
Code blocks, markdown formatting, and JSON structures are hit disproportionately hard because of how subword tokenisation handles special characters and whitespace patterns in technical content. If you are building coding assistants, agent workflows, or automation pipelines, your content type is in the high-impact zone.
2. Newer model versions reason deeper by default
More recent Claude versions spend more internal token budget on reasoning before generating output. This is the "thinking more" behaviour — better response quality, more tokens burned per request. If you upgraded model versions without adjusting your prompting approach, you absorbed a reasoning cost increase on top of the tokenizer change.
3. Conversations compound exponentially
Every message in a multi-turn conversation appends to the context window. With each individual message now slightly larger than before (due to tokenizer changes), long sessions accumulate token weight faster than the same sessions would have previously. A ten-turn conversation that used to cost X tokens now costs 1.4X or more — and that multiplier applies to every turn, not just the last one.
The real-world impact on different use cases
Use case | Token impact | Why
--------------------------|--------------|------------------------------------
Coding assistant | Very high | Code blocks + formatting = heavy tokenisation
Long agent workflows | Very high | Multi-turn accumulation compounds
Single short prompts | Low | Minimal compounding, tokenizer delta only
Automation with full logs | Very high | Raw log data is extremely token-inefficient
Repeat identical prompts | Medium-low | Caching mitigates tokenizer delta
If your workflow sends full files, complete conversation histories, or raw log output to the API — you are in the highest impact category.
The six changes that actually reduced my costs
1. Stop sending unnecessary context
Before:
`# Sending everything
messages = [
{"role": "user", "content": full_file_contents + full_history + current_question}
]`
After:
`# Send only what the model needs for this specific task
relevant_context = extract_relevant_section(file_contents, current_question)
messages = [
{"role": "user", "content": relevant_context + current_question}
]
`
Result for my workflow: 40% token reduction on input alone. The model does not need your entire codebase to answer a question about one function.
2. Break large prompts into focused smaller ones
Before:
`One prompt that asks: analyse this code, suggest improvements,
rewrite the function, add error handling, and write unit tests.`
After:
`Prompt 1: Analyse this function for issues (code only, no file context)
Prompt 2: Rewrite with suggested fixes applied (function only)
Prompt 3: Add error handling to this specific function
Prompt 4: Write unit tests for this function signature`
Each smaller prompt is cheaper than one large prompt, and the outputs are easier to validate and iterate on.
3. Match reasoning effort to task complexity
Claude's newer versions apply heavier reasoning by default. For tasks that do not require deep reasoning — format conversion, simple extraction, classification — you can explicitly instruct the model to keep responses concise:
`system_prompt = """
Answer directly and concisely.
Do not show your reasoning unless asked.
Do not pad the response with caveats or summaries.
"""`
Saved approximately 15–20% on output tokens for my classification tasks where verbose responses were not needed.
4. Reset sessions instead of accumulating context
Before: One long continuous session per agent workflow.
After: Fresh session when the conversation exceeds a threshold.
`MAX_CONTEXT_TURNS = 8
if len(conversation_history) > MAX_CONTEXT_TURNS:
# Summarise previous turns into a single context block
summary = summarise_history(conversation_history)
conversation_history = [{"role": "user", "content": f"Context: {summary}"}]`
This prevents the exponential compounding problem. Each accumulated turn is now slightly heavier — capping accumulation and replacing with a summary keeps costs predictable.
5. Clean and compress inputs before sending
Before: Raw log output, unformatted JSON, full HTML responses pasted directly.
After:
`def compress_for_api(text):
# Remove blank lines
text = '\n'.join(line for line in text.splitlines() if line.strip())
# Remove HTML if only text is needed
text = re.sub(r'<[^>]+>', '', text)
# Truncate repeated whitespace
text = re.sub(r' +', ' ', text)
return text.strip()`
Cleaning inputs before sending reduced my token count by 20–30% on workflows that process scraped web content.
6. Optimise stable prompts for prompt caching
Claude's prompt caching charges a lower rate for repeated identical prompt prefixes. Structure your system prompts so the stable portion comes first and the dynamic content comes at the end:
`# Structure that caches well:
system = """[Your long, stable system prompt here — this gets cached]"""
user = f"""
Context for this specific request: {dynamic_context}
Task: {specific_task}
"""`
Caching only applies to the prefix that stays identical between calls. If your system prompt varies per request, caching does not help — standardise it.
The mental model shift that matters
The problem is not that Claude got worse or more expensive for no reason. The tokenizer change reflects better text processing. The deeper reasoning produces better outputs.
What changed is the default efficiency assumption. Prompts and workflows designed for the old tokenizer behaviour were implicitly relying on cheaper processing. That assumption no longer holds.
The fix is not "use less AI." The fix is designing your prompts the way you would design efficient database queries — send only what is needed, structure it cleanly, cache what repeats, and reset state before it compounds.
Full breakdown with context on the OpenRouter analysis at my Medium article.
Have you hit this in your own workflows? Drop your use case and the token multiplier you observed in the comments — trying to build a picture of which workflows are hit hardest.
Top comments (0)