DEV Community

Cover image for My Claude API Bill Jumped 47% and I Didn't Change a Single Prompt — Here's Why
Jayanth
Jayanth

Posted on

My Claude API Bill Jumped 47% and I Didn't Change a Single Prompt — Here's Why

The problem nobody warned me about

Same prompts. Same workflows. Same model. Higher bill — every single week.
I checked my logs expecting to find a spike in usage. Instead I found something stranger: the same prompt returning different token counts on different runs. Some runs were 30–47% higher than identical runs a few weeks earlier.

No input change. No structural change. Just more tokens — and a cost bill I could not explain.
After going through Reddit threads, developer forums, and an OpenRouter analysis of over 1M+ real API requests, here is what is actually happening and the six changes that brought my token usage back under control.

What changed under the hood

Three things shifted simultaneously, and most cost-increase explanations only mention one of them.

1. The tokenizer itself changed

Claude's tokenizer now splits text differently. Same input produces more token segments. The expected increase from this alone is roughly 1.0x–1.35x. But real-world data from high-volume API users shows 1.4x–1.47x increases — and sometimes higher on specific content types.
Code blocks, markdown formatting, and JSON structures are hit disproportionately hard because of how subword tokenisation handles special characters and whitespace patterns in technical content. If you are building coding assistants, agent workflows, or automation pipelines, your content type is in the high-impact zone.

2. Newer model versions reason deeper by default

More recent Claude versions spend more internal token budget on reasoning before generating output. This is the "thinking more" behaviour — better response quality, more tokens burned per request. If you upgraded model versions without adjusting your prompting approach, you absorbed a reasoning cost increase on top of the tokenizer change.

3. Conversations compound exponentially

Every message in a multi-turn conversation appends to the context window. With each individual message now slightly larger than before (due to tokenizer changes), long sessions accumulate token weight faster than the same sessions would have previously. A ten-turn conversation that used to cost X tokens now costs 1.4X or more — and that multiplier applies to every turn, not just the last one.

The real-world impact on different use cases

Use case                  | Token impact | Why
--------------------------|--------------|------------------------------------
Coding assistant          | Very high    | Code blocks + formatting = heavy tokenisation
Long agent workflows      | Very high    | Multi-turn accumulation compounds
Single short prompts      | Low          | Minimal compounding, tokenizer delta only
Automation with full logs | Very high    | Raw log data is extremely token-inefficient
Repeat identical prompts  | Medium-low   | Caching mitigates tokenizer delta
Enter fullscreen mode Exit fullscreen mode

If your workflow sends full files, complete conversation histories, or raw log output to the API — you are in the highest impact category.

The six changes that actually reduced my costs
1. Stop sending unnecessary context
Before:

`# Sending everything
messages = [
    {"role": "user", "content": full_file_contents + full_history + current_question}
]`
Enter fullscreen mode Exit fullscreen mode

After:

`# Send only what the model needs for this specific task
relevant_context = extract_relevant_section(file_contents, current_question)
messages = [
    {"role": "user", "content": relevant_context + current_question}
]
`
Enter fullscreen mode Exit fullscreen mode

Result for my workflow: 40% token reduction on input alone. The model does not need your entire codebase to answer a question about one function.

2. Break large prompts into focused smaller ones

Before:

`One prompt that asks: analyse this code, suggest improvements, 
rewrite the function, add error handling, and write unit tests.`
Enter fullscreen mode Exit fullscreen mode

After:

`Prompt 1: Analyse this function for issues (code only, no file context)
Prompt 2: Rewrite with suggested fixes applied (function only)
Prompt 3: Add error handling to this specific function
Prompt 4: Write unit tests for this function signature`
Enter fullscreen mode Exit fullscreen mode

Each smaller prompt is cheaper than one large prompt, and the outputs are easier to validate and iterate on.

3. Match reasoning effort to task complexity

Claude's newer versions apply heavier reasoning by default. For tasks that do not require deep reasoning — format conversion, simple extraction, classification — you can explicitly instruct the model to keep responses concise:

`system_prompt = """
Answer directly and concisely. 
Do not show your reasoning unless asked.
Do not pad the response with caveats or summaries.
"""`
Enter fullscreen mode Exit fullscreen mode

Saved approximately 15–20% on output tokens for my classification tasks where verbose responses were not needed.

4. Reset sessions instead of accumulating context

Before: One long continuous session per agent workflow.
After: Fresh session when the conversation exceeds a threshold.

`MAX_CONTEXT_TURNS = 8

if len(conversation_history) > MAX_CONTEXT_TURNS:
    # Summarise previous turns into a single context block
    summary = summarise_history(conversation_history)
    conversation_history = [{"role": "user", "content": f"Context: {summary}"}]`
Enter fullscreen mode Exit fullscreen mode

This prevents the exponential compounding problem. Each accumulated turn is now slightly heavier — capping accumulation and replacing with a summary keeps costs predictable.

5. Clean and compress inputs before sending

Before: Raw log output, unformatted JSON, full HTML responses pasted directly.
After:

`def compress_for_api(text):
    # Remove blank lines
    text = '\n'.join(line for line in text.splitlines() if line.strip())
    # Remove HTML if only text is needed
    text = re.sub(r'<[^>]+>', '', text)
    # Truncate repeated whitespace
    text = re.sub(r' +', ' ', text)
    return text.strip()`
Enter fullscreen mode Exit fullscreen mode

Cleaning inputs before sending reduced my token count by 20–30% on workflows that process scraped web content.

6. Optimise stable prompts for prompt caching

Claude's prompt caching charges a lower rate for repeated identical prompt prefixes. Structure your system prompts so the stable portion comes first and the dynamic content comes at the end:

`# Structure that caches well:
system = """[Your long, stable system prompt here — this gets cached]"""

user = f"""
Context for this specific request: {dynamic_context}
Task: {specific_task}
"""`
Enter fullscreen mode Exit fullscreen mode

Caching only applies to the prefix that stays identical between calls. If your system prompt varies per request, caching does not help — standardise it.

The mental model shift that matters

The problem is not that Claude got worse or more expensive for no reason. The tokenizer change reflects better text processing. The deeper reasoning produces better outputs.

What changed is the default efficiency assumption. Prompts and workflows designed for the old tokenizer behaviour were implicitly relying on cheaper processing. That assumption no longer holds.

The fix is not "use less AI." The fix is designing your prompts the way you would design efficient database queries — send only what is needed, structure it cleanly, cache what repeats, and reset state before it compounds.

Full breakdown with context on the OpenRouter analysis at my Medium article.

Have you hit this in your own workflows? Drop your use case and the token multiplier you observed in the comments — trying to build a picture of which workflows are hit hardest.

Top comments (0)