I Kept Hitting Claude Token Limits Until I Tracked What Was Actually Burning Them

#ai #claude #productivity #llm

The pattern that made no sense

Some days I barely used Claude and hit the limit early. Other days I pushed it hard and lasted much longer.
If the platform was the problem, the behaviour should be consistent. It was not — which meant the variable was me, not the system.
I started tracking everything: how I was prompting, how long my sessions ran, what types of tasks I was running, what differed between good days and bad days. The pattern that emerged was clear once I could see it.
I was not running out of messages. I was burning tokens — silently, in ways that felt completely normal.

Why token limits feel invisible

Most developers think of API cost in terms of individual requests. The problem with Claude (and any context-window model) is that cost is not per-request in any intuitive sense. It is per-token-processed, and that includes:

`Every message you send now
+
Your entire conversation history
+
Any files currently in context
+
Any active tools or integrations
+
All previous responses in the thread`

So every single message you send causes Claude to reprocess the entire accumulated weight of the conversation. A short chat is cheap. A long chat becomes exponentially expensive — not linearly.
This is why the limit can feel random. It is not random at all. It is the silent compounding of context weight across every turn.

The 10 changes that fixed it

Edit the original prompt instead of stacking follow-ups

This was the highest-impact change I made.
What I was doing:

`Message 1: "Write me an intro for this post"
Message 2: "Make it shorter"
Message 3: "Add more energy to the opening line"
Message 4: "Actually, change the angle entirely"`

Each of those follow-ups looks like a small correction. What it actually is: a full context reload multiplied four times.
What I changed to:
Edit the original prompt directly instead of appending corrections. Claude processes one refined version with one context load instead of four cumulative loads.
Immediate result: sharper outputs, fewer iterations, lower token burn.

2. Reset sessions before context gets heavy

At message 5, Claude might process a few hundred tokens of history. At message 25, it could be processing tens of thousands — for every single new message, including simple questions.
The pattern I use now:

`After every 15–20 messages:
1. Ask Claude: "Summarise everything we have covered so far in this session."
2. Copy the summary.
3. Open a new chat.
4. Paste the summary as the opening context.`

You keep continuity. You eliminate history weight. Same outcome at a fraction of the token cost.

3. Combine multi-step tasks into single execution prompts

Before:

`Prompt 1: "Summarise this document"
Prompt 2: "Now extract the key points"
Prompt 3: "Rewrite the intro in under 100 words"`

Three context loads. Three rounds of history reprocessing.
After:

`"Summarise this document, extract the key points as bullets, 
and rewrite the intro in under 100 words. Return all three 
in separate labelled sections."`

One context load. One processing cycle. Better output because Claude sees the full scope of what you need before it starts.

4. Stop re-uploading the same files

Every time you upload a file, Claude reprocesses it from scratch — even if it is identical to one you uploaded two messages ago. This is a hidden cost that does not feel like a cost because the UX treats uploads casually.
Fix: Use Projects to store reference files once. Files in a Project are cached in context rather than reprocessed per upload. For workflows that reference the same documentation, codebase, or data repeatedly, this alone produces significant savings.

5. Store persistent instructions once

If you start every session with:

`"I am a developer building automation workflows. 
Write in a concise technical tone. 
Use code blocks for all examples. 
Avoid unnecessary explanation."`

You are burning those tokens every single session. Stored in a Project's system instructions or a reusable prompt template, they load once — and you stop paying for them repeatedly.

6. Disable tools you are not actively using

Active tools — web search, integrations, connected services — add processing overhead even when they are not being called in a given turn. The model still has to account for their presence in every response.
Default posture: everything off. Enable specific tools only for sessions where they are required. Disable them when done. The per-turn cost reduction is small but it compounds across long sessions.

7. Match model to task complexity

Task type	Model tier
Simple Q&A, lookup	Lightest available
Writing, summarisation	Mid-tier
Complex reasoning	Heavy model
Code generation	Depends on complexity

Using Claude's most powerful model for every task is the equivalent of running a GPU render for a text file. The lighter models handle most everyday tasks at lower token cost per output token. Reserve heavy models for tasks that genuinely need them.

8. Write specific, scoped prompts

Vague prompts produce long outputs. Long outputs consume output tokens. Output tokens cost more per unit than input tokens.
Before:

`"What do you think about this approach?"`

After:

`"List the 3 main risks with this approach. 
Each risk in one sentence. No intro or conclusion."`

The second prompt produces a shorter, more useful response — and costs less to generate. Specificity is free. Vagueness is expensive.

9. Timing matters more than expected

Heavy usage periods produce tighter limits and slower responses regardless of your plan tier. If you are running intensive sessions — large context windows, complex reasoning tasks, multi-file workflows — scheduling them outside peak hours produces longer effective sessions and more consistent availability.

10. Track what you are actually burning

None of these changes matter without visibility. Start logging:

`# Minimal usage tracking
session_start_tokens = get_token_count(initial_context)
after_message_tokens = get_token_count(full_context)
delta = after_message_tokens - session_start_tokens
print(f"This message cost: {delta} tokens")
print(f"Session total: {after_message_tokens} tokens")`

When you can see the delta on each turn, the expensive patterns become obvious within a few sessions. The invisible becomes trackable.

The four patterns that account for most waste

After implementing all of the above, four root causes produce the majority of unnecessary token consumption:
Long unmanaged conversations — context weight compounds exponentially. Reset and summarise before turn 20.
Repeated file uploads — every upload is a full reprocess. Use Projects for persistent files.

Active unused tools — overhead exists whether you call them or not. Disable by default.
Heavy models on simple tasks — every output token from the heaviest model costs more than the same token from a lighter one. Match the tool to the job.
Fix these four and the limit stops feeling like a restriction. It starts feeling like something you can comfortably stay inside with normal usage.

The mental model that actually changes behaviour

I used to frame it as: Claude is limiting me.
The accurate frame is: my workflow was consuming more than it needed to.
Token limits are signals, not ceilings. They show you exactly how efficiently you are using the system. When you design prompts the way you would design efficient queries — send only what is needed, scope the output, reset before state compounds — the limit stops being a problem you bump into and starts being something you never think about.