5 Claude API mistakes that cost me money (and how I fixed them)
I spent three months building with the Claude API directly before switching to a flat-rate wrapper. Here are the five mistakes that cost me real money — and the code fixes that solved each one.
Mistake 1: Not trimming conversation history
Each API call sends your entire conversation history. A 20-turn conversation can easily hit 40,000 tokens per request.
# ❌ EXPENSIVE: sends full history every time
response = anthropic.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=conversation_history # grows unbounded
)
# ✅ CHEAP: trim to last N turns
def trim_history(history, max_turns=10):
if len(history) <= max_turns * 2:
return history
# Keep system context + recent turns
return history[-(max_turns * 2):]
response = anthropic.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=trim_history(conversation_history)
)
This alone cut my token usage by 60% in long sessions.
Mistake 2: Using max_tokens=4096 everywhere
I copy-pasted max_tokens=4096 into every call without thinking. Claude charges for output tokens too. Most responses are under 500 tokens.
# ❌ EXPENSIVE: reserves 4096 output tokens even for short answers
response = anthropic.messages.create(
model="claude-opus-4-5",
max_tokens=4096, # you don't need this
messages=messages
)
# ✅ CHEAPER: set a reasonable ceiling based on your use case
# For a chatbot: 1024 is usually enough
# For code generation: 2048
# For long-form content: 4096
response = anthropic.messages.create(
model="claude-opus-4-5",
max_tokens=1024, # right-sized
messages=messages
)
Mistake 3: No caching on system prompts
If you have a long system prompt (instructions, context, persona), you're paying to re-send it on every single call. Anthropic's prompt caching can reduce this cost by 90%.
# ❌ EXPENSIVE: re-sends full system prompt every call
response = anthropic.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system="You are a helpful assistant with expertise in...[500 words]",
messages=messages
)
# ✅ WITH CACHING: system prompt only charged once per 5 minutes
response = anthropic.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful assistant with expertise in...[500 words]",
"cache_control": {"type": "ephemeral"}
}
],
messages=messages
)
This requires anthropic-beta: prompt-caching-2024-07-31 header. Worth it if your system prompt is over 1000 tokens.
Mistake 4: Retrying on rate limits without backoff
I had a simple retry loop that hammered the API on rate limit errors. This caused cascading failures and I got temporarily blocked.
// ❌ BROKEN: immediate retry causes cascading rate limit failures
async function callClaude(messages) {
try {
return await anthropic.messages.create({ ... })
} catch (err) {
if (err.status === 429) {
return callClaude(messages) // infinite retry loop
}
throw err
}
}
// ✅ CORRECT: exponential backoff with jitter
async function callClaudeWithBackoff(messages, attempt = 0) {
try {
return await anthropic.messages.create({
model: 'claude-opus-4-5',
max_tokens: 1024,
messages
})
} catch (err) {
if (err.status === 429 && attempt < 4) {
const delay = Math.pow(2, attempt) * 1000 + Math.random() * 1000
await new Promise(resolve => setTimeout(resolve, delay))
return callClaudeWithBackoff(messages, attempt + 1)
}
throw err
}
}
Mistake 5: Counting tokens manually instead of using the API
I wrote my own token counter that was always wrong. Claude's tokenizer is not GPT's tokenizer. My estimates were 15-30% off.
# ❌ WRONG: tiktoken is for GPT, not Claude
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
token_count = len(enc.encode(text)) # wrong for Claude
# ✅ CORRECT: use Anthropic's count_tokens endpoint
response = anthropic.messages.count_tokens(
model="claude-opus-4-5",
messages=[{"role": "user", "content": text}]
)
print(f"Actual tokens: {response.input_tokens}")
The real fix: stop counting tokens entirely
After fixing all five mistakes, my API bill dropped significantly. But I was still spending mental energy on token optimization instead of building features.
The actual solution was switching to a flat-rate service. I use SimplyLouie at $2/month — same Claude model, no token counting, no surprise bills.
My monthly spend went from unpredictable ($8-45 depending on usage) to exactly $2.00.
The code got simpler too. Here's my entire API call now:
curl -X POST https://simplylouie.com/api/chat \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"message": "explain binary search trees"}'
No token math. No retry logic I have to maintain. No billing anxiety.
Which mistakes hit you hardest?
I'm curious which of these caught you out. Mistake #1 (unbounded history) was by far the most expensive for me — a long debugging session could cost $3-5 in a single conversation.
If you're still building directly on the Claude API, all five fixes above are worth implementing. And if you're tired of the token math entirely, SimplyLouie is $2/month flat.
Top comments (0)