DEV Community

brian austin
brian austin

Posted on

5 Claude API mistakes that cost me money (and how I fixed them)

5 Claude API mistakes that cost me money (and how I fixed them)

I spent three months building with the Claude API directly before switching to a flat-rate wrapper. Here are the five mistakes that cost me real money — and the code fixes that solved each one.

Mistake 1: Not trimming conversation history

Each API call sends your entire conversation history. A 20-turn conversation can easily hit 40,000 tokens per request.

# ❌ EXPENSIVE: sends full history every time
response = anthropic.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    messages=conversation_history  # grows unbounded
)

# ✅ CHEAP: trim to last N turns
def trim_history(history, max_turns=10):
    if len(history) <= max_turns * 2:
        return history
    # Keep system context + recent turns
    return history[-(max_turns * 2):]

response = anthropic.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    messages=trim_history(conversation_history)
)
Enter fullscreen mode Exit fullscreen mode

This alone cut my token usage by 60% in long sessions.

Mistake 2: Using max_tokens=4096 everywhere

I copy-pasted max_tokens=4096 into every call without thinking. Claude charges for output tokens too. Most responses are under 500 tokens.

# ❌ EXPENSIVE: reserves 4096 output tokens even for short answers
response = anthropic.messages.create(
    model="claude-opus-4-5",
    max_tokens=4096,  # you don't need this
    messages=messages
)

# ✅ CHEAPER: set a reasonable ceiling based on your use case
# For a chatbot: 1024 is usually enough
# For code generation: 2048
# For long-form content: 4096
response = anthropic.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,  # right-sized
    messages=messages
)
Enter fullscreen mode Exit fullscreen mode

Mistake 3: No caching on system prompts

If you have a long system prompt (instructions, context, persona), you're paying to re-send it on every single call. Anthropic's prompt caching can reduce this cost by 90%.

# ❌ EXPENSIVE: re-sends full system prompt every call
response = anthropic.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system="You are a helpful assistant with expertise in...[500 words]",
    messages=messages
)

# ✅ WITH CACHING: system prompt only charged once per 5 minutes
response = anthropic.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant with expertise in...[500 words]",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=messages
)
Enter fullscreen mode Exit fullscreen mode

This requires anthropic-beta: prompt-caching-2024-07-31 header. Worth it if your system prompt is over 1000 tokens.

Mistake 4: Retrying on rate limits without backoff

I had a simple retry loop that hammered the API on rate limit errors. This caused cascading failures and I got temporarily blocked.

// ❌ BROKEN: immediate retry causes cascading rate limit failures
async function callClaude(messages) {
  try {
    return await anthropic.messages.create({ ... })
  } catch (err) {
    if (err.status === 429) {
      return callClaude(messages) // infinite retry loop
    }
    throw err
  }
}

// ✅ CORRECT: exponential backoff with jitter
async function callClaudeWithBackoff(messages, attempt = 0) {
  try {
    return await anthropic.messages.create({
      model: 'claude-opus-4-5',
      max_tokens: 1024,
      messages
    })
  } catch (err) {
    if (err.status === 429 && attempt < 4) {
      const delay = Math.pow(2, attempt) * 1000 + Math.random() * 1000
      await new Promise(resolve => setTimeout(resolve, delay))
      return callClaudeWithBackoff(messages, attempt + 1)
    }
    throw err
  }
}
Enter fullscreen mode Exit fullscreen mode

Mistake 5: Counting tokens manually instead of using the API

I wrote my own token counter that was always wrong. Claude's tokenizer is not GPT's tokenizer. My estimates were 15-30% off.

# ❌ WRONG: tiktoken is for GPT, not Claude
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
token_count = len(enc.encode(text))  # wrong for Claude

# ✅ CORRECT: use Anthropic's count_tokens endpoint
response = anthropic.messages.count_tokens(
    model="claude-opus-4-5",
    messages=[{"role": "user", "content": text}]
)
print(f"Actual tokens: {response.input_tokens}")
Enter fullscreen mode Exit fullscreen mode

The real fix: stop counting tokens entirely

After fixing all five mistakes, my API bill dropped significantly. But I was still spending mental energy on token optimization instead of building features.

The actual solution was switching to a flat-rate service. I use SimplyLouie at $2/month — same Claude model, no token counting, no surprise bills.

My monthly spend went from unpredictable ($8-45 depending on usage) to exactly $2.00.

The code got simpler too. Here's my entire API call now:

curl -X POST https://simplylouie.com/api/chat \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"message": "explain binary search trees"}'
Enter fullscreen mode Exit fullscreen mode

No token math. No retry logic I have to maintain. No billing anxiety.

Which mistakes hit you hardest?

I'm curious which of these caught you out. Mistake #1 (unbounded history) was by far the most expensive for me — a long debugging session could cost $3-5 in a single conversation.

If you're still building directly on the Claude API, all five fixes above are worth implementing. And if you're tired of the token math entirely, SimplyLouie is $2/month flat.

Top comments (0)