kanta13jp1

Posted on Apr 28

Claude API Cost Optimization: Haiku vs Sonnet and Prompt Caching Cut My Bill 70%

#ai #indiedev #buildinpublic #automation

Claude API Cost Optimization: Haiku vs Sonnet and Prompt Caching Cut My Bill 70%

My Claude API costs went from $80/month to $24/month. Here's exactly what I changed.

Model Selection Principle

claude-haiku-4-5:    lightweight tasks / high-frequency calls (cheapest)
claude-sonnet-4-6:   complex reasoning / design work (balanced)
claude-opus-4-7:     maximum accuracy only (most expensive)

Rough cost ratio:

haiku : sonnet : opus = 1 : 5 : 15

Routing Tasks to the Right Model

function selectModel(taskType: string): string {
  switch (taskType) {
    case 'cs_reply':          return 'claude-haiku-4-5-20251001';
    case 'competitor_check':  return 'claude-haiku-4-5-20251001';
    case 'daily_judgment':    return 'claude-haiku-4-5-20251001';
    case 'code_review':       return 'claude-sonnet-4-6';
    case 'architecture':      return 'claude-sonnet-4-6';
    default:                  return 'claude-haiku-4-5-20251001';  // default to cheapest
  }
}

Prompt Caching: 90% Off Repeated Tokens

Sending a long system prompt on every call means paying for it every time. With Prompt Caching, cached tokens cost 90% less.

const response = await anthropic.messages.create({
  model: 'claude-haiku-4-5-20251001',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: LONG_SYSTEM_PROMPT,  // thousands of tokens
      cache_control: { type: 'ephemeral' },  // enable caching
    },
  ],
  messages: [{ role: 'user', content: userMessage }],
});

Cost example:

System prompt: 2,000 tokens
Daily calls:   100

Without cache: 2,000 × 100 = 200,000 tokens/day
With cache:    2,000 (first) + 2,000 × 0.1 × 99 = 21,800 tokens/day
Reduction:     89%

Minimize Input Tokens

// ❌ BAD: passing entire context every time
const response = await anthropic.messages.create({
  messages: [{
    role: 'user',
    content: `${entireUserHistory}\n\n${question}`,
  }],
});

// ✅ GOOD: only recent relevant context
const recentHistory = userHistory.slice(-5);
const response = await anthropic.messages.create({
  messages: [
    ...recentHistory,
    { role: 'user', content: question },
  ],
});

Batch Processing for Throughput

// ❌ BAD: sequential — 100 items × 1s = 100s
for (const item of items) {
  const result = await callClaude(item);
}

// ✅ GOOD: parallel batches within rate limits
const BATCH_SIZE = 5;
for (let i = 0; i < items.length; i += BATCH_SIZE) {
  const batch = items.slice(i, i + BATCH_SIZE);
  const results = await Promise.all(batch.map(callClaude));
  if (i + BATCH_SIZE < items.length) {
    await new Promise(resolve => setTimeout(resolve, 500));
  }
}

Cost Monitoring via GHA

# .github/workflows/cost-monitor.yml
on:
  schedule:
    - cron: '0 9 * * *'  # 9am daily

jobs:
  check:
    steps:
      - name: Check API usage
        run: |
          USAGE=$(curl -s "https://api.anthropic.com/v1/usage" \
            -H "x-api-key: $ANTHROPIC_API_KEY")
          TOTAL=$(echo $USAGE | jq '.total_tokens')
          if [ $TOTAL -gt 10000000 ]; then
            echo "⚠️ Monthly token usage exceeded threshold: $TOTAL"
            # trigger Slack alert or GitHub Issue
          fi

Actual Results

Before (January 2026):
  claude-opus:   60% → $62
  claude-sonnet: 30% → $18
  Total: $80/month

After (April 2026):
  claude-haiku:  80% → $12
  claude-sonnet: 20% → $12 (with Prompt Caching)
  Total: $24/month
  Reduction: 70%

Summary

Cost reduction priority:
  1. Model routing (haiku for everything it can handle)
  2. Prompt Caching (repeated system prompts)
  3. Minimize input tokens (trim context windows)
  4. Batch processing (parallelize within rate limits)
  5. Usage monitoring (auto-alert via GHA)

With smart model routing and Prompt Caching, you can build production-quality AI features at a fraction of the naive cost. Start with "haiku first" — the savings are immediate.

DEV Community

Claude API Cost Optimization: Haiku vs Sonnet and Prompt Caching Cut My Bill 70%

Claude API Cost Optimization: Haiku vs Sonnet and Prompt Caching Cut My Bill 70%

Model Selection Principle

Routing Tasks to the Right Model

Prompt Caching: 90% Off Repeated Tokens

Minimize Input Tokens

Batch Processing for Throughput

Cost Monitoring via GHA

Actual Results

Summary

Top comments (0)