DEV Community

Atlas Whoff
Atlas Whoff

Posted on

Claude Opus 4.7 Is Burning Your Budget: 4 Token Multipliers Nobody Warns You About

Claude Opus 4.7 Is Burning Your Budget: 4 Token Multipliers Nobody Warns You About

Developers moving production workloads to Claude Opus 4.7 are reporting 1.5–3x higher costs than projected. Not because of pricing changes — because of four silent token multipliers that compound on each other. Fix all four and you can cut costs 60-75% on the same workload.

The Compounding Problem

Token costs aren't linear. Each multiplier stacks:

Base cost: $10
× Retry loops (2x average): $20
× Context bloat (3x on turn 50): $60  
× No prompt caching (full price each call): $60
× Verbose schemas (1.4x tool overhead): $84

Actual cost: $84 vs projected $10
Enter fullscreen mode Exit fullscreen mode

Here's each one and how to fix it.

Multiplier 1: Retry Loops

A failed tool call reinvokes the model. If your agent retries 3 times on a network error, you've paid 3x for that turn. On a pipeline that calls tools 20 times per run:

// Bad: naive retry
async function callTool(tool: string, args: unknown) {
  for (let i = 0; i < 3; i++) {
    try {
      return await executeToolCall(tool, args);
    } catch (e) {
      // Retrying means paying full context price again
      continue;
    }
  }
}

// Better: exponential backoff + circuit breaker
const RETRY_BUDGET = { maxAttempts: 2, backoffMs: 1000 };

async function callToolWithBudget(tool: string, args: unknown, attempt = 0) {
  try {
    return await executeToolCall(tool, args);
  } catch (e) {
    if (attempt >= RETRY_BUDGET.maxAttempts) throw e; // fail fast
    if (isTransientError(e)) {
      await sleep(RETRY_BUDGET.backoffMs * Math.pow(2, attempt));
      return callToolWithBudget(tool, args, attempt + 1);
    }
    throw e; // don't retry logic errors
  }
}
Enter fullscreen mode Exit fullscreen mode

More importantly: distinguish transient from permanent errors. A 400 (bad request) shouldn't retry at all. A 503 can retry twice. Retrying a bad prompt 3 times is paying 3x for the same wrong answer.

Multiplier 2: Context Bloat

Nobody truncates conversation history. A 50-turn conversation means turn 50 pays for 49 previous turns as input tokens — every single one.

// This is what most people do:
const response = await client.messages.create({
  model: 'claude-opus-4-7',
  messages: conversationHistory, // grows unbounded — silent cost bomb
  // ...
});

// What you should do:
function pruneHistory(
  history: Message[],
  maxTokenBudget = 8000
): Message[] {
  // Always keep last N turns verbatim
  const KEEP_RECENT = 6;
  const recent = history.slice(-KEEP_RECENT);

  // Summarize older context rather than dropping it
  if (history.length > KEEP_RECENT) {
    const older = history.slice(0, -KEEP_RECENT);
    // Option 1: Drop (lossy but cheap)
    // Option 2: Summarize with Haiku (cheap model, good compression)
    const summary = await summarizeWithHaiku(older);
    return [
      { role: 'user', content: `[Context summary: ${summary}]` },
      ...recent,
    ];
  }

  return recent;
}
Enter fullscreen mode Exit fullscreen mode

For long agent runs, use a sliding window with periodic summarization. The compression ratio on Haiku is roughly 10:1 in tokens, so you pay ~$0.50 to summarize $5 of context.

Multiplier 3: No Prompt Caching (The March 2026 Trap)

If you're not using cache_control: { type: 'ephemeral' } on your system prompt, you're paying full Opus 4.7 pricing on every token in your system prompt, every call.

Worse: in March 2026, Anthropic silently dropped the default cache TTL from 1 hour to 5 minutes. If you implemented caching before March and assumed 1-hour TTL, your cache hit rate may have collapsed without any warning.

// Check your cache hit rate — if this is near 0, you have a problem:
console.log(response.usage);
// {
//   input_tokens: 45,
//   cache_read_input_tokens: 0,    // ← should be non-zero on repeated calls
//   cache_creation_input_tokens: 9843
// }

// Fix: mark your stable system prompt for caching
const response = await client.messages.create({
  model: 'claude-opus-4-7',
  system: [
    {
      type: 'text',
      text: LARGE_SYSTEM_PROMPT, // the part that doesn't change
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: conversationHistory,
});
Enter fullscreen mode Exit fullscreen mode

With a 10,000-token system prompt at Opus 4.7 pricing:

  • Without caching: $0.15 per call (1,000 calls = $150)
  • With caching + 5min TTL: $0.015 per cache hit (1,000 calls within TTL = $15.75 total)
  • Savings: ~89% if calls stay within TTL window

For workloads with >5 minute gaps between calls, caching won't help — switch to the Batch API for 50% off instead.

Multiplier 4: Verbose Tool Schemas

Tool definitions count as input tokens on every call. A detailed JSON schema with 20 tools, each with rich descriptions and parameter documentation, can add 2,000-4,000 tokens per request.

// This schema is 400 tokens:
const verboseTool = {
  name: 'search_database',
  description: 'Searches the production PostgreSQL database using parameterized queries. Supports full-text search across the documents table. Returns paginated results with relevance scoring. Use this when the user asks for information that might be stored in our database.',
  input_schema: {
    type: 'object',
    properties: {
      query: {
        type: 'string',
        description: 'The search query string to use for full-text search across the documents table'
      },
      limit: {
        type: 'number',
        description: 'Maximum number of results to return (default 10, max 100)'
      },
      offset: {
        type: 'number', 
        description: 'Pagination offset for retrieving subsequent pages of results'
      }
    },
    required: ['query']
  }
};

// This schema is 80 tokens and Claude handles it fine:
const compactTool = {
  name: 'search_database',
  description: 'Full-text search across documents. Returns paginated results.',
  input_schema: {
    type: 'object',
    properties: {
      query: { type: 'string' },
      limit: { type: 'number' },
      offset: { type: 'number' }
    },
    required: ['query']
  }
};
Enter fullscreen mode Exit fullscreen mode

Reduce descriptions to the minimum Claude needs to select the right tool. Verbose documentation belongs in your codebase, not in the token stream.

Also: don't send tools the agent doesn't need for the current turn. If this turn is purely analytical, strip tool definitions entirely.

Measuring Your Actual Multipliers

class TokenBudgetTracker {
  private calls = 0;
  private totalInput = 0;
  private totalCacheRead = 0;
  private totalCacheWrite = 0;
  private retries = 0;

  record(usage: Anthropic.Usage, retryCount: number) {
    this.calls++;
    this.totalInput += usage.input_tokens;
    this.totalCacheRead += usage.cache_read_input_tokens ?? 0;
    this.totalCacheWrite += usage.cache_creation_input_tokens ?? 0;
    this.retries += retryCount;
  }

  report() {
    const effectiveCacheRate = this.totalCacheRead /
      (this.totalInput + this.totalCacheRead + this.totalCacheWrite);

    console.log({
      calls: this.calls,
      avgInputTokens: (this.totalInput / this.calls).toFixed(0),
      cacheHitRate: `${(effectiveCacheRate * 100).toFixed(1)}%`,
      retryRate: `${((this.retries / this.calls) * 100).toFixed(1)}%`,
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

Run this for one hour in production and you'll see exactly which multiplier is doing the most damage.

Quick Wins Ranked by Impact

  1. Add cache_control to system prompt — 5 minutes, ~50-89% savings on system prompt tokens
  2. Prune conversation history to last 6 turns — 30 minutes, ~60% reduction on long sessions
  3. Strip tool schemas when not needed — 1 hour, ~20-40% reduction depending on schema size
  4. Cap retries at 2 with transient-only logic — 2 hours, eliminates 2-3x retry waste

Don't optimize blindly — measure first with the tracker above, then fix the biggest multiplier.

Top comments (0)