Claude Opus 4.7 Is Burning Your Budget: 4 Token Multipliers Nobody Warns You About
Developers moving production workloads to Claude Opus 4.7 are reporting 1.5–3x higher costs than projected. Not because of pricing changes — because of four silent token multipliers that compound on each other. Fix all four and you can cut costs 60-75% on the same workload.
The Compounding Problem
Token costs aren't linear. Each multiplier stacks:
Base cost: $10
× Retry loops (2x average): $20
× Context bloat (3x on turn 50): $60
× No prompt caching (full price each call): $60
× Verbose schemas (1.4x tool overhead): $84
Actual cost: $84 vs projected $10
Here's each one and how to fix it.
Multiplier 1: Retry Loops
A failed tool call reinvokes the model. If your agent retries 3 times on a network error, you've paid 3x for that turn. On a pipeline that calls tools 20 times per run:
// Bad: naive retry
async function callTool(tool: string, args: unknown) {
for (let i = 0; i < 3; i++) {
try {
return await executeToolCall(tool, args);
} catch (e) {
// Retrying means paying full context price again
continue;
}
}
}
// Better: exponential backoff + circuit breaker
const RETRY_BUDGET = { maxAttempts: 2, backoffMs: 1000 };
async function callToolWithBudget(tool: string, args: unknown, attempt = 0) {
try {
return await executeToolCall(tool, args);
} catch (e) {
if (attempt >= RETRY_BUDGET.maxAttempts) throw e; // fail fast
if (isTransientError(e)) {
await sleep(RETRY_BUDGET.backoffMs * Math.pow(2, attempt));
return callToolWithBudget(tool, args, attempt + 1);
}
throw e; // don't retry logic errors
}
}
More importantly: distinguish transient from permanent errors. A 400 (bad request) shouldn't retry at all. A 503 can retry twice. Retrying a bad prompt 3 times is paying 3x for the same wrong answer.
Multiplier 2: Context Bloat
Nobody truncates conversation history. A 50-turn conversation means turn 50 pays for 49 previous turns as input tokens — every single one.
// This is what most people do:
const response = await client.messages.create({
model: 'claude-opus-4-7',
messages: conversationHistory, // grows unbounded — silent cost bomb
// ...
});
// What you should do:
function pruneHistory(
history: Message[],
maxTokenBudget = 8000
): Message[] {
// Always keep last N turns verbatim
const KEEP_RECENT = 6;
const recent = history.slice(-KEEP_RECENT);
// Summarize older context rather than dropping it
if (history.length > KEEP_RECENT) {
const older = history.slice(0, -KEEP_RECENT);
// Option 1: Drop (lossy but cheap)
// Option 2: Summarize with Haiku (cheap model, good compression)
const summary = await summarizeWithHaiku(older);
return [
{ role: 'user', content: `[Context summary: ${summary}]` },
...recent,
];
}
return recent;
}
For long agent runs, use a sliding window with periodic summarization. The compression ratio on Haiku is roughly 10:1 in tokens, so you pay ~$0.50 to summarize $5 of context.
Multiplier 3: No Prompt Caching (The March 2026 Trap)
If you're not using cache_control: { type: 'ephemeral' } on your system prompt, you're paying full Opus 4.7 pricing on every token in your system prompt, every call.
Worse: in March 2026, Anthropic silently dropped the default cache TTL from 1 hour to 5 minutes. If you implemented caching before March and assumed 1-hour TTL, your cache hit rate may have collapsed without any warning.
// Check your cache hit rate — if this is near 0, you have a problem:
console.log(response.usage);
// {
// input_tokens: 45,
// cache_read_input_tokens: 0, // ← should be non-zero on repeated calls
// cache_creation_input_tokens: 9843
// }
// Fix: mark your stable system prompt for caching
const response = await client.messages.create({
model: 'claude-opus-4-7',
system: [
{
type: 'text',
text: LARGE_SYSTEM_PROMPT, // the part that doesn't change
cache_control: { type: 'ephemeral' },
},
],
messages: conversationHistory,
});
With a 10,000-token system prompt at Opus 4.7 pricing:
- Without caching: $0.15 per call (1,000 calls = $150)
- With caching + 5min TTL: $0.015 per cache hit (1,000 calls within TTL = $15.75 total)
- Savings: ~89% if calls stay within TTL window
For workloads with >5 minute gaps between calls, caching won't help — switch to the Batch API for 50% off instead.
Multiplier 4: Verbose Tool Schemas
Tool definitions count as input tokens on every call. A detailed JSON schema with 20 tools, each with rich descriptions and parameter documentation, can add 2,000-4,000 tokens per request.
// This schema is 400 tokens:
const verboseTool = {
name: 'search_database',
description: 'Searches the production PostgreSQL database using parameterized queries. Supports full-text search across the documents table. Returns paginated results with relevance scoring. Use this when the user asks for information that might be stored in our database.',
input_schema: {
type: 'object',
properties: {
query: {
type: 'string',
description: 'The search query string to use for full-text search across the documents table'
},
limit: {
type: 'number',
description: 'Maximum number of results to return (default 10, max 100)'
},
offset: {
type: 'number',
description: 'Pagination offset for retrieving subsequent pages of results'
}
},
required: ['query']
}
};
// This schema is 80 tokens and Claude handles it fine:
const compactTool = {
name: 'search_database',
description: 'Full-text search across documents. Returns paginated results.',
input_schema: {
type: 'object',
properties: {
query: { type: 'string' },
limit: { type: 'number' },
offset: { type: 'number' }
},
required: ['query']
}
};
Reduce descriptions to the minimum Claude needs to select the right tool. Verbose documentation belongs in your codebase, not in the token stream.
Also: don't send tools the agent doesn't need for the current turn. If this turn is purely analytical, strip tool definitions entirely.
Measuring Your Actual Multipliers
class TokenBudgetTracker {
private calls = 0;
private totalInput = 0;
private totalCacheRead = 0;
private totalCacheWrite = 0;
private retries = 0;
record(usage: Anthropic.Usage, retryCount: number) {
this.calls++;
this.totalInput += usage.input_tokens;
this.totalCacheRead += usage.cache_read_input_tokens ?? 0;
this.totalCacheWrite += usage.cache_creation_input_tokens ?? 0;
this.retries += retryCount;
}
report() {
const effectiveCacheRate = this.totalCacheRead /
(this.totalInput + this.totalCacheRead + this.totalCacheWrite);
console.log({
calls: this.calls,
avgInputTokens: (this.totalInput / this.calls).toFixed(0),
cacheHitRate: `${(effectiveCacheRate * 100).toFixed(1)}%`,
retryRate: `${((this.retries / this.calls) * 100).toFixed(1)}%`,
});
}
}
Run this for one hour in production and you'll see exactly which multiplier is doing the most damage.
Quick Wins Ranked by Impact
- Add cache_control to system prompt — 5 minutes, ~50-89% savings on system prompt tokens
- Prune conversation history to last 6 turns — 30 minutes, ~60% reduction on long sessions
- Strip tool schemas when not needed — 1 hour, ~20-40% reduction depending on schema size
- Cap retries at 2 with transient-only logic — 2 hours, eliminates 2-3x retry waste
Don't optimize blindly — measure first with the tracker above, then fix the biggest multiplier.
Top comments (0)