Token Budgeting: The Engineering Skill Nobody Talks About

Sanjay Singh — Sat, 20 Jun 2026 17:38:41 +0000

1. The Misconception That's Costing You Money

Ask a developer how to reduce their LLM bill and they'll say: "write shorter prompts." Remove adjectives. Trim examples. Cut the system prompt.

This isn't wrong — it's just the lowest-leverage version of the right idea. It optimizes the 4% of your context that is the actual user message while ignoring the 96% that is conversation history, system prompt, idle tool schemas, and over-retrieved documents.

Token optimization is a context engineering problem. The real questions are:

What is in your context that doesn't need to be there?
Is your context structured so the cache can work?
Is the model you're paying for the right one for this specific task?

Answer those and you'll reduce your bill by 60–80%. Shorten prompts and you'll reduce it by 5%.

2. Measure First: You Can't Optimize What You Can't See

Before touching anything, instrument what you have. Every provider returns token usage in the API response — read it.

// Wrap your API calls to log token usage from day one
async function loggedCompletion(params: Anthropic.MessageCreateParams) {
  const response = await client.messages.create(params);
  const { input_tokens, output_tokens,
          cache_read_input_tokens, cache_creation_input_tokens } = response.usage;

  console.log({
    inputTokens:    input_tokens,
    outputTokens:   output_tokens,
    cacheHits:      cache_read_input_tokens   ?? 0,   // paid 10% of normal
    cacheWrites:    cache_creation_input_tokens ?? 0,  // paid 125% (first call)
    estimatedCost: (
      (input_tokens  * 0.000003) +
      (output_tokens * 0.000015) +
      ((cache_read_input_tokens ?? 0) * 0.0000003)
    ).toFixed(6),
  });

  return response;
}

Count tokens before you send to understand what a request costs before paying for it:

const count = await client.messages.countTokens({
  model:    'claude-sonnet-4-6',
  system:   SYSTEM_PROMPT,
  tools:    TOOLS,
  messages: conversationHistory,
});
console.log(`This call: ${count.input_tokens} input tokens`);

Run this for 48 hours before optimizing anything. The distribution will tell you exactly which lever to pull first.

3. The 2026 Pricing Landscape

Before optimizing, know what you're paying. Two facts dominate the table:

Model	Input /1M	Output /1M	Cached Input	Context
Claude Opus 4.8	$5.00	$25.00	$0.50	1M
Claude Sonnet 4.6	$3.00	$15.00	$0.30	1M
Claude Haiku 4.5	$0.80	$4.00	$0.08	200K
GPT-5.5	$5.00	$30.00	$0.50	1M
GPT-4.1	$2.00	$8.00	$1.00	1M
GPT-4.1 Nano	$0.10	$0.40	$0.05	1M
DeepSeek V4 Flash	$0.14	$0.28	$0.003	1M

Fact 1: Output costs 4–8× more than input. A verbose response is far more expensive than a verbose prompt. Controlling output length matters more than controlling input length.

Fact 2: The model spread is 89×. GPT-4.1 Nano at $0.10/1M input vs Claude Opus 4.8 at $5.00/1M. Every routine task sent to a frontier model burns 10–50× its necessary cost.

xychart-beta
    title "Output Token Price per 1M (USD) — June 2026"
    x-axis ["DeepSeek V4 Flash", "GPT-4.1 Nano", "Haiku 4.5", "GPT-4.1", "Sonnet 4.6", "Opus 4.8", "GPT-5.5"]
    y-axis "$/1M output tokens" 0 --> 32
    bar [0.28, 0.40, 4.00, 8.00, 15.00, 25.00, 30.00]

Cache savings that stack with any of the above: Anthropic 90% off cached input, OpenAI 50% off automatically, Batch API 50% off async requests at both providers.

4. Where Your Tokens Actually Go

Most teams optimize the wrong things because they haven't diagnosed where their tokens land.

pie title "Typical Token Distribution — Multi-Turn Agent Session"
    "Conversation history (replayed each turn)" : 42
    "System prompt (replayed each turn)" : 18
    "Tool schemas (loaded but often unused)" : 15
    "RAG retrieved context" : 14
    "Actual user message" : 7
    "Model output" : 4

The user message — the thing most developers try to shorten — is 4% of the total. The big cost drivers are conversation history and system prompt, both of which replay in full on every single turn.

The quadratic growth problem: Every new message replays the entire prior conversation from scratch. A session with n equal-length turns doesn't cost n turns — it costs n(n+1)/2 turn-equivalents.

xychart-beta
    title "Cumulative Input Tokens — 200-Token Average Turn"
    x-axis ["Turn 5", "Turn 10", "Turn 20", "Turn 30", "Turn 50"]
    y-axis "Cumulative tokens (thousands)" 0 --> 260
    bar [3, 11, 42, 93, 255]

A 50-turn conversation pays for 255,000 input tokens from something that generated roughly 10,000 tokens of actual content. This compounds silently and is the dominant cost driver for any application with extended conversations.

5. Lever 1: Prompt Caching — 90% Off Static Content

Prompt caching is the highest single-impact optimization available if your app has a consistent system prompt, tool definitions, or reference documents. At Anthropic, cached tokens cost 10% of the normal rate. At OpenAI, 50% off automatically.

The mechanism: Claude computes a KV cache of your prompt prefix. On the next request, if the prefix is identical, it reuses the cached state instead of reprocessing it.

const response = await client.messages.create({
  model:      'claude-sonnet-4-6',
  max_tokens: 1024,
  system: [
    {
      type:          'text',
      text:          SYSTEM_PROMPT,          // 3,000+ tokens of static context
      cache_control: { type: 'ephemeral' },  // ← mark for caching
    },
  ],
  messages: conversationHistory,
});

// Inspect the result — is caching actually working?
console.log({
  cacheHits:   response.usage.cache_read_input_tokens,    // paid 10%
  cacheWrites: response.usage.cache_creation_input_tokens, // paid 125% (once)
});

The rules you must know before caching anything:

Minimum 1,024 tokens per block to qualify. Sub-1,024 blocks are silently ignored.
Default TTL is 5 minutes. Requests spaced further apart miss the cache — use the 1-hour extension for slower workflows.
Up to 4 cache breakpoints per request. Use them on your four largest static blocks.
Cache writes cost 25% more on the first call. Breakeven is typically 3–4 requests.

Cache structure: the ordering that makes or breaks hits

Caching is prefix-based. A single dynamic element placed before your static content breaks every cache hit. The correct order — every time:

// ✅ CORRECT — static content first, dynamic last
await client.messages.create({
  system: [
    { type: 'text', text: SYSTEM_PROMPT,   cache_control: { type: 'ephemeral' } },
    { type: 'text', text: TOOL_DOCS,       cache_control: { type: 'ephemeral' } },
    { type: 'text', text: REFERENCE_DOCS,  cache_control: { type: 'ephemeral' } },
  ],
  messages: [
    ...conversationHistory,                // dynamic — grows per turn
    { role: 'user', content: userMessage }, // always last
  ],
});

// ❌ WRONG — timestamp in the system block breaks every cache hit
system: [{
  type: 'text',
  text: `Current time: ${new Date().toISOString()}\n${SYSTEM_PROMPT}`,
  //     ↑ changes every second — cache never activates
}]

// ✅ Put dynamic values in the user message instead
messages: [{ role: 'user', content: `[${new Date().toISOString()}] ${userMessage}` }]

For OpenAI, caching is automatic — keep the prompt prefix identical across calls and the 50% discount applies with no code changes.

6. Lever 2: Context Pruning and Compaction

Caching addresses static content. Pruning addresses the quadratic growth of conversation history.

Summarize old turns into a cached block:

async function prepareContext(messages: Message[], keepRecentTurns = 6) {
  if (messages.length <= keepRecentTurns * 2) {
    return { summary: null, messages };
  }

  const older  = messages.slice(0, -(keepRecentTurns * 2));
  const recent = messages.slice(-(keepRecentTurns * 2));

  // Use a cheap model for summarization — this is simple work
  const res = await client.messages.create({
    model:      'claude-haiku-4-5',
    max_tokens: 300,
    messages: [{
      role:    'user',
      content: `Summarize this conversation in under 200 words.
Include: user's goal, decisions made, current status. Exclude: detailed reasoning.
${older.map(m => `${m.role}: ${m.content}`).join('\n')}`,
    }],
  });

  return { summary: res.content[0].text, messages: recent };
}

// Use the summary as a second cached block
const { summary, messages } = await prepareContext(history);

await client.messages.create({
  system: [
    { type: 'text', text: SYSTEM_PROMPT, cache_control: { type: 'ephemeral' } },
    ...(summary ? [{ type: 'text' as const, text: `## Prior context\n${summary}`,
                     cache_control: { type: 'ephemeral' as const } }] : []),
  ],
  messages,
  max_tokens: 1024,
});

The Haiku summarization call costs a fraction of a cent. The savings from not replaying 30 turns of history on every subsequent call are substantial.

Warning: Over-pruning context degrades answer quality and triggers retries that cost more than the context you saved. The goal is the smallest sufficient context, not the smallest possible one.

7. Lever 3: Model Routing — Pay for Reasoning Only When You Need It

This is the highest-leverage structural change most teams never make. The 89× price spread between cheap and frontier models means every routine task sent to Opus burns unnecessary budget.

// A cheap classifier routes tasks to the right model tier
async function classifyTask(message: string): Promise<'simple' | 'medium' | 'complex'> {
  const res = await client.messages.create({
    model:      'claude-haiku-4-5',  // $0.80/1M — the classifier itself is cheap
    max_tokens: 5,
    messages: [{
      role:    'user',
      content: `Reply with ONE word: simple, medium, or complex.
simple = extraction, classification, yes/no, short lookups
medium = summarization, structured analysis, multi-step predictable tasks
complex = deep reasoning, architecture decisions, research synthesis
Task: "${message}"`,
    }],
  });

  const result = res.content[0].text.trim().toLowerCase();
  return (['simple', 'medium', 'complex'].includes(result) ? result : 'medium') as any;
}

const MODEL_MAP = {
  simple:  'claude-haiku-4-5',   // $4.00/1M output
  medium:  'claude-sonnet-4-6',  // $15.00/1M output
  complex: 'claude-opus-4-8',    // $25.00/1M output
};

const MAX_TOKENS_MAP = { simple: 256, medium: 1024, complex: 4096 };

export async function routedCompletion(message: string, context: ConversationContext) {
  const complexity = await classifyTask(message);
  return client.messages.create({
    model:      MODEL_MAP[complexity],
    max_tokens: MAX_TOKENS_MAP[complexity],
    system:     context.system,
    messages:   [...context.history, { role: 'user', content: message }],
  });
}

Real-world distribution for a typical support application: ~60% simple, ~30% medium, ~10% complex. Routing to Haiku for 60% of requests instead of Sonnet reduces total cost by roughly 40% on that portion alone — with no quality degradation on routine tasks.

8. Lever 4: Batch API — 50% Off Everything Async

Not every LLM request needs to be real-time. Both Anthropic and OpenAI offer batch APIs at 50% off with identical quality. If your users aren't waiting for the response, use it.

Good candidates: content classification, bulk summarization, document extraction, data enrichment, A/B prompt testing, nightly report generation.

// Anthropic Message Batches — 50% off, processed within 24 hours
const batch = await client.beta.messages.batches.create({
  requests: items.map(item => ({
    custom_id: item.id,
    params: {
      model:      'claude-sonnet-4-6',
      max_tokens: 512,
      system: [{ type: 'text', text: SYSTEM_PROMPT,
                 cache_control: { type: 'ephemeral' } }],
      messages: [{ role: 'user', content: item.content }],
    },
  })),
});

// Poll or webhook for results
for await (const result of await client.beta.messages.batches.results(batch.id)) {
  if (result.result.type === 'succeeded') {
    await saveResult(result.custom_id, result.result.message.content[0]);
  }
}

Stack caching with batching: a cached batch request on Sonnet 4.6 drops from $3.00/1M to $0.30/1M (cache) × 0.50 (batch) = $0.15/1M — a 95% reduction on that input.

9. Lever 5: Surgical RAG — Stop Dumping Documents

RAG pipelines are the most common source of unnecessary context bloat. Teams retrieve 10,000+ tokens by default because the context window can hold it — not because it improves results.

The research is clear: context window size does not drive quality. Placement and precision do. More context, placed poorly, actively hurts performance via the "lost in the middle" effect — models perform worse when critical information is buried mid-context.

async function surgicalRag(query: string, tokenBudget = 2000) {
  // Retrieve more candidates than you'll use
  const candidates = await vectorDB.search(query, { limit: 20 });

  // Filter by relevance score — weak matches add noise
  const relevant = candidates.filter(c => c.score >= 0.75);

  // Fill up to budget, best matches first
  let used = 0;
  const selected = [];
  for (const chunk of relevant) {
    if (used + chunk.tokenCount > tokenBudget) break;
    selected.push(chunk);
    used += chunk.tokenCount;
  }

  return selected
    .map((c, i) => `[Source ${i + 1}: ${c.source}]\n${c.content}`)
    .join('\n\n---\n\n');
}
// Result: ~1,500–2,000 tokens vs 10,000+ from naïve retrieval
// Same or better quality. Significantly lower cost.

The savings from surgical RAG are substantial — limiting retrieval to 2–3 focused chunks instead of 8–10 full documents can cut input tokens by more than 50%. Cache stable reference content (product docs, FAQs, policies) using prompt caching instead of re-retrieving it on every turn.

10. Anti-Patterns

1. Optimizing prompt length before diagnosing cost drivers Spending hours trimming 200 tokens from a system prompt when 20,000 tokens of conversation history are growing quadratically. Instrument first, optimize second.

2. max_tokens: 4096 as a default everywhere Output tokens cost 4–8× more than input. Set max_tokens deliberately:

Task	Appropriate max_tokens
Classification, yes/no	10
Factual lookup	256
Summary or analysis	512–1024
Document generation	2048–4096

3. Dynamic content in static blocks A timestamp, session ID, or A/B flag placed inside the system prompt block resets the cache prefix on every call. Every cached token becomes uncached. Move dynamic values to the user message.

4. Routing everything to the frontier model "to be safe" Safe is not the same as correct. Haiku handles 60% of real-world support tasks correctly. Routing those to Opus or GPT-5.5 burns 6–30× the necessary budget with no quality gain on routine work.

11. Case Study: $2,400 to $680

A team running a customer support AI agent across 1,200 daily conversations hit a $2,847 monthly bill after three months of growth. After applying the fixes in this guide:

pie title "Source of 72% Cost Reduction"
    "Model routing (Haiku for simple tasks)" : 41
    "Prompt caching (fixing the cache miss)" : 33
    "Context pruning (conversation history)" : 18
    "Output token budgeting (max_tokens)" : 8

Month 3: $2,847 → Month 6: $849. A 70% reduction while handling 18% more conversations per day. The largest single saving was model routing — not prompt caching — because the team had been sending every request including simple intent classification to Sonnet 4.6.

The full story with month-by-month numbers is in How I Cut My AI API Bill by 70%.

12. FAQ

Q: Should I optimize input tokens or output tokens first? Output tokens. They cost 4–8× more. Set max_tokens deliberately per task type, use structured output (JSON schema) to prevent verbose prose when you need data, and ask the model for concise responses where appropriate. This is the fastest single change to make.

Q: How do I know if prompt caching is actually working? Check cache_read_input_tokens in the API response. If it's 0 on every call, caching is not activating. Common reasons: content below the 1,024-token minimum, cache_control markers missing or misplaced, or dynamic content in the static block breaking the prefix.

Q: Is model routing worth the engineering effort for a small app? Yes, if you're running more than ~1,000 requests per day. The classifier call costs a fraction of a cent using Haiku and pays for itself immediately. Breakeven is typically within the first day of deployment.

Q: How does the Batch API affect latency — and when is it acceptable? Batch requests complete within 24 hours, typically much faster for small batches. It rules out any real-time user-facing flow. It's ideal for anything that runs in the background: classification pipelines, nightly report generation, bulk data extraction, A/B prompt testing. If your users aren't waiting for the response, use the Batch API.

13. Further Reading

Official Docs

Deep Dives

Tools

🛠️ CostGoat — Real-time LLM pricing comparison
🛠️ OpenRouter — Multi-provider routing + spend dashboard
🛠️ LangSmith — Token usage tracing per node in agent graphs

Measure before you optimize. Run client.messages.countTokens() on your most frequent request type before changing anything. The distribution will tell you exactly which lever to pull first.

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

DEV Community: Sanjay Singh