How I Cut My LLM API Costs by 90%: A Battle-Tested Token Optimization Guide
Last Updated: May 2026
Audience: Backend engineers, ML engineers, and product developers building LLM-powered applications
Introduction
If you're building production LLM applications, you've probably watched your token costs spiral out of control faster than expected. A moderately successful chatbot can easily burn through $10,000/month, and a high-traffic API integration can hit six figures.
This guide covers battle-tested strategies for reducing token consumption without sacrificing output quality. These techniques are based on industry best practices and can be applied to any LLM provider.
What you'll learn:
- How to reduce input tokens by 60-90% with prompt caching
- When to use batch processing for 50% cost savings
- How to choose the right model for each task
- Architecture patterns that scale token efficiency
1. Understanding the Cost Structure
Before optimizing, you need to understand where your money goes.
Current Pricing (May 2026)
International Providers:
| Provider | Model | Input (per MTok) | Output (per MTok) | Context Window |
|---|---|---|---|---|
| Anthropic | Claude Opus 4.7 | $5.00 | $25.00 | 1M tokens |
| Anthropic | Claude Sonnet 4.6 | $3.00 | $15.00 | 1M tokens |
| Anthropic | Claude Haiku 4.5 | $1.00 | $5.00 | 200K tokens |
| OpenAI | GPT-4o | $2.50 | $10.00 | 128K tokens |
| OpenAI | GPT-4o-mini | $0.15 | $0.60 | 128K tokens |
| Gemini 1.5 Pro | $3.50 | $10.50 | 2M tokens |
Chinese Providers:
| Provider | Model | Input (per MTok) | Output (per MTok) | Context Window |
|---|---|---|---|---|
| DeepSeek | DeepSeek V4 Pro | $0.14 | $0.28 | 128K tokens |
| DeepSeek | DeepSeek V4 Flash | $0.07 | $0.14 | 128K tokens |
| Zhipu AI | GLM 5.1 | $0.14 | $0.28 | 128K tokens |
| Zhipu AI | GLM 5V Turbo | $0.14 | $0.28 | 128K tokens |
Key insight: Output tokens are typically 3-5x more expensive than input tokens. Optimizing output length often has the highest ROI.
Cost Formula
Total Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price)
For a typical chatbot conversation (2000 input tokens, 500 output tokens) using DeepSeek V4 Pro:
- Input cost: 0.002 × $0.14 = $0.00028
- Output cost: 0.0005 × $0.28 = $0.00014
- Total: $0.00042 per conversation
At 10,000 conversations/day, that's $4.20/day or ~$126/month.
Cost comparison: The same conversation using Claude Sonnet 4.6 would cost $0.0135—32x more expensive than DeepSeek V4 Pro.
2. Prompt Caching: The Highest-Impact Optimization
Prompt caching is the single most effective cost reduction technique available today. Both Anthropic and OpenAI now support it natively.
How It Works
When you send a request with prompt caching enabled, the provider caches the prefix of your prompt. Subsequent requests with the same prefix reuse the cached version, dramatically reducing both cost and latency.
Anthropic Prompt Caching
Anthropic's implementation offers 90% savings on cached input tokens:
| Operation | Price (Claude Sonnet 4.6) |
|---|---|
| Base input | $3.00 / MTok |
| Cache write (5min TTL) | $3.75 / MTok (1.25x) |
| Cache write (1h TTL) | $6.00 / MTok (2x) |
| Cache read | $0.30 / MTok (0.1x) |
Implementation:
import anthropic
client = anthropic.Anthropic()
# System prompt with cache control
system_prompt = [
{
"type": "text",
"text": """You are an expert code reviewer. Follow these guidelines:
- Focus on security vulnerabilities
- Check for performance issues
- Verify error handling
- Suggest improvements with code examples
[Your full system prompt here...]""",
"cache_control": {"type": "ephemeral"}
}
]
# First request: cache write
response1 = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": "Review this Python function..."}]
)
# Subsequent requests: cache read (90% cheaper)
response2 = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=system_prompt, # Same system prompt = cache hit
messages=[{"role": "user", "content": "Review this other function..."}]
)
Real-world savings: A code review tool processing 1,000 requests/day with a 2,000-token system prompt:
- Without caching: $6.00/day for system prompt tokens
- With caching: $0.60/day for system prompt tokens
- Savings: $5.40/day ($162/month)
Automatic Caching (New in 2026)
Anthropic now supports automatic caching for multi-turn conversations:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
cache_control={"type": "ephemeral"}, # Auto-cache last block
system="You are a helpful assistant.",
messages=[
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi! How can I help?"},
{"role": "user", "content": "What's the weather?"}
]
)
The cache point automatically moves forward as conversations grow. No manual breakpoint management needed.
OpenAI Prompt Caching
OpenAI automatically caches prompts longer than 1,024 tokens (for most models). Cached input tokens are billed at 50% of the standard rate.
from openai import OpenAI
client = OpenAI()
# OpenAI automatically caches long prompts
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Your long system prompt here..."},
{"role": "user", "content": "Your query here"}
]
)
No code changes required—caching happens automatically.
Best Practices for Prompt Caching
Place static content first: System prompts, tool definitions, and context should come before dynamic content.
Use explicit breakpoints strategically: For multi-section prompts, place
cache_controlon sections that change at different frequencies.Pre-warm caches: Send a "warmup" request before users arrive to eliminate first-request latency.
# Pre-warm cache before users arrive
prewarm = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=0, # No output needed
system=[
{
"type": "text",
"text": "Your system prompt...",
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "warmup"}]
)
-
Monitor cache hit rates: Track
cache_read_input_tokensandcache_creation_input_tokensin API responses.
3. Model Selection: Right-Size for the Task
Not every task needs the most expensive model. Implement a routing system that matches tasks to appropriate models.
Task-Based Routing
import anthropic
def route_task(task_type: str, complexity: int) -> str:
"""Route tasks to appropriate models based on type and complexity."""
routing_table = {
# Simple tasks: use cheapest model
"classification": {"low": "claude-haiku-4-5", "high": "claude-sonnet-4-6"},
"summarization": {"low": "claude-haiku-4-5", "high": "claude-sonnet-4-6"},
"translation": {"low": "claude-haiku-4-5", "high": "claude-sonnet-4-6"},
# Complex tasks: use capable model
"code_generation": {"low": "claude-sonnet-4-6", "high": "claude-opus-4-7"},
"reasoning": {"low": "claude-sonnet-4-6", "high": "claude-opus-4-7"},
"analysis": {"low": "claude-sonnet-4-6", "high": "claude-opus-4-7"},
}
complexity_level = "high" if complexity > 7 else "low"
return routing_table.get(task_type, {}).get(complexity_level, "claude-sonnet-4-6")
def call_llm(prompt: str, task_type: str, complexity: int):
"""Call LLM with appropriate model based on task."""
client = anthropic.Anthropic()
model = route_task(task_type, complexity)
response = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Cost Comparison
For a typical application with mixed tasks:
| Task Distribution | Fixed (Sonnet) | Routed (Mixed) | Savings |
|---|---|---|---|
| 60% simple tasks | $3.00/MTok | $1.00/MTok | 67% |
| 30% medium tasks | $3.00/MTok | $3.00/MTok | 0% |
| 10% complex tasks | $3.00/MTok | $5.00/MTok | -67% |
| Weighted average | $3.00/MTok | $1.80/MTok | 40% |
4. Batch Processing: 50% Savings for Async Workloads
For tasks that don't require immediate responses, batch processing offers 50% cost savings.
Anthropic Message Batches API
import anthropic
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request
client = anthropic.Anthropic()
# Create batch
batch = client.messages.batches.create(
requests=[
Request(
custom_id=f"review-{i}",
params=MessageCreateParamsNonStreaming(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": f"Review code snippet {i}..."}]
)
)
for i in range(100) # 100 requests in one batch
]
)
# Poll for results
import time
while True:
batch_status = client.messages.batches.retrieve(batch.id)
if batch_status.processing_status == "ended":
break
time.sleep(60)
# Process results
for result in client.messages.batches.results(batch.id):
if result.result.type == "succeeded":
print(f"{result.custom_id}: {result.result.message.content[0].text}")
Batch Pricing (50% Discount)
| Model | Standard Input | Batch Input | Standard Output | Batch Output |
|---|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $2.50 | $25.00 | $12.50 |
| Claude Sonnet 4.6 | $3.00 | $1.50 | $15.00 | $7.50 |
| Claude Haiku 4.5 | $1.00 | $0.50 | $5.00 | $2.50 |
When to use batch processing:
- Large-scale evaluations
- Content moderation
- Data analysis
- Bulk content generation
- Code review pipelines
Combining batch + caching: You can stack batch processing with prompt caching for up to 95% savings on input tokens (50% batch + 90% cache read).
5. Output Optimization
Since output tokens are 3-5x more expensive than input tokens, optimizing output length has high ROI.
Limit Output Length
# Explicit token limit
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=256, # Limit output
messages=[{"role": "user", "content": "Summarize this article in 3 bullet points."}]
)
# Prompt-based length control
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": "Explain quantum computing. Keep it under 100 words."
}]
)
Structured Output
Request structured output to reduce verbose explanations:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": """Analyze this code. Return JSON:
{
"issues": ["issue1", "issue2"],
"severity": "high|medium|low",
"suggestions": ["suggestion1", "suggestion2"]
}"""
}]
)
Streaming
Streaming doesn't reduce token costs, but it improves perceived latency:
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Write a function..."}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
6. Context Management
Long conversations accumulate tokens quickly. Implement strategies to manage context efficiently.
Sliding Window
class ConversationManager:
def __init__(self, max_tokens: int = 4000):
self.max_tokens = max_tokens
self.messages = []
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
self._trim()
def _trim(self):
"""Keep system prompt + recent messages within token budget."""
total = self._count_tokens()
while total > self.max_tokens and len(self.messages) > 2:
# Remove oldest message (preserve system prompt)
removed = self.messages.pop(1)
total -= self._count_tokens([removed])
def _count_tokens(self, messages=None):
"""Estimate token count (simplified)."""
msgs = messages or self.messages
return sum(len(m["content"]) // 4 for m in msgs) # Rough estimate
Conversation Summarization
For very long conversations, periodically summarize:
def summarize_conversation(messages: list) -> list:
"""Compress long conversation into summary."""
client = anthropic.Anthropic()
summary_response = client.messages.create(
model="claude-haiku-4-5", # Use cheap model for summarization
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize this conversation in 50 words:\n{format_messages(messages)}"
}]
)
summary = summary_response.content[0].text
return [
{"role": "system", "content": f"Previous conversation summary: {summary}"},
messages[-1] # Keep last message for context
]
7. Semantic Caching
For applications with repetitive queries, implement semantic caching to avoid redundant API calls.
import hashlib
import json
from sentence_transformers import SentenceTransformer
import numpy as np
class SemanticCache:
def __init__(self, similarity_threshold: float = 0.95):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.cache = {}
self.threshold = similarity_threshold
def get(self, query: str):
"""Find semantically similar cached response."""
query_embedding = self.model.encode(query)
for cached_query, (cached_embedding, response) in self.cache.items():
similarity = np.dot(query_embedding, cached_embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(cached_embedding)
)
if similarity > self.threshold:
return response
return None
def set(self, query: str, response: str):
"""Cache response with query embedding."""
embedding = self.model.encode(query)
self.cache[query] = (embedding, response)
# Usage
cache = SemanticCache()
def get_llm_response(query: str) -> str:
# Check cache first
cached = cache.get(query)
if cached:
return cached
# Call LLM if not cached
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": query}]
)
result = response.content[0].text
cache.set(query, result)
return result
8. Monitoring and Cost Tracking
You can't optimize what you don't measure. Implement comprehensive token monitoring.
import time
from dataclasses import dataclass
from datetime import datetime
@dataclass
class TokenUsage:
timestamp: datetime
model: str
input_tokens: int
output_tokens: int
cache_read_tokens: int
cache_write_tokens: int
cost: float
class TokenMonitor:
def __init__(self):
self.usage_log = []
def log(self, model: str, input_tokens: int, output_tokens: int,
cache_read: int = 0, cache_write: int = 0):
"""Log token usage with cost calculation."""
cost = self._calculate_cost(model, input_tokens, output_tokens,
cache_read, cache_write)
usage = TokenUsage(
timestamp=datetime.now(),
model=model,
input_tokens=input_tokens,
output_tokens=output_tokens,
cache_read_tokens=cache_read,
cache_write_tokens=cache_write,
cost=cost
)
self.usage_log.append(usage)
return cost
def _calculate_cost(self, model, input_tokens, output_tokens,
cache_read, cache_write):
"""Calculate cost based on model pricing."""
pricing = {
"claude-opus-4-7": {"input": 5.0, "output": 25.0, "cache_read": 0.50},
"claude-sonnet-4-6": {"input": 3.0, "output": 15.0, "cache_read": 0.30},
"claude-haiku-4-5": {"input": 1.0, "output": 5.0, "cache_read": 0.10},
}
p = pricing.get(model, pricing["claude-sonnet-4-6"])
# Uncached input tokens
uncached_input = input_tokens - cache_read - cache_write
input_cost = (uncached_input * p["input"] +
cache_read * p["cache_read"] +
cache_write * p["input"] * 1.25) / 1_000_000
output_cost = output_tokens * p["output"] / 1_000_000
return input_cost + output_cost
def get_daily_summary(self):
"""Get daily cost summary."""
today = datetime.now().date()
today_usage = [u for u in self.usage_log if u.timestamp.date() == today]
return {
"total_cost": sum(u.cost for u in today_usage),
"total_requests": len(today_usage),
"total_input_tokens": sum(u.input_tokens for u in today_usage),
"total_output_tokens": sum(u.output_tokens for u in today_usage),
"cache_hit_rate": self._calculate_cache_hit_rate(today_usage)
}
def _calculate_cache_hit_rate(self, usage_list):
"""Calculate cache hit rate."""
total_input = sum(u.input_tokens for u in usage_list)
total_cache_read = sum(u.cache_read_tokens for u in usage_list)
return total_cache_read / total_input if total_input > 0 else 0
9. Architecture Patterns
Pattern 1: Tiered Processing
User Request
↓
[Classifier] (Haiku - cheap)
↓
[Simple Handler] (Haiku) → Response
↓
[Complex Handler] (Sonnet/Opus) → Response
Pattern 2: Cache Layer
User Request
↓
[Semantic Cache] → Cache Hit? → Return cached response
↓ Cache Miss
[Prompt Cache Layer] → Add cache_control markers
↓
[LLM API] → Response
↓
[Cache Storage] → Store for future
Pattern 3: Batch Pipeline
[Data Source]
↓
[Batch Collector] → Accumulate requests
↓
[Batch API] → Process asynchronously (50% discount)
↓
[Result Distributor] → Send results to users
10. Real-World Case Study
Scenario: Customer support chatbot processing 5,000 conversations/day
Before optimization:
- Model: Claude Sonnet 4.6 (fixed)
- Average tokens: 3,000 input, 800 output per conversation
- Daily cost: $78.00
- Monthly cost: ~$2,340
After optimization:
- Model routing: 70% Haiku, 30% Sonnet
- Prompt caching: 90% cache hit rate on system prompt
- Output limits: Reduced average output to 400 tokens
- Daily cost: $12.50
- Monthly cost: ~$375
Total savings: 84%
11. Provider Agnostic Tips
When working with multiple LLM providers or switching between them:
Abstract your LLM layer: Use a unified interface that makes it easy to switch providers.
Test with multiple providers: Some tasks work equally well with cheaper providers.
Monitor provider-specific features: Prompt caching, batch processing, and pricing vary significantly.
Consider Chinese models: For cost-sensitive applications, Chinese models like DeepSeek and GLM offer significantly lower pricing. Services like Token China provide unified API access to these models with OpenAI-compatible endpoints—no Chinese phone number required, and you get 100K free tokens to start.
Negotiate volume discounts: For high-volume applications, contact providers directly for custom pricing.
12. Checklist
Before deploying to production, verify:
- [ ] System prompts are optimized and use prompt caching
- [ ] Model routing is implemented for different task types
- [ ] Output length limits are set appropriately
- [ ] Batch processing is used for async workloads
- [ ] Token monitoring and alerting is in place
- [ ] Semantic caching is implemented for repetitive queries
- [ ] Conversation context is managed efficiently
- [ ] Cost budgets and alerts are configured
Resources
- Anthropic Prompt Caching Documentation
- Anthropic Batch Processing Documentation
- OpenAI Pricing
- Google AI Pricing
- Token China - Unified API for DeepSeek, GLM, and more (OpenAI-compatible)
TL;DR: Used prompt caching (90% savings on cached tokens), model routing (40% average savings), batch processing (50% savings), and output optimization to reduce LLM API costs by 84%. Consider Chinese models like DeepSeek for even cheaper alternatives.
Edit: Fixed formatting
Top comments (0)