I Cut My LLM API Bill by 73% — Here's the Exact Optimization Playbook
Running LLMs in production burns cash. Fast. When your app goes from "prototype" to "actually used by people," that API bill can go from "whatever" to "wait, that's a mortgage payment" in about two weeks.
I learned this the hard way. My knowledge base platform went from a few hundred requests to thousands per day, and my LLM bill jumped to $4,200/month. After spending three weeks optimizing, I brought it down to $1,130/month — a 73% reduction — without anyone noticing a drop in quality.
Here's the exact playbook.
1. The Routing Layer: Right Model for the Right Job
Most developers send everything to the biggest model. That's like using a sledgehammer to crack a nut.
The strategy: Classify requests by complexity and route accordingly.
// Simple classification layer
function routeByComplexity(userInput: string): LLMModel {
const tokens = userInput.split(/\s+/).length;
if (tokens < 15 && !containsTechnicalTerms(userInput)) {
return 'cheap-fast-model'; // $0.15/M tokens
}
if (tokens < 100 && isStructuredQuery(userInput)) {
return 'mid-tier-model'; // $0.50/M tokens
}
return 'premium-model'; // $3.00/M tokens — only when needed
}
The impact: ~40% of our requests are simple (formatting, classification, short answers). Routing those to cheaper models saved ~$900/month alone.
How to classify: Start with heuristics (token count, keyword matching). Once you have data, train a tiny classifier that costs pennies to run.
2. Response Caching: The $600/Month Win
If a user asks "What is RAG?" and another user asks "What is RAG?" three hours later — that's the same answer. Don't pay twice.
import hashlib
import redis
class LLMCache:
def __init__(self):
self.redis = redis.Redis()
def get_cache_key(self, prompt: str, model: str) -> str:
raw = f"{model}:{prompt.strip().lower()}"
return f"llm:{hashlib.sha256(raw.encode()).hexdigest()[:16]}"
def get(self, prompt: str, model: str):
key = self.get_cache_key(prompt, model)
return self.redis.get(key)
def set(self, prompt: str, model: str, response: str, ttl: int = 86400):
key = self.get_cache_key(prompt, model)
self.redis.setex(key, ttl, response)
Key decisions:
- TTL: 24 hours for general knowledge, 1 hour for time-sensitive queries
- Cache scope: Cache at the prompt level, not the response — normalize whitespace, lowercase, strip trailing punctuation
- Hit rate: We achieved 35% cache hit rate on FAQ-style content
The catch: Don't cache creative tasks (writing, brainstorming). Those need fresh outputs every time.
3. Token Budgeting: The Silent Killer Is Output Length
Most LLM pricing charges per output token. A model that outputs 2,000 tokens when 300 would do is burning your money.
Before:
User: "Summarize this article"
Model: *generates 1,800 token essay with examples, caveats, and a conclusion*
Cost: $0.054 per request
After:
User: "Summarize this article in 3 bullet points, max 50 words each."
Model: *generates 120 tokens*
Cost: $0.0036 per request
Tactics that work:
- Explicit token budgets in prompts: "Answer in under 100 words"
-
max_tokensparameter: Set hard limits (but beware of truncated responses) - Output format constraints: JSON schemas force conciseness
- Temperature tuning: Lower temperature (0.1-0.3) reduces rambling
This alone cut our output token count by 60%.
4. Prompt Compression: Shrink the Input, Shrink the Cost
Your prompt tokens cost money too. If you're sending a 5,000-token system prompt with every request, you're paying $0.015 per call just for setup.
What I compressed:
- System prompts: 3,200 → 800 tokens (removed redundant instructions)
- Few-shot examples: 6 examples → 2 carefully chosen ones
- Context windows: Only include relevant sections, not entire documents
The compression technique:
- Run your prompt through a cheap model first: "Condense these instructions to the minimum needed for correct execution"
- Test output quality — if it drops, you compressed too far
- A/B test the compressed vs. original for a week
We cut input tokens by 55% with zero quality loss.
5. Batch Processing: The Async Advantage
If your app doesn't need real-time responses, batching is your best friend. Most providers offer significant discounts for batch API calls.
Our use case: Article processing pipeline — we need LLMs to tag, summarize, and extract entities from crawled content. None of this is user-facing in real-time.
# Batch API pattern (OpenAI example)
import openai
batch_input = [
{"custom_id": "article-001", "method": "POST", "url": "/v1/chat/completions",
"body": {"model": "gpt-4o-mini", "messages": [...], "max_tokens": 500}},
{"custom_id": "article-002", "method": "POST", "url": "/v1/chat/completions",
"body": {"model": "gpt-4o-mini", "messages": [...], "max_tokens": 500}},
# ... up to 50,000 requests per batch
]
# 50% discount on batch API + higher rate limits
batch_file = upload_batch_file(batch_input)
batch_job = openai.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions")
Batch processing costs 50% less than real-time API calls. For our content pipeline, this saved $400/month.
6. Model Distillation: The Advanced Play
This is the most effort but the biggest payoff. For tasks you run thousands of times per day (classification, tagging, entity extraction), fine-tune a smaller model on outputs from the big model.
The process:
- Run 1,000 examples through GPT-4/Claude to get "gold standard" outputs
- Fine-tune GPT-4o-mini (or Claude Haiku) on those examples
- The small model now produces ~90% of the quality at ~10% of the cost
Our results:
- Classification task: 94% accuracy with premium model → 89% with fine-tuned small model
- Cost per request: $0.03 → $0.003
- Break-even: After ~5,000 requests, the fine-tuning cost pays for itself
The Numbers, Laid Bare
| Tactic | Monthly Savings | Effort |
|---|---|---|
| Smart routing | ~$900 | 2 days |
| Response caching | ~$600 | 1 day |
| Token budgeting | ~$800 | 3 hours |
| Prompt compression | ~$400 | 1 day |
| Batch processing | ~$400 | 2 hours |
| Model distillation | ~$170 | 1 week |
| Total | ~$3,070 |
From $4,200 to $1,130. And my users didn't notice a thing.
The One Rule I Follow Now
Every LLM call should answer: "Does this need to be an LLM call at all?"
Some things are better solved with:
- Regex (formatting, validation)
- Database queries (lookup, filtering)
- Rule engines (classification with clear rules)
- Embedding similarity (search without generation)
The cheapest API call is the one you don't make.
What I'd Do Differently
If I started over, I'd instrument LLM costs from day one. Track per-endpoint spend, set budgets with alerts, and review the bill weekly. Cost creep is real — it's not one big mistake, it's a hundred small inefficiencies that compound.
What's your LLM bill looking like? Drop your optimization war stories below — I'm always looking for the next 10% to cut.
Top comments (0)