kol kol

Posted on May 18

I Cut My LLM API Bill by 73% — Here's the Exact Optimization Playbook

#ai #llm #devops #costoptimization

I Cut My LLM API Bill by 73% — Here's the Exact Optimization Playbook

Running LLMs in production burns cash. Fast. When your app goes from "prototype" to "actually used by people," that API bill can go from "whatever" to "wait, that's a mortgage payment" in about two weeks.

I learned this the hard way. My knowledge base platform went from a few hundred requests to thousands per day, and my LLM bill jumped to $4,200/month. After spending three weeks optimizing, I brought it down to $1,130/month — a 73% reduction — without anyone noticing a drop in quality.

Here's the exact playbook.

1. The Routing Layer: Right Model for the Right Job

Most developers send everything to the biggest model. That's like using a sledgehammer to crack a nut.

The strategy: Classify requests by complexity and route accordingly.

// Simple classification layer
function routeByComplexity(userInput: string): LLMModel {
  const tokens = userInput.split(/\s+/).length;

  if (tokens < 15 && !containsTechnicalTerms(userInput)) {
    return 'cheap-fast-model';     // $0.15/M tokens
  }
  if (tokens < 100 && isStructuredQuery(userInput)) {
    return 'mid-tier-model';       // $0.50/M tokens
  }
  return 'premium-model';          // $3.00/M tokens — only when needed
}

The impact: ~40% of our requests are simple (formatting, classification, short answers). Routing those to cheaper models saved ~$900/month alone.

How to classify: Start with heuristics (token count, keyword matching). Once you have data, train a tiny classifier that costs pennies to run.

2. Response Caching: The $600/Month Win

If a user asks "What is RAG?" and another user asks "What is RAG?" three hours later — that's the same answer. Don't pay twice.

import hashlib
import redis

class LLMCache:
    def __init__(self):
        self.redis = redis.Redis()

    def get_cache_key(self, prompt: str, model: str) -> str:
        raw = f"{model}:{prompt.strip().lower()}"
        return f"llm:{hashlib.sha256(raw.encode()).hexdigest()[:16]}"

    def get(self, prompt: str, model: str):
        key = self.get_cache_key(prompt, model)
        return self.redis.get(key)

    def set(self, prompt: str, model: str, response: str, ttl: int = 86400):
        key = self.get_cache_key(prompt, model)
        self.redis.setex(key, ttl, response)

Key decisions:

TTL: 24 hours for general knowledge, 1 hour for time-sensitive queries
Cache scope: Cache at the prompt level, not the response — normalize whitespace, lowercase, strip trailing punctuation
Hit rate: We achieved 35% cache hit rate on FAQ-style content

The catch: Don't cache creative tasks (writing, brainstorming). Those need fresh outputs every time.

3. Token Budgeting: The Silent Killer Is Output Length

Most LLM pricing charges per output token. A model that outputs 2,000 tokens when 300 would do is burning your money.

Before:

User: "Summarize this article"
Model: *generates 1,800 token essay with examples, caveats, and a conclusion*
Cost: $0.054 per request

After:

User: "Summarize this article in 3 bullet points, max 50 words each."
Model: *generates 120 tokens*
Cost: $0.0036 per request

Tactics that work:

Explicit token budgets in prompts: "Answer in under 100 words"
max_tokens parameter: Set hard limits (but beware of truncated responses)
Output format constraints: JSON schemas force conciseness
Temperature tuning: Lower temperature (0.1-0.3) reduces rambling

This alone cut our output token count by 60%.

4. Prompt Compression: Shrink the Input, Shrink the Cost

Your prompt tokens cost money too. If you're sending a 5,000-token system prompt with every request, you're paying $0.015 per call just for setup.

What I compressed:

System prompts: 3,200 → 800 tokens (removed redundant instructions)
Few-shot examples: 6 examples → 2 carefully chosen ones
Context windows: Only include relevant sections, not entire documents

The compression technique:

Run your prompt through a cheap model first: "Condense these instructions to the minimum needed for correct execution"
Test output quality — if it drops, you compressed too far
A/B test the compressed vs. original for a week

We cut input tokens by 55% with zero quality loss.

5. Batch Processing: The Async Advantage

If your app doesn't need real-time responses, batching is your best friend. Most providers offer significant discounts for batch API calls.

Our use case: Article processing pipeline — we need LLMs to tag, summarize, and extract entities from crawled content. None of this is user-facing in real-time.

# Batch API pattern (OpenAI example)
import openai

batch_input = [
    {"custom_id": "article-001", "method": "POST", "url": "/v1/chat/completions",
     "body": {"model": "gpt-4o-mini", "messages": [...], "max_tokens": 500}},
    {"custom_id": "article-002", "method": "POST", "url": "/v1/chat/completions",
     "body": {"model": "gpt-4o-mini", "messages": [...], "max_tokens": 500}},
    # ... up to 50,000 requests per batch
]

# 50% discount on batch API + higher rate limits
batch_file = upload_batch_file(batch_input)
batch_job = openai.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions")

Batch processing costs 50% less than real-time API calls. For our content pipeline, this saved $400/month.

6. Model Distillation: The Advanced Play

This is the most effort but the biggest payoff. For tasks you run thousands of times per day (classification, tagging, entity extraction), fine-tune a smaller model on outputs from the big model.

The process:

Run 1,000 examples through GPT-4/Claude to get "gold standard" outputs
Fine-tune GPT-4o-mini (or Claude Haiku) on those examples
The small model now produces ~90% of the quality at ~10% of the cost

Our results:

Classification task: 94% accuracy with premium model → 89% with fine-tuned small model
Cost per request: $0.03 → $0.003
Break-even: After ~5,000 requests, the fine-tuning cost pays for itself

The Numbers, Laid Bare

Tactic	Monthly Savings	Effort
Smart routing	~$900	2 days
Response caching	~$600	1 day
Token budgeting	~$800	3 hours
Prompt compression	~$400	1 day
Batch processing	~$400	2 hours
Model distillation	~$170	1 week
Total	~$3,070

From $4,200 to $1,130. And my users didn't notice a thing.

The One Rule I Follow Now

Every LLM call should answer: "Does this need to be an LLM call at all?"

Some things are better solved with:

Regex (formatting, validation)
Database queries (lookup, filtering)
Rule engines (classification with clear rules)
Embedding similarity (search without generation)

The cheapest API call is the one you don't make.

What I'd Do Differently

If I started over, I'd instrument LLM costs from day one. Track per-endpoint spend, set budgets with alerts, and review the bill weekly. Cost creep is real — it's not one big mistake, it's a hundred small inefficiencies that compound.

What's your LLM bill looking like? Drop your optimization war stories below — I'm always looking for the next 10% to cut.

DEV Community

I Cut My LLM API Bill by 73% — Here's the Exact Optimization Playbook

I Cut My LLM API Bill by 73% — Here's the Exact Optimization Playbook

1. The Routing Layer: Right Model for the Right Job

2. Response Caching: The $600/Month Win

3. Token Budgeting: The Silent Killer Is Output Length

4. Prompt Compression: Shrink the Input, Shrink the Cost

5. Batch Processing: The Async Advantage

6. Model Distillation: The Advanced Play

The Numbers, Laid Bare

The One Rule I Follow Now

What I'd Do Differently

Top comments (0)