Here's the thing: how I Cut My AI API Bill From Scratch: What Nobody Tells You
I still remember the day I opened our team's monthly invoice and nearly spilled coffee on my keyboard. We'd been "playing around" with LLMs for a few months, and the bill had quietly ballooned to something absurd. After digging in, I realised something that genuinely embarrassed me: we were burning cash because we were lazy. Every prompt, every request, every tiny classification task — all routed through the most expensive models because, well, they were the defaults.
Here's how I dug out of that hole. These aren't theoretical tricks from some whitepaper. They're the exact things I wired into our system over a long weekend, and the savings have stuck for months. Let me walk you through what worked.
The Embarrassing Truth About My Stack
Before we dive in, I want to give you the same panic-inducing math that motivated me. The cost gap between models isn't a small difference — it's an order of magnitude. Sometimes two orders of magnitude. Once I built out a proper comparison, I couldn't unsee it.
Here's the table that changed how I think about every API call:
| Task | Expensive Choice | Smart Choice | Savings |
|---|---|---|---|
| Simple chat | GPT-4o ($10/M) | DeepSeek V4 Flash ($0.25/M) | 97.5% |
| Classification | GPT-4o-mini ($0.60/M) | Qwen3-8B ($0.01/M) | 98.3% |
| Code generation | GPT-4o ($10/M) | DeepSeek Coder ($0.25/M) | 97.5% |
| Summarization | GPT-4o ($10/M) | Qwen3-32B ($0.28/M) | 97.2% |
| Translation | GPT-4o ($10/M) | Qwen-MT-Turbo ($0.30/M) | 97% |
Read that last column again. Ninety-seven percent. On line items I'd been treating as "cheap." That's the kind of number where you stop, laugh, and start refactoring.
Strategy 1: Stop Asking the Ferrari to Pick Up Groceries
The first lesson was the easiest, and also the one I should have learned months earlier. Stop sending every task to your priciest model. Most of what we send through an LLM is not rocket science. Classifying a support ticket, summarizing a paragraph, answering a FAQ — none of that needs the brainpower of a frontier reasoning model.
Here's how I built a tiny router in our codebase. It's not fancy, and that's the point. Let me show you the bones of it:
from openai import OpenAI
client = OpenAI(
base_url="https://global-apis.com/v1",
api_key="YOUR_GLOBAL_API_KEY"
)
MODEL_MAP = {
"chat": "deepseek-v4-flash", # $0.25/M
"code": "deepseek-coder", # $0.25/M
"simple": "Qwen/Qwen3-8B", # $0.01/M
"reasoning": "deepseek-reasoner", # $2.50/M
}
def classify_complexity(user_input: str) -> str:
# another cheap model call, whatever floats your boat.
lowered = user_input.lower()
if "code" in lowered or "function" in lowered or "implement" in lowered:
return "code"
if "prove" in lowered or "step by step" in lowered or "why" in lowered:
return "reasoning"
if len(user_input.split()) < 20:
return "simple"
return "chat"
def route_and_answer(user_input: str) -> str:
task = classify_complexity(user_input)
model = MODEL_MAP[task]
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": user_input}]
)
return response.choices[0].message.content
That single change — picking the right engine for the right job — cut roughly 90% off our bill. Nothing else. Just routing logic.
Strategy 2: The Cascade Pattern (Where I Saved Another 5%)
Once I had basic routing working, I got greedy. Here's how the cascade works: try the cheapest model first, and only escalate if the answer isn't good enough. It's the same idea as a junior dev reviewing before a senior jumps in.
Here's how I implemented it:
def call_model(model: str, prompt: str) -> str:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
def quality_check(response: str) -> float:
# Your heuristic — length, keyword presence, another cheap model
# to grade it, logprobs, whatever. Keep this fast and cheap.
return min(1.0, len(response) / 200)
def smart_generate(prompt: str) -> str:
# Tier 1: Ultra-budget ($0.01/M)
resp = call_model("Qwen/Qwen3-8B", prompt)
if quality_check(resp) >= 0.8:
return resp # ~80% of requests handled here
# Tier 2: Standard ($0.25/M)
resp = call_model("deepseek-v4-flash", prompt)
if quality_check(resp) >= 0.9:
return resp # ~15% of requests
# Tier 3: Premium ($0.78-$2.50/M)
return call_model("deepseek-reasoner", prompt) # ~5% of requests
The real win is in that 80% figure. Most requests never need the heavy artillery. We saw a customer support chatbot go from $420/month down to $28/month just by routing 85% of queries through Qwen3-8B. That's not a typo. Twenty-eight dollars. From four hundred and twenty.
Strategy 3: Cache Like Your Wallet Depends On It (Because It Does)
Okay, this one I should have implemented on day one. So many requests are identical or nearly identical. The same FAQ, the same documentation lookup, the same "summarize this article" prompt run twice by two teammates. Every duplicate is money you'd otherwise hand to a GPU cluster somewhere.
Here's a simple, working cache you can drop in:
import hashlib
import json
import time
cache = {}
def cached_chat(model: str, messages: list, ttl: int = 3600):
key = hashlib.md5(
json.dumps({"model": model, "messages": messages}).encode()
).hexdigest()
if key in cache:
entry = cache[key]
if time.time() - entry["time"] < ttl:
return entry["response"] # Cache hit — $0 cost
response = client.chat.completions.create(
model=model,
messages=messages
)
cache[key] = {"response": response, "time": time.time()}
return response
For real workloads — the kind where users ask variations of the same handful of questions — I've seen cache hit rates of 50% to 80%. That alone stacks another 20% to 50% on top of whatever savings you've already eked out. It's almost unfair.
Strategy 4: Compress Your Prompts Before They Leave Your Server
This one surprised me with how effective it was. Long prompts mean more input tokens. More input tokens means more cost. We were sending multi-thousand-token system prompts to handle relatively simple queries. After I started compressing context before sending, I watched the meter slow down dramatically.
Here's the pattern. If your context is short, just send it. If it's long, summarize it first with a cheap model, then send the summary plus the actual question:
def compress_prompt(text: str, target_ratio: float = 0.5) -> str:
if len(text) < 500:
return text # Already short, don't waste a call
target_chars = int(len(text) * target_ratio)
summary = call_model(
"Qwen/Qwen3-8B",
f"Summarize this in {target_chars} chars: {text}"
)
return summary
Let me share the concrete math because this one really sells itself. A 2,000-token system prompt compressed down to 400 tokens saves $0.024 per request on DeepSeek V4 Flash. That's not a lot per call. But if you're processing 10,000 requests a day? That's $240/day, or about $87,600 a year. From one prompt compression. Wild.
Strategy 5: Batch Until It Hurts (Then Back Off Slightly)
Here's another "duh, why wasn't I doing this" moment. When you have a list of independent questions, don't loop through them and fire one call each. Bundle them into a single prompt and let the model chew through them in one pass. The overhead per request drops, you pay one set of input tokens instead of three, and the model is happy because it's running fewer inference calls.
Here's the before-and-after that I think captures it best:
questions = [
"What is the capital of France?",
"What is the capital of Japan?",
"What is the capital of Brazil?",
]
# BEFORE: 3 separate API calls
for question in questions:
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": question}]
)
print(response.choices[0].message.content)
# AFTER: 1 batched call
batch_prompt = (
"Answer each question on its own line. "
"Questions:\n" + "\n".join(f"- {q}" for q in questions)
)
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": batch_prompt}]
)
print(response.choices[0].message.content)
You can stack another 10% to 20% on top of everything else with this. The trick is to respect context window limits — don't try to batch 5,000 questions into one prompt — but for the realistic workloads where this matters, it pays off.
Putting It All Together: My Real Numbers
Here's the receipts. I won't lie about this — the headline number from the strategy table felt exaggerated when I first heard it. So let me share what I actually saw:
- Smart model selection alone: ~90% savings on the routed traffic.
- Adding tiered routing: pushed us toward ~95%.
- Adding response caching: another 20-50% on top of that.
- Adding prompt compression: another 15-30% on remaining requests.
- Adding batching: another 10-20% where it applied.
Layered together, we comfortably cleared 95% savings. And honestly? The output quality got better in some places because I was finally thinking about which model was right for each task, instead of letting the default do everything.
A Few Things I Wish I'd Known Sooner
Let me give you the soft advice, the stuff that doesn't fit in a code snippet:
Instrument everything. The first thing I did was log which model handled each request and how much it cost. Once you see where the money goes, the optimization opportunities practically announce themselves.
Don't optimize the easy stuff and call it done. The big wins are usually boring — the second-most-expensive model handling 80% of traffic quietly, the cache hits you never knew about, the bloated system prompt you've been shipping since launch.
Quality checks aren't optional. With cascading tiers, you
Top comments (0)