Cutting AI API Costs: A Backend Engineer's Field Notes
I still remember the morning I opened the dashboard and saw a $14,200 line item for "AI services." I'd been a backend engineer for a decade at that point, and I'd seen plenty of infrastructure bills. But this one made me laugh out loud, because it was almost entirely avoidable.
Here's the thing nobody tells you when you're shipping your first LLM-powered feature: the defaults are expensive on purpose. The vendors want you to think of these models as interchangeable, so you reach for the brand-name one for everything from "summarize this paragraph" to "translate this greeting card." fwiw, I made that exact mistake for six months before I bothered to measure anything.
What follows is the playbook I ended up with after auditing a handful of production systems. Nothing exotic. No vendor lock-in tricks, no "secret" pricing tiers. Just backend engineering hygiene applied to a new class of resource. imo, that's the whole game.
The 90% Lever: Stop Using GPT-4o For Everything
The single largest line item in any AI bill is model selection, and most teams treat it as a non-decision. They pick the model they're most familiar with and use it for every prompt. I've watched four companies do this. Every single one of them had a workload where 70%+ of calls were trivial — greetings, FAQ lookups, simple classification — and they were paying premium rates for it.
The fix isn't subtle. It's literally a routing table. Here's the one I ship in basically every project now:
| Workload | If You're Lazy | What You Should Use | Cost Delta |
|---|---|---|---|
| Casual chat | GPT-4o ($10/M out) | DeepSeek V4 Flash ($0.25/M) | 97.5% |
| Classification | GPT-4o-mini ($0.60/M) | Qwen3-8B ($0.01/M) | 98.3% |
| Code generation | GPT-4o ($10/M out) | DeepSeek Coder ($0.25/M) | 97.5% |
| Summarization | GPT-4o ($10/M out) | Qwen3-32B ($0.28/M) | 97.2% |
| Translation | GPT-4o ($10/M out) | Qwen-MT-Turbo ($0.30/M) | 97% |
Those percentages aren't theoretical. They're just arithmetic on the published rates. Under the hood, "model selection" is a 40x lever on your output costs, and most people never touch it.
If you do nothing else from this article, do this. I mean it.
Tiered Routing: The "Cheap First" Pattern
Once you've accepted that not every prompt deserves the flagship model, the natural next question is: how do I decide which one? The answer is something RFC 7231 would probably approve of: try the cheap path, escalate only if it fails.
This is the same idea as CDN fallbacks, database read replicas, or circuit breakers. Backend engineers have been doing tiered routing forever. Applying it to LLM calls is almost embarrassingly obvious in hindsight.
from openai import OpenAI
client = OpenAI(
api_key="sk-your-key",
base_url="https://global-apis.com/v1"
)
def smart_generate(prompt, quality_bar=0.8):
# Tier 1: ultra-budget at $0.01/M
tier1 = call_model("Qwen/Qwen3-8B", prompt)
if score_quality(tier1) >= quality_bar:
return tier1 # ~80% of traffic dies here
# Tier 2: standard at $0.25/M
tier2 = call_model("deepseek-v4-flash", prompt)
if score_quality(tier2) >= 0.9:
return tier2 # ~15% of traffic
# Tier 3: only the hard stuff, $0.78-$2.50/M
return call_model("deepseek-reasoner", prompt) # ~5%
In production I've seen this pattern push a customer-support chatbot from $420/month down to $28/month, just by letting 85% of queries die at the cheapest tier. The "quality check" doesn't have to be fancy either — for most apps, a regex or a tiny classifier does the job. You're not trying to be perfect. You're trying to not be dumb.
Caching: The Free Lunch You've Been Ignoring
Caching is the optimization technique that backend engineers already know how to do. The weird part is that a lot of teams don't bother with it for LLM calls, because they assume "every request is unique." In practice, that's almost never true.
FAQ queries, documentation lookups, "what's the weather in Berlin," template completions — these all repeat. A lot. Like, 50-80% cache-hit-rate a lot, if you measure honestly. I once audited a "personalized assistant" that turned out to be answering the same 200 questions on rotation.
Here's the minimal viable version, lifted and tweaked from a real codebase:
import hashlib, json, time
cache = {}
def cached_chat(model, messages, ttl=3600):
key = hashlib.md5(
json.dumps({"model": model, "messages": messages}, sort_keys=True).encode()
).hexdigest()
entry = cache.get(key)
if entry and (time.time() - entry["ts"]) < ttl:
return entry["response"] # $0.00 cost, baby
response = client.chat.completions.create(
model=model,
messages=messages,
)
cache[key] = {"response": response, "ts": time.time()}
return response
That's 20 lines. It pays for itself in a day if your traffic has any duplication at all. For prefix-based or semantic caching (caching similar-but-not-identical queries), you'd reach for Redis or a vector store, but start with exact-match. Seriously. Most teams never even get to step one.
Prompt Compression: Pay For Less
Here's a number that genuinely surprised me when I first ran the math: a 2,000-token system prompt compressed to 400 tokens saves roughly $0.024 per request on DeepSeek V4 Flash. Multiply that by 10,000 requests per day and you're looking at $240/day — or about $87,600 a year. For a single prompt trim.
The reason this works is embarrassingly simple: input tokens are billed, and most system prompts are bloated. They contain boilerplate, fallback instructions, personality traits, examples that nobody reads, and three paragraphs of "you are a helpful assistant." Trim that down.
The lazy way is to literally just delete text. The slightly less lazy way is to use a cheap model to summarize your own context before you send it:
def compress_prompt(text, target_ratio=0.5):
if len(text) < 500:
return text # not worth the round-trip
budget = int(len(text) * target_ratio)
summary = call_model(
"Qwen/Qwen3-8B",
f"Summarize the following in about {budget} characters, "
f"preserving all factual details:\n\n{text}"
)
return summary
This pattern stacks with everything else. Compressed prompts mean cheaper input tokens, fewer output tokens in the response (because the model has less to react to), and a smaller cache footprint. It's the optimization that keeps on optimizing. 15-30% savings per request is the realistic range I've observed, and I'd lean toward the higher end for anything with a heavy RAG context.
Batching: One of the Oldest Tricks in the Book
Backend engineers have been batching database inserts since before SQL was a thing. The same principle applies to LLM calls: if you're going to make ten requests, see if you can make one.
The cost saving here is mostly about input tokens. Ten separate calls with overlapping context waste a ton of tokens re-sending the same system prompt ten times. One batched call shares that overhead exactly once.
# before: 3 calls, 3x system prompt tokens, 3x round trips
for q in questions:
client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": q}],
)
# after: 1 call, 1x system prompt, 1x round trip
batch_prompt = (
"Answer each question on its own line, prefixed with the number.\n"
+ "\n".join(f"{i+1}. {q}" for i, q in enumerate(questions))
)
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": batch_prompt}],
)
parsed = parse_numbered_responses(response.choices[0].message.content)
You give up a little per-request latency, you gain 10-20% on aggregate cost, and you stop hammering the rate limiter. There's a reason every infrastructure RFC since 1998 has had a "batch things" section.
Streaming + Early Termination
This one isn't in the original guide I cribbed most of this from, but it's earned its place in mine. If you're generating long outputs, stream them and stop on the first sign that downstream consumers have what they need.
Most chat UIs render incrementally. If the user has the answer in the first 80 tokens of a 600-token response, you're paying for 520 tokens they'll never read. For batch jobs, you can cut off on a sentinel token.
stream = client.chat.completions.create(
model="deepseek-v4-flash",
messages=messages,
stream=True,
max_tokens=800,
)
collected = []
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
collected.append(delta)
full_so_far = "".join(collected)
if has_conclusive_answer(full_so_far):
break
It's not a huge line-item saver on its own — maybe 5-10% — but it stacks with everything else, and it improves perceived latency, which is the metric your users actually care about.
Observability: The Thing You Skip Until It's Too Late
I saved this for last because it's the least fun. But it's also the only one that prevents you from regressing. Every other technique in this article is useless if you can't measure whether it's working.
You need three numbers per request:
- What model you called
- How many input/output tokens you used
- Whether you served it from cache, compressed it, batched it, etc.
Log those. Tag them. Put them in a dashboard. Without this, you're flying blind, and the next time your bill spikes you'll have no idea whether it's because someone changed a prompt or because someone re-routed all traffic to GPT-4o for a week.
import logging
logger = logging.getLogger("llm-cost")
def tracked_call(model, messages, **kwargs):
resp = client.chat.completions.create(model=model, messages=messages, **kwargs)
usage = resp.usage
logger.info("llm.call", extra={
"model": model,
"in_tokens": usage.prompt_tokens,
"out_tokens": usage.completion_tokens,
"est_cost_usd": estimate_cost(model, usage),
})
return resp
I won't pretend this is glamorous. It isn't. But RFC 8252 didn't make observability a footnote either, and for good reason — you can't optimize a system you can't see.
Putting It All Together
Let me give you a back-of-the-envelope for a "medium" workload. Say you're doing 1M LLM calls per month, average 500 input tokens and 300 output tokens.
| Strategy | Baseline Cost | Optimized Cost | Savings |
|---|---|---|---|
| Default: GPT-4o everywhere | $3,300 | — | — |
| Smart model selection | — | $330 | 90% |
| + Tiered routing | — | $165 | 95% |
| + Caching (40% hit rate) | — | $99 | 97% |
| + Prompt compression | — | $74 | 97.8% |
| + Batching where possible | — | $60 | 98.2% |
These numbers are conservative. I've seen teams go further.
A Note On Where I'm Routing These Days
I've been pushing most of my traffic through Global API lately, mostly because I'm tired of maintaining five different SDK credentials and watching three of them rotate at different cadences. They expose a single OpenAI-compatible endpoint at https://global-apis.com/v1, so the code samples above literally work as-is — just swap the base URL and you're done. If you're neck-deep in a multi-vendor setup and want one bill to look at, it's worth checking out. Not sponsored, just a thing that made my life easier.
The whole thesis of this article, though, doesn't depend on which gateway you pick. Pick whichever one you want, but pick a cheap model for the easy stuff. That's the 40x lever. Everything else is fine-tuning on top.
Top comments (0)