The demo was cheap. Then you shipped, traffic grew, and the monthly model bill quietly became one of your largest infrastructure line items. LLM spend scales linearly with usage, and most teams leave 50–90% of it on the table because the easy wins are invisible until you go looking.
Here are nine tactics that actually move the number, ordered roughly from highest to lowest leverage. None of them require switching providers, and most are a few hours of work.
1. Stop paying twice for the same answer (exact-match caching)
A surprising share of production traffic is duplicate prompts: the same FAQ, the same summarization of the same document, the same system-prompted classification. Hash the full request (model + messages + params) and cache the response.
import hashlib, json, redis
r = redis.Redis()
def cached_completion(model, messages, **kw):
key = "llm:" + hashlib.sha256(
json.dumps({"m": model, "msgs": messages, **kw}, sort_keys=True).encode()
).hexdigest()
if (hit := r.get(key)):
return json.loads(hit)
resp = call_model(model, messages, **kw)
r.setex(key, 86400, json.dumps(resp)) # 24h TTL
return resp
For deterministic calls (temperature=0) this is free money. Cache hit = zero tokens.
2. Catch near-duplicates too (semantic caching)
Exact-match misses "What's your refund policy?" vs "How do refunds work?". Embed the query, search a vector store, and if the nearest cached question is above a similarity threshold (~0.95), return its answer. Embeddings cost a fraction of a completion, so the math works strongly in your favor at scale. Tune the threshold carefully — too loose and you'll serve wrong answers.
3. Route by difficulty (model cascades)
You do not need your most expensive model for "is this sentiment positive or negative?". Send everything to a small/cheap model first; escalate to the frontier model only when the cheap one signals low confidence or the task is genuinely hard.
def route(task, prompt):
if task in CHEAP_TASKS: # classification, extraction, routing
return call_model("small-cheap-model", prompt)
return call_model("frontier-model", prompt) # reasoning, long-form, code
A well-tuned cascade routinely cuts blended cost-per-request by 60–80% because the long tail of simple requests stops hitting the premium tier.
4. Compress the prompt, not the quality
You pay for every input token, and most prompts are bloated. Three high-ROI trims:
- Shrink the system prompt. A 1,500-token system prompt sent on every request is a tax on every call. Move static instructions into a fine-tune or a shorter canonical version.
- Prune RAG context. Retrieving 20 chunks "to be safe" when 4 answer the question multiplies input cost. Re-rank and keep the top few.
- Summarize history. In long chats, replace old turns with a running summary instead of resending the entire transcript every time.
5. Cap and control output tokens
Output tokens usually cost more than input tokens, and an unbounded max_tokens invites rambling. Set a sensible ceiling, and ask for structured/terse output when you don't need prose:
call_model(model, messages,
max_tokens=256, # bound the worst case
response_format={"type": "json_object"}) # no filler, easy to parse
"Answer in one sentence" or "return only JSON" is a real cost lever, not just a UX choice.
6. Batch when latency allows
Many workloads — nightly enrichment, backfills, evals, bulk classification — don't need real-time responses. Most providers offer an asynchronous batch API at a steep discount (commonly ~50%) for jobs you can wait hours on. Split your traffic: interactive requests go to the real-time endpoint, everything deferrable goes to the batch lane.
7. Use provider-side prompt caching
Several providers now cache a static prompt prefix server-side and bill the cached portion at a large discount on subsequent calls. If you send the same long system prompt or document context repeatedly, order your messages so the stable part comes first and opt into prompt caching. This stacks with tactic #4.
8. Fine-tune to delete the prompt
When a task is narrow and high-volume, a small fine-tuned model can match a big model's quality on that task — with a fraction of the prompt tokens, because the instructions and few-shot examples are baked into the weights. You trade a one-time training cost for a permanently smaller per-request bill. Run the break-even math: above some daily volume, fine-tuning a cheaper base model wins decisively.
9. Measure cost per request, or you're flying blind
You can't cut what you don't see. Log tokens and dollar cost on every call, tagged by feature, model, and user tier. The first time you do this you'll find one endpoint quietly burning a third of the budget.
def log_cost(feature, model, usage):
cost = usage.prompt_tokens * PRICE[model]["in"] + \
usage.completion_tokens * PRICE[model]["out"]
metrics.increment("llm.cost_usd", cost, tags=[f"feature:{feature}", f"model:{model}"])
Watch cost-per-successful-request as your north-star metric — it normalizes for traffic and exposes regressions a raw total hides.
Put it together
Stack these and the savings compound: caching removes duplicate work, routing moves the bulk of traffic to cheap models, prompt compression and output caps shrink what's left, batching discounts the deferrable tail, and observability keeps it all honest. Teams that apply the top four typically see their bill drop by more than half without any user-visible quality loss.
If you'd rather not build the caching layer, router, and cost-tracking dashboard from scratch, the LLM Cost Optimizer bundles these patterns — semantic cache, model-routing logic, token accounting, and ready-to-wire dashboards — so you can start saving this week instead of next quarter.
The mindset shift
LLM cost optimization isn't a one-time cleanup; it's a habit. Treat tokens like you treat database queries — something you profile, budget, and watch. The cheapest token is the one you never send, and the second cheapest is the one a small model handles.
Top comments (0)