Ravi Patel

Posted on Jun 12 • Originally published at ssimplifi.com

LLM cost reduction techniques ranked by ROI: the 5 that matter, the 9 that don't (much)

#llm #costreduction #aicostoptimization #rankedtechniques

There are 14 documented ways to reduce an LLM API bill. Five of them deliver ~80% of the savings; the rest are decimal-point optimisations or scale-specific bets that don't pay back for most teams. The five, in deploy order: provider-native prompt caching, exact-match response caching, model-tier routing, max_tokens discipline, semantic caching. Deploy these in this order and you'll capture most of the cost-reduction wedge in roughly a week of engineering. This post is the opinionated ranking — not the encyclopedia, which lives in the parent LLM cost reduction playbook. Use this post to decide what to do first; use the playbook to deep-dive any technique that's relevant to your specific workload.

The ranking

Rank	Technique	Typical savings	Effort	ROI
1	Provider-native prompt caching	30-60% on input cost	Zero (OpenAI) or trivial (Anthropic)	Highest
2	Exact-match response caching	5-15% on full bill	Half a day with a gateway	Very high
3	Model-tier routing (mini vs flagship)	30-50% on full bill	2-3 days	Very high
4	`max_tokens` discipline	10-20% on output cost	30 minutes	High
5	Semantic response caching	15-30% on full bill	1-2 weeks	High
6-14	The rest	Each <10% on full bill	Varies	Diminishing

The top 5 are roughly Pareto — they cover the 80% of cost reduction that almost every production workload can capture. The remaining 9 techniques (prompt compression, batch APIs, structured outputs, deterministic skip rules, streaming cancellation, etc.) are either small wins or workload-specific bets that don't generalise.

Why this order

Three principles drive the ranking:

1. Highest leverage per engineering hour. Every saving has to be earned with engineering time. Techniques that capture meaningful savings in trivial effort rank above ones that require restructuring application code.

2. Lowest risk to quality. Cost reduction that degrades output quality isn't really cost reduction — it's a downgrade. Techniques that are quality-neutral (or quality-positive, like routing the right model per task) rank above ones that require careful quality validation (semantic caching false-positive risk).

3. Independence of stack. Techniques that work on any provider and any application stack rank above ones that require specific infrastructure (e.g. self-hosted models, batch-API integration).

The five below satisfy all three principles. The 9 below the cut don't — they're either niche, risky, or low-leverage.

#1 — Provider-native prompt caching (the free 30-60%)

What it is: the LLM provider's own server-side prefix-attention caching. Anthropic discounts cache-read tokens to 10% of normal input price (a 90% discount, with a 25% write premium on first writes). OpenAI discounts cached tokens to 50% automatically on prompts ≥1,024 tokens with no caller-side configuration.

Why it's #1: the work-to-savings ratio is unmatched. On OpenAI, you do nothing — the discount appears automatically. On Anthropic, you add a one-line cache_control: {"type": "ephemeral"} marker to the stable portion of your prompt. Most production system prompts cross the threshold where caching engages. Verifying it works is a one-line check: response.usage.cache_read_input_tokens > 0 (Anthropic) or response.usage.cached_tokens > 0 (OpenAI).

Typical impact: 30-60% reduction in input-token cost on workloads with stable system prompts (which is almost every production workload). Input tokens are usually 70-80% of total cost on long-context applications; the bottom-line bill cuts roughly proportionally.

What it doesn't do: anything for output cost or for prompts under ~1,024 tokens. Workloads with novel prompts on every request (e.g. one-shot transformation of fresh content) don't benefit.

Deploy in: 30 minutes for Anthropic marker attachment; verify on OpenAI within 5 minutes by checking the usage block.

Deep dive: Anthropic prompt caching explained, prompt caching glossary.

#2 — Exact-match response caching (the safe 5-15%)

What it is: fingerprint each request with a SHA-256 hash, store responses in a key-value store, return the cached response on byte-identical repeats. Sub-10ms p95 lookup. Provably correct because the fingerprint guarantees byte-equivalence.

Why it's #2: the cheapest safe lift after provider-native passthrough. Hit rates of 5-15% are conservative on real production traffic; the savings stack with provider-native (cached request avoided the model call entirely; provider-native discounts the calls that go through). Implementation effort is half a day with a managed gateway, 2-3 days self-built. False-positive rate is zero by construction.

Typical impact: 5-15% bill reduction. Lower than #1 because hit rates are smaller, but the savings are pure (no provider call at all on hits).

Where it shines: workloads with deterministic patterns (cron jobs, evaluation runs, regression tests) where hit rates can exceed 50%. Where exact match works it works really well.

What it doesn't catch: anything where the user's request varies even slightly in wording. The 5-15% production hit rate is the long-tail residual after personalised context and timestamp drift defeat most matches.

Deploy in: half a day with Prism / Portkey / Helicone / Cloudflare AI Gateway (which all ship Layer 1 caching). 2-3 days if building on Redis directly. Use a managed gateway unless you have specific reason not to — the build complexity is in the fingerprinting discipline, not the storage, and the gateway has that worked out.

Deep dive: exact vs semantic caching for LLMs, prompt cache fingerprinting pitfalls.

#3 — Model-tier routing (the structural 30-50%)

What it is: classify each request by task type, route simple tasks to a small fast model (GPT-4o-mini, Claude Haiku, Gemini Flash, Groq Llama 8B), route complex reasoning to a frontier model (Claude Opus, GPT-5, Gemini Pro). The router decides per request based on a classifier or a heuristic rule set.

Why it's #3: the largest structural lever once provider-native and exact caching are in place. The price gap between tiers is meaningful — GPT-5.4 Mini costs ~1/3.3rd of GPT-5.4 (and previously GPT-4o-mini was ~1/16th of GPT-4o; the per-tier gap has narrowed in the GPT-5 generation but is still substantial); Claude Haiku 4.5 is ~1/5th of Claude Sonnet 4.6 and ~1/5th of Claude Opus 4.7. Routing simple tasks to the smaller model captures 70%+ of that gap on the simple-task slice, often with no measurable quality regression.

Typical impact: 30-50% bill reduction on workloads with mixed task complexity. The dependency is "mixed" — if your traffic is uniformly reasoning-heavy, routing has nothing to optimise. If it's uniformly simple, you should be using mini-class models for everything.

Where the risk lives: quality regression on tasks that look simple but actually need reasoning depth. The mitigation is feedback-signal monitoring: capture thumbs-down rates per feature; if regressions show up, route the affected task type back to the higher-tier model. A/B test routing changes before universal deployment.

Deploy in: 2-3 days for a basic mode-driven router (caller declares intent via header; gateway picks the model). Longer for a full classifier-driven router that infers task type from the prompt. Most teams start with mode-driven and add classifier later.

Deep dive: task-type routing glossary, LLM routing glossary.

#4 — `max_tokens` discipline (the 30-minute 10-20%)

What it is: set the max_tokens request parameter aggressively per task type. The OpenAI SDK defaults to 4096 if unset; if your response only needs 200 tokens, the model has a 4096-token budget to fill. Output tokens cost 4-5x input tokens; constraining output is the highest-leverage zero-engineering-cost change available.

Why it's #4: ROI is dramatic for effort. 30 minutes of code to set per-task-type defaults in your application config; 10-20% output-cost reduction across the board. No infrastructure, no quality regression (the model generates appropriate-length responses naturally), no edge cases.

Typical impact: 10-20% on output cost on workloads where defaults are loose. Less on workloads where output is naturally bounded.

Trade: occasionally truncated responses if max_tokens is set too tight. Mitigation: per-task defaults with 30% margin above expected output length. Easy to monitor (truncated responses have finish_reason == "length") and adjust.

Deploy in: 30 minutes. Define max_tokens defaults per task type in your application config; apply at request construction time.

#5 — Semantic response caching (the powerful but careful 15-30%)

What it is: embed the user's prompt with a sentence-embedding model, look up the nearest stored embedding in a vector index, return the cached response if cosine similarity exceeds a threshold (default 0.95). Catches paraphrased versions of repeat questions — "How do I reset my password?" and "I forgot my password, what do I do?" embed close enough to share a cached answer.

Why it's #5: the wedge is large (15-30% on workloads with paraphrasable intent — customer support, FAQ, documentation Q&A) but the implementation effort + ongoing discipline are real. Threshold tuning + false-positive monitoring + per-workload calibration take a week or two of careful work to land cleanly.

Where the risk lives: false positives. Two semantically-distinct prompts can embed close enough to cross the threshold; the cache returns the wrong response and the user gets bad information they can't trace back to a cache. Mitigation is sampled validation: pull 100 random hits weekly, judge each for appropriateness, tune threshold against the false-positive rate.

Typical impact: 15-30% bill reduction on chatbot/FAQ workloads; <5% on code generation; near-zero on workloads with truly novel requests every time.

Deploy in: 1-2 weeks including threshold-tuning discipline. Use a managed gateway with semantic-cache built in (Prism, LiteLLM, Cloudflare); building Layer 2 from scratch is 4-6 weeks of careful engineering.

Deep dive: exact vs semantic caching for LLMs, semantic cache glossary.

The cumulative effect

If you deploy all 5 in order on a workload where they apply (chatbot, FAQ, support tool — the typical production shape):

After deploying	Bill reduction (cumulative)
#1 Provider-native	30-45%
#1+#2 Provider-native + exact cache	35-55%
#1+#2+#3 ... + routing	50-70%
#1+#2+#3+#4 ... + max_tokens	55-75%
#1+#2+#3+#4+#5 ... + semantic	60-80%

The diminishing returns are real — each layer captures less than the one before because the upstream layers already removed the easiest savings. But the cumulative bill cut of 60-80% on workloads where all five apply is the production norm for well-instrumented LLM systems. Skipping the top 2 (provider-native + exact cache) is the most common mistake; teams spend engineering time on niche optimisations while ignoring the techniques with the best ROI.

The 9 techniques below the cut

The encyclopedic LLM cost reduction playbook covers the full 14-technique list. The 9 that didn't make this cluster's top 5, with notes on when each does pay off:

Technique	When to deploy
Async batching with Batch API	Offline workloads tolerant of 24h latency — analytics, evaluation runs, content moderation passes. 50% discount; big win when applicable. Doesn't apply to real-time.
Prompt compression (LLMLingua, etc.)	Long-context workloads (>10K tokens regularly) where you can afford the extra inference. 20-40% input-token cut when carefully implemented; quality risk if not.
Structured output / JSON mode	Extraction + classification workloads with bounded response shape. 30-50% output reduction on the applicable slice.
Deterministic skip rules	Application-layer optimisation: don't send the LLM things rule-based logic can answer. High-leverage when applicable; obvious in hindsight.
Streaming cancellation discipline	High-volume streaming workloads with frequent abort patterns. 5-10% on the streaming slice.
Memoisation at the function layer	Above-the-gateway optimisation. Catches repeated identical calls in the same session. Trivial to add; small wins.
Provider arbitrage (cheapest healthy host per request)	High-volume workloads using open-weights models hosted across multiple providers. Operational complexity is real; savings are 10-30% on the affected slice.
Model downsizing on cache miss	Retry cache misses with a cheaper model. Quality-risky; useful only on workloads where slightly worse responses are acceptable.
Self-hosting open-weights	Only justifies above ~$30-50K/month spend with dedicated SRE capacity. Strategic decision, not a quick win.

None of these belong in the "first week" project. They're for the post-top-5 optimisation pass when you've captured the headline wins and are looking for the last 10-20% on top.

What this priority order rules out

Common mistakes that this ranking rules out:

Starting with self-hosted models. Tempting because the unit economics look great on paper. In practice, self-hosting only makes sense at significant scale + with dedicated infrastructure capacity. Below ~$30K/month spend, the engineering + ops cost dominates the savings. The top 5 above all apply at any scale.

Starting with prompt compression. Heavy engineering investment for moderate savings. Belongs in the second-pass optimisation, not the first.

Skipping caching to "just route to a cheaper model." Routing is #3; caching #1 and #2 stack with routing for compound savings. Routing alone leaves money on the table.

Routing to mini for everything. Aggressive but quality-naive. Routing should be task-driven, not blanket.

Caching with semantic threshold below 0.93. False-positive rates climb non-linearly below 0.93 on broad-domain workloads. Stay conservative until sampled validation justifies tuning down.

How Prism handles the top 5

Prism deploys all five techniques as default-on product features:

Provider-native passthrough (#1): Anthropic 90% + OpenAI 50% discounts on cached tokens are read from the upstream usage block and passed through to customer billing. The X-Prism-Native-Cache-Saved-Cents response header surfaces the per-request saving.
Exact-match caching (#2): Layer 1 of Prism's 3-layer cache. Default-on for every paid request; sub-8ms p95 lookup; per-project scope (Pro+).
Model-tier routing (#3): customer sets X-Prism-Mode: eco / balanced / sport per request; the router picks the right model per task type via a calibrated routing table. See /models for the live catalog.
max_tokens defaults (#4): sensible per-mode defaults applied when not specified by the caller. Customer overrides via the max_tokens request param.
Semantic caching (#5): Layer 2 of Prism's 3-layer cache. Default-on at threshold 0.95; per-project tuning on Pro+ via X-Prism-Cache-Threshold header.

The cumulative effect is what powers the savings counter on the landing page — real customer savings across all 5 techniques, computed per request, aggregated live.

Decision framework

If you're cutting an LLM bill on a real workload:

Verify provider-native is engaging. Check cache_read_input_tokens / cached_tokens in usage blocks. If zero, fix the stable-prefix structure or attach Anthropic markers. 30-min change for the largest single saving.
Add exact-match caching via a gateway. Half-day deployment; 5-15% lift immediately.
Set max_tokens per task type. 30-minute change; 10-20% output-cost cut.
Add model-tier routing. 2-3 day project; 30-50% lift on mixed-complexity workloads.
Layer semantic caching with threshold-tuning discipline. 1-2 week project; 15-30% additional lift on paraphrasable-intent workloads.

Stop here for the first pass. By the end you've cut the bill 60-80% on workloads where the techniques apply, in roughly a week of focused engineering. The remaining 9 techniques are for the second optimisation pass, months later, when the headline wins are captured and the law of diminishing returns sets the agenda.

Where to go next

The full 14-technique encyclopedia: LLM cost reduction playbook. The OpenAI-specific deep dive: OpenAI cost optimization. The caching layer (#1, #2, #5 in the ranking above): AI API caching. The routing wedge (#3): task-type routing.

For modelling savings on your specific workload: savings calculator. For estimating cache hit rates: cache hit rate estimator. For comparing per-model costs across the catalog: cost comparison by model.

FAQ

Why isn't fine-tuning on the list?

Fine-tuning is a quality + control optimisation, not a cost-reduction technique by itself. It can incidentally reduce cost (a smaller fine-tuned model that matches a frontier model's quality for your specific task), but the engineering and dataset-curation work involved make it a strategic move, not a "do this in week one" win.

Why isn't switching to GPT-4o-mini for everything on the list?

That's #3 (model-tier routing) done wrong. Routing is task-driven — simple tasks to mini, complex to frontier. Blanket-switching everything to mini means quality regression on the workload slice that actually needs frontier capability. The right move is the conditional route, not the blanket switch.

Is "use a cheaper provider" a cost-reduction technique?

Sort of, but a noisy one. Provider pricing changes monthly; the "cheaper" provider today may not be cheaper tomorrow. More fundamentally, per-token cost differences across providers at the same quality tier are usually under 20% — meaningful but not the structural wedge that caching + routing represent. Best treated as a tactical optimisation within the model-tier-routing umbrella (#3), not as its own technique.

How much can I save in total?

60-80% on workloads where all 5 techniques apply, with no quality regression. Less on workloads where caching is structurally difficult (novel-prompt traffic) or routing has nothing to optimise (uniformly complex tasks). The first pass through the top 5 reliably captures 40%+ on almost every production workload.

Do I need a gateway to do all this?

Most of it, yes. Provider-native passthrough (#1) works whether you use a gateway or call providers directly. The rest — exact + semantic caching, routing, max_tokens discipline — can be built from scratch but the engineering effort is substantial. A managed gateway (Prism, Portkey, LiteLLM, Cloudflare AI Gateway) ships these as default features; the rest of your time is integration not infrastructure.

What if I'm only spending $200/month on LLMs?

Skip everything except #1 (provider-native, free) and #4 (max_tokens, 30-min change). Below ~$1K/month spend, the engineering time to deploy the rest exceeds the savings. Revisit when the bill crosses $1-2K/month.

What's the order if I already have caching deployed?

Verify provider-native is engaging (#1), set max_tokens (#4), add routing (#3). You've already captured #2 + likely #5 by having caching. The remaining gaps are usually provider-native (often missed because OpenAI's is automatic and Anthropic's needs markers) and routing (often missed because teams default to one model for everything).

For the deep technical detail on any of the top 5: AI API caching covers #1+#2+#5; LLM cost reduction covers all 14; task-type routing covers #3. For modelling your specific workload: savings calculator.

DEV Community

LLM cost reduction techniques ranked by ROI: the 5 that matter, the 9 that don't (much)

The ranking

Why this order

#1 — Provider-native prompt caching (the free 30-60%)

#2 — Exact-match response caching (the safe 5-15%)

#3 — Model-tier routing (the structural 30-50%)

#4 — `max_tokens` discipline (the 30-minute 10-20%)

#5 — Semantic response caching (the powerful but careful 15-30%)

The cumulative effect

The 9 techniques below the cut

What this priority order rules out

How Prism handles the top 5

Decision framework

Where to go next

FAQ

Top comments (0)

The ranking

Why this order

#1 — Provider-native prompt caching (the free 30-60%)

#2 — Exact-match response caching (the safe 5-15%)

#3 — Model-tier routing (the structural 30-50%)

#4 — max_tokens discipline (the 30-minute 10-20%)

#5 — Semantic response caching (the powerful but careful 15-30%)

The cumulative effect

The 9 techniques below the cut

What this priority order rules out

How Prism handles the top 5

Decision framework

Where to go next

FAQ

#4 — `max_tokens` discipline (the 30-minute 10-20%)