Token Cost Optimization in Production LLMs: 3 Approaches With Real Numbers

#ai #llm #devops #performance

We were burning $4,100/month on inference for one fintech client. Here's the three-part stack that cut it to $1,560, without touching the model.
LLM inference costs are the silent budget killer of production AI. You see a demo that costs pennies to run. You ship it, users arrive, the corpus grows, query complexity rises — and suddenly you're looking at a cloud bill that nobody planned for.

We hit this on a fintech client's internal compliance Q&A system. At launch: ~2,000 queries/day, average prompt length 1,800 tokens, GPT-4 for everything. Monthly inference bill: $4,100. Three months post-launch: 6,000 queries/day, average prompt ballooning to 2,400 tokens from accumulated context. Projected bill: $13,000/month. Nobody had modelled for usage growth.

Here's the three-layer optimization stack we implemented, with exact numbers from that engagement.

01 Prompt compression — trim the fat before it hits the model
The most direct lever: reduce the token count of every prompt before it reaches the inference endpoint. This sounds obvious. Most teams don't do it because the naive approach (just truncate) destroys quality. The right approach uses semantic compression.

We used LLMLingua from Microsoft Research, a small model that compresses prompts by removing tokens that are statistically low-information relative to the query, while preserving semantic content. On our fintech client's prompts, we achieved 38% compression with less than 3% degradation in answer quality on the golden dataset.

The latency cost of compression is ~120ms on the CPU. For our use case (internal tool, not real-time), this was acceptable. If you're building a customer-facing product where P95 latency matters, benchmark this carefully — it may not always be worth it.

✓ On 2,400-token average prompts, 38% compression saves ~912 tokens per query. At $0.03/1K tokens (GPT-4), that's $0.027/query. At 6,000 queries/day: ~$162/day, ~$4,860/month, from compression alone.

02 Intelligent model routing — not everything needs GPT-4
The second insight sounds simple, but gets skipped: most queries in a production system don't require your most expensive model. Simple factual lookups, short-answer questions, classification tasks — these can be handled by a cheaper model with no perceptible quality difference to the user.

We built a lightweight router that classifies incoming queries by complexity before they hit the inference endpoint. Simple queries go to GPT-3.5-turbo (or equivalent). Complex, multi-hop, or reasoning-heavy queries go to GPT-4. The classification itself is done with a fine-tuned small model (300M parameters) that adds ~15ms of latency.

In our fintech client's query distribution, 61% of queries were classifiable as "simple" (lookup, boolean, date-retrieval). Routing those to GPT-3.5-turbo reduced cost per query by ~93% on that segment, which was 61% of all queries. Blended cost reduction: ~57%.

⚠ Do not use a threshold below 0.80 for your complexity classifier. At 0.70, we saw too many complex queries slipping through to the cheaper model, which produced noticeably lower quality answers on multi-part compliance questions. Trust the uncertainty — if it's not clearly simple, route up.

03 Semantic caching — stop paying for identical questions
In any production deployment with hundreds or thousands of users, a meaningful percentage of queries are semantically identical even if lexically different. "What's the KYC requirement?" and "Can you explain the know-your-customer process?" are the same query. Without a cache, you pay full inference cost for both.

Semantic caching embeds incoming queries and compares them against a cache index. If a semantically similar query exists (above a cosine similarity threshold), you return the cached response. No model call required.

On the fintech compliance system, cache hit rate stabilised at 34% after two weeks. That's 34% of queries returning a cached answer with zero inference cost. With the combination of all three approaches, compression, routing, and caching, here's what the numbers looked like:

→Implementation order matters
If you're implementing these on an existing system, do them in this order:

Semantic caching first — zero infrastructure complexity, immediate payoff on any system with repeated query patterns. Measurable in 72 hours.
Model routing second — requires building or fine-tuning a classifier, but the ROI is significant if your query distribution is mixed-complexity (most are).
Prompt compression third — most engineering effort, requires calibration against your golden dataset. Worth it at scale, but don't start here.

✓ Before you implement any of these: instrument everything. If you don't have per-query token counts, model selection, and cache hit rate logged today, you're flying blind. Add logging first. Optimize second.

If your team is burning more than $1,000/month on inference and you haven't implemented semantic caching yet, that's your fastest win. The model routing classifier takes longer to build but pays back disproportionately if your query mix is heterogeneous.

What optimization approaches is your team using? Drop a comment, I'm specifically curious whether anyone's had success with speculative decoding or prefix caching at the infrastructure level.

We run production AI delivery engagements at Ailoitte, if you're wrestling with runaway inference costs, the architecture decisions are usually fixable without changing the model.

DEV Community

Token Cost Optimization in Production LLMs: 3 Approaches With Real Numbers

Top comments (0)