Orvi Das

Posted on Jun 23

From Code to Governance: The Complete Guide to LLM Token Optimization

#ai #llm #rag

Your token costs are growing faster than your usage. You've already optimized model selection on non-critical paths. Now you need real wins on your main feature without tanking quality.

Most token optimization advice is too generic. "Use shorter prompts" or "cache your context" is true but useless—it doesn't tell you where the actual bloat is, what the real tradeoffs look like, or when to stop optimizing because you're just hurting yourself.

This guide covers the full stack: code-level techniques (structured output, trimming, compression, caching, batching), infrastructure wins, and the cost governance layer that actually makes this stick in production. Each has real numbers. By the end you'll know what works, what doesn't, and when you're optimizing the wrong thing.

Part 1: Code-Level Wins

Technique 1: Structured output

The fastest token reduction doesn't come from asking the model to be brief. It comes from making brevity the only option.

Real example: Comment moderation

Ask Claude to moderate a comment in free-form, and you get 70 tokens of explanation: "This comment appears to be spam advertising a crypto scheme. I'm recommending removal because it violates the community guidelines against financial solicitation. The user account is new with no post history, which is a common spam pattern."

Now ask it to return JSON with just {decision, reason}:

{"decision": "remove", "reason": "Spam: crypto solicitation"}

That's 12 tokens. Same information. 83% reduction.

On 10,000 requests/day, that's $730/year. Not earth-shattering, but it's the easiest win on this list.

When this helps: Classification, extraction, structured data. When it hurts: Creative writing, exploratory analysis, anything where the model needs room to think.

Technique 2: Trim context before you send it

You retrieve 5 documents for RAG. The model needs maybe 2.

The catch: similarity score (what retrieval gives you) isn't the same as relevance (what you actually need). Document #1 matches semantically but doesn't answer the question. Document #4 does.

The fix: run a cheap model (Haiku) to score the top 5 by relevance, keep the top 2, send those to Sonnet.

Cost math:

Scoring pass: 50 tokens at Haiku rates = $0.00004
Context savings: 9,300 tokens at Sonnet rates = $0.028
Net: save $0.028 per query, or $280/day at 10k queries

The scoring step pays for itself in the first two queries.

When this works: Multi-document RAG, conversation trimming, example selection. When it breaks: When the answer is in document #5 and your relevance scorer just doesn't see it. Always test on real queries.

Technique 3: Compress prompts

Prompts grow. Redundant instructions pile up. You end up with 380 tokens of system message when 120 would do.

A verbose support prompt might ramble: "You are a helpful AI assistant that helps users with customer support tasks. Your role is to help users respond to customer inquiries in a way that is professional, concise, and empathetic. When a user provides an inquiry, you should: 1. Understand what they're asking..." Plus two full examples.

Tighten it: "You are a customer support assistant. For each inquiry: 1. Acknowledge concern 2. Provide a solution 3. Keep responses under 3 sentences. Example: 'I'm sorry you're experiencing login issues. Try resetting your password at example.com/reset.'"

Result: 260 tokens saved (68% reduction).

The risk is real though. Test on 50 real queries before you roll this out. A 10% quality drop kills any token savings.

Part 2: Infrastructure (the bigger wins)

These aren't "techniques"—they're features the APIs already have. Most teams miss them.

Technique 4: Prompt caching

Claude and GPT-4 both cache prompts. Send the same large context once, reuse it for 10% of the cost on subsequent calls.

Example: Support triage

Your support team processes tickets. Every ticket checks against:

Your knowledge base
Previous ticket patterns
Product docs
SLA policies

That context is static. It's identical on ticket #1 and ticket #100.

With caching: ticket #1 sends all that context and pays full price. Tickets #2–#100 (within 5 minutes) reuse the cache at 90% off.

Real math:

Without caching: 1000 tickets/day × $0.15 = $150/day
With caching: ~$20/day
Annual savings: $47,500

The catch: cache TTL is short (5 minutes for ephemeral). Works great for high-volume, repeat-context work. Single-shot queries get zero benefit.

Technique 5: Batch APIs

If you don't need the answer in 30 seconds, batch APIs are 50% cheaper. Dump requests, wait a few minutes, get results back.

Works for:

Daily reports and summaries
Bulk content moderation
Analytics and backlog classification

Doesn't work for:

Real-time user-facing features
Anything where latency matters

Cost: $1.50/M tokens (vs $3.00 for standard).

Part 3: The Real Cost of Optimization

This is where most guides stop lying.

Scenario 1: Compression breaks quality

You compress your system prompt by 60%, save 200 tokens per call. Quality drops 8%. You switch to a more expensive model to compensate. You saved $50/month and lost $500/month in refunds.

Scenario 2: Aggressive trimming misses answers

You trim to top-1 document to save tokens. Now 15% of questions the model should answer, it says "I don't have enough information." Users follow up or escalate. You halved your token spend and tripled your support load.

Scenario 3: Structured output forces bad answers

You constrain output to 3 fields. The model can't express edge cases or uncertainty. Your support team now spends 10 hours/month rewriting outputs. You saved $50/month in tokens and burned 10 hours of labor.

The rule: Measure quality alongside cost. A 10% token reduction that drops accuracy 5% is a bad trade. The cheapest optimization is the one you don't have to do—like caching, which doesn't hurt quality.

Part 4: A Real Pipeline

Say you're building a RAG tool for a 50-person company.

Before optimization:

100 queries/day
200 input tokens + 400 output
60,000 tokens/day
$180/month on Sonnet

After optimization:

Cache system prompt (90% off cached context)
Score 5 docs, keep 2 (saves 150 tokens/query)
Structured output (saves 250 tokens/response)
Compress prompt (saves 60 tokens/query)

Result: $54/month. 70% reduction.

The bonus: quality went up because you removed noise, not signal.

Part 5: The Thing That Actually Matters

Here's where teams hit the wall.

You've optimized everything. Tokens are down. Life is good. Then a customer's usage explodes. They go from 10 queries/day to 10,000. Or a bug sends malformed queries. Or a feature you built for 100 users goes viral.

Costs spike. You have no idea which users are driving it. You don't know which features are expensive. You can't enforce limits without hard-coding them and disappointing people.

This is cost governance. It's bigger than any code optimization.

The problem

Token-level optimization is local: trim this, compress that. But it doesn't answer:

Which users are driving 80% of my costs?
Which feature is eating the budget?
How do I enforce a per-customer budget without arbitrary caps?
How do I route a request to a cheaper model if the user's budget is low?
How do I know before I send a query whether it'll blow the budget?

These questions live outside the LLM call itself.

The governance layer

Before you make an LLM call, check if the user has budget left. Route to a cheaper model if they don't, or block it. Track spending by user, feature, model.

Building this yourself means:

Token counting at every LLM call (you have to remember)
User budget storage and syncing with billing
Routing logic in every LLM call across your codebase
Billing reconciliation (estimated vs actual, disputes)
Dashboards and alerts

For a SaaS, this is usually bigger than any code optimization. A 50% token reduction might save $200/month. Smart routing that sends 30% of queries to Haiku saves $5,000/month.

Most teams either skip it (and lose visibility) or build it, then rebuild it twice as requirements change. Some use platforms like noburn that handle metering, routing, and billing together—so you focus on code optimization instead of infrastructure.

Measuring

To make this work, you need:

Token counting before every call
Cost tracking per request (which user, feature, model)
Budget management (per-user limits)
Alerts (costs spike, budgets run low)

Then query your data daily to see which features and users are expensive.

The Full Picture

At scale, your system looks like:

User Query
    ↓
[Estimator] — Is this expensive? Over budget?
    ↓
[Router] — Which model? Cache hit? Trim?
    ↓
[LLM Call] (Sonnet, Haiku, or cached)
    ↓
[Meter] — Log tokens and cost
    ↓
[Governance] — Update budget, trigger alerts
    ↓
Response

Each layer does one thing. The estimator tells you if it's worth making the call. The router picks the model and strategy. The meter logs the cost. The governance layer enforces the limits.

What Actually Works (Priority Order)

Prompt caching (if you have repeat context) — 50-90% savings, zero quality loss, minimal work
Structured output — 30-80% output savings, usually improves quality, trivial to implement
Cost governance — No token savings, but prevents blowouts and enables billing. Usually worth more than code optimization.
Context trimming — 20-50% savings, but test to avoid quality regression
Batch APIs — 50% savings for non-realtime work, easy if you have async patterns
Prompt compression — 30-60% savings, very easy to break quality

Common mistakes:

Optimizing before measuring
Compressing without testing
Trimming too aggressively
Building governance without metrics
Switching models and hoping quality holds

Timeline

Week 1: Add token counting to your most expensive feature. Get a baseline.

Week 2: Implement structured output. Measure the improvement. Is quality still good?

Week 3: Add caching if you have repeat context. Measure the cache hit rate.

Week 4: Build cost governance—per-user metering and budget limits. This is where you'll see the biggest wins. You can build it yourself, or use something like noburn. Either way, at scale you need it.

Month 2+: Look at your data. Trim expensive features. Route smartly. Compress bloated prompts.

Most teams see: 50-70% token reduction in a month, 2-3% quality improvement, $5-50k/month in savings (on SaaS).

The Next Level

Once code is optimized and governance is in place:

Model selection frameworks (route to Haiku/Sonnet/Opus by task complexity, not uniformly)
Semantic routing (different query types to different models)
Multi-turn compression (compress conversation history across turns)
Fine-tuning (if you process >1M tokens/month on the same task, it pays for itself)

These are rare before you've exhausted the basics.

Summary

Token optimization is a system, not a trick. It has three layers:

Code: Structured output, compression, trimming, caching (50-80% savings)
Infrastructure: Batch APIs, smart routing (another 30-50% savings, zero quality loss)
Governance: Metering, budgets, routing by budget (prevents blowouts, enables billing)

Together, you go from "costs are out of control" to "we have precise control and can bill customers fairly."

Start with caching and structured output. They're easy and they work. Then build governance—it's the real win. Then compress and trim where the data shows problems.

On governance: This is where most teams struggle. Building metering and routing in-house is doable, but it competes with product work. If you're a SaaS running LLMs for end-users, something like noburn handles metering and per-user billing and pays for itself in engineering time alone. You get to focus on code-level optimization instead of building infrastructure.

Measure everything. Test everything. Don't optimize what you can't measure.

DEV Community