DEV Community: Orvi Das

From Code to Governance: The Complete Guide to LLM Token Optimization

Orvi Das — Tue, 23 Jun 2026 04:44:39 +0000

Your token costs are growing faster than your usage. You've already optimized model selection on non-critical paths. Now you need real wins on your main feature without tanking quality.

Most token optimization advice is too generic. "Use shorter prompts" or "cache your context" is true but useless—it doesn't tell you where the actual bloat is, what the real tradeoffs look like, or when to stop optimizing because you're just hurting yourself.

This guide covers the full stack: code-level techniques (structured output, trimming, compression, caching, batching), infrastructure wins, and the cost governance layer that actually makes this stick in production. Each has real numbers. By the end you'll know what works, what doesn't, and when you're optimizing the wrong thing.

Part 1: Code-Level Wins

Technique 1: Structured output

The fastest token reduction doesn't come from asking the model to be brief. It comes from making brevity the only option.

Real example: Comment moderation

Ask Claude to moderate a comment in free-form, and you get 70 tokens of explanation: "This comment appears to be spam advertising a crypto scheme. I'm recommending removal because it violates the community guidelines against financial solicitation. The user account is new with no post history, which is a common spam pattern."

Now ask it to return JSON with just {decision, reason}:

{"decision": "remove", "reason": "Spam: crypto solicitation"}

That's 12 tokens. Same information. 83% reduction.

On 10,000 requests/day, that's $730/year. Not earth-shattering, but it's the easiest win on this list.

When this helps: Classification, extraction, structured data. When it hurts: Creative writing, exploratory analysis, anything where the model needs room to think.

Technique 2: Trim context before you send it

You retrieve 5 documents for RAG. The model needs maybe 2.

The catch: similarity score (what retrieval gives you) isn't the same as relevance (what you actually need). Document #1 matches semantically but doesn't answer the question. Document #4 does.

The fix: run a cheap model (Haiku) to score the top 5 by relevance, keep the top 2, send those to Sonnet.

Cost math:

Scoring pass: 50 tokens at Haiku rates = $0.00004
Context savings: 9,300 tokens at Sonnet rates = $0.028
Net: save $0.028 per query, or $280/day at 10k queries

The scoring step pays for itself in the first two queries.

When this works: Multi-document RAG, conversation trimming, example selection. When it breaks: When the answer is in document #5 and your relevance scorer just doesn't see it. Always test on real queries.

Technique 3: Compress prompts

Prompts grow. Redundant instructions pile up. You end up with 380 tokens of system message when 120 would do.

A verbose support prompt might ramble: "You are a helpful AI assistant that helps users with customer support tasks. Your role is to help users respond to customer inquiries in a way that is professional, concise, and empathetic. When a user provides an inquiry, you should: 1. Understand what they're asking..." Plus two full examples.

Tighten it: "You are a customer support assistant. For each inquiry: 1. Acknowledge concern 2. Provide a solution 3. Keep responses under 3 sentences. Example: 'I'm sorry you're experiencing login issues. Try resetting your password at example.com/reset.'"

Result: 260 tokens saved (68% reduction).

The risk is real though. Test on 50 real queries before you roll this out. A 10% quality drop kills any token savings.

Part 2: Infrastructure (the bigger wins)

These aren't "techniques"—they're features the APIs already have. Most teams miss them.

Technique 4: Prompt caching

Claude and GPT-4 both cache prompts. Send the same large context once, reuse it for 10% of the cost on subsequent calls.

Example: Support triage

Your support team processes tickets. Every ticket checks against:

Your knowledge base
Previous ticket patterns
Product docs
SLA policies

That context is static. It's identical on ticket #1 and ticket #100.

With caching: ticket #1 sends all that context and pays full price. Tickets #2–#100 (within 5 minutes) reuse the cache at 90% off.

Real math:

Without caching: 1000 tickets/day × $0.15 = $150/day
With caching: ~$20/day
Annual savings: $47,500

The catch: cache TTL is short (5 minutes for ephemeral). Works great for high-volume, repeat-context work. Single-shot queries get zero benefit.

Technique 5: Batch APIs

If you don't need the answer in 30 seconds, batch APIs are 50% cheaper. Dump requests, wait a few minutes, get results back.

Works for:

Daily reports and summaries
Bulk content moderation
Analytics and backlog classification

Doesn't work for:

Real-time user-facing features
Anything where latency matters

Cost: $1.50/M tokens (vs $3.00 for standard).

Part 3: The Real Cost of Optimization

This is where most guides stop lying.

Scenario 1: Compression breaks quality

You compress your system prompt by 60%, save 200 tokens per call. Quality drops 8%. You switch to a more expensive model to compensate. You saved $50/month and lost $500/month in refunds.

Scenario 2: Aggressive trimming misses answers

You trim to top-1 document to save tokens. Now 15% of questions the model should answer, it says "I don't have enough information." Users follow up or escalate. You halved your token spend and tripled your support load.

Scenario 3: Structured output forces bad answers

You constrain output to 3 fields. The model can't express edge cases or uncertainty. Your support team now spends 10 hours/month rewriting outputs. You saved $50/month in tokens and burned 10 hours of labor.

The rule: Measure quality alongside cost. A 10% token reduction that drops accuracy 5% is a bad trade. The cheapest optimization is the one you don't have to do—like caching, which doesn't hurt quality.

Part 4: A Real Pipeline

Say you're building a RAG tool for a 50-person company.

Before optimization:

100 queries/day
200 input tokens + 400 output
60,000 tokens/day
$180/month on Sonnet

After optimization:

Cache system prompt (90% off cached context)
Score 5 docs, keep 2 (saves 150 tokens/query)
Structured output (saves 250 tokens/response)
Compress prompt (saves 60 tokens/query)

Result: $54/month. 70% reduction.

The bonus: quality went up because you removed noise, not signal.

Part 5: The Thing That Actually Matters

Here's where teams hit the wall.

You've optimized everything. Tokens are down. Life is good. Then a customer's usage explodes. They go from 10 queries/day to 10,000. Or a bug sends malformed queries. Or a feature you built for 100 users goes viral.

Costs spike. You have no idea which users are driving it. You don't know which features are expensive. You can't enforce limits without hard-coding them and disappointing people.

This is cost governance. It's bigger than any code optimization.

The problem

Token-level optimization is local: trim this, compress that. But it doesn't answer:

Which users are driving 80% of my costs?
Which feature is eating the budget?
How do I enforce a per-customer budget without arbitrary caps?
How do I route a request to a cheaper model if the user's budget is low?
How do I know before I send a query whether it'll blow the budget?

These questions live outside the LLM call itself.

The governance layer

Before you make an LLM call, check if the user has budget left. Route to a cheaper model if they don't, or block it. Track spending by user, feature, model.

Building this yourself means:

Token counting at every LLM call (you have to remember)
User budget storage and syncing with billing
Routing logic in every LLM call across your codebase
Billing reconciliation (estimated vs actual, disputes)
Dashboards and alerts

For a SaaS, this is usually bigger than any code optimization. A 50% token reduction might save $200/month. Smart routing that sends 30% of queries to Haiku saves $5,000/month.

Most teams either skip it (and lose visibility) or build it, then rebuild it twice as requirements change. Some use platforms like noburn that handle metering, routing, and billing together—so you focus on code optimization instead of infrastructure.

Measuring

To make this work, you need:

Token counting before every call
Cost tracking per request (which user, feature, model)
Budget management (per-user limits)
Alerts (costs spike, budgets run low)

Then query your data daily to see which features and users are expensive.

The Full Picture

At scale, your system looks like:

User Query
    ↓
[Estimator] — Is this expensive? Over budget?
    ↓
[Router] — Which model? Cache hit? Trim?
    ↓
[LLM Call] (Sonnet, Haiku, or cached)
    ↓
[Meter] — Log tokens and cost
    ↓
[Governance] — Update budget, trigger alerts
    ↓
Response

Each layer does one thing. The estimator tells you if it's worth making the call. The router picks the model and strategy. The meter logs the cost. The governance layer enforces the limits.

What Actually Works (Priority Order)

Prompt caching (if you have repeat context) — 50-90% savings, zero quality loss, minimal work
Structured output — 30-80% output savings, usually improves quality, trivial to implement
Cost governance — No token savings, but prevents blowouts and enables billing. Usually worth more than code optimization.
Context trimming — 20-50% savings, but test to avoid quality regression
Batch APIs — 50% savings for non-realtime work, easy if you have async patterns
Prompt compression — 30-60% savings, very easy to break quality

Common mistakes:

Optimizing before measuring
Compressing without testing
Trimming too aggressively
Building governance without metrics
Switching models and hoping quality holds

Timeline

Week 1: Add token counting to your most expensive feature. Get a baseline.

Week 2: Implement structured output. Measure the improvement. Is quality still good?

Week 3: Add caching if you have repeat context. Measure the cache hit rate.

Week 4: Build cost governance—per-user metering and budget limits. This is where you'll see the biggest wins. You can build it yourself, or use something like noburn. Either way, at scale you need it.

Month 2+: Look at your data. Trim expensive features. Route smartly. Compress bloated prompts.

Most teams see: 50-70% token reduction in a month, 2-3% quality improvement, $5-50k/month in savings (on SaaS).

The Next Level

Once code is optimized and governance is in place:

Model selection frameworks (route to Haiku/Sonnet/Opus by task complexity, not uniformly)
Semantic routing (different query types to different models)
Multi-turn compression (compress conversation history across turns)
Fine-tuning (if you process >1M tokens/month on the same task, it pays for itself)

These are rare before you've exhausted the basics.

Summary

Token optimization is a system, not a trick. It has three layers:

Code: Structured output, compression, trimming, caching (50-80% savings)
Infrastructure: Batch APIs, smart routing (another 30-50% savings, zero quality loss)
Governance: Metering, budgets, routing by budget (prevents blowouts, enables billing)

Together, you go from "costs are out of control" to "we have precise control and can bill customers fairly."

Start with caching and structured output. They're easy and they work. Then build governance—it's the real win. Then compress and trim where the data shows problems.

On governance: This is where most teams struggle. Building metering and routing in-house is doable, but it competes with product work. If you're a SaaS running LLMs for end-users, something like noburn handles metering and per-user billing and pays for itself in engineering time alone. You get to focus on code-level optimization instead of building infrastructure.

Measure everything. Test everything. Don't optimize what you can't measure.

Hermes Agent ran overnight and I woke up to a $47 bill — so I built a kill-switch

Orvi Das — Tue, 26 May 2026 21:05:23 +0000

This is a submission for the Hermes Agent Challenge

What I built

It was a Tuesday. I gave Hermes Agent a research task before bed: "analyse the top open-source agent frameworks and write a comparison report." Reasonable task. Maybe 10 minutes of work. I'd check the output in the morning.

I woke up to a $47 bill and a 34-page report that no one asked for.

Hermes had hit a tricky subtask around 2am, retried with different approaches, gone deeper on each one, and kept going, because that's what it's supposed to do. Autonomous agent. Autonomy is the feature. The problem is that autonomy doesn't come with a receipt until after you've already paid.

I spent that morning looking for a way to give Hermes a hard spending limit. Not a dashboard alert at $40 that I'd miss while sleeping. A hard stop that fires before the API call, not after. I didn't find one, so I built it.

baar-core is a budget-aware proxy that sits between Hermes and the real LLM providers. Every call Hermes makes goes through a kill-switch first. When a call would push spend past the cap, it gets 402 Payment Required. The provider is never contacted. Cost: $0.00.

from baar.integrations.hermes import BaarHermesSession

with BaarHermesSession(budget=1.00) as session:
    reply = session.run_task("Research the top 5 open-source agent frameworks")
    print(reply)
    print(f"Spent ${session.spent:.4f} of $1.00")

# It cannot spend $1.01. Not $1.005. $1.00 is the ceiling.

Demo

Code

GitHub: github.com/orvi2014/Baar-Core

pip install baar-core[vercel] hermes-agent

Tech stack

Component	Role
Python 3.10+	Core library
Hermes Agent	The agentic runtime being budget-capped
LiteLLM	Unified provider interface + live pricing data
FastAPI + uvicorn	Local OpenAI-compatible proxy server
SQLite (WAL mode)	Persistent spend store, concurrent-safe
pytest	606 tests, all passing

How I used Hermes Agent

Hermes doesn't stop. It plans, tool-calls, reflects, retries until the task is done or you kill the process. That's the whole point of it, and also what caused the $47 bill.

I couldn't change how Hermes works internally, but it lets you point its provider config at any OpenAI-compatible endpoint. So I built one: a local proxy that speaks the OpenAI API and intercepts every LLM call before it leaves the machine.

BaarHermesSession(budget=1.00)
  ├── BaarHermesProxy.start()    ← uvicorn on 127.0.0.1:8080, daemon thread
  └── hermes subprocess          ← HERMES_HOME → temp config pointing to proxy

Each Hermes LLM turn:
  POST /v1/chat/completions → baar proxy (local, no network)
    └── BAARRouter
          ├── pre-flight budget check    → over limit? 402. Zero API calls made.
          ├── complexity routing         → simple task → cheap model, hard task → big model
          └── real provider call via LiteLLM

Hermes thinks it's talking to a provider. Every tool-call, retry, and reflection step goes through the check first.

Getting the timing right took a few attempts. Most cost tracking records spend when the response arrives — by then you've already paid. baar estimates the cost of each call, atomically reserves that amount, makes the call, then reconciles the real cost. Two concurrent Hermes turns can't both pass the check and jointly overshoot the cap, because the reservation step is atomic.

It also routes to cheaper models as budget runs down

The routing layer scores each request for complexity and picks the model accordingly. Low-complexity turns go to the cheap model, high-complexity turns go to the big one. As the budget runs low, the threshold shifts:

Budget 0–30%:   complexity > 0.50 → big model
Budget 60–80%:  complexity > 0.75 → big model
Budget 95%+:    almost everything → small model

A $1.00 session doesn't just cut off at $1.00. It gets cheaper per turn as the budget depletes. The agent keeps working, it just costs less toward the end.

Alerts, because a silently dead session is also bad

Waking up to a session stuck at $0.999 since 2am, waiting on a cap that already fired, is almost as annoying as the $47 bill. So I added thresholds:

from baar import BAARRouter, BudgetWindow, Alert

def warn_at_80(info):
    print(f"⚠️  {info['utilization']*100:.0f}% of daily budget used — "
          f"${info['remaining']:.4f} remaining")

router = BAARRouter(
    budget=5.00,
    window=BudgetWindow.DAILY,   # resets at midnight UTC, no cron needed
    alerts=[
        Alert(threshold=0.8, callback=warn_at_80),
        Alert(threshold=0.95, callback=lambda _: send_slack("Hermes at 95% — check it")),
    ],
)

BudgetWindow.DAILY resets at midnight UTC. Each day gets its own bucket. Historical spend is preserved so you can audit any past session, and the alert re-arms automatically when the new window opens.

A policy engine, for when a single number isn't enough

If you're running Hermes on behalf of multiple users, you need rules. A free tier user hitting gpt-4o at 60% budget utilization is a problem waiting to happen. An enterprise user getting downgraded to gpt-4o-mini is a different kind of problem.

from baar.core.policy import Policy, Rule

policy = Policy(rules=[
    # Free tier users: force cheap model past 50% spend
    Rule(when={"plan": "free", "utilization": ">= 0.5"}, then="force_small"),

    # Never use big model past 70% budget for anyone
    Rule(when={"utilization": ">= 0.7"}, then="force_small"),

    # Enterprise users always get the big model
    Rule(when={"plan": "enterprise"}, then="force_big"),
])

router = BAARRouter(budget=5.00, policy=policy)

Rules are first-match-wins. You thread user metadata per call from your application layer. System facts like real utilization always override caller context, so users can't spoof their own budget status.

When a block rule fires, baar raises PolicyViolation, distinct from BudgetExhausted. Both carry a facts dict with exactly which rule matched and why.

The audit log

Every Hermes turn is logged:

for step in session.log.steps:
    print(
        f"Step {step.step_num:2d} | {step.decision.model:<20} | "
        f"${step.cost:.6f} | {step.latency_ms:6.0f}ms | "
        f"{step.decision.reason}"
    )

Step  1 | gpt-4o-mini          | $0.000023 |   412ms | complexity=0.31 → small
Step  2 | gpt-4o               | $0.000891 |  1823ms | complexity=0.78 → big
Step  3 | gpt-4o-mini          | $0.000019 |   388ms | complexity=0.28 → small
Step  4 | gpt-4o-mini          | $0.000021 |   401ms | [POLICY FORCE_SMALL] complexity=0.71
...
Total: $0.003847 of $1.00 (0.38% used)

forced_by_budget on each step tells you whether the model downgrade was a budget constraint or a policy decision.

One more thing: a supply chain issue we caught mid-build

While shipping v0.7.0 we found CVE-2026-33634, a supply chain compromise in litellm==1.82.7 and 1.82.8. Since baar-core depends on LiteLLM, any user installing without this fix would pull in the compromised version.

Two defences: an install-time constraint (!=1.82.7,!=1.82.8) so pip never resolves to those versions, and a runtime check that raises at BAARRouter construction if the bad version is already installed. If you have it, baar won't start.

The $47 bill was the useful part of that Tuesday. Turns out "iterate until done" is not a plan when you're paying per iteration and you're asleep.

GitHub: github.com/orvi2014/Baar-Core — pip install baar-core[vercel] hermes-agent

I Use AI to Build. I Don't Let It Think for Me.

Orvi Das — Sun, 17 May 2026 00:45:01 +0000

I have been building software with AI tools for about two years now. I ship with Claude Code every day. And I still do not vibe code. Here is why that distinction matters more than people think.

The first time I watched someone vibe code, I felt the same thing I feel watching someone drive with their knees. Technically possible. Impressive for about thirty seconds. And then just a matter of time.

I use AI every single day. I am not writing this from some purist position where I compile my own tools and distrust anything generated by a machine. I use Claude to write boilerplate, draft logic, suggest patterns I would have spent an hour looking up. It has made me faster in the ways that were boring to be slow in.

But I do not let it think for me.

There is a difference and it matters more than almost anything else I have learned in the last two years of building with these tools.

What Vibe Coding Actually Is

Vibe coding is not just "using AI to help write code." That is a category error that people make to either defend or attack it. Using AI to help write code is just programming now. That is what the tools are for.

Vibe coding is something specific: it is prompting without understanding, accepting without reading, and shipping without testing. It is the workflow where the developer's job becomes describing what they want and clicking approve.

The output looks like software. It passes the smell test. It runs. And then, three weeks later, something quietly breaks in production and you spend an afternoon staring at code you do not actually understand, written by a model that does not remember writing it.

I have seen this happen to smart people. I have started to do it myself on late nights when I was tired and the model was confident. It is seductive because the short loop feels productive. You say a thing, the code appears, it works. The feedback is immediate and positive.

The cost is invisible until it is not.

The Line I Draw

My workflow goes in one direction: I understand first, then I use AI to execute faster.

That means I write the unit test before I ask the model for the implementation. Not because I am rigorous by nature — I am not — but because writing the test forces me to know what I actually want. It forces me to think about edge cases before I have code that creates attachment to a specific approach. It forces me to have a definition of done that exists outside my head.

When I hand that context to the model, the output is different. Not because the model is smarter — it is the same model but because I am asking a specific, bounded question instead of a vague, open-ended one. The difference between "write me a function that handles payments" and "write me a function that takes a Stripe webhook payload, validates the signature, extracts the event type, and returns a typed result with this shape" is the difference between code that kind of works and code that actually does the thing.

Then I read what comes back. All of it. Even when it is long. Especially when it is long.

This sounds obvious. It is not practiced as much as it sounds.

What AI Is Actually Good At

The honest list of where AI makes me dramatically faster:

Boilerplate that I know the shape of. If I know I need a repository pattern with these five methods, I can describe it precisely and get it in thirty seconds instead of fifteen minutes. I understand what it should look like before I ask. The AI just types faster than I do.

Surfacing options I had not thought of. I will describe a problem and ask what patterns exist for solving it, not for the model to pick one, but so I have a more complete menu. Then I decide. The model does not know my codebase, my constraints, or my risk tolerance.

Catching things I missed. After I write something, I ask the model to review it — specifically to look for edge cases, error paths I did not handle, security issues I glossed over. It finds real things. Not always, but often enough that I have made it a habit.

Writing tests for logic I just wrote. Once the implementation is done and I understand it, I will have the model write additional test cases. It is good at thinking of inputs I did not try.

What it is not good at: deciding what to build, deciding how to architect something that will need to scale, or writing code I am not equipped to review. When I catch myself in that last situation, I stop and learn the thing first.

The Senior Developer Problem

There is a version of this conversation that gets framed as: AI will replace junior developers but senior developers are safe because they can guide it.

I do not think this is quite right, and I think believing it creates a complacency that is more dangerous than the replacement question.

The thing that makes a senior developer valuable is not primarily the ability to generate correct code. It is the ability to know which code should not exist, which abstractions will turn into debt, which requirements are wrong before you build them. That judgment comes from having been wrong enough times to develop taste.

AI does not have taste. It has pattern completion. It will write you a technically correct solution to the wrong problem with the same confidence it writes a technically correct solution to the right one. It cannot tell the difference.

If you are not developing the judgment — because you are outsourcing the thinking to the model — you are not building toward senior. You are extending the period where you do not yet know what you do not know.

Why This Is Not About Being Anti-AI

I am not arguing for slowing down or using fewer tools. I am arguing for staying in the driver's seat of your own work.

The people I have watched get the most out of these tools are the ones who get more done, not the ones who get more generated. The difference is that they know what done means before they start, and they verify they reached it before they ship.

I write unit tests first. Then integration tests against real systems, not mocks — I learned the hard way that mocked tests can pass while the actual integration is broken. Then end-to-end tests with Playwright for the paths users actually take. And I read everything the model gives me before I commit it.

That workflow is slower than vibe coding for the first hour. It is faster than vibe coding over any meaningful timescale.

The AI handles the typing. I stay responsible for the thinking.

That is the only arrangement I trust.