Your team shipped the agent. The demo was magic. Then finance asked why one engineer's coding assistant burned $40k in a month while the feature backlog looked the same.
Welcome to the hangover after tokenmaxxing — the habit of maximizing raw LLM usage, often celebrated on internal leaderboards, without tying token spend to shipped outcomes.
The counter-move has a name: tokenminning. Journalists call it token minimizing. Engineering teams call it tokenminning when they formalize it — metering, model routing, context hygiene, hard budgets, and CI enforcement.
This is the developer's cut. Philosophy lives in the Manifesto. Law lives in the Constitution. This post is what you ship before Friday.
TL;DR
- A token is a billable unit (~word fragment). Providers charge for input and output separately.
- In agent loops, input usually dominates — full history and tool dumps are re-sent every turn.
- Tokenminning = same output quality, minimum viable token spend.
- Order matters: meter → prompt hygiene → model routing → caching → context hygiene → caps.
- Soft guidance produces soft budgets. Hard ceilings belong in code.
Tokenmaxxing vs tokenminning (for people who write openai.chat.completions.create)
| Tokenmaxxing | Tokenminning | |
|---|---|---|
| Goal | Maximize token volume | Maximize value per token |
| Model choice | Frontier default | Cheapest model that passes your quality bar |
| Prompts | Grow by accretion | Versioned, schema-enforced, audited |
| Agents | Run until done | Session budget + graceful degradation |
| Measurement | "Who used the most AI?" | USD per feature, user, session |
| Enforcement | Slack reminders | CI gates, runtime caps, immutable ledgers |
Meta, Uber, Walmart, and Amazon all reversed course in 2026 — caps, limits, leaderboard removals. Uber burned through its projected annual AI budget in four months. The NYT documented the shift. Your invoice is not a personal failure. It is a missing constraint layer.
Why your bill exploded (three bugs in production, not one)
1. Hardcoded frontier models
// tokenmaxxing
const response = await openai.chat.completions.create({
model: 'gpt-4', // every request, regardless of task
messages,
});
Classification, extraction, and JSON transforms rarely need frontier tiers. Teams report 60–90% savings from model routing alone — after they benchmark on real traffic.
// tokenminning
const response = await router.complete({
capability: 'classify-intent',
messages,
});
// router maps capability → cheapest model passing SLA
2. Agent loops with unbounded context
A chat turn might cost hundreds of tokens. An agent that retains full tool payloads, never summarizes state, and re-sends everything each step can cost tens of thousands per session. Per-token prices look flat; context inflation makes total tokens per request grow 10×. See what is context inflation.
3. Prompts treated as Google Docs
Every token in a system prompt is billed on every request. A prompt that grows 200 tokens sounds free. At 10M requests/month, it is a line item.
# scaffolding waste (delete these)
"You are a helpful assistant."
"Take a deep breath and think step by step."
// TODO: update this when we switch models
Replace prose formatting rules with schema enforcement:
{
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "summary_response",
"schema": {
"type": "object",
"properties": {
"title": { "type": "string" },
"summary": { "type": "string", "maxLength": 500 }
},
"required": ["title", "summary"]
}
}
}
}
Schema outputs validate in CI. Vibes do not. Full guide: prompt hygiene.
The optimization sequence (do not skip steps)
From Where to start on tokenminning.ai:
| Step | Technique | Typical impact |
|---|---|---|
| 0 | Metering + attribution | Prerequisite |
| 1 | Prompt hygiene | ~20–25% on high-volume templates |
| 2 | Model routing | 60–95% depending on task mix |
| 3 | Prompt caching | 41–80% on stable prefixes |
| 4 | Context hygiene | 40–60% in agentic workloads |
| 5 | Output control + RAG discipline | 15–40% |
| 6 | Semantic caching | Eliminates inference on cache hits |
Anti-pattern: optimizing prompts before metering. You cannot prove savings or find the real bottleneck.
Anti-pattern: trimming max_tokens on output when input is 80% of spend.
Start this week: five PRs that actually move the needle
PR 1 — Log input and output tokens separately
Before changing anything, answer: "What did this user action cost, in USD, when it completed?"
Minimum viable instrumentation:
const result = await model.complete({ messages, max_tokens: 1024 });
await ledger.record({
feature: 'document-analysis',
userId: session.userId,
sessionId: session.id,
inputTokens: result.usage.prompt_tokens,
outputTokens: result.usage.completion_tokens,
model: result.model,
usd: priceResolver.toUsd(result.usage, result.model),
});
Implement Article I: immutable metering before you optimize prompts.
PR 2 — Audit top 10 templates for scaffolding
Pull highest-volume or highest-cost prompts. Remove signal-free tokens. Diff token counts before/after. Target 20–25% on bloated templates.
PR 3 — Add max_tokens to every bounded task
If the answer shape is known, cap it in the API call:
await model.complete({
messages,
max_tokens: 256, // not a paragraph essay for a boolean classification
});
PR 4 — Stable prefix for caching
Static system prompts and tool definitions go first. Dynamic user content goes after.
const messages = [
{ role: 'system', content: STABLE_SYSTEM_PROMPT }, // cacheable prefix
{ role: 'user', content: dynamicUserInput },
];
Enable provider prefix caching. Guide: prompt caching.
PR 5 — Session budgets with graceful degradation
Hard ceilings are architecture, not suggestions. From Article IV:
const session = createSession({
userId: 'usr_123',
feature: 'document-analysis',
tokenBudget: 50_000,
usdCeiling: 0.5,
});
// at 90% budget: inject wrap-up directive
// at 100%: return best partial with status `completed_degraded`
// never: abort mid-sentence with a raw 429
No budget, no session. No "internal testing" exceptions in prod.
Model routing: cascade, don't default upward
Request → small model
↓ quality check passes → return
↓ fails → mid-tier model
↓ passes → return (log escalation)
↓ fails → frontier (log justification)
Every frontier call needs an auditable reason:
- Complexity classifier score above threshold
- Cheaper model failed quality checks (logged)
- User explicitly chose a billed "high quality" tier
Not acceptable: developer preference, "the prompt is long," no routing layer at all.
Benchmark on your data — 1,000 real inputs, parallel runs, compare quality × cost × latency. Blog post savings are not your savings.
What tokenminning is not
- Stripping required instructions to shave counts (noise removal ≠ signal removal)
- Using the smallest model for everything without benchmarks
- Fragile micro-edits that save 12 tokens and break edge cases
- Asking engineers to "write shorter prompts" without infrastructure
Tokenminning is building systems where token bloat physically cannot deploy and every inference call maps to a ledger.
What is the most expensive inference call pattern in your stack right now — and do you actually know its USD cost per session?
Top comments (0)