Oskar

Posted on Jun 19

Tokenminning (for people who write openai.chat.completions.create)

#ai #programming #productivity #llm

Your team shipped the agent. The demo was magic. Then finance asked why one engineer's coding assistant burned $40k in a month while the feature backlog looked the same.

Welcome to the hangover after tokenmaxxing — the habit of maximizing raw LLM usage, often celebrated on internal leaderboards, without tying token spend to shipped outcomes.

The counter-move has a name: tokenminning. Journalists call it token minimizing. Engineering teams call it tokenminning when they formalize it — metering, model routing, context hygiene, hard budgets, and CI enforcement.

This is the developer's cut. Philosophy lives in the Manifesto. Law lives in the Constitution. This post is what you ship before Friday.

TL;DR

A token is a billable unit (~word fragment). Providers charge for input and output separately.
In agent loops, input usually dominates — full history and tool dumps are re-sent every turn.
Tokenminning = same output quality, minimum viable token spend.
Order matters: meter → prompt hygiene → model routing → caching → context hygiene → caps.
Soft guidance produces soft budgets. Hard ceilings belong in code.

Tokenmaxxing vs tokenminning (for people who write `openai.chat.completions.create`)

	Tokenmaxxing	Tokenminning
Goal	Maximize token volume	Maximize value per token
Model choice	Frontier default	Cheapest model that passes your quality bar
Prompts	Grow by accretion	Versioned, schema-enforced, audited
Agents	Run until done	Session budget + graceful degradation
Measurement	"Who used the most AI?"	USD per feature, user, session
Enforcement	Slack reminders	CI gates, runtime caps, immutable ledgers

Meta, Uber, Walmart, and Amazon all reversed course in 2026 — caps, limits, leaderboard removals. Uber burned through its projected annual AI budget in four months. The NYT documented the shift. Your invoice is not a personal failure. It is a missing constraint layer.

Why your bill exploded (three bugs in production, not one)

1. Hardcoded frontier models

// tokenmaxxing
const response = await openai.chat.completions.create({
  model: 'gpt-4', // every request, regardless of task
  messages,
});

Classification, extraction, and JSON transforms rarely need frontier tiers. Teams report 60–90% savings from model routing alone — after they benchmark on real traffic.

// tokenminning
const response = await router.complete({
  capability: 'classify-intent',
  messages,
});
// router maps capability → cheapest model passing SLA

2. Agent loops with unbounded context

A chat turn might cost hundreds of tokens. An agent that retains full tool payloads, never summarizes state, and re-sends everything each step can cost tens of thousands per session. Per-token prices look flat; context inflation makes total tokens per request grow 10×. See what is context inflation.

3. Prompts treated as Google Docs

Every token in a system prompt is billed on every request. A prompt that grows 200 tokens sounds free. At 10M requests/month, it is a line item.

# scaffolding waste (delete these)
"You are a helpful assistant."
"Take a deep breath and think step by step."
// TODO: update this when we switch models

Replace prose formatting rules with schema enforcement:

{
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "summary_response",
      "schema": {
        "type": "object",
        "properties": {
          "title": { "type": "string" },
          "summary": { "type": "string", "maxLength": 500 }
        },
        "required": ["title", "summary"]
      }
    }
  }
}

Schema outputs validate in CI. Vibes do not. Full guide: prompt hygiene.

The optimization sequence (do not skip steps)

From Where to start on tokenminning.ai:

Step	Technique	Typical impact
0	Metering + attribution	Prerequisite
1	Prompt hygiene	~20–25% on high-volume templates
2	Model routing	60–95% depending on task mix
3	Prompt caching	41–80% on stable prefixes
4	Context hygiene	40–60% in agentic workloads
5	Output control + RAG discipline	15–40%
6	Semantic caching	Eliminates inference on cache hits

Anti-pattern: optimizing prompts before metering. You cannot prove savings or find the real bottleneck.

Anti-pattern: trimming max_tokens on output when input is 80% of spend.

Start this week: five PRs that actually move the needle

PR 1 — Log input and output tokens separately

Before changing anything, answer: "What did this user action cost, in USD, when it completed?"

Minimum viable instrumentation:

const result = await model.complete({ messages, max_tokens: 1024 });

await ledger.record({
  feature: 'document-analysis',
  userId: session.userId,
  sessionId: session.id,
  inputTokens: result.usage.prompt_tokens,
  outputTokens: result.usage.completion_tokens,
  model: result.model,
  usd: priceResolver.toUsd(result.usage, result.model),
});

Implement Article I: immutable metering before you optimize prompts.

PR 2 — Audit top 10 templates for scaffolding

Pull highest-volume or highest-cost prompts. Remove signal-free tokens. Diff token counts before/after. Target 20–25% on bloated templates.

PR 3 — Add `max_tokens` to every bounded task

If the answer shape is known, cap it in the API call:

await model.complete({
  messages,
  max_tokens: 256, // not a paragraph essay for a boolean classification
});

PR 4 — Stable prefix for caching

Static system prompts and tool definitions go first. Dynamic user content goes after.

const messages = [
  { role: 'system', content: STABLE_SYSTEM_PROMPT }, // cacheable prefix
  { role: 'user', content: dynamicUserInput },
];

Enable provider prefix caching. Guide: prompt caching.

PR 5 — Session budgets with graceful degradation

Hard ceilings are architecture, not suggestions. From Article IV:

const session = createSession({
  userId: 'usr_123',
  feature: 'document-analysis',
  tokenBudget: 50_000,
  usdCeiling: 0.5,
});

// at 90% budget: inject wrap-up directive
// at 100%: return best partial with status `completed_degraded`
// never: abort mid-sentence with a raw 429

No budget, no session. No "internal testing" exceptions in prod.

Model routing: cascade, don't default upward

Request → small model
            ↓ quality check passes → return
            ↓ fails → mid-tier model
                        ↓ passes → return (log escalation)
                        ↓ fails → frontier (log justification)

Every frontier call needs an auditable reason:

Complexity classifier score above threshold
Cheaper model failed quality checks (logged)
User explicitly chose a billed "high quality" tier

Not acceptable: developer preference, "the prompt is long," no routing layer at all.

Benchmark on your data — 1,000 real inputs, parallel runs, compare quality × cost × latency. Blog post savings are not your savings.

What tokenminning is not

Stripping required instructions to shave counts (noise removal ≠ signal removal)
Using the smallest model for everything without benchmarks
Fragile micro-edits that save 12 tokens and break edge cases
Asking engineers to "write shorter prompts" without infrastructure

Tokenminning is building systems where token bloat physically cannot deploy and every inference call maps to a ledger.

What is the most expensive inference call pattern in your stack right now — and do you actually know its USD cost per session?

DEV Community

Tokenminning (for people who write openai.chat.completions.create)

TL;DR

Tokenmaxxing vs tokenminning (for people who write `openai.chat.completions.create`)

Why your bill exploded (three bugs in production, not one)

1. Hardcoded frontier models

2. Agent loops with unbounded context

3. Prompts treated as Google Docs

The optimization sequence (do not skip steps)

Start this week: five PRs that actually move the needle

PR 1 — Log input and output tokens separately

PR 2 — Audit top 10 templates for scaffolding

PR 3 — Add `max_tokens` to every bounded task

PR 4 — Stable prefix for caching

PR 5 — Session budgets with graceful degradation

Model routing: cascade, don't default upward

What tokenminning is not

Top comments (0)

TL;DR

Tokenmaxxing vs tokenminning (for people who write openai.chat.completions.create)

Why your bill exploded (three bugs in production, not one)

1. Hardcoded frontier models

2. Agent loops with unbounded context

3. Prompts treated as Google Docs

The optimization sequence (do not skip steps)

Start this week: five PRs that actually move the needle

PR 1 — Log input and output tokens separately

PR 2 — Audit top 10 templates for scaffolding

PR 3 — Add max_tokens to every bounded task

PR 4 — Stable prefix for caching

PR 5 — Session budgets with graceful degradation

Model routing: cascade, don't default upward

What tokenminning is not

Tokenmaxxing vs tokenminning (for people who write `openai.chat.completions.create`)

PR 3 — Add `max_tokens` to every bounded task