F3d1

Posted on Mar 20

How I built budget enforcement that actually works for AI APIs

#api #llm #mcp #showdev

I've been using Claude Code daily for 1 year across 30+ projects. When I checked what all those sessions would cost at API rates, the number was over $10,000. Claude Max subscribers have zero visibility into this. No dashboard, no breakdown, no way to know which project or session is burning the most tokens.

So I built two things. An MCP server that shows Claude Code users their costs in real time, no API key needed, reads local session data directly. And an open-source API gateway called LLMKit with actual budget enforcement for teams routing traffic through AI providers.

The budget layer took longer than everything else combined. Database locks, Redis counters, optimistic concurrency: nothing held up under concurrent agent traffic. The gap between "check balance" and "record cost" is where money disappears.

Cloudflare Durable Objects turned out to be the answer.

Why every other approach leaks money

Standard flow in most AI proxies:

Request comes in
  -> Read balance from DB           (sees $12 used of $50)
  -> Allow request                  (plenty of room)
  -> Forward to provider            (takes 5 to 30 seconds)
  -> Get response                   (cost: $3.20)
  -> Write new balance to DB        ($15.20)

The read and write aren't atomic. During those 5 to 30 seconds of streaming, every other concurrent request reads the same stale $12 balance and passes through. Four parallel requests? Four times $3.20 gets approved against the same snapshot.

The pattern is the same everywhere: check, then act, with a gap in between. Race conditions on concurrent reads, zero-cost estimates that skip the check entirely, requests that crash before cost gets recorded. If the enforcement isn't atomic, it's a suggestion.

What Durable Objects give you

Cloudflare Durable Objects are globally unique, single-threaded actors with persistent storage. One object per budget ID; all requests to the same budget serialize through a single thread.

const stub = env.BUDGET_DO.get(env.BUDGET_DO.idFromName(budgetId));
const result = await stub.check({ estimatedCents, budgetConfig });

idFromName(budgetId) resolves to the same instance worldwide. Two requests from Frankfurt and Virginia, same API key: the second waits for the first. No races, by construction.

The flow becomes:

Request -> [Auth] -> [Budget DO: reserve] -> [Provider API] -> [Budget DO: settle]
                          |                                          |
                     single-threaded                          release estimate,
                     globally unique                          record actual cost
                          |
                     rejects if budget
                     can't cover estimate

Reserve first, settle after

The core idea is simple: reserve the estimated cost before calling the provider, then reconcile with the actual cost after.

Three numbers per budget:

interface BudgetState {
  limitCents: number;
  usedCents: number;      // settled charges
  reservedCents: number;  // in-flight estimates
}

The check counts both spent and reserved:

const committed = root.usedCents + root.reservedCents;
const remaining = root.limitCents - committed;

if (remaining < estimatedCents) {
  return { allowed: false };
}

root.reservedCents += Math.max(estimatedCents, 1);
await this.ctx.storage.put('root', root);
return { allowed: true, reservationId };

Math.max(estimatedCents, 1) closes the zero-cost bypass. Empty request body? Still reserves 1 cent. You can't sneak past with a cost of zero.

Estimating cost before the call

No token counts exist before the provider responds, so I estimate from the request body:

const inputTokens = Math.ceil(inputChars / 4);  // ~4 chars per token
const maxOutput = body.max_tokens || 1024;
const estimated = (inputTokens * inputPrice + maxOutput * outputPrice) / 1_000_000;
return Math.ceil(estimated * 100);  // always in integer cents

Images get a flat 12,800 character estimate, roughly 3,200 tokens. Conservative on purpose; over-estimating is safe because settlement refunds the difference.

Integer cents everywhere. Math.ceil rounds up, so a $0.001 request reserves 1 cent, not 0.

Settlement

When the provider responds with real token counts, the DO swaps the estimate for the actual:

async record({ reservationId, costCents }) {
  const reservation = await this.ctx.storage.get(`r:${reservationId}`);
  root.reservedCents -= reservation.amount;  // release estimate
  root.usedCents += costCents;               // record actual
  await this.ctx.storage.put('root', root);
  await this.ctx.storage.delete(`r:${reservationId}`);
}

Estimated $0.12, actual $0.08? The $0.04 flows back into available budget. The budget never leaks because estimates are always settled against actuals.

Handling failures

Provider returns 500. Network times out. Worker runs out of memory. The reservation can't sit there forever blocking future requests.

Two defenses. First, the error handler releases immediately when it catches a failure:

app.onError(async (err, c) => {
  if (budgetId && reservationId) {
    c.executionCtx.waitUntil(
      releaseReservation(env.BUDGET_DO, budgetId, reservationId)
    );
  }
});

Second, for requests that die without triggering the error handler, a DO alarm sweeps stale reservations. Anything unreleased after 5 minutes gets reclaimed:

async alarm() {
  const cutoff = Date.now() - 300_000;
  for (const [key, val] of reservations) {
    if (val.createdAt < cutoff) {
      stale += val.amount;
      toDelete.push(key);
    }
  }
  root.reservedCents = Math.max(0, root.reservedCents - stale);
}

The alarm reschedules itself daily as long as active sessions or reservations exist. No fixed cron. It activates when needed and stops when idle.

Dual-tier enforcement: keys and sessions

AI agents run multiple sessions under one API key. Sometimes you want $100/day total for the key, sometimes $5 per conversation. I enforce both at once:

const sessionRemaining = session.limitCents - session.usedCents - session.reservedCents;
const rootRemaining = root.limitCents - root.usedCents - root.reservedCents;
const remaining = Math.min(sessionRemaining, rootRemaining);

Tag requests with x-llmkit-session-id and each conversation gets independent tracking. A single session can't blow its limit, and the key can't exceed its total across all sessions.

Graceful degradation near the limit

Say you have $0.50 left and the request asks for 4,096 output tokens that would cost $0.80. Instead of rejecting outright, I clamp max_tokens to what's affordable:

const affordable = Math.floor(
  (remainingCents / 100 / outputPricePer1k) * 1_000_000
);
if (affordable < 10) {
  throw new BudgetExceededError();
}
body.max_tokens = Math.min(body.max_tokens, affordable);

The request goes through with a shorter response. Hard rejection only when you can't afford 10 tokens. Better than cutting off mid-conversation when there's still room for a useful reply.

The 9 bypass vectors this prevents

Every row is a real attack pattern that budget enforcement needs to handle:

Attack	Why it works elsewhere	Defense
Race two requests	Concurrent DB reads	Single-threaded DO
Zero-cost estimate	Skips budget check	`Math.max(estimated, 1)`
Exceed then record	Post-hoc accounting	Pre-reservation
Crash before record	Budget leaks	5-min TTL alarm
Spoof cost client-side	Trusts response.usage	Server-side pricing
Change limit mid-period	Stale cached config	Config sync on every check
Cross-session bleed	Single shared pool	Dual-tier key + session
Reservation buildup	No cleanup	Alarm GC
Period rollover carry	Old reservations persist	Clear all on reset

Performance

After the first request warms the DO:

Reserve: ~10ms
Settle: ~10ms
Cold start: ~50ms (loads config from Supabase)

On API calls that take 2 to 30 seconds, 20ms round-trip is invisible. The DO runs at whichever Cloudflare edge is closest to the caller.

When you need this

If you're running agents that loop, serving multiple users with different budgets, or building anything where a runaway API bill would be a problem: you need enforcement without a gap between check and record. The reservation pattern on Durable Objects is the simplest way I've found to get real consistency without running your own infrastructure.

The 20ms of latency is the price of knowing your budget means what it says.

LLMKit is open source (MIT) and runs on Cloudflare Workers. Budget enforcement is one piece. It also handles auth, provider routing with fallback chains, and per-user cost tracking across 11 AI providers.

GitHub / MCP Server / Dashboard