I've been using Claude Code daily for 1 year across 30+ projects. When I checked what all those sessions would cost at API rates, the number was over $10,000. Claude Max subscribers have zero visibility into this. No dashboard, no breakdown, no way to know which project or session is burning the most tokens.
So I built two things. An MCP server that shows Claude Code users their costs in real time, no API key needed, reads local session data directly. And an open-source API gateway called LLMKit with actual budget enforcement for teams routing traffic through AI providers.
The budget layer took longer than everything else combined. Database locks, Redis counters, optimistic concurrency: nothing held up under concurrent agent traffic. The gap between "check balance" and "record cost" is where money disappears.
Cloudflare Durable Objects turned out to be the answer.
Why every other approach leaks money
Standard flow in most AI proxies:
Request comes in
-> Read balance from DB (sees $12 used of $50)
-> Allow request (plenty of room)
-> Forward to provider (takes 5 to 30 seconds)
-> Get response (cost: $3.20)
-> Write new balance to DB ($15.20)
The read and write aren't atomic. During those 5 to 30 seconds of streaming, every other concurrent request reads the same stale $12 balance and passes through. Four parallel requests? Four times $3.20 gets approved against the same snapshot.
The pattern is the same everywhere: check, then act, with a gap in between. Race conditions on concurrent reads, zero-cost estimates that skip the check entirely, requests that crash before cost gets recorded. If the enforcement isn't atomic, it's a suggestion.
What Durable Objects give you
Cloudflare Durable Objects are globally unique, single-threaded actors with persistent storage. One object per budget ID; all requests to the same budget serialize through a single thread.
const stub = env.BUDGET_DO.get(env.BUDGET_DO.idFromName(budgetId));
const result = await stub.check({ estimatedCents, budgetConfig });
idFromName(budgetId) resolves to the same instance worldwide. Two requests from Frankfurt and Virginia, same API key: the second waits for the first. No races, by construction.
The flow becomes:
Request -> [Auth] -> [Budget DO: reserve] -> [Provider API] -> [Budget DO: settle]
| |
single-threaded release estimate,
globally unique record actual cost
|
rejects if budget
can't cover estimate
Reserve first, settle after
The core idea is simple: reserve the estimated cost before calling the provider, then reconcile with the actual cost after.
Three numbers per budget:
interface BudgetState {
limitCents: number;
usedCents: number; // settled charges
reservedCents: number; // in-flight estimates
}
The check counts both spent and reserved:
const committed = root.usedCents + root.reservedCents;
const remaining = root.limitCents - committed;
if (remaining < estimatedCents) {
return { allowed: false };
}
root.reservedCents += Math.max(estimatedCents, 1);
await this.ctx.storage.put('root', root);
return { allowed: true, reservationId };
Math.max(estimatedCents, 1) closes the zero-cost bypass. Empty request body? Still reserves 1 cent. You can't sneak past with a cost of zero.
Estimating cost before the call
No token counts exist before the provider responds, so I estimate from the request body:
const inputTokens = Math.ceil(inputChars / 4); // ~4 chars per token
const maxOutput = body.max_tokens || 1024;
const estimated = (inputTokens * inputPrice + maxOutput * outputPrice) / 1_000_000;
return Math.ceil(estimated * 100); // always in integer cents
Images get a flat 12,800 character estimate, roughly 3,200 tokens. Conservative on purpose; over-estimating is safe because settlement refunds the difference.
Integer cents everywhere. Math.ceil rounds up, so a $0.001 request reserves 1 cent, not 0.
Settlement
When the provider responds with real token counts, the DO swaps the estimate for the actual:
async record({ reservationId, costCents }) {
const reservation = await this.ctx.storage.get(`r:${reservationId}`);
root.reservedCents -= reservation.amount; // release estimate
root.usedCents += costCents; // record actual
await this.ctx.storage.put('root', root);
await this.ctx.storage.delete(`r:${reservationId}`);
}
Estimated $0.12, actual $0.08? The $0.04 flows back into available budget. The budget never leaks because estimates are always settled against actuals.
Handling failures
Provider returns 500. Network times out. Worker runs out of memory. The reservation can't sit there forever blocking future requests.
Two defenses. First, the error handler releases immediately when it catches a failure:
app.onError(async (err, c) => {
if (budgetId && reservationId) {
c.executionCtx.waitUntil(
releaseReservation(env.BUDGET_DO, budgetId, reservationId)
);
}
});
Second, for requests that die without triggering the error handler, a DO alarm sweeps stale reservations. Anything unreleased after 5 minutes gets reclaimed:
async alarm() {
const cutoff = Date.now() - 300_000;
for (const [key, val] of reservations) {
if (val.createdAt < cutoff) {
stale += val.amount;
toDelete.push(key);
}
}
root.reservedCents = Math.max(0, root.reservedCents - stale);
}
The alarm reschedules itself daily as long as active sessions or reservations exist. No fixed cron. It activates when needed and stops when idle.
Dual-tier enforcement: keys and sessions
AI agents run multiple sessions under one API key. Sometimes you want $100/day total for the key, sometimes $5 per conversation. I enforce both at once:
const sessionRemaining = session.limitCents - session.usedCents - session.reservedCents;
const rootRemaining = root.limitCents - root.usedCents - root.reservedCents;
const remaining = Math.min(sessionRemaining, rootRemaining);
Tag requests with x-llmkit-session-id and each conversation gets independent tracking. A single session can't blow its limit, and the key can't exceed its total across all sessions.
Graceful degradation near the limit
Say you have $0.50 left and the request asks for 4,096 output tokens that would cost $0.80. Instead of rejecting outright, I clamp max_tokens to what's affordable:
const affordable = Math.floor(
(remainingCents / 100 / outputPricePer1k) * 1_000_000
);
if (affordable < 10) {
throw new BudgetExceededError();
}
body.max_tokens = Math.min(body.max_tokens, affordable);
The request goes through with a shorter response. Hard rejection only when you can't afford 10 tokens. Better than cutting off mid-conversation when there's still room for a useful reply.
The 9 bypass vectors this prevents
Every row is a real attack pattern that budget enforcement needs to handle:
| Attack | Why it works elsewhere | Defense |
|---|---|---|
| Race two requests | Concurrent DB reads | Single-threaded DO |
| Zero-cost estimate | Skips budget check | Math.max(estimated, 1) |
| Exceed then record | Post-hoc accounting | Pre-reservation |
| Crash before record | Budget leaks | 5-min TTL alarm |
| Spoof cost client-side | Trusts response.usage | Server-side pricing |
| Change limit mid-period | Stale cached config | Config sync on every check |
| Cross-session bleed | Single shared pool | Dual-tier key + session |
| Reservation buildup | No cleanup | Alarm GC |
| Period rollover carry | Old reservations persist | Clear all on reset |
Performance
After the first request warms the DO:
- Reserve: ~10ms
- Settle: ~10ms
- Cold start: ~50ms (loads config from Supabase)
On API calls that take 2 to 30 seconds, 20ms round-trip is invisible. The DO runs at whichever Cloudflare edge is closest to the caller.
When you need this
If you're running agents that loop, serving multiple users with different budgets, or building anything where a runaway API bill would be a problem: you need enforcement without a gap between check and record. The reservation pattern on Durable Objects is the simplest way I've found to get real consistency without running your own infrastructure.
The 20ms of latency is the price of knowing your budget means what it says.
LLMKit is open source (MIT) and runs on Cloudflare Workers. Budget enforcement is one piece. It also handles auth, provider routing with fallback chains, and per-user cost tracking across 11 AI providers.
Top comments (0)