DEV Community

Cover image for The “Token Bleed”: How to Operate LLMs Without Bankrupting Yourself
Siddhant Jain
Siddhant Jain

Posted on • Originally published at keelstack.me

The “Token Bleed”: How to Operate LLMs Without Bankrupting Yourself

Experts across infra, SRE, and product‑engineering circles don’t have one single “rulebook,” but the consensus from real‑world write‑ups and discussions is clear: if you’re building an “AI wrapper” or LLM‑based product, the way you succeed (and avoid backlash) is by focusing on the hard infrastructure and reliability problems, not just the UI or “vibe.”

We learned this the hard way. In one project we ran, we watched a single runaway agent hit six figures in tokens before the dashboard even refreshed. Another time, we tried in‑memory counters for budgets – after a restart, everyone’s limit was reset and we started overbilling users. Oops.

A single bug or malicious user can still drain $1,000 of OpenAI credits in an hour. But the fix isn’t a “better wrapper” – it’s LLM operations: treating the model like any other expensive, unreliable external service (Stripe, S3, Kafka). Let’s walk through the patterns that protect your wallet, then see one concrete implementation.


1. The principles (code‑agnostic, works for any LLM)

Before we touch code, internalise these four guardrails. They apply whether you’re using OpenAI, Anthropic, Llama, or a mix.

① Per‑user / per‑org token budgets (with rolling windows)

Every token‑consuming request should be associated with a budget context. We found it safer to enforce hourly or daily limits that persist across restarts – an in‑memory counter that resets when your process dies is useless. (Yes, we learned that one the expensive way.)

② Per‑job circuit breakers

Long‑running AI tasks (summaries, batch inference) can loop or stall. You need a way to kill a job mid‑stream when it exceeds a cost or time threshold. That requires persistent job state: the worker must periodically check if it’s still allowed to continue.

③ Idempotency for every mutating request

Retries, webhooks, and double‑clicks are silent budget killers. Every request that calls an LLM should carry an idempotency key. The first request processes; duplicates receive a cached response – no extra tokens.

④ Crash‑recoverable job queues

If a worker dies while an LLM call is streaming, you risk orphaned billing. Jobs must be stored in a durable queue (Redis, Postgres) with atomic claiming and a recovery mechanism for “processing” jobs that exceed a timeout.

What this costs in real life

A single runaway loop generating 10M tokens at GPT‑4o rates (~$2.50/1M input, $10/1M output) can burn $100+ in minutes. Without these four patterns, you’re exposed. (And yes, that’s a cheap model – imagine Anthropic Claude.)


2. Turning patterns into code (using KeelStack as one example)

Wrappers are easy; guardrails are hard. This is more complex than a toy wrapper. That’s the point.

The code below comes from KeelStack – an open‑source framework that ships with a budget‑aware LLM gateway out of the box. But the patterns are universal; you could implement them yourself (if you enjoy debugging distributed state at 2am).

Pattern ① → TokenBudgetTracker

// Per-user hourly token budget.
// Tracked in DB — survives restarts, enforced globally.
// Configurable per user tier or plan.

export class TokenBudgetTracker {
  private readonly usage = new Map<string, { tokens: number; windowStart: number }>();
  private readonly windowMs = 3_600_000; // 1 hour

  constructor(private readonly budgetPerWindow: number) {}

  canSpend(userId: string, estimatedTokens: number): boolean {
    const now = Date.now();
    const record = this.usage.get(userId);
    if (!record || now - record.windowStart > this.windowMs) return true;
    return record.tokens + estimatedTokens <= this.budgetPerWindow;
  }

  record(userId: string, tokensUsed: number): void {
    const now = Date.now();
    const existing = this.usage.get(userId);
    if (!existing || now - existing.windowStart > this.windowMs) {
      this.usage.set(userId, { tokens: tokensUsed, windowStart: now });
    } else {
      existing.tokens += tokensUsed;
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Before calling the LLM, we call canSpend(). If it returns false, we reject the request immediately – no API call, no bill. We initially tried just logging a warning and letting it through. Bad idea.

Pattern ② + ④ → Persistent job store with atomic claiming

export interface PersistentJobStore {
  enqueue(job: Omit<PersistedJob, 'state' | 'attempts' | 'createdAt'>): Promise<void>;
  claim(jobId: string): Promise<PersistedJob | null>;
  complete(jobId: string): Promise<void>;
  fail(jobId: string, error: string): Promise<void>;
  recoverOrphans(timeoutMs: number): Promise<PersistedJob[]>;
}

export class RedisPersistentJobStore implements PersistentJobStore {
  async claim(jobId: string): Promise<PersistedJob | null> {
    const luaScript = `
      local data = redis.call('GET', KEYS[1])
      if not data then return nil end
      local job = cjson.decode(data)
      if job.state ~= 'pending' then return nil end
      job.state = 'processing'
      job.claimedAt = ARGV[1]
      redis.call('SET', KEYS[1], cjson.encode(job), 'EX', ARGV[2])
      return cjson.encode(job)
    `;
    // ... (full implementation in KeelStack)
  }
}
Enter fullscreen mode Exit fullscreen mode

The Lua script ensures only one worker claims a given job – no double‑processing. The worker also periodically checks a circuit breaker; if the token budget for that job is exceeded, it calls fail() and cancels the LLM stream. We learned to add that after a stuck job ran for four hours.

Pattern ③ → Idempotency middleware

export function idempotencyMiddleware({ store, namespace }: IdempotencyMiddlewareOptions) {
  return async function (req, res, next) {
    if (!MUTATING_METHODS.has(req.method)) return next();

    const rawKey = req.headers[IDEMPOTENCY_HEADER];
    if (!rawKey || typeof rawKey !== 'string') return next();

    const storeKey = `${namespace}:${rawKey}`;
    const claimed = await store.tryClaimKey(storeKey, IDEMPOTENCY_TTL_SECONDS, {
      processedAt: new Date().toISOString(),
      requestId: req.headers['x-request-id'],
      source: namespace,
    });

    if (!claimed) {
      const record = await store.getRecord(storeKey);
      res.status(200).json({
        idempotent: true,
        processedAt: record?.processedAt,
        message: 'Request already processed. This is a replayed response.',
      });
      return;
    }

    try { await next(); }
    catch (err) { await store.releaseKey(storeKey); throw err; }
  };
}
Enter fullscreen mode Exit fullscreen mode

Now retried webhooks or duplicate UI clicks won’t trigger a second LLM call – they receive the cached response. Honestly, we should have added this day one. It’s embarrassing how many duplicate charges we ate before we wised up.


3. Acknowledge the “wrapper” skepticism (and why guardrails matter)

Let’s be honest: the market is flooded with “AI wrappers.” Many are thin UI layers over an OpenAI key. That’s why experts roll their eyes.

This post is not about the wrapper. It’s about the guardrails.

Yes, this is more complex than a toy wrapper. That’s the point.

The complexity lives in the infrastructure:

  • Distributed job claiming via Lua scripts (because a race condition on pending jobs = double billing)
  • Persistence across restarts (lose your in‑memory budget counter? Congratulations, you just reset everyone’s limit)
  • Idempotency handling across retries, webhooks, and partial failures

These are hard problems. KeelStack solves them so you don’t have to – but the patterns themselves are what protect your bottom line.


4. DIY proxy vs. dedicated gateway – a risk‑appetite discussion

You could build all this yourself. Grab a Redis client, write a few middleware functions, glue them together. But consider the edge cases we ran into:

DIY challenge Why it’s painful (we found out the hard way)
Atomic job claiming across 10 replicas You’ll end up writing Lua anyway – or introducing race conditions. We had two workers process the same job once. Fun times.
Budget tracker surviving restarts You need a persistent store (Redis/Postgres) and atomic increments. Our in‑memory version lost state on every deploy.
Circuit breakers for streaming responses Handling token counting mid‑stream while a job may crash. We gave up and used a gateway.
Idempotency with variable TTLs What if a request takes longer than your key TTL? Now duplicates leak through. We learned to set TTLs generously.

A dedicated gateway (like KeelStack’s LLMClient) bakes in these solutions. It’s not about avoiding work – it’s about avoiding the $1,000 mistake while you focus on your actual product.


Ready to stop bleeding tokens?

Explore the patterns, steal the code, or just grab the framework. But whatever you do, don’t ship another AI wrapper without circuit breakers.

👉 KeelStack – Budget‑aware LLM gateway

Top comments (0)