Phasu Yeneng

Posted on May 4

Stop Your OpenAI Bill from Exploding: Per-User LLM Budget Caps in Node.js

#ai #javascript #openai #node

The cost incident that started this

Three weeks after we put our chatbot into production, I opened the OpenAI billing dashboard on a Monday morning and stopped breathing for a second. One session — not one user, one session — had burned through roughly four times the daily budget for the entire app. Over a single afternoon.

The session wasn't malicious. It was a test account someone forgot to log out of, hammering the chat endpoint in the background while reloading a broken page. No rate limit was breached. No alarm fired. No infrastructure metric looked unusual. The only place it showed up was the bill at the end of the month.

That was the day I learned that rate limits and budget limits are not the same thing, and that running an LLM-powered app without a per-user cost cap is roughly the same as putting a credit card behind a public form and hoping nobody fills it in 800 times.

This post walks through the pattern I now use in every Node.js + Express app that talks to OpenAI: track first, then cap, then degrade gracefully, then cache aggressively. It's framework-agnostic, Postgres-backed, and a fresh team can ship it in a single afternoon.

Why "rate limit" is the wrong abstraction

Classic rate limiting nudges you toward thinking in requests per minute. That model works fine for a REST API where every request costs roughly the same. It falls apart for LLM APIs because request count is decoupled from cost.

Compare two requests to the same endpoint:

A 50-token "what's your refund policy?" question → roughly $0.0005
A 50-page pasted document with "summarize this" prompt → roughly $0.30

That's a 600× cost spread on two requests that both count as "1" to a token bucket. If your rate limit is 60 requests/minute, an attacker — or a buggy client, or a curious power user — can drive your bill into triple digits per hour while staying perfectly within rate-limit bounds.

You need to cap the dollar value, not the request count.

Step 1 — Measure before you cap

You cannot cap what you do not measure. The first thing to ship is a single funnel that every LLM call passes through, with structured logging into a real database (not a JSON file, not an analytics tool — something you can JOIN and WHERE against in real time).

Here's the schema I use, simplified:

CREATE TABLE llm_usage_logs (
  id                     BIGSERIAL PRIMARY KEY,
  session_id             TEXT NOT NULL,
  model                  TEXT NOT NULL,
  prompt_tokens          INT  NOT NULL DEFAULT 0,
  cached_prompt_tokens   INT  NOT NULL DEFAULT 0,    -- subset of prompt_tokens that hit OpenAI's cache
  completion_tokens      INT  NOT NULL DEFAULT 0,
  total_tokens           INT  NOT NULL DEFAULT 0,
  prompt_cost_usd        NUMERIC(10,6) NOT NULL DEFAULT 0,
  cached_prompt_cost_usd NUMERIC(10,6) NOT NULL DEFAULT 0,  -- billed at ~50% of full prompt rate
  completion_cost_usd    NUMERIC(10,6) NOT NULL DEFAULT 0,
  total_cost_usd         NUMERIC(10,6) NOT NULL DEFAULT 0,
  finish_reason          TEXT,
  response_time_ms       INT,
  created_at             TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX llm_usage_logs_session_time_idx
  ON llm_usage_logs (session_id, created_at DESC);

Three non-obvious choices:

Store cost in USD as NUMERIC, not as cents in an integer. Token-priced cost has 4–6 significant decimal digits. If you store cents, you'll round most short calls to zero and the arithmetic gets useless.
Index on (session_id, created_at DESC). Every "is this user over budget?" query scans recent rows for a session. Without this index it's a sequential scan, and you'll regret it the day usage spikes.
Provision the cached-token columns from day one. Even if you're not using prompt caching yet, adding cached_prompt_tokens and cached_prompt_cost_usd up front saves you a migration later — they default to 0 and Step 2 wires them up.

Then a single logging function:

async function logLLMCall({
  session_id, model,
  prompt_tokens, completion_tokens, total_tokens,
  finish_reason, response_time_ms,
}) {
  // All prices below are USD per 1M tokens (NOT per 1K).
  const promptPricePerM     = parseFloat(process.env.LLM_PROMPT_PRICE_PER_M)     || 2.50;
  const completionPricePerM = parseFloat(process.env.LLM_COMPLETION_PRICE_PER_M) || 10.00;

  const promptCost     = (prompt_tokens     / 1_000_000) * promptPricePerM;
  const completionCost = (completion_tokens / 1_000_000) * completionPricePerM;
  const totalCost      = promptCost + completionCost;

  await pg.query(
    `INSERT INTO llm_usage_logs
       (session_id, model, prompt_tokens, completion_tokens, total_tokens,
        prompt_cost_usd, completion_cost_usd, total_cost_usd,
        finish_reason, response_time_ms)
     VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10)`,
    [session_id, model, prompt_tokens, completion_tokens, total_tokens,
     promptCost, completionCost, totalCost, finish_reason, response_time_ms]
  );
}

Step 2 — Get the cost math right

Three things that bite people in cost calculation:

Different pricing per model. GPT-4o costs roughly 5× what GPT-4o-mini costs per million tokens. Don't hardcode prices; pull them from env vars (or a small model_pricing table) keyed by model name. When OpenAI announces new pricing — and they will — you change config, not code.

Trust the API's token count, not your local one. Tokenizers like tiktoken are close to what the API actually charges, but not identical. The number that matters is response.usage.{prompt,completion}_tokens returned in the API response. Log that, not your local pre-call estimate.

Streaming responses still report usage — if you ask for it. With stream: true, you must pass stream_options: { include_usage: true } to get a final usage chunk. Many people miss this and end up logging 0 tokens for every streamed call, which silently zeroes their cost dashboard.

const stream = await openai.chat.completions.create({
  model, messages,
  stream: true,
  stream_options: { include_usage: true },
});

let usage;
for await (const chunk of stream) {
  if (chunk.usage) usage = chunk.usage;        // arrives in the final chunk
  // ... yield chunk.choices[0].delta to client
}

await logLLMCall({ session_id, model, ...usage, ...timing });

Don't forget prompt caching — it changes the math

If you've enabled OpenAI prompt caching (it kicks in automatically once your prompt prefix is long and reused), part of your prompt tokens come back at roughly half price. They show up under usage.prompt_tokens_details.cached_tokens. If you ignore the field, your dashboard will overstate spend by 20–30% — and worse, you'll under-credit the optimizations you're actually doing.

Three-rate calculation:

function calculateCost(usage) {
  const cached    = usage.prompt_tokens_details?.cached_tokens || 0;
  const promptRaw = (usage.prompt_tokens || 0) - cached;          // billed at full rate
  const completion = usage.completion_tokens || 0;

  // All prices below are USD per 1M tokens.
  const fullPromptPerM   = parseFloat(process.env.LLM_PROMPT_PRICE_PER_M)        || 2.50;
  const cachedPromptPerM = parseFloat(process.env.LLM_CACHED_PROMPT_PRICE_PER_M) || 1.25;  // ~50% off full
  const completionPerM   = parseFloat(process.env.LLM_COMPLETION_PRICE_PER_M)    || 10.00;

  return {
    prompt_cost_usd:        (promptRaw  / 1_000_000) * fullPromptPerM,
    cached_prompt_cost_usd: (cached     / 1_000_000) * cachedPromptPerM,
    completion_cost_usd:    (completion / 1_000_000) * completionPerM,
    total_cost_usd:
      (promptRaw  / 1_000_000) * fullPromptPerM +
      (cached     / 1_000_000) * cachedPromptPerM +
      (completion / 1_000_000) * completionPerM,
  };
}

Concrete example: a 4,000-token system prompt with a 2,000-token cache hit, plus 200 completion tokens on gpt-4o:

Component	Tokens	Rate (per 1M)	Cost
Prompt (uncached)	2,000	$2.50	$0.005000
Prompt (cached)	2,000	$1.25	$0.002500
Completion	200	$10.00	$0.002000
Total			$0.0095

Without caching that same call would be $0.0120 — about 21% more expensive. Add a cached_prompt_cost_usd column to your llm_usage_logs table and you'll be able to track caching ROI directly.

A note on currency

If you operate in a non-USD currency (we report in THB internally), store the canonical cost as USD and convert at query time. Exchange rates drift; locking yesterday's rate into the row makes month-over-month comparisons quietly lie to you.

Step 3 — Per-session budget middleware

Now that costs are visible, add an Express middleware that runs before the LLM call:

async function budgetGuard(req, res, next) {
  const sessionId = req.session?.id || req.body.session_id;
  const tier      = req.user?.tier || 'anonymous';

  const dailyCapUsd = {
    anonymous: 0.50,
    free:      2.00,
    paid:     20.00,
    internal:100.00,
  }[tier];

  const { rows } = await pg.query(
    `SELECT COALESCE(SUM(total_cost_usd), 0)::float AS spent
       FROM llm_usage_logs
      WHERE session_id = $1
        AND created_at > NOW() - INTERVAL '24 hours'`,
    [sessionId]
  );

  const spent = rows[0].spent;
  req.budget = { spent, cap: dailyCapUsd, remaining: dailyCapUsd - spent };

  if (spent >= dailyCapUsd) {
    return res.status(429).json({
      error: 'daily_budget_exceeded',
      message: 'You have hit your daily usage limit. Try again in 24 hours.',
      spent_usd: spent,
      cap_usd: dailyCapUsd,
    });
  }

  next();
}

app.post('/chat', budgetGuard, chatHandler);

⚠️ Security note on the identity you cap against. The example above falls back to req.body.session_id for clarity, but in production never trust an identifier the client can rotate. A hostile (or just curious) client can change session_id on every request and dodge the cap entirely. Pull the identity from a verified source: a signed cookie session, a JWT subject claim, or the authenticated user object set by your auth middleware. Treat the body fallback as prototype-only — replace it with the real authenticated principal before anything ships.

A few real-world refinements:

Tier the cap. Anonymous traffic gets the smallest budget, paid users get more. Don't give every visitor a $20/day allowance by default.
24h sliding window, not calendar day. A user who maxes out at 23:59 shouldn't get a fresh budget at 00:00.
Attach the budget object to req. Downstream handlers can read req.budget.remaining to make smarter decisions — see the next section.

Step 4 — Soft cap → hard cap → fallback (the 3-tier pattern)

A single threshold is too binary. Instead, treat the cap as three zones:

Zone	Trigger	Action
Green	< 80% of cap	Use the premium model. Business as usual.
Yellow	80–100%	Log a warning, ping `#alerts` on Slack, degrade to a cheaper model (e.g. `gpt-4o-mini`). User keeps getting answers.
Red	> 100%	Hard-refuse with a friendly message.

The decision flow looks like this:

flowchart TD
    A[Incoming chat request] --> B[Look up 24h spend for user]
    B --> C{spent < 80% of cap?}
    C -- yes --> D["🟢 GREEN<br/>use premium model"]
    C -- no --> E{spent < 100% of cap?}
    E -- yes --> F["🟡 YELLOW<br/>fall back to cheap model<br/>+ Slack alert"]
    E -- no --> G["🔴 RED<br/>return 429<br/>budget_exceeded"]
    D --> H[Call LLM, log usage, respond]
    F --> H

function pickModel(req) {
  const usage = req.budget.spent / req.budget.cap;
  if (usage >= 1.0) return null;            // upstream middleware already 429'd
  if (usage >= 0.8) return 'gpt-4o-mini';   // graceful degradation
  return 'gpt-4o';
}

The Yellow zone is the trick that makes this user-friendly instead of abusive. The product still works at 90% of cap; it's just running on a cheaper engine. Most users won't notice. Engineers who wake up to the Slack alert can investigate before the wall is hit.

Step 5 — Caching is the largest cost cut you can make

Logging and capping reduces runaway spend. Caching reduces baseline spend, often by 20–50%. Two layers, in order of effort:

Exact-match cache (cheap). Normalize the question (lowercase, trim, collapse whitespace). If you've answered the exact same thing in the last 24h, return the stored answer.

async function findExactMatch(question) {
  const { rows } = await pg.query(
    `SELECT answer FROM llm_usage_logs
      WHERE question = $1 AND answer IS NOT NULL
      ORDER BY created_at DESC
      LIMIT 1`,
    [question]
  );
  return rows[0]?.answer || null;
}

Semantic cache (high ROI). Embed the incoming question, look for past questions with cosine similarity ≥ 0.95, return their stored answers. This catches "what is your refund policy?" vs "how do refunds work?" — textually different, semantically identical. If your hit rate is 30%, you've cut 30% off your bill the day you ship it.

A practical tip: don't cache personalized or stateful answers. Cache the FAQ-style stuff. A simple cacheable: true flag on the prompt template handles this cleanly without leaking one user's data to another.

Step 6 — Observability you'll actually look at

Build one endpoint that returns this morning's numbers:

app.get('/api/llm/budget', async (req, res) => {
  const { rows } = await pg.query(`
    SELECT
      DATE(created_at)            AS date,
      model,
      COUNT(*)                    AS calls,
      SUM(total_tokens)           AS tokens,
      SUM(total_cost_usd)::float  AS cost_usd
    FROM llm_usage_logs
    WHERE created_at > NOW() - INTERVAL '30 days'
    GROUP BY DATE(created_at), model
    ORDER BY date DESC
  `);
  res.json({ daily: rows });
});

Pipe it into a small HTML dashboard or a Grafana board. Set one alert worth its salt: >30% above the 7-day rolling average triggers a Slack ping. That single alert has caught every cost incident I've had since I shipped it.

A simple version of the dashboard looks roughly like this — replace with a real screenshot once you have one:

┌──────────────────────────────────────────────────────────────────┐
│  LLM SPEND — last 7 days                              ↻ refresh  │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│   $12 ┤                                            ▇             │
│   $10 ┤             ▇                              ▇             │
│    $8 ┤      ▇      ▇      ▇             ▇        ▇             │
│    $6 ┤      ▇      ▇      ▇      ▇      ▇        ▇             │
│    $4 ┤▇     ▇      ▇      ▇      ▇      ▇        ▇             │
│    $2 ┤▇     ▇      ▇      ▇      ▇      ▇        ▇             │
│    $0 ┴┴─────┴──────┴──────┴──────┴──────┴────────┴───           │
│       Mon   Tue    Wed    Thu    Fri    Sat      Sun ⚠ +47%      │
│                                                                  │
│  Today          $11.83   ████████████████  (cap $20)             │
│  Top session    sess_8f2 $4.12  • 312 calls  • gpt-4o            │
│  Cache hit      31.4%    (≈ $5.20 saved today)                   │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

The three numbers I look at every morning: today's total, the highest-spend session, and cache hit rate. Anything weird shows up in one of those three.

Pitfalls I learned the hard way

Tool calls aren't always counted the way you expect. Older SDK versions occasionally return 0 for completion_tokens when the response is a tool call instead of plain text. Always log the full usage object and verify against your dashboard once a week.

Retries double-count. If your HTTP client retries a 5xx, you can log the same call twice. Pass an idempotency key — the request ID works — and dedupe on it.

Logging on the hot path. Don't await the log insert before responding to the user; the user is waiting on you. Fire-and-forget with a try/catch, or push to a small in-memory queue and drain in the background. But bound the queue size: an unbounded queue is just a memory leak waiting for traffic.

Long system prompts compound. A 4,000-token system prompt × 10 messages per session × 1,000 sessions/day is 40M prompt tokens, even if every user message is short. Use OpenAI's prompt caching where supported, and trim the system prompt aggressively. Most "best practice" system prompts on the internet are 3× longer than they need to be.

Cached input is cheaper — and the field is easy to miss. Prompt caching lives in usage.prompt_tokens_details.cached_tokens, not at the top level. We covered the math in Step 2; the practical pitfall is that older code paths often only read prompt_tokens and silently double-bill anything cached. Audit every place that reads usage.

What this gets you

After shipping the full pattern in our app:

No more cost surprises. The worst possible day is now bounded by cap_per_user × active_users. We can math it out before launching a feature.
Abuse is a non-event. The same loop that burned through a daily budget in three hours now hits the 429 in twelve minutes and stays quiet.
Spend is steerable. When we want to spend less, we tighten the Yellow threshold. When we want to spend more on quality, we raise it. It's a knob, not a panic button.
Engineers stopped being scared of LLM features. "What if it gets expensive?" finally has an answer.

Where to go next

The version above is the minimum viable pattern. Once it's running, the natural extensions are:

Per-feature budgets — chat, summarization, and embedding refresh each get their own cap.
Anomaly detection — alert on cost-per-session 3σ above the mean instead of a fixed threshold.
Auto-throttle — when the daily org-wide cap is approaching, slow down lower-tier users first.
Pre-flight estimation — refuse a 200KB pasted blob before it hits the API, not after.

Each is a follow-up post. The foundation — log, cap, degrade, cache — is what stops the bleeding tonight.

If you ship this pattern and it saves you money, I'd love to hear what your hit rate on the semantic cache ended up being. That number was the most surprising thing for me — much higher than I expected.

DEV Community