DEV Community

Akash Melavanki
Akash Melavanki

Posted on

Rate limiting your LLM API is useless. Here's what actually protects you.

Last month, the LiteLLM supply chain attack exposed API keys across thousands of developer projects.

The standard advice: rotate your keys immediately.

Here's what nobody tells you after that: a rotated key doesn't protect you from the next attack. Rate limiting doesn't either. I'll show you why — and what actually works.


The problem with rate limiting LLMs

Rate limiting assumes all requests cost roughly the same. For traditional APIs, that's true.

For LLMs, it's completely wrong.

Request 1: "Hi"
→ ~10 tokens → cost: $0.0001

Request 2: "Summarize this 50-page PDF"
→ ~30,000 tokens → cost: $0.45
Enter fullscreen mode Exit fullscreen mode

An attacker doesn't need high volume. They just need expensive requests. 10 requests per minute means nothing when each request costs $0.45.

What you actually need is budget limiting — enforcing a maximum dollar spend per user, per day, in real time.


The race condition nobody talks about

OK, so you decide to track spend in Redis. Simple, right?

Wrong. Here's what happens at scale.

Your app receives 10 concurrent requests from the same user.

Instance A reads budget: $0.05 remaining. Proceeds.
Instance B reads budget: $0.05 remaining. Proceeds.
Instance C reads budget: $0.05 remaining. Proceeds.
...all 10 instances read $0.05 and proceed.
Enter fullscreen mode Exit fullscreen mode

All 10 fire $1.00 LLM requests. Your user's budget was $1.00. You just spent $10.00.

This is the race condition. Standard Redis GET + SET cannot solve it — there's always a gap between reading and writing where another instance sneaks through.


The fix: atomic Lua scripts

The solution is to move the entire check-and-update logic into a single atomic operation inside Redis. Lua scripts on Redis run as one uninterruptible step — no interleaving, no race condition possible.

-- runs atomically inside Redis — no race condition possible
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local cost = tonumber(ARGV[2])
local current = tonumber(redis.call('GET', key) or "0")

if current + cost > limit then
  return 0 -- BLOCK
end

redis.call('INCRBYFLOAT', key, cost)
return 1 -- ALLOW
Enter fullscreen mode Exit fullscreen mode

This runs in ~10ms on edge infrastructure. Instance A and Instance B hitting this at the exact same millisecond? Redis queues them. One passes, one fails. Budget enforced. Mathematically consistent.


The two-phase protocol

There's one more problem: you don't know the exact cost of an LLM call until it finishes.

The solution is a two-phase commit:

Phase 1 — pre-flight (before the LLM call)
Estimate the cost based on max possible tokens. Reserve that amount atomically. If budget exceeded, return 429 immediately — the LLM never even gets called.

Phase 2 — reconciliation (after the LLM call)
OpenAI returns actual token usage. Reconcile: release the estimate, apply the real cost. If the estimate was too high, refund the difference back to the user's budget.

This means your budget enforcement is tight even under worst-case conditions.


Benchmark: controlled Denial of Wallet attack

I ran a simulated DoW attack against a standard GPT-4o endpoint.

Setup: Recursive script, concurrent requests, 800+ token payloads per request.

Unprotected With atomic governance
Duration 47 seconds Stopped at request 3
Total spend $847.00 $0.08
Intervention Manual Automatic

The governance layer fired a 429 at the third request. The attacker's loop never got traction.


Why this matters after supply chain attacks

Here's the thing about the LiteLLM and Axios breaches: rotating your API key is the right move, but it's reactive. The damage happens in the window between the breach and you waking up.

Budget governance is your last line of defense. Even with a stolen key, the attacker can only drain up to the limit you set. No $1,000 surprise at sunrise.


The implementation

I built this into an open-source SDK called Thskyshield. Two lines to wrap any LLM call:

const { allowed, requestId } = await shield.check({
  externalUserId: userId,
  model: 'gpt-4o',
  estimatedTokens: { input: 500, output: 200 }
})

if (!allowed) return Response.json({ error: 'Budget exceeded' }, { status: 429 })

// ...your LLM call here...

await shield.log({ requestId, externalUserId: userId, model: 'gpt-4o',
  tokens: { input: usage.prompt_tokens, output: usage.completion_tokens }
})
Enter fullscreen mode Exit fullscreen mode

The SDK handles the atomic reservation, the two-phase reconciliation, and the 429 response automatically. Works on Vercel Edge, Cloudflare Workers, or any Node.js environment.

→ SDK: npm install @thsky-21/thskyshield
→ Live attack simulation: thskyshield.com/simulator


What are you actually using to cap LLM spend right now? Are you relying on OpenAI's hard limits, or have you built something custom? Would genuinely like to know what's working.

Top comments (0)