Jace · Rivetz

Posted on May 19

Rate Limiting for Lovable Apps: How to Stop Surprise OpenAI Bills

#ai #webdev #vercel #security

AI-built apps ship with no rate limiting on their AI endpoints. One user with a loop script can burn through your entire OpenAI budget overnight. Here's how to check, fix, and verify rate limiting on every endpoint that costs money to call.

The short version
Every AI endpoint in your app (chat, generate, summarize, draft) makes a paid API call when invoked. Without rate limiting, anyone who finds the endpoint can call it in a loop. The bill scales linearly with the call volume, and modern providers (OpenAI, Anthropic) charge in real time against your API key. Stories of $1,000-$5,000 surprise bills from a single abusive user are common. The fix is putting a rate limiter in front of every endpoint that costs money.

Why AI endpoints need rate limiting specifically
Rate limiting is the practice of restricting how many requests a given client can make to your API within a time window. For most web endpoints, the consequence of skipping rate limiting is performance (your server gets overloaded). For AI endpoints, the consequence is direct financial loss.

Each call to your AI endpoint triggers a downstream API call to OpenAI, Anthropic, or whichever provider you use. That provider charges you per token consumed. If your app exposes an endpoint that calls GPT-4 with a 4,000-token prompt and gets a 1,000-token response, every invocation costs you real money. At GPT-4 pricing, a single call might be $0.05-$0.20. Sounds tiny. Multiplied by 50,000 abusive calls in an hour, you're looking at a four-figure bill before you wake up.

The financial damage is also usually irreversible. By the time you notice the spike in your provider dashboard, the charges have already cleared. Some providers offer credit refunds for clear abuse cases, but it's not guaranteed, and the process is slow. The defensive posture has to be: assume any unprotected endpoint will eventually be abused.

The specific failure pattern in Lovable apps
The typical AI endpoint generated by Lovable, Bolt, or V0 looks something like this:

// What AI builders typically generate
import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export default async function handler(req, res) {
const { prompt } = req.body;

const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: prompt }],
});

res.status(200).json({ result: response.choices[0].message.content });
}
Three problems:

No authentication check. Anyone, including non-logged-in users, can hit this endpoint.
No rate limiting. A single client can hit this in a loop. There's no per-IP, per-session, or per-user throttle.
No input validation on token cost. The user can send a 100,000-token prompt and you pay for it.
The attack is trivial:

while true; do
curl -X POST https://yourapp.com/api/chat \
-H "Content-Type: application/json" \
-d '{"prompt": "Write a 1000-word essay on rate limiting"}'
done
Run that overnight from a single VM and the bill the next morning is bad. Run it from a few hundred residential proxies and the bill is catastrophic.

How to check if your endpoints are rate limited
Three checks.

Check 1: Read the code
Open your AI endpoint files. Search for any of these patterns:

ratelimit
upstash
rateLimit
throttle
429 (the HTTP status code for rate limit exceeded)
If none of those appear in your AI endpoint code, the endpoint is not rate-limited.

Check 2: Hammer your own endpoint
Hit your endpoint 50-100 times in quick succession from a script. If every request returns 200 OK, you have no rate limiting. A correctly rate-limited endpoint should start returning 429 after some threshold (often 5-10 requests per minute for AI endpoints).

for i in {1..50}; do
curl -s -o /dev/null -w "%{http_code}\n" \
-X POST https://yourapp.com/api/chat \
-H "Content-Type: application/json" \
-d '{"prompt": "test"}'
done
If you see 50 lines of 200, rate limiting is missing. If you see some 200s followed by 429s, rate limiting is working.

Check 3: Look at your provider dashboard
Open your OpenAI/Anthropic usage dashboard. Look at the request count over the last 7 days. Does the volume match your actual user activity? If you have 50 users but you're seeing 5,000 requests per day, something or someone is hitting your endpoint outside of normal user behavior.

How to add rate limiting (Upstash + Vercel)
The standard solution for Vercel-hosted apps is Upstash Ratelimit. It's a serverless rate limiter backed by Upstash Redis. Free tier handles 10,000 commands per day, which is plenty for most starting apps.

Step 1: Create an Upstash Redis database
Go to upstash.com, sign up free, create a Redis database in the same region as your Vercel deployment. Copy the REST URL and REST token.

Step 2: Add the dependencies
npm install @upstash/ratelimit @upstash/redis
Step 3: Add environment variables to Vercel
UPSTASH_REDIS_REST_URL=https://your-db.upstash.io
UPSTASH_REDIS_REST_TOKEN=your-token-here
Step 4: Wrap your AI endpoint with rate limiting
import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';
import OpenAI from 'openai';

const ratelimit = new Ratelimit({
redis: Redis.fromEnv(),
limiter: Ratelimit.slidingWindow(5, '1 m'), // 5 requests per minute
});

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export default async function handler(req, res) {
// Identify the requester (IP for unauthenticated, user_id for logged-in)
const identifier = req.headers['x-forwarded-for']?.split(',')[0]
?? req.socket.remoteAddress
?? 'anonymous';

const { success, limit, remaining, reset } = await ratelimit.limit(identifier);

if (!success) {
res.setHeader('X-RateLimit-Limit', limit);
res.setHeader('X-RateLimit-Remaining', remaining);
res.setHeader('X-RateLimit-Reset', reset);
return res.status(429).json({ error: 'Rate limit exceeded' });
}

// Validate the prompt size
const { prompt } = req.body;
if (!prompt || prompt.length > 2000) {
return res.status(400).json({ error: 'Invalid prompt' });
}

const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: prompt }],
max_tokens: 1000, // cap the response size
});

res.status(200).json({ result: response.choices[0].message.content });
}
What this does: each incoming request is keyed by the requester's IP. Upstash counts the requests in a sliding 1-minute window. If a single IP exceeds 5 requests in that window, the next request returns 429 instead of calling OpenAI. The counter resets continuously, so legitimate users aren't permanently blocked.

The added prompt.length check and max_tokens cap also prevent the "send a 100,000-token prompt" attack. Even if a user passes the rate limit, the maximum cost per call is bounded.

Layered limits: per-IP, per-user, per-cost
One rate limit usually isn't enough. The real-world approach is to layer multiple limits.

Layer 1: Per-IP
The cheapest defense. Limits anonymous abuse. Easy to bypass with proxies, but raises the cost of an attack significantly.

Layer 2: Per-authenticated-user
If your endpoint requires login, also rate-limit by user ID. This prevents a single legitimate user from accidentally (or intentionally) burning your budget.

const userRatelimit = new Ratelimit({
redis: Redis.fromEnv(),
limiter: Ratelimit.fixedWindow(100, '1 d'), // 100 per day per user
});

// In your handler, after auth check:
const userResult = await userRatelimit.limit(user:${session.userId});
if (!userResult.success) {
return res.status(429).json({ error: 'Daily limit reached' });
}
Layer 3: Global cost cap
The most defensive layer: a daily total budget across all users and IPs. If your app should cost no more than $50/day in OpenAI calls, set a counter that increments per call and cuts off everyone if the day's spend exceeds the cap.

const globalRatelimit = new Ratelimit({
redis: Redis.fromEnv(),
limiter: Ratelimit.fixedWindow(2000, '1 d'), // 2000 total calls/day
});

const globalResult = await globalRatelimit.limit('global');
if (!globalResult.success) {
return res.status(503).json({ error: 'Service temporarily at capacity' });
}
Returning 503 instead of 429 here is intentional: the user isn't being rate-limited, the service is at capacity. The behavior is the same (request rejected) but the framing makes it easier to explain to a confused user.

Common pitfalls when adding rate limiting
Pitfall 1: Identifying users by an unreliable header
Some apps use req.connection.remoteAddress for IP-based limiting. On Vercel and most modern hosts, every request comes from the platform's IP, not the user's. Always use x-forwarded-for (and take the first IP in the comma-separated list — that's the original client).

Pitfall 2: In-memory rate limiting on serverless
If you use an in-memory counter (a regular JavaScript object) in a Vercel serverless function, the counter resets on every cold start. Different function instances also have different counters. Effective rate limiting on serverless requires external state (Redis, KV, Postgres). Upstash is the standard choice because it's serverless-native.

Pitfall 3: Returning 200 with an error in the body
Some AI-generated handlers return 200 with { error: 'rate limited' } in the JSON. Don't do this. Rate limit violations should return HTTP 429. Monitoring tools, browser caches, and other clients all respect the 429 status code.

Pitfall 4: No retry-after guidance
When you return 429, include a Retry-After header (or X-RateLimit-Reset) so clients know when they can try again. Without this, retry logic often defaults to "hammer the endpoint until something works."

Pitfall 5: Rate limiting after the expensive call
The rate limit check must happen before the OpenAI/Anthropic call. Some incorrect implementations check the rate limit, call the AI, then update the counter after the response. That race condition allows burst attacks during the window.

Related cost-control issues to check
Rate limiting fixes the worst-case scenario, but other cost-control issues compound it.

No prompt size cap. Even with rate limiting, a single call with a 50,000-token prompt is expensive. Always cap prompt.length at something reasonable (2,000-4,000 characters for most chat apps) and reject anything larger with a 400.

No max_tokens on the response. Without max_tokens in your OpenAI call, the response can run as long as the model wants. Caps the response cost.

No alerting on usage spikes. Set up an alert in your OpenAI/Anthropic dashboard to email you if daily usage exceeds a threshold.

Streaming endpoints with no cancellation. If you use streaming responses, make sure your handler cancels the upstream OpenAI request when the client disconnects.

The TL;DR
AI endpoints without rate limiting = direct path to a four-figure surprise bill
Default AI-builder-generated handlers have no rate limit, no auth check, no prompt size cap
Use Upstash Ratelimit + Vercel for serverless-friendly rate limiting
Layer three limits: per-IP (anonymous abuse), per-user (legitimate user runaway), global cost cap (total budget)
Always rate-limit BEFORE the expensive API call, not after
Cap prompt size and response tokens to bound per-call cost
Set up alerts on provider dashboards as a backstop
This is one of five technical guides I've written on production-readiness issues in AI-built apps. The original lives at rivetzco.com/guides/rate-limiting-for-lovable, along with guides on Supabase RLS, Stripe webhook security, secrets management, and the pillar piece on why these patterns keep showing up.

Top comments (1)

Harjot Singh • May 31

The "surprise OpenAI bill" is the single most common way a Lovable/Bolt app turns from a fun weekend win into a panic - because the generated app almost always calls the model straight from the client or an ungated endpoint with no per-user ceiling. One scraper or one abusive user and your key is the one paying for it. Rate limiting + per-user quotas + a server-side proxy that never exposes the key is exactly the boring-but-load-bearing layer the prototype skips.

This is a big part of why I lean on orchestration over one-shot generation in Moonshift - a multi-agent pipeline that ships a prompt to a real SaaS on your own GitHub + Vercel with these guardrails (key handling, gated API routes, sane defaults) baked in rather than bolted on after the first scary invoice. The routing also keeps Moonshift's own build cost ~$3 flat. First run's free, no card. Good practical post - do you find a fixed per-user token cap or a sliding cost budget works better in practice? The cost-budget version seems more honest but way harder to get right.