The day my LLM app hit the quota wall (and the boring fix that saved it)
I had one of those "this is fine" moments.
My little app was working. People were using it. Then one evening it started failing in the dumbest way:
random errors + slow replies + a bill that didn’t match my mental math.
Nothing “mystical” happened. I just hit the quota wall.
Here’s what I did next (it’s not fancy, but it’s the first thing I’d ship again).
What the quota wall looks like (in a real product)
It usually shows up as one of these:
1) Hard stops — requests fail, users bounce.
2) Soft stops — latency spikes, retries pile up, UX feels laggy.
3) Silent drift — outputs get worse or inconsistent, and you spend time “re-doing” work.
If you’re trying to make money, the killer isn’t “the model isn’t smart.”
It’s you can’t predict delivery or cost.
The boring fix: cheap-by-default + upgrade only when it matters
I stopped calling the expensive model for everything.
Instead I used a simple rule:
- Default: cheap model
- Upgrade: only for user‑facing, high‑stakes, or irreversible steps
- Fallback: if it errors / times out / quota caps, try another provider/model
That’s it.
It’s basically how you’d run any service with reliability requirements.
The “if/then” rules I actually use
If I had to explain it to a friend, I’d write it like this:
- If the task is drafting → cheap model
- If the task is final copy users will see → stronger model
- If the output must follow a strict format → cheap model first, upgrade only if it fails validation
- If it’s taking longer than X seconds → switch to the fast fallback
- If the request fails with quota/rate-limit → switch provider immediately
Validation can be dumb:
- did it include required fields?
- did it follow the JSON/schema?
- is it empty/too short?
You don’t need a PhD. You need a gate.
The only 3 numbers worth tracking (or you’re guessing)
I log these, otherwise I’m just vibes:
- cost per request (and per user/day)
- latency (p50/p95)
- redo rate (how often I have to rerun / fix / user complains)
If you want a money metric:
cost per useful outcome (total cost / successful completions).
Quick question (I want replies)
If you’ve shipped anything with LLMs:
- Where did you hit the wall first — cost, rate limits, or quality drift?
- What kind of app is it (writing tool / support / automation / something else)?
Drop your constraints (budget + latency). I’ll reply with the exact routing rules I’d start with.
Context that pushed me to write this:
https://dev.to/dor_amir_dbb52baafff7ca5b/i-kept-hitting-the-quota-wall-with-ai-coding-tools-so-i-built-a-router-5lj
Top comments (0)