DEV Community

zy j
zy j

Posted on • Edited on

The day my LLM app hit the quota wall (and the boring fix that saved it)

The day my LLM app hit the quota wall (and the boring fix that saved it)

I had one of those "this is fine" moments.

My little app was working. People were using it. Then one evening it started failing in the dumbest way:
random errors + slow replies + a bill that didn’t match my mental math.

Nothing “mystical” happened. I just hit the quota wall.

Here’s what I did next (it’s not fancy, but it’s the first thing I’d ship again).


What the quota wall looks like (in a real product)

It usually shows up as one of these:

1) Hard stops — requests fail, users bounce.
2) Soft stops — latency spikes, retries pile up, UX feels laggy.
3) Silent drift — outputs get worse or inconsistent, and you spend time “re-doing” work.

If you’re trying to make money, the killer isn’t “the model isn’t smart.”
It’s you can’t predict delivery or cost.


The boring fix: cheap-by-default + upgrade only when it matters

I stopped calling the expensive model for everything.

Instead I used a simple rule:

  • Default: cheap model
  • Upgrade: only for user‑facing, high‑stakes, or irreversible steps
  • Fallback: if it errors / times out / quota caps, try another provider/model

That’s it.

It’s basically how you’d run any service with reliability requirements.


The “if/then” rules I actually use

If I had to explain it to a friend, I’d write it like this:

  • If the task is drafting → cheap model
  • If the task is final copy users will see → stronger model
  • If the output must follow a strict format → cheap model first, upgrade only if it fails validation
  • If it’s taking longer than X seconds → switch to the fast fallback
  • If the request fails with quota/rate-limit → switch provider immediately

Validation can be dumb:

  • did it include required fields?
  • did it follow the JSON/schema?
  • is it empty/too short?

You don’t need a PhD. You need a gate.


The only 3 numbers worth tracking (or you’re guessing)

I log these, otherwise I’m just vibes:

  • cost per request (and per user/day)
  • latency (p50/p95)
  • redo rate (how often I have to rerun / fix / user complains)

If you want a money metric:
cost per useful outcome (total cost / successful completions).


Quick question (I want replies)

If you’ve shipped anything with LLMs:

  • Where did you hit the wall first — cost, rate limits, or quality drift?
  • What kind of app is it (writing tool / support / automation / something else)?

Drop your constraints (budget + latency). I’ll reply with the exact routing rules I’d start with.


Context that pushed me to write this:
https://dev.to/dor_amir_dbb52baafff7ca5b/i-kept-hitting-the-quota-wall-with-ai-coding-tools-so-i-built-a-router-5lj

Top comments (0)