How I built a production-grade LLM router for a chatbot running on Groq's free tier — surviving rate limits without dropping users.
I was building a chatbot for Smatal Academy — an institutional admissions assistant — and I had a constraint most LLM tutorials don't talk about:
The free tier of my LLM provider.
Groq's free tier is generous, but it's rate-limited. When you hit the limit, your request fails with a RateLimitError and you're locked out for a window. For a personal demo, that's fine. For a chatbot that real users were typing into, it was a problem.
The naive answer is "upgrade to paid." Sometimes that's right. But this was a project deployed to Zoho Catalyst Cloud with modest traffic, and paying for an LLM subscription felt premature. So I asked a different question: can I survive rate limits architecturally instead of financially?
The answer turned out to be yes. Here's what I built.
What "naive fallback" looks like (and why it breaks)
The first instinct everyone has is this:
try:
response = llm.invoke(question)
except RateLimitError:
time.sleep(60)
response = llm.invoke(question)
Two problems with it:
- The user waits 60 seconds. That's a dead chat session. They'll close the tab.
- You're retrying the same model. If you're rate-limited now, you're probably still rate-limited 60 seconds from now.
The better instinct is to switch to a different model. Groq hosts many models — LLaMA-3.3-70B, LLaMA-4-Scout-17B, Kimi-K2, and others. Each has its own rate-limit bucket. If one is exhausted, another probably isn't.
But you don't want to fall back to a smaller model every time — only when you have to. And you don't want to keep hammering a rate-limited model when you know it's still cooling down.
That's the 3-tier router.
The 3-tier router
The core idea:
MODEL_CONFIG = [
("fast", "llama-3.3-70b-versatile"),
("backup", "meta-llama/llama-4-scout-17b-16e-instruct"),
("last_resort", "moonshotai/kimi-k2-instruct-0905"),
]
Three models in descending order of preference. The fast model is what you want most of the time. The backup model kicks in when the fast one is rate-limited. The last_resort exists so the chatbot never goes fully dark.
When a request comes in:
- Try the fast model.
- If it's currently cooling down, skip it and try the backup.
- If the backup is also cooling down, try the last resort.
- If all three are cooling down — only then return an error to the user.
This converts a hard outage into a soft degradation. Responses might be slightly less rich temporarily, but the chatbot doesn't go down.
The cooldown mechanism
This is the part most "fallback" tutorials skip.
When you catch a RateLimitError, the obvious thing to do is try the next model immediately. That works for the current request. But what about the next request that comes in 5 seconds later? You'll try the rate-limited model first, get the same error, then fall back.
That's wasted latency on every request until the rate limit clears. Bad UX, bad cost.
The fix: when a model gets rate-limited, mark it as unavailable for a cooldown window (I used 1 hour — close to Groq's typical reset behaviour). The router skips any model whose cooldown hasn't expired:
_llm_cooldowns = {name: 0.0 for name, _ in MODEL_CONFIG}
def _get_available_models():
now = time.time()
for name, _ in MODEL_CONFIG:
if now >= _llm_cooldowns[name]:
yield name
Subsequent requests immediately go to the backup model. Zero wasted calls on a model we know is dead.
The concurrency problem
This is where it got interesting.
I deployed the chatbot with Gunicorn running multiple workers. Each worker handles requests in parallel. The _llm_cooldowns dictionary is shared state.
Imagine two requests arrive at the same instant. Both try the fast model. Both get rate-limited. Both want to write to _llm_cooldowns["fast"] simultaneously.
In Python, dictionary writes are mostly atomic thanks to the GIL — so you won't see a corrupted dict — but you can absolutely see a race: one worker reads the cooldown value while another is mid-update, and the read returns a stale value. Result: the second worker thinks the model is still available and tries it again, wasting a call.
The fix is a simple threading.Lock:
_cooldown_lock = Lock()
def run_with_fallback(chain_builder, retriever, memory, question):
for model_name in _get_available_models():
llm = LLMS[model_name]
try:
chain = chain_builder(retriever, memory, llm)
return chain.invoke({"question": question})
except RateLimitError:
with _cooldown_lock:
_llm_cooldowns[model_name] = time.time() + COOLDOWN_SECONDS
continue
raise RuntimeError("All LLMs are currently rate-limited.")
Worth noting: I only lock the write, not the read. The read can tolerate a stale value — worst case, a worker tries a rate-limited model and gets the error, which we already handle. The lock just prevents a lost update.
What I learned
Three things I didn't expect to learn going in:
- Fallback isn't retry. Retrying the same model on rate-limit is almost always wrong. Switching to a different model is almost always right.
- Cooldowns matter as much as fallbacks. Without cooldowns, you waste latency probing dead models on every request. With cooldowns, you skip them instantly.
- Shared state in multi-worker apps needs synchronization, even in Python. The GIL gives you a lot, but not everything. A lock is one line. Use it.
What I'd do differently next time
A few things I'd add if I rebuilt this:
-
Observability. Right now I
print()when a fallback fires. I should be emitting metrics — fallback count per model, cooldown duration, total failure rate. Without metrics, you don't know when your "graceful degradation" has actually started degrading badly. -
Circuit breaker pattern. Right now any
RateLimitErrortriggers the cooldown. But other errors (network blip, 5xx) should be handled differently. A proper circuit breaker with separate counters for transient vs persistent failures would be more robust. - Half-open probes. When the cooldown expires, I just allow traffic through. A safer design would send one "probe" request first, and only fully reopen if that probe succeeds.
- Per-user routing. Right now all users share the same cooldown state. In a paid product, premium users could be routed to the fast model preferentially while free users fall back sooner.
Closing thought
A lot of LLM apps in 2026 are still built like demos — one model, one provider, no fallback. That's fine for prototypes. But the moment real users show up, "what happens when the model is unavailable" stops being a hypothetical question.
The 3-tier router with cooldown locking turned a chatbot that would die under rate limits into one that quietly degrades. Total code: about 60 lines. Total time: an afternoon. Total payoff: real users never saw an outage.
If you're building on top of an LLM API and you don't have a fallback strategy yet, this is the cheapest reliability win you'll find.
Code: github.com/muhammadhamthan/Smatal-Institude — see llm_router.py
Tags: python llm langchain groq reliability systemdesign
Muhammad Hamthan is a backend engineer leading the backend of an AI-powered operations platform. He writes about backend architecture, AI integration, and the production reliability lessons most tutorials skip. — GitHub • LinkedIn
Top comments (0)