DEV Community

Muhammad hamthan
Muhammad hamthan

Posted on

Designing a 3-Tier LLM Fallback Router with Cooldown Locking

How I built a production-grade LLM router for a chatbot running on Groq's free tier — surviving rate limits without dropping users.


I was building a chatbot for Smatal Academy — an institutional admissions assistant — and I had a constraint most LLM tutorials don't talk about:

The free tier of my LLM provider.

Groq's free tier is generous, but it's rate-limited. When you hit the limit, your request fails with a RateLimitError and you're locked out for a window. For a personal demo, that's fine. For a chatbot that real users were typing into, it was a problem.

The naive answer is "upgrade to paid." Sometimes that's right. But this was a project deployed to Zoho Catalyst Cloud with modest traffic, and paying for an LLM subscription felt premature. So I asked a different question: can I survive rate limits architecturally instead of financially?

The answer turned out to be yes. Here's what I built.


What "naive fallback" looks like (and why it breaks)

The first instinct everyone has is this:

try:
    response = llm.invoke(question)
except RateLimitError:
    time.sleep(60)
    response = llm.invoke(question)
Enter fullscreen mode Exit fullscreen mode

Two problems with it:

  1. The user waits 60 seconds. That's a dead chat session. They'll close the tab.
  2. You're retrying the same model. If you're rate-limited now, you're probably still rate-limited 60 seconds from now.

The better instinct is to switch to a different model. Groq hosts many models — LLaMA-3.3-70B, LLaMA-4-Scout-17B, Kimi-K2, and others. Each has its own rate-limit bucket. If one is exhausted, another probably isn't.

But you don't want to fall back to a smaller model every time — only when you have to. And you don't want to keep hammering a rate-limited model when you know it's still cooling down.

That's the 3-tier router.


The 3-tier router

The core idea:

MODEL_CONFIG = [
    ("fast",        "llama-3.3-70b-versatile"),
    ("backup",      "meta-llama/llama-4-scout-17b-16e-instruct"),
    ("last_resort", "moonshotai/kimi-k2-instruct-0905"),
]
Enter fullscreen mode Exit fullscreen mode

Three models in descending order of preference. The fast model is what you want most of the time. The backup model kicks in when the fast one is rate-limited. The last_resort exists so the chatbot never goes fully dark.

When a request comes in:

  1. Try the fast model.
  2. If it's currently cooling down, skip it and try the backup.
  3. If the backup is also cooling down, try the last resort.
  4. If all three are cooling down — only then return an error to the user.

This converts a hard outage into a soft degradation. Responses might be slightly less rich temporarily, but the chatbot doesn't go down.


The cooldown mechanism

This is the part most "fallback" tutorials skip.

When you catch a RateLimitError, the obvious thing to do is try the next model immediately. That works for the current request. But what about the next request that comes in 5 seconds later? You'll try the rate-limited model first, get the same error, then fall back.

That's wasted latency on every request until the rate limit clears. Bad UX, bad cost.

The fix: when a model gets rate-limited, mark it as unavailable for a cooldown window (I used 1 hour — close to Groq's typical reset behaviour). The router skips any model whose cooldown hasn't expired:

_llm_cooldowns = {name: 0.0 for name, _ in MODEL_CONFIG}

def _get_available_models():
    now = time.time()
    for name, _ in MODEL_CONFIG:
        if now >= _llm_cooldowns[name]:
            yield name
Enter fullscreen mode Exit fullscreen mode

Subsequent requests immediately go to the backup model. Zero wasted calls on a model we know is dead.


The concurrency problem

This is where it got interesting.

I deployed the chatbot with Gunicorn running multiple workers. Each worker handles requests in parallel. The _llm_cooldowns dictionary is shared state.

Imagine two requests arrive at the same instant. Both try the fast model. Both get rate-limited. Both want to write to _llm_cooldowns["fast"] simultaneously.

In Python, dictionary writes are mostly atomic thanks to the GIL — so you won't see a corrupted dict — but you can absolutely see a race: one worker reads the cooldown value while another is mid-update, and the read returns a stale value. Result: the second worker thinks the model is still available and tries it again, wasting a call.

The fix is a simple threading.Lock:

_cooldown_lock = Lock()

def run_with_fallback(chain_builder, retriever, memory, question):
    for model_name in _get_available_models():
        llm = LLMS[model_name]
        try:
            chain = chain_builder(retriever, memory, llm)
            return chain.invoke({"question": question})
        except RateLimitError:
            with _cooldown_lock:
                _llm_cooldowns[model_name] = time.time() + COOLDOWN_SECONDS
            continue
    raise RuntimeError("All LLMs are currently rate-limited.")
Enter fullscreen mode Exit fullscreen mode

Worth noting: I only lock the write, not the read. The read can tolerate a stale value — worst case, a worker tries a rate-limited model and gets the error, which we already handle. The lock just prevents a lost update.


What I learned

Three things I didn't expect to learn going in:

  1. Fallback isn't retry. Retrying the same model on rate-limit is almost always wrong. Switching to a different model is almost always right.
  2. Cooldowns matter as much as fallbacks. Without cooldowns, you waste latency probing dead models on every request. With cooldowns, you skip them instantly.
  3. Shared state in multi-worker apps needs synchronization, even in Python. The GIL gives you a lot, but not everything. A lock is one line. Use it.

What I'd do differently next time

A few things I'd add if I rebuilt this:

  • Observability. Right now I print() when a fallback fires. I should be emitting metrics — fallback count per model, cooldown duration, total failure rate. Without metrics, you don't know when your "graceful degradation" has actually started degrading badly.
  • Circuit breaker pattern. Right now any RateLimitError triggers the cooldown. But other errors (network blip, 5xx) should be handled differently. A proper circuit breaker with separate counters for transient vs persistent failures would be more robust.
  • Half-open probes. When the cooldown expires, I just allow traffic through. A safer design would send one "probe" request first, and only fully reopen if that probe succeeds.
  • Per-user routing. Right now all users share the same cooldown state. In a paid product, premium users could be routed to the fast model preferentially while free users fall back sooner.

Closing thought

A lot of LLM apps in 2026 are still built like demos — one model, one provider, no fallback. That's fine for prototypes. But the moment real users show up, "what happens when the model is unavailable" stops being a hypothetical question.

The 3-tier router with cooldown locking turned a chatbot that would die under rate limits into one that quietly degrades. Total code: about 60 lines. Total time: an afternoon. Total payoff: real users never saw an outage.

If you're building on top of an LLM API and you don't have a fallback strategy yet, this is the cheapest reliability win you'll find.


Code: github.com/muhammadhamthan/Smatal-Institude — see llm_router.py

Tags: python llm langchain groq reliability systemdesign


Muhammad Hamthan is a backend engineer leading the backend of an AI-powered operations platform. He writes about backend architecture, AI integration, and the production reliability lessons most tutorials skip. — GitHubLinkedIn

Top comments (0)