DEV Community

ANKIT AMBASTA
ANKIT AMBASTA

Posted on

The Fallback Pattern: How I Handle 15+ RPM (30,000 Tokens/Min) on Free AI Models # The Solution: Dynamic Fallback Queue"

When I built VerdictAI X — a high-end decision support system where five specialized AI agents debate your life choices — I ran into a massive architectural problem.

Multi-agent systems do not just eat tokens; they completely destroy your rate limits.

Most tutorials show you how to build a simple chatbot that makes one API call per user message. But what happens when you have a multi-agent orchestration pipeline that triggers 21 simultaneous LLM calls for a single button click?

If you are using the free tier of Google AI Studio, you can hit 429 RESOURCE_EXHAUSTED errors almost immediately.

The bottleneck is not the tokens. It is the RPM (Requests Per Minute).


The Math: Why RPM Kills Multi-Agent Systems

VerdictAI X is not a standard chatbot; it is a multi-layered reasoning pipeline.

When a user submits a dilemma, the system spins up five specialized agents:

  • The Strategist
  • The Guardian
  • The Visionary
  • The Humanist
  • The Contrarian

A single user query requires the following behind the scenes:

Initial Analysis: 5 requests
Debate Round 1 (Challenge): 5 requests
Debate Round 2 (Defend & Challenge): 5 requests
Debate Round 2 (Defend): 5 requests
Final Verdict Synthesis: 1 request

Total = 21 LLM requests per user click
Enter fullscreen mode Exit fullscreen mode

That creates a real problem for free-tier usage, because the primary model may allow only around 15 RPM. One user query can already exceed that ceiling, even when token usage is still well under the TPM limit.


The Solution: Dynamic Fallback Queue

Instead of hardcoding a single model, I built a fallback queue.

The idea was simple:

  • Try the primary model first
  • If it hits a rate limit, move to the next model
  • Keep retrying until one succeeds
  • Show a small system notice in the UI when switching models

This way, the app can keep streaming responses instead of crashing on a 429 error.


Core Failover Logic

Here is the architecture powering the automatic model switching inside gemini_client.py:

import os
from google import genai
from google.genai import types

FALLBACK_MODELS = [
    "gemini-3.1-flash-lite-preview",
    "gemini-2.5-flash",
    "gemma-4-31b-it",
    "gemma-4-26b-a4b-it",
]

def _get_model_queue(use_pro: bool) -> list:
    """Returns a list of models to try in order."""
    primary = "gemini-2.5-pro" if use_pro else "gemini-2.5-flash"
    return [primary] + FALLBACK_MODELS

def generate_stream(prompt: str, system_prompt: str = "", use_pro: bool = False):
    """
    Streams a response with automatic failover to fallback models.
    """
    client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
    models_to_try = _get_model_queue(use_pro)

    for i, model in enumerate(models_to_try):
        config, final_prompt = _build_config_and_prompt(model, prompt, system_prompt)

        try:
            if i > 0:
                yield f"<br><span style='color:#fbbf24; font-size:10px;'>[System: Primary RPM limit reached. Switching to {model}...]</span><br>"

            for chunk in client.models.generate_content_stream(
                model=model,
                contents=final_prompt,
                config=config,
            ):
                if chunk.text:
                    yield chunk.text

            return

        except Exception as e:
            error_msg = str(e)

            if "429" in error_msg or "RESOURCE_EXHAUSTED" in error_msg:
                if i < len(models_to_try) - 1:
                    continue
                yield "<span style='color:#f43f5e; font-weight:600;'>System overloaded. All backup models are currently busy. Please try again in a few minutes.</span>"

            elif "500" in error_msg or "internal" in error_msg.lower():
                break
Enter fullscreen mode Exit fullscreen mode

What This Actually Bought Me

When the primary model hits its RPM limit, generate_stream() catches the 429 error, skips to the next model, and retries the same prompt.

Because the fallback happens inside the streaming loop, the UI can show a tiny notice like this:

[System: Primary RPM limit reached. Switching to gemma-4-31b-it...]
Enter fullscreen mode Exit fullscreen mode

The user does not get an ugly error screen. They just keep seeing the response stream normally.


Why I Am Writing About This

Most tutorials end at the point where one LLM call works.

But if you want to build complex, multi-agent AI applications, Requests Per Minute limits are one of the first real architectural hurdles you will face.

You do not always need to upgrade to a paid tier immediately. Sometimes the better solution is to design your system to fail gracefully and take advantage of the available model ecosystem.


Project Links

Top comments (0)