When I built VerdictAI X — a high-end decision support system where five specialized AI agents debate your life choices — I ran into a massive architectural problem.
Multi-agent systems do not just eat tokens; they completely destroy your rate limits.
Most tutorials show you how to build a simple chatbot that makes one API call per user message. But what happens when you have a multi-agent orchestration pipeline that triggers 21 simultaneous LLM calls for a single button click?
If you are using the free tier of Google AI Studio, you can hit 429 RESOURCE_EXHAUSTED errors almost immediately.
The bottleneck is not the tokens. It is the RPM (Requests Per Minute).
The Math: Why RPM Kills Multi-Agent Systems
VerdictAI X is not a standard chatbot; it is a multi-layered reasoning pipeline.
When a user submits a dilemma, the system spins up five specialized agents:
- The Strategist
- The Guardian
- The Visionary
- The Humanist
- The Contrarian
A single user query requires the following behind the scenes:
Initial Analysis: 5 requests
Debate Round 1 (Challenge): 5 requests
Debate Round 2 (Defend & Challenge): 5 requests
Debate Round 2 (Defend): 5 requests
Final Verdict Synthesis: 1 request
Total = 21 LLM requests per user click
That creates a real problem for free-tier usage, because the primary model may allow only around 15 RPM. One user query can already exceed that ceiling, even when token usage is still well under the TPM limit.
The Solution: Dynamic Fallback Queue
Instead of hardcoding a single model, I built a fallback queue.
The idea was simple:
- Try the primary model first
- If it hits a rate limit, move to the next model
- Keep retrying until one succeeds
- Show a small system notice in the UI when switching models
This way, the app can keep streaming responses instead of crashing on a 429 error.
Core Failover Logic
Here is the architecture powering the automatic model switching inside gemini_client.py:
import os
from google import genai
from google.genai import types
FALLBACK_MODELS = [
"gemini-3.1-flash-lite-preview",
"gemini-2.5-flash",
"gemma-4-31b-it",
"gemma-4-26b-a4b-it",
]
def _get_model_queue(use_pro: bool) -> list:
"""Returns a list of models to try in order."""
primary = "gemini-2.5-pro" if use_pro else "gemini-2.5-flash"
return [primary] + FALLBACK_MODELS
def generate_stream(prompt: str, system_prompt: str = "", use_pro: bool = False):
"""
Streams a response with automatic failover to fallback models.
"""
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
models_to_try = _get_model_queue(use_pro)
for i, model in enumerate(models_to_try):
config, final_prompt = _build_config_and_prompt(model, prompt, system_prompt)
try:
if i > 0:
yield f"<br><span style='color:#fbbf24; font-size:10px;'>[System: Primary RPM limit reached. Switching to {model}...]</span><br>"
for chunk in client.models.generate_content_stream(
model=model,
contents=final_prompt,
config=config,
):
if chunk.text:
yield chunk.text
return
except Exception as e:
error_msg = str(e)
if "429" in error_msg or "RESOURCE_EXHAUSTED" in error_msg:
if i < len(models_to_try) - 1:
continue
yield "<span style='color:#f43f5e; font-weight:600;'>System overloaded. All backup models are currently busy. Please try again in a few minutes.</span>"
elif "500" in error_msg or "internal" in error_msg.lower():
break
What This Actually Bought Me
When the primary model hits its RPM limit, generate_stream() catches the 429 error, skips to the next model, and retries the same prompt.
Because the fallback happens inside the streaming loop, the UI can show a tiny notice like this:
[System: Primary RPM limit reached. Switching to gemma-4-31b-it...]
The user does not get an ugly error screen. They just keep seeing the response stream normally.
Why I Am Writing About This
Most tutorials end at the point where one LLM call works.
But if you want to build complex, multi-agent AI applications, Requests Per Minute limits are one of the first real architectural hurdles you will face.
You do not always need to upgrade to a paid tier immediately. Sometimes the better solution is to design your system to fail gracefully and take advantage of the available model ecosystem.
Project Links
- GitHub: VerdictAI X repository [https://github.com/A-Square8/VerdictAI-X]
- LinkedIn: Ankit Ambasta [https://www.linkedin.com/in/ankit-ambasta-4a58002b9/]
Top comments (0)