TL;DR
I run a small dev-shop. Every product I ship needs an LLM call somewhere — content generation, security analysis, classification, summaries. The economics only work if these calls average near-zero.
The trick: never pay for what a free tier can do. Stack 5+ providers in a deterministic fallback chain so that when one rate-limits, account-bans, or hikes prices, the next one takes over within the same request — invisible to the user.
This post is the actual production code from audit_routes.py powering askoracle.site/audit — a 12-question crypto security scan that costs me $0 per scan in practice, while still producing real AI reports anchored to 2024-25 incident cases.
The fallback chain
User request
│
▼
┌──────────────────────────────────────────────────────────┐
│ Groq llama-3.3-70b (5 keys, sequential, free tier) │ Tier 1
└──────────────┬───────────────────────────────────────────┘
│ all 5 rate-limited / Cloudflare-blocked?
▼
┌──────────────────────────────────────────────────────────┐
│ DeepSeek v4-flash ($0.27/1M in, $1.10/1M out) │ Tier 2
└──────────────┬───────────────────────────────────────────┘
│ rate-limited / credits depleted?
▼
┌──────────────────────────────────────────────────────────┐
│ Vertex AI Gemini 2.5 Pro (Service Account, $200 trial) │ Tier 3
└──────────────┬───────────────────────────────────────────┘
│ all LLMs down?
▼
┌──────────────────────────────────────────────────────────┐
│ Deterministic Python template (cannot fail) │ Tier 4
└──────────────────────────────────────────────────────────┘
Implementation
Tier 1+2 — Groq × 5 keys → DeepSeek
from groq import Groq, RateLimitError as GroqRateLimitError
import openai as _openai
GROQ_KEYS = [open(f"/root/groq_key_{i}.txt").read().strip() for i in range(1, 6)]
groq_clients = [Groq(api_key=k) for k in GROQ_KEYS]
deepseek_client = _openai.OpenAI(api_key=DEEPSEEK_KEY, base_url="https://api.deepseek.com/v1")
def _chat_complete(system_msg, user_msg, max_tokens=900, temperature=0.3):
"""Groq×5 → DeepSeek. Returns (text, provider_name)."""
for name, gc in [("groq1", groq_clients[0]), ("groq2", groq_clients[1]),
("groq3", groq_clients[2]), ("groq4", groq_clients[3]),
("groq5", groq_clients[4])]:
try:
resp = gc.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role":"system","content":system_msg},
{"role":"user","content":user_msg}],
max_tokens=max_tokens, temperature=temperature,
)
text = resp.choices[0].message.content.strip()
if text:
return text, name
except GroqRateLimitError:
log.warning(f"[llm] {name} rate limited → next")
except Exception as e:
log.warning(f"[llm] {name} error: {e} → next")
# DeepSeek fallback
try:
resp = deepseek_client.chat.completions.create(
model="deepseek-chat",
messages=[{"role":"system","content":system_msg},
{"role":"user","content":user_msg}],
max_tokens=max_tokens, temperature=temperature,
)
text = resp.choices[0].message.content.strip()
if text:
return text, "deepseek"
except Exception as e:
log.warning(f"[llm] deepseek error: {e}")
return None, None
Tier 3 — Vertex AI Pro via subprocess CLI
The trick: I made ask_pro a standalone CLI in /usr/local/bin/. Any process that has the right service account key can invoke it.
import subprocess, tempfile, re
def _generate_report(answers, score, breakdown, vulns, lang):
"""Returns (report_md, provider). Groq → DS → Pro CLI → deterministic."""
text, provider = _chat_complete(system_msg, user_msg, max_tokens=2200)
if text and len(text) > 400:
return text, provider
# Tier 3: Pro subprocess fallback
try:
with tempfile.NamedTemporaryFile("w", suffix=".md", delete=False) as f:
f.write(system_msg + "\n\n---\n\n" + user_msg)
prompt_path = f.name
r = subprocess.run(
["/usr/local/bin/ask_pro", "-f", prompt_path, "--max", "3000"],
capture_output=True, text=True, timeout=90,
)
out = r.stdout.strip()
out = re.sub(r"\n\[tokens:.*?\]\s*$", "", out) # strip cost line
if out and len(out) > 400:
return out, "vertex-pro-fallback"
except Exception as e:
log.warning(f"pro fallback failed: {e}")
# Tier 4: deterministic template
return _fallback_report(score, breakdown, vulns, lang), "fallback-template"
The ask_pro CLI is ~120 lines that handles auto-fallback across 3 GCP regions (us-central1 → europe-west1 → europe-west4) when one region rate-limits. That's an inner fallback chain inside the outer one.
Tier 4 — the chain cannot fail
def _fallback_report(score, breakdown, vulns, lang):
"""Deterministic report when all LLMs are down."""
fortress_title = {
"ru": f"## Твоя цифровая крепость: {score}/100",
"en": f"## Your digital fortress: {score}/100",
"es": f"## Tu fortaleza digital: {score}/100",
}[lang]
intro = {
"ru": ("У тебя серьёзные дыры..." if score < 40 else "..."),
"en": ("You have serious gaps..." if score < 40 else "..."),
# ...
}[lang]
# 30 lines of f-strings producing usable (if bland) report
return "\n".join(lines)
If the chain reaches Tier 4, the user still gets a usable report. It's not great — it doesn't reference Inferno Drainer or address poisoning — but it answers the question. The pipeline cannot return 500.
The interesting bit: why 5 Groq keys?
Three Cloudflare-related reasons most posts don't mention:
1. Cloudflare 1010 (IP-level block)
Groq's API sits behind Cloudflare. If your outbound IP looks "bot-y" (many requests, no browser fingerprint), Cloudflare will return HTTP 403 with error code: 1010 to all your API keys simultaneously. It's not a rate limit — it's a block. Yesterday during testing, all 5 of my keys went down at once for ~4 hours.
The fallback to DeepSeek (different infra, no Cloudflare) saved the day.
Lesson: Cloudflare 1010 is a class of failure that key rotation alone doesn't solve.
2. Tokens-per-minute vs requests-per-minute
Groq's free tier limits are per-key. Llama 3.3 70B = 6,000 tokens/min input + 6,000 output. Long prompts (~5K input) trip token limit before request limit. With 5 keys you get 30K tokens/min input — enough for ~5 long-form generation calls per minute, plenty for a B2C audit product.
3. Provider parity ≠ output parity
DeepSeek-chat is excellent at following structured prompts. Llama 3.3 70B is better at incident-case anchoring (mentions specific names like "Inferno Drainer 2023" more reliably). The fallback chain runs the same prompt across both. For ~95% of inputs, output quality is interchangeable. For the 5% where DeepSeek is wordier, I use it as Tier 2 specifically because by then we already know Groq is gone.
How this saves real money
The audit_routes.py pipeline has done ~40 E2E test runs during build + production. Cost breakdown:
| Tier | Times hit | Cost |
|---|---|---|
| Groq | 38 | $0 |
| DeepSeek | 2 (when Groq Cloudflare-blocked) | $0.0030 |
| Vertex Pro | 0 | $0 |
| Template | 0 | $0 |
| Total | 40 runs | $0.003 |
At $0.003 per 40 scans, each scan is fractions of a cent. Even at scale (1,000 scans/day), monthly LLM cost = ~$2.
This is a security product that charges $49 for the manual audit. Unit economics: insane.
What I'd do differently next time
Add OpenRouter as Tier 1.5 — they aggregate 200+ models with one key. Hashnode and Together work through them. If Groq Cloudflare 1010 lasts >1 day, OpenRouter becomes Tier 1 automatically.
HuggingFace Inference Router — works in our setup as a Tier 1.5 backup (free serverless 70B via Novita provider). I tested it earlier today —
/usr/local/bin/ask_hfwraps it as one-liner CLI. Worth adding to your stack: free, no Cloudflare, no prepayment.Cache by
(prompt_hash, lang)— for high-volume use cases, ~30% of prompts are repeats. SimpleSQL tablellm_cache(hash, response, created_at). Don't add this until you actually have >10K requests/day.Sign the request URLs with HMAC to prevent enumerability attacks. I didn't, because the report content already takes user input through enum validation.
What this powers (side effect)
A 12-question crypto security audit. Free tier scan: askoracle.site/audit (RU/EN/ES). Manual audit by an engineer: $49.
But honestly — the audit is a side effect of the pipeline. The pipeline is reusable for:
- Content generation (blog posts, video scripts)
- Code review at scale (PR analysis)
- Customer support response drafting
- Real-time chat assistants
- Classification tasks
- Anything where you want LLM cost = $0 in practice
If you ship anything with LLMs and your bill is more than $1/day, copy this chain.
Source
The full audit_routes.py is ~1200 lines. The fallback section is ~80 lines (shown above). If you want the rest — the question schema, scoring math, XSS escape on LLM output, rate-limit decorators — the engineering write-up of the audit itself is in my earlier post.
If you want to see the chain working live, run the free scan and check the provider field in the response. Most of the time it'll say groq3 or groq4 (because groq1/groq2 are usually rate-limited from earlier traffic that minute). That's the chain doing its job.
— @sspoisk / GuardLabs
Top comments (0)