How I Cut My LLM API Bill by 90%: A Practical Guide to Multi-Provider Routing
Last month I was spending $120/month on LLM API calls for a small SaaS. Not a fortune, but for a solo developer running on a $6 VPS, it was 20x my infrastructure cost. The worst part? 80% of those calls were simple tasks — text extraction, summarization, formatting — that didn't need GPT-4o.
This month: $15. Same workload. Here's exactly how I did it.
The Problem: One Provider for Everything
Most developers (including me, until recently) pick one LLM provider and use it for everything. GPT-4o for summarizing a tweet? Sure. GPT-4o for classifying a support ticket? Why not. GPT-4o for extracting a date from a string? Of course.
That's like hiring a senior engineer to photocopy documents. Technically they can do it, but you're massively overpaying.
The Solution: Task-Based Routing
The key insight: not every request needs the same model quality. I categorize every LLM call into three tiers:
| Tier | Task Examples | Best Model | Cost (per 1M input tokens) |
|---|---|---|---|
| Low | Text extraction, formatting, classification, simple Q&A | Gemini 2.0 Flash | $0.075 |
| Medium | Summarization, code generation, translation, data analysis | DeepSeek V4 Flash | $0.14 |
| High | Complex reasoning, multi-step planning, creative writing | GPT-4o | $2.50 |
By routing 80% of requests to Gemini/DeepSeek and only 20% to OpenAI, my average cost per token dropped from $2.50 to $0.27 — a 90% reduction.
The Implementation: 40 Lines of Python
Here's the routing logic I use (simplified from my production proxy):
import os
import requests
from functools import lru_cache
PROVIDERS = {
"gemini": {
"url": "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent",
"cost_per_1m_input": 0.075,
"api_key": os.environ["GEMINI_API_KEY"],
},
"deepseek": {
"url": "https://api.deepseek.com/v1/chat/completions",
"cost_per_1m_input": 0.14,
"api_key": os.environ["DEEPSEEK_API_KEY"],
},
"openai": {
"url": "https://api.openai.com/v1/chat/completions",
"cost_per_1m_input": 2.50,
"api_key": os.environ["OPENAI_API_KEY"],
},
}
def classify_task(prompt: str) -> str:
"""Simple heuristic — in production, use a small model to classify."""
prompt_lower = prompt.lower()
high_keywords = ["analyze", "reason", "plan", "strategy", "compare and contrast", "write a story"]
medium_keywords = ["summarize", "translate", "generate code", "explain", "rewrite"]
if any(kw in prompt_lower for kw in high_keywords):
return "high"
elif any(kw in prompt_lower for kw in medium_keywords):
return "medium"
return "low"
TIER_TO_PROVIDER = {"low": "gemini", "medium": "deepseek", "high": "openai"}
def route_request(prompt: str, **kwargs) -> dict:
tier = classify_task(prompt)
provider_name = TIER_TO_PROVIDER[tier]
provider = PROVIDERS[provider_name]
# Try primary, fallback to next cheapest
for fallback in [provider_name, "deepseek", "openai"]:
try:
return call_provider(fallback, prompt, **kwargs)
except Exception:
continue
raise Exception("All providers failed")
The classify_task function is deliberately simple. In production, I use a tiny Gemini Flash call to classify — it costs $0.000075 per classification, which is essentially free.
The 3 Patterns That Save the Most Money
Pattern 1: Cache Everything
The single biggest win: cache identical requests. If two users ask "summarize this article" with the same article, you only pay once.
from python_revenue_engine import CacheManager
cache = CacheManager(max_size=5000, default_ttl=3600) # 1 hour
@cache.cached(ttl=300)
def llm_call(prompt, model="gemini"):
return route_request(prompt)
In my production system, 35% of LLM requests are cache hits. That alone saved $42/month.
Pattern 2: Batch Similar Requests
Instead of calling the LLM 10 times for 10 short texts, batch them into one call:
def batch_summarize(texts: list[str]) -> list[str]:
"""Summarize 10 texts in 1 API call instead of 10."""
combined = "\n---\n".join(f"Text {i+1}: {text}" for i, text in enumerate(texts))
prompt = f"Summarize each text below in one sentence. Separate with |.\n\n{combined}"
result = route_request(prompt)
return [s.strip() for s in result.split("|")]
This cuts API calls by 80% for batch operations and uses the cheapest model.
Pattern 3: Progressive Enhancement
Start with the cheapest model. Only escalate if the response quality is insufficient:
def progressive_call(prompt: str, quality_threshold: float = 0.7) -> str:
"""Try cheap first, escalate only if needed."""
# Try Gemini first
response = call_provider("gemini", prompt)
if score_quality(response) >= quality_threshold:
return response
# Escalate to DeepSeek
response = call_provider("deepseek", prompt)
if score_quality(response) >= quality_threshold:
return response
# Nuclear option: GPT-4o
return call_provider("openai", prompt)
In practice, 85% of requests pass quality checks on the first (cheapest) attempt.
The Real Numbers: My June 2026 Cost Breakdown
| Metric | Before (single provider) | After (multi-provider) |
|---|---|---|
| Monthly tokens | 48M | 48M |
| Avg cost per 1M tokens | $2.50 | $0.27 |
| Monthly bill | $120 | $12.96 |
| Cache hit rate | 0% | 35% |
| Failed requests | 0.2% | 0.1% (fallback improves reliability) |
The proxy itself costs $0.06/month to run on my $6 VPS. Total cost: $13.02. Savings: $107/month.
What About Quality?
This is the #1 question I get. Here's my honest take after 3 months of production use:
- Gemini Flash is surprisingly good for extraction, classification, and simple Q&A. It fails on complex reasoning.
- DeepSeek V4 Flash is competitive with GPT-4o on most coding tasks. Slightly worse on creative writing.
- OpenAI GPT-4o is still king for complex multi-step reasoning and nuanced tasks.
The trick is: most SaaS workloads are 70-80% simple tasks. You don't need the best model for "extract the date from this text" or "classify this support ticket."
Get Started
If you want to skip the implementation, I packaged my production proxy into a ready-to-use product: AI API Proxy ($29 one-time). It includes:
- OpenAI-compatible API endpoint (drop-in replacement)
- Automatic routing to cheapest provider
- Built-in caching with configurable TTL
- Multi-tenant billing support
- Provider health checks and automatic failover
- Works on any VPS with Python 3.10+
Or build it yourself using the patterns above. Either way, stop paying OpenAI prices for tasks that Gemini can handle for 1/30th the cost.
Kai Thorne builds solo SaaS infrastructure and writes about cost-effective AI development. Follow for more posts on running profitable indie tools without enterprise budgets.
Related Posts
- LLM API Pricing in 2026: DeepSeek vs OpenAI vs Gemini — Real Numbers
- I Put 20 Python Scripts in a Single File and It Runs My Entire Backend
- How I Deploy Python APIs on a $6 VPS
🚀 Save 90% on your LLM bill: AI API Proxy ($29) — drop-in OpenAI replacement with automatic multi-provider routing.
📦 Also available: Python Revenue Engine ($19) — 20 production scripts in one file. Telegram Bot Starter Kit ($25).
Top comments (0)