If you're building anything with LLMs in 2026, you already know the pain: your OpenAI bill is climbing faster than your user count. I was spending $400+/month on API calls for a side project that was barely breaking even. Then I built a proxy that cut my costs by 75%.
Here's exactly how it works — and how you can replicate it.
The Problem: Blind API Spending
Most developers call LLM APIs the naive way: send a prompt, get a response, pay whatever the provider charges. But here's what most people miss:
- Different providers charge wildly different rates for comparable quality
- Caching identical prompts can save 30-40% of your bill
- Routing simple queries to cheaper models makes a massive difference
- Batching requests reduces overhead costs
I was sending every request to GPT-4 without thinking. My average cost per request was $0.045. After building my proxy, it dropped to $0.011.
Architecture: The API Arbitrage Proxy
The concept is simple. Instead of calling OpenAI directly, you route all requests through a lightweight Python proxy that:
- Checks a cache for duplicate/similar prompts
- Evaluates query complexity using a fast classifier
- Routes to the optimal provider based on price/quality tradeoffs
- Logs everything so you can track savings
Here's the core routing logic:
import hashlib
import json
from openai import OpenAI
import anthropic
import google.generativeai as genai
class LLMArbitrageProxy:
def __init__(self):
self.cache = {}
self.pricing = {
"gpt-4": {"input": 0.03, "output": 0.06},
"gpt-3.5-turbo": {"input": 0.001, "output": 0.002},
"claude-3-haiku": {"input": 0.00025, "output": 0.00125},
"gemini-flash": {"input": 0.000075, "output": 0.0003},
}
self.total_saved = 0.0
def _cache_key(self, prompt, model=None):
return hashlib.sha256(
json.dumps({"prompt": prompt, "model": model}, sort_keys=True).encode()
).hexdigest()[:16]
def _classify_complexity(self, prompt):
"""Simple heuristic: longer, more technical = higher complexity."""
word_count = len(prompt.split())
technical_terms = ["analyze", "debug", "architecture", "refactor", "optimize"]
score = sum(1 for t in technical_terms if t in prompt.lower())
if word_count > 200 or score >= 2:
return "high"
elif word_count > 50 or score >= 1:
return "medium"
return "low"
def _select_model(self, complexity, requested_model=None):
"""Route to cheapest model that handles the complexity."""
if requested_model:
return requested_model
routing = {
"low": "gemini-flash",
"medium": "claude-3-haiku",
"high": "gpt-4",
}
return routing[complexity]
def query(self, prompt, requested_model=None, use_cache=True):
# Check cache first
cache_key = self._cache_key(prompt, requested_model)
if use_cache and cache_key in self.cache:
print(f"Cache HIT — saved ~${self.cache[cache_key]['saved']:.4f}")
return self.cache[cache_key]["response"]
# Route intelligently
complexity = self._classify_complexity(prompt)
model = self._select_model(complexity, requested_model)
# Calculate what we would have paid
naive_cost = len(prompt.split()) * self.pricing["gpt-4"]["input"] / 1000
actual_cost = len(prompt.split()) * self.pricing[model]["input"] / 1000
saved = naive_cost - actual_cost
# Make the actual API call
response = self._call_provider(model, prompt)
# Cache the result
if use_cache:
self.cache[cache_key] = {
"response": response,
"saved": saved,
"model": model,
}
self.total_saved += saved
print(f"Routed to {model} (complexity: {complexity}) — saved ${saved:.4f}")
return response
The Caching Layer That Saves 30%+
The single biggest win was caching. Not just exact-match caching, but semantic caching:
from sentence_transformers import SentenceTransformer
import numpy as np
class SemanticCache:
def __init__(self, similarity_threshold=0.92):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.entries = []
self.threshold = similarity_threshold
def find_similar(self, prompt):
if not self.entries:
return None
embedding = self.model.encode([prompt])
cached_embeddings = np.array([e['embedding'] for e in self.entries])
similarities = np.dot(cached_embeddings, embedding.T).flatten()
best_idx = np.argmax(similarities)
if similarities[best_idx] > self.threshold:
return self.entries[best_idx]['response']
return None
def store(self, prompt, response):
embedding = self.model.encode([prompt])[0]
self.entries.append({
'embedding': embedding,
'response': response,
})
This catches rephrased versions of the same question. My app asks "summarize this article" in 50 slightly different ways — semantic cache catches 90% of them.
Real Numbers From My Production Setup
After 3 months of running this proxy in production:
- Monthly API cost: $412 → $98
- Avg cost per request: $0.045 → $0.011
- Cache hit rate: 0% → 34%
- Response quality (user ratings): 4.2/5 → 4.1/5
The slight quality dip? Nobody noticed. My users care about speed and correctness, not which model produced the answer.
Adding Provider Failover
One unexpected benefit: when OpenAI had an outage last month, my proxy automatically failed over to Claude. Zero downtime for my users:
def _call_with_fallback(self, prompt, primary_model):
providers = ["gpt-4", "claude-3-haiku", "gemini-flash"]
# Put primary model first
ordered = [primary_model] + [p for p in providers if p != primary_model]
for model in ordered:
try:
return self._call_provider(model, prompt)
except Exception as e:
print(f"{model} failed: {e}, trying next...")
continue
raise Exception("All providers failed")
Deploy It in 15 Minutes
You don't have to build this from scratch. I packaged the entire system — proxy, caching, routing, analytics dashboard — into a complete toolkit.
Get the AI API Arbitrage Proxy for $29 →
It includes:
- Production-ready proxy server with FastAPI
- Semantic caching engine
- Multi-provider routing with automatic failover
- Cost analytics dashboard
- Deployment configs for Docker, Railway, and Fly.io
- Full documentation and setup guide
Key Takeaways
- Don't blindly use the most expensive model — most queries don't need GPT-4
- Cache aggressively — semantic caching catches rephrased duplicates
- Track your costs per-request — you can't optimize what you don't measure
- Build in failover — multi-provider routing is a resilience win that happens to save money too
The era of throwing expensive API calls at every prompt is over. With a smart routing layer, you can deliver the same quality at a fraction of the cost.
Have questions about LLM cost optimization? Drop a comment below — I'm happy to dig into the details.
Top comments (0)