I Built an AI Gateway That Cuts LLM Costs by 60% - Here's How
If you're using OpenAI, Claude, or other LLM APIs in production, you've probably experienced two painful realities:
- 💸 Costs spiral out of control as usage scales
- 🔄 You're paying for the same requests multiple times
After seeing my company's AI bill hit $12,000/month (with 30-40% being duplicate requests), I built an open-source solution that sits between your app and LLM providers to automatically optimize costs.
🎯 The Problem
Let's say your customer service chatbot gets asked "What are your business hours?" 500 times a day. Without optimization, you're paying OpenAI for the same answer 500 times:
500 requests × $0.03 per 1K tokens = $15/day = $450/month
For just ONE common question. Multiply that across hundreds of similar queries, and you see how quickly costs explode.
💡 The Solution: An Intelligent Proxy
I built AI Cost Optimizer Proxy - a Go-based API gateway that:
- ✅ Caches responses using Redis (30-60% cost savings)
- ✅ Routes intelligently to cheaper models when appropriate
- ✅ Handles failures with automatic provider fallbacks
- ✅ Tracks everything - costs, usage, cache hit rates
- ✅ Works with existing code - OpenAI-compatible API
🏗️ Architecture Overview
┌─────────────┐
│ Your App │
└──────┬──────┘
│ (No code changes needed)
▼
┌─────────────────────────────────┐
│ AI Cost Optimizer Proxy │
├─────────────────────────────────┤
│ ┌─────────┐ ┌──────────┐ │
│ │ Cache │ │ Smart │ │
│ │ Layer │ │ Router │ │
│ └─────────┘ └──────────┘ │
└────────┬────────────────────────┘
│
┌────┼────┬──────┬───────┐
▼ ▼ ▼ ▼ ▼
OpenAI Claude Gemini DeepSeek Grok
🚀 Key Features
1. Smart Caching
The proxy generates a SHA-256 hash from the request (model + messages + temperature) and checks Redis first:
// Pseudo-code
cacheKey := SHA256(model + messages + temperature)
if cachedResponse := redis.Get(cacheKey) {
return cachedResponse // Cost: $0
}
// Cache miss - call API
response := callProvider(request)
redis.Set(cacheKey, response, TTL)
return response
Real-world impact: Our customer support bot went from 10,000 API calls/day to 3,500 (65% cache hit rate).
2. Intelligent Model Routing
Not every request needs GPT-4. The proxy automatically routes based on complexity:
if tokenCount < 500 && !containsComplexKeywords(prompt) {
model = "gpt-4o-mini" // $0.00015 per 1K tokens
} else {
model = "gpt-4" // $0.03 per 1K tokens
}
Savings: 200x cost reduction on simple queries.
3. Automatic Fallback
Primary provider down? The proxy automatically tries backups:
providers := []string{"openai", "anthropic", "gemini"}
for _, provider := range providers {
response, err := tryProvider(provider, request)
if err == nil {
return response
}
}
Result: 99.9% uptime even when individual providers fail.
4. Streaming Support
Full SSE (Server-Sent Events) streaming for all providers:
curl -N -X POST http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer YOUR_KEY" \
-d '{"model":"gpt-4o-mini","stream":true,"messages":[...]}'
Real-time responses with cost tracking.
5. Cost Analytics
Built-in admin API for tracking:
GET /admin/analytics/costs
{
"total_cost": 245.67,
"total_requests": 45230,
"cache_hits": 28640,
"cache_misses": 16590,
"estimated_saved": 156.23,
"by_model": [
{
"model": "gpt-4o-mini",
"requests": 32000,
"cost": 4.80
}
]
}
💻 Show Me The Code
Drop-In Replacement
Before:
import openai
client = openai.OpenAI(api_key="sk-...")
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
After:
import openai
client = openai.OpenAI(
api_key="YOUR_PROXY_KEY",
base_url="http://localhost:8080/v1" # Just change this!
)
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
That's it. No code changes.
📊 Real-World Results
After 30 days in production:
| Metric | Before | After | Savings |
|---|---|---|---|
| Monthly Cost | $12,000 | $4,800 | 60% |
| Avg Response Time | 1,800ms | 350ms | 80% faster |
| API Failures | 12/month | 0 | 100% |
| Cache Hit Rate | 0% | 63% | ✨ |
🛠️ Tech Stack
- Go 1.22 - Fast, concurrent, production-ready
- Gin - HTTP web framework
- Redis - Cache layer
- PostgreSQL - Usage tracking & analytics
- Docker - Easy deployment
🚀 Getting Started
# Clone the repo
git clone https://github.com/mazyaryousefiniyaeshad/ai_cost_optimizer_proxy.git
cd ai_cost_optimizer_proxy
# Setup environment
cp .env.example .env
# Add your OpenAI/Claude/Gemini API keys
# Start infrastructure
docker-compose -f docker/docker-compose.yml up -d
# Run the proxy
go run cmd/server/main.go
Server runs on http://localhost:8080 - just point your app there!
🔐 Security Features
- ✅ API key authentication
- ✅ Rate limiting per key
- ✅ Admin token protection
- ✅ Request validation
- ✅ Secure key generation (crypto/rand)
📈 What's Next?
Planned features:
- [ ] Request queueing for rate limit management
- [ ] A/B testing between models
- [ ] Custom routing rules via config
- [ ] Prometheus metrics export
- [ ] Web dashboard for analytics
🤝 Contributing
This is open source (MIT license)! Here's how you can help:
- Add new providers - Just 30 lines of code to add Mistral, Cohere, etc.
- Improve caching - Semantic similarity caching?
- Build the dashboard - React/Vue frontend for analytics
- Write tests - Always need more coverage
Check out the contributing guide.
🎯 Key Takeaways
- Caching is your best friend - 60%+ of LLM requests are duplicates
- Not all requests need GPT-4 - Smart routing saves 10-100x
- Fallbacks prevent downtime - Multi-provider = reliability
- Measure everything - You can't optimize what you don't track
🔗 Links
- GitHub: https://github.com/mazyaryousefiniyaeshad/ai_cost_optimizer_proxy
- Documentation: Full API docs in the repo
- License: MIT (use it anywhere!)
💬 Discussion
Have you dealt with LLM cost explosion in your projects? What strategies worked for you? Drop a comment below!
Found this useful? ⭐ Star the repo and follow me for more production AI tools!
Top comments (1)
Smart caching alone can cut costs dramatically, but there's another lever: prompt quality. A well-structured prompt that produces the right output on the first call is cheaper than a cheap prompt that needs 3 retries. The savings compound. I focus on structured prompt templates with flompt before even thinking about caching strategy. flompt.dev / github.com/Nyrok/flompt