I Built an AI Gateway That Cuts LLM Costs by 60% - Here's How
If you're using OpenAI, Claude, or other LLM APIs in production, you've probably experienced two painful realities:
- πΈ Costs spiral out of control as usage scales
- π You're paying for the same requests multiple times
After seeing my company's AI bill hit $12,000/month (with 30-40% being duplicate requests), I built an open-source solution that sits between your app and LLM providers to automatically optimize costs.
π― The Problem
Let's say your customer service chatbot gets asked "What are your business hours?" 500 times a day. Without optimization, you're paying OpenAI for the same answer 500 times:
500 requests Γ $0.03 per 1K tokens = $15/day = $450/month
For just ONE common question. Multiply that across hundreds of similar queries, and you see how quickly costs explode.
π‘ The Solution: An Intelligent Proxy
I built AI Cost Optimizer Proxy - a Go-based API gateway that:
- β Caches responses using Redis (30-60% cost savings)
- β Routes intelligently to cheaper models when appropriate
- β Handles failures with automatic provider fallbacks
- β Tracks everything - costs, usage, cache hit rates
- β Works with existing code - OpenAI-compatible API
ποΈ Architecture Overview
βββββββββββββββ
β Your App β
ββββββββ¬βββββββ
β (No code changes needed)
βΌ
βββββββββββββββββββββββββββββββββββ
β AI Cost Optimizer Proxy β
βββββββββββββββββββββββββββββββββββ€
β βββββββββββ ββββββββββββ β
β β Cache β β Smart β β
β β Layer β β Router β β
β βββββββββββ ββββββββββββ β
ββββββββββ¬βββββββββββββββββββββββββ
β
ββββββΌβββββ¬βββββββ¬ββββββββ
βΌ βΌ βΌ βΌ βΌ
OpenAI Claude Gemini DeepSeek Grok
π Key Features
1. Smart Caching
The proxy generates a SHA-256 hash from the request (model + messages + temperature) and checks Redis first:
// Pseudo-code
cacheKey := SHA256(model + messages + temperature)
if cachedResponse := redis.Get(cacheKey) {
return cachedResponse // Cost: $0
}
// Cache miss - call API
response := callProvider(request)
redis.Set(cacheKey, response, TTL)
return response
Real-world impact: Our customer support bot went from 10,000 API calls/day to 3,500 (65% cache hit rate).
2. Intelligent Model Routing
Not every request needs GPT-4. The proxy automatically routes based on complexity:
if tokenCount < 500 && !containsComplexKeywords(prompt) {
model = "gpt-4o-mini" // $0.00015 per 1K tokens
} else {
model = "gpt-4" // $0.03 per 1K tokens
}
Savings: 200x cost reduction on simple queries.
3. Automatic Fallback
Primary provider down? The proxy automatically tries backups:
providers := []string{"openai", "anthropic", "gemini"}
for _, provider := range providers {
response, err := tryProvider(provider, request)
if err == nil {
return response
}
}
Result: 99.9% uptime even when individual providers fail.
4. Streaming Support
Full SSE (Server-Sent Events) streaming for all providers:
curl -N -X POST http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer YOUR_KEY" \
-d '{"model":"gpt-4o-mini","stream":true,"messages":[...]}'
Real-time responses with cost tracking.
5. Cost Analytics
Built-in admin API for tracking:
GET /admin/analytics/costs
{
"total_cost": 245.67,
"total_requests": 45230,
"cache_hits": 28640,
"cache_misses": 16590,
"estimated_saved": 156.23,
"by_model": [
{
"model": "gpt-4o-mini",
"requests": 32000,
"cost": 4.80
}
]
}
π» Show Me The Code
Drop-In Replacement
Before:
import openai
client = openai.OpenAI(api_key="sk-...")
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
After:
import openai
client = openai.OpenAI(
api_key="YOUR_PROXY_KEY",
base_url="http://localhost:8080/v1" # Just change this!
)
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
That's it. No code changes.
π Real-World Results
After 30 days in production:
| Metric | Before | After | Savings |
|---|---|---|---|
| Monthly Cost | $12,000 | $4,800 | 60% |
| Avg Response Time | 1,800ms | 350ms | 80% faster |
| API Failures | 12/month | 0 | 100% |
| Cache Hit Rate | 0% | 63% | β¨ |
π οΈ Tech Stack
- Go 1.22 - Fast, concurrent, production-ready
- Gin - HTTP web framework
- Redis - Cache layer
- PostgreSQL - Usage tracking & analytics
- Docker - Easy deployment
π Getting Started
# Clone the repo
git clone https://github.com/mazyaryousefiniyaeshad/ai_cost_optimizer_proxy.git
cd ai_cost_optimizer_proxy
# Setup environment
cp .env.example .env
# Add your OpenAI/Claude/Gemini API keys
# Start infrastructure
docker-compose -f docker/docker-compose.yml up -d
# Run the proxy
go run cmd/server/main.go
Server runs on http://localhost:8080 - just point your app there!
π Security Features
- β API key authentication
- β Rate limiting per key
- β Admin token protection
- β Request validation
- β Secure key generation (crypto/rand)
π What's Next?
Planned features:
- [ ] Request queueing for rate limit management
- [ ] A/B testing between models
- [ ] Custom routing rules via config
- [ ] Prometheus metrics export
- [ ] Web dashboard for analytics
π€ Contributing
This is open source (MIT license)! Here's how you can help:
- Add new providers - Just 30 lines of code to add Mistral, Cohere, etc.
- Improve caching - Semantic similarity caching?
- Build the dashboard - React/Vue frontend for analytics
- Write tests - Always need more coverage
Check out the contributing guide.
π― Key Takeaways
- Caching is your best friend - 60%+ of LLM requests are duplicates
- Not all requests need GPT-4 - Smart routing saves 10-100x
- Fallbacks prevent downtime - Multi-provider = reliability
- Measure everything - You can't optimize what you don't track
π Links
- GitHub: https://github.com/mazyaryousefiniyaeshad/ai_cost_optimizer_proxy
- Documentation: Full API docs in the repo
- License: MIT (use it anywhere!)
π¬ Discussion
Have you dealt with LLM cost explosion in your projects? What strategies worked for you? Drop a comment below!
Found this useful? β Star the repo and follow me for more production AI tools!
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.