DEV Community

Cover image for I Built an AI Gateway That Cuts LLM Costs by 60% - Here's How
Mazyar Yousefiniyae shad
Mazyar Yousefiniyae shad

Posted on • Originally published at github.com

I Built an AI Gateway That Cuts LLM Costs by 60% - Here's How

I Built an AI Gateway That Cuts LLM Costs by 60% - Here's How

If you're using OpenAI, Claude, or other LLM APIs in production, you've probably experienced two painful realities:

  1. 💸 Costs spiral out of control as usage scales
  2. 🔄 You're paying for the same requests multiple times

After seeing my company's AI bill hit $12,000/month (with 30-40% being duplicate requests), I built an open-source solution that sits between your app and LLM providers to automatically optimize costs.

🎯 The Problem

Let's say your customer service chatbot gets asked "What are your business hours?" 500 times a day. Without optimization, you're paying OpenAI for the same answer 500 times:

500 requests × $0.03 per 1K tokens = $15/day = $450/month
Enter fullscreen mode Exit fullscreen mode

For just ONE common question. Multiply that across hundreds of similar queries, and you see how quickly costs explode.

💡 The Solution: An Intelligent Proxy

I built AI Cost Optimizer Proxy - a Go-based API gateway that:

  • Caches responses using Redis (30-60% cost savings)
  • Routes intelligently to cheaper models when appropriate
  • Handles failures with automatic provider fallbacks
  • Tracks everything - costs, usage, cache hit rates
  • Works with existing code - OpenAI-compatible API

🏗️ Architecture Overview

┌─────────────┐
│ Your App    │
└──────┬──────┘
       │ (No code changes needed)
       ▼
┌─────────────────────────────────┐
│  AI Cost Optimizer Proxy        │
├─────────────────────────────────┤
│  ┌─────────┐   ┌──────────┐    │
│  │ Cache   │   │  Smart   │    │
│  │ Layer   │   │ Router   │    │
│  └─────────┘   └──────────┘    │
└────────┬────────────────────────┘
         │
    ┌────┼────┬──────┬───────┐
    ▼    ▼    ▼      ▼       ▼
  OpenAI Claude Gemini DeepSeek Grok
Enter fullscreen mode Exit fullscreen mode

🚀 Key Features

1. Smart Caching

The proxy generates a SHA-256 hash from the request (model + messages + temperature) and checks Redis first:

// Pseudo-code
cacheKey := SHA256(model + messages + temperature)
if cachedResponse := redis.Get(cacheKey) {
    return cachedResponse  // Cost: $0
}

// Cache miss - call API
response := callProvider(request)
redis.Set(cacheKey, response, TTL)
return response
Enter fullscreen mode Exit fullscreen mode

Real-world impact: Our customer support bot went from 10,000 API calls/day to 3,500 (65% cache hit rate).

2. Intelligent Model Routing

Not every request needs GPT-4. The proxy automatically routes based on complexity:

if tokenCount < 500 && !containsComplexKeywords(prompt) {
    model = "gpt-4o-mini"  // $0.00015 per 1K tokens
} else {
    model = "gpt-4"         // $0.03 per 1K tokens
}
Enter fullscreen mode Exit fullscreen mode

Savings: 200x cost reduction on simple queries.

3. Automatic Fallback

Primary provider down? The proxy automatically tries backups:

providers := []string{"openai", "anthropic", "gemini"}
for _, provider := range providers {
    response, err := tryProvider(provider, request)
    if err == nil {
        return response
    }
}
Enter fullscreen mode Exit fullscreen mode

Result: 99.9% uptime even when individual providers fail.

4. Streaming Support

Full SSE (Server-Sent Events) streaming for all providers:

curl -N -X POST http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -d '{"model":"gpt-4o-mini","stream":true,"messages":[...]}'
Enter fullscreen mode Exit fullscreen mode

Real-time responses with cost tracking.

5. Cost Analytics

Built-in admin API for tracking:

GET /admin/analytics/costs
Enter fullscreen mode Exit fullscreen mode
{
  "total_cost": 245.67,
  "total_requests": 45230,
  "cache_hits": 28640,
  "cache_misses": 16590,
  "estimated_saved": 156.23,
  "by_model": [
    {
      "model": "gpt-4o-mini",
      "requests": 32000,
      "cost": 4.80
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

💻 Show Me The Code

Drop-In Replacement

Before:

import openai

client = openai.OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)
Enter fullscreen mode Exit fullscreen mode

After:

import openai

client = openai.OpenAI(
    api_key="YOUR_PROXY_KEY",
    base_url="http://localhost:8080/v1"  # Just change this!
)
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)
Enter fullscreen mode Exit fullscreen mode

That's it. No code changes.

📊 Real-World Results

After 30 days in production:

Metric Before After Savings
Monthly Cost $12,000 $4,800 60%
Avg Response Time 1,800ms 350ms 80% faster
API Failures 12/month 0 100%
Cache Hit Rate 0% 63%

🛠️ Tech Stack

  • Go 1.22 - Fast, concurrent, production-ready
  • Gin - HTTP web framework
  • Redis - Cache layer
  • PostgreSQL - Usage tracking & analytics
  • Docker - Easy deployment

🚀 Getting Started

# Clone the repo
git clone https://github.com/mazyaryousefiniyaeshad/ai_cost_optimizer_proxy.git
cd ai_cost_optimizer_proxy

# Setup environment
cp .env.example .env
# Add your OpenAI/Claude/Gemini API keys

# Start infrastructure
docker-compose -f docker/docker-compose.yml up -d

# Run the proxy
go run cmd/server/main.go
Enter fullscreen mode Exit fullscreen mode

Server runs on http://localhost:8080 - just point your app there!

🔐 Security Features

  • ✅ API key authentication
  • ✅ Rate limiting per key
  • ✅ Admin token protection
  • ✅ Request validation
  • ✅ Secure key generation (crypto/rand)

📈 What's Next?

Planned features:

  • [ ] Request queueing for rate limit management
  • [ ] A/B testing between models
  • [ ] Custom routing rules via config
  • [ ] Prometheus metrics export
  • [ ] Web dashboard for analytics

🤝 Contributing

This is open source (MIT license)! Here's how you can help:

  1. Add new providers - Just 30 lines of code to add Mistral, Cohere, etc.
  2. Improve caching - Semantic similarity caching?
  3. Build the dashboard - React/Vue frontend for analytics
  4. Write tests - Always need more coverage

Check out the contributing guide.

🎯 Key Takeaways

  1. Caching is your best friend - 60%+ of LLM requests are duplicates
  2. Not all requests need GPT-4 - Smart routing saves 10-100x
  3. Fallbacks prevent downtime - Multi-provider = reliability
  4. Measure everything - You can't optimize what you don't track

🔗 Links

💬 Discussion

Have you dealt with LLM cost explosion in your projects? What strategies worked for you? Drop a comment below!


Found this useful? ⭐ Star the repo and follow me for more production AI tools!

AI #OpenAI #Golang #DevOps #CloudCosts #MachineLearning #ChatGPT

Top comments (1)

Collapse
 
nyrok profile image
Hamza KONTE

Smart caching alone can cut costs dramatically, but there's another lever: prompt quality. A well-structured prompt that produces the right output on the first call is cheaper than a cheap prompt that needs 3 retries. The savings compound. I focus on structured prompt templates with flompt before even thinking about caching strategy. flompt.dev / github.com/Nyrok/flompt