I Built an AI Gateway That Cuts LLM Costs by 60% - Here's How

#ai #go #openai

I Built an AI Gateway That Cuts LLM Costs by 60% - Here's How

If you're using OpenAI, Claude, or other LLM APIs in production, you've probably experienced two painful realities:

💸 Costs spiral out of control as usage scales
🔄 You're paying for the same requests multiple times

After seeing my company's AI bill hit $12,000/month (with 30-40% being duplicate requests), I built an open-source solution that sits between your app and LLM providers to automatically optimize costs.

🎯 The Problem

Let's say your customer service chatbot gets asked "What are your business hours?" 500 times a day. Without optimization, you're paying OpenAI for the same answer 500 times:

500 requests × $0.03 per 1K tokens = $15/day = $450/month

For just ONE common question. Multiply that across hundreds of similar queries, and you see how quickly costs explode.

💡 The Solution: An Intelligent Proxy

I built AI Cost Optimizer Proxy - a Go-based API gateway that:

✅ Caches responses using Redis (30-60% cost savings)
✅ Routes intelligently to cheaper models when appropriate
✅ Handles failures with automatic provider fallbacks
✅ Tracks everything - costs, usage, cache hit rates
✅ Works with existing code - OpenAI-compatible API

🏗️ Architecture Overview

┌─────────────┐
│ Your App    │
└──────┬──────┘
       │ (No code changes needed)
       ▼
┌─────────────────────────────────┐
│  AI Cost Optimizer Proxy        │
├─────────────────────────────────┤
│  ┌─────────┐   ┌──────────┐    │
│  │ Cache   │   │  Smart   │    │
│  │ Layer   │   │ Router   │    │
│  └─────────┘   └──────────┘    │
└────────┬────────────────────────┘
         │
    ┌────┼────┬──────┬───────┐
    ▼    ▼    ▼      ▼       ▼
  OpenAI Claude Gemini DeepSeek Grok

🚀 Key Features

1. Smart Caching

The proxy generates a SHA-256 hash from the request (model + messages + temperature) and checks Redis first:

// Pseudo-code
cacheKey := SHA256(model + messages + temperature)
if cachedResponse := redis.Get(cacheKey) {
    return cachedResponse  // Cost: $0
}

// Cache miss - call API
response := callProvider(request)
redis.Set(cacheKey, response, TTL)
return response

Real-world impact: Our customer support bot went from 10,000 API calls/day to 3,500 (65% cache hit rate).

2. Intelligent Model Routing

Not every request needs GPT-4. The proxy automatically routes based on complexity:

if tokenCount < 500 && !containsComplexKeywords(prompt) {
    model = "gpt-4o-mini"  // $0.00015 per 1K tokens
} else {
    model = "gpt-4"         // $0.03 per 1K tokens
}

Savings: 200x cost reduction on simple queries.

3. Automatic Fallback

Primary provider down? The proxy automatically tries backups:

providers := []string{"openai", "anthropic", "gemini"}
for _, provider := range providers {
    response, err := tryProvider(provider, request)
    if err == nil {
        return response
    }
}

Result: 99.9% uptime even when individual providers fail.

4. Streaming Support

Full SSE (Server-Sent Events) streaming for all providers:

curl -N -X POST http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -d '{"model":"gpt-4o-mini","stream":true,"messages":[...]}'

Real-time responses with cost tracking.

5. Cost Analytics

Built-in admin API for tracking:

GET /admin/analytics/costs

{
  "total_cost": 245.67,
  "total_requests": 45230,
  "cache_hits": 28640,
  "cache_misses": 16590,
  "estimated_saved": 156.23,
  "by_model": [
    {
      "model": "gpt-4o-mini",
      "requests": 32000,
      "cost": 4.80
    }
  ]
}

💻 Show Me The Code

Drop-In Replacement

Before:

import openai

client = openai.OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

After:

import openai

client = openai.OpenAI(
    api_key="YOUR_PROXY_KEY",
    base_url="http://localhost:8080/v1"  # Just change this!
)
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

That's it. No code changes.

📊 Real-World Results

After 30 days in production:

Metric	Before	After	Savings
Monthly Cost	$12,000	$4,800	60%
Avg Response Time	1,800ms	350ms	80% faster
API Failures	12/month	0	100%
Cache Hit Rate	0%	63%	✨

🛠️ Tech Stack

Go 1.22 - Fast, concurrent, production-ready
Gin - HTTP web framework
Redis - Cache layer
PostgreSQL - Usage tracking & analytics
Docker - Easy deployment

🚀 Getting Started

# Clone the repo
git clone https://github.com/mazyaryousefiniyaeshad/ai_cost_optimizer_proxy.git
cd ai_cost_optimizer_proxy

# Setup environment
cp .env.example .env
# Add your OpenAI/Claude/Gemini API keys

# Start infrastructure
docker-compose -f docker/docker-compose.yml up -d

# Run the proxy
go run cmd/server/main.go