DEV Community

Kai Thorne
Kai Thorne

Posted on

How I Cut My LLM API Costs by 75% with a Simple Python Proxy

If you're building anything with LLMs in 2026, you already know the pain: your OpenAI bill is climbing faster than your user count. I was spending $400+/month on API calls for a side project that was barely breaking even. Then I built a proxy that cut my costs by 75%.

Here's exactly how it works — and how you can replicate it.

The Problem: Blind API Spending

Most developers call LLM APIs the naive way: send a prompt, get a response, pay whatever the provider charges. But here's what most people miss:

  • Different providers charge wildly different rates for comparable quality
  • Caching identical prompts can save 30-40% of your bill
  • Routing simple queries to cheaper models makes a massive difference
  • Batching requests reduces overhead costs

I was sending every request to GPT-4 without thinking. My average cost per request was $0.045. After building my proxy, it dropped to $0.011.

Architecture: The API Arbitrage Proxy

The concept is simple. Instead of calling OpenAI directly, you route all requests through a lightweight Python proxy that:

  1. Checks a cache for duplicate/similar prompts
  2. Evaluates query complexity using a fast classifier
  3. Routes to the optimal provider based on price/quality tradeoffs
  4. Logs everything so you can track savings

Here's the core routing logic:

import hashlib
import json
from openai import OpenAI
import anthropic
import google.generativeai as genai

class LLMArbitrageProxy:
    def __init__(self):
        self.cache = {}
        self.pricing = {
            "gpt-4": {"input": 0.03, "output": 0.06},
            "gpt-3.5-turbo": {"input": 0.001, "output": 0.002},
            "claude-3-haiku": {"input": 0.00025, "output": 0.00125},
            "gemini-flash": {"input": 0.000075, "output": 0.0003},
        }
        self.total_saved = 0.0

    def _cache_key(self, prompt, model=None):
        return hashlib.sha256(
            json.dumps({"prompt": prompt, "model": model}, sort_keys=True).encode()
        ).hexdigest()[:16]

    def _classify_complexity(self, prompt):
        """Simple heuristic: longer, more technical = higher complexity."""
        word_count = len(prompt.split())
        technical_terms = ["analyze", "debug", "architecture", "refactor", "optimize"]
        score = sum(1 for t in technical_terms if t in prompt.lower())

        if word_count > 200 or score >= 2:
            return "high"
        elif word_count > 50 or score >= 1:
            return "medium"
        return "low"

    def _select_model(self, complexity, requested_model=None):
        """Route to cheapest model that handles the complexity."""
        if requested_model:
            return requested_model

        routing = {
            "low": "gemini-flash",
            "medium": "claude-3-haiku",
            "high": "gpt-4",
        }
        return routing[complexity]

    def query(self, prompt, requested_model=None, use_cache=True):
        # Check cache first
        cache_key = self._cache_key(prompt, requested_model)
        if use_cache and cache_key in self.cache:
            print(f"Cache HIT — saved ~${self.cache[cache_key]['saved']:.4f}")
            return self.cache[cache_key]["response"]

        # Route intelligently
        complexity = self._classify_complexity(prompt)
        model = self._select_model(complexity, requested_model)

        # Calculate what we would have paid
        naive_cost = len(prompt.split()) * self.pricing["gpt-4"]["input"] / 1000
        actual_cost = len(prompt.split()) * self.pricing[model]["input"] / 1000
        saved = naive_cost - actual_cost

        # Make the actual API call
        response = self._call_provider(model, prompt)

        # Cache the result
        if use_cache:
            self.cache[cache_key] = {
                "response": response,
                "saved": saved,
                "model": model,
            }

        self.total_saved += saved
        print(f"Routed to {model} (complexity: {complexity}) — saved ${saved:.4f}")
        return response
Enter fullscreen mode Exit fullscreen mode

The Caching Layer That Saves 30%+

The single biggest win was caching. Not just exact-match caching, but semantic caching:

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, similarity_threshold=0.92):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.entries = []
        self.threshold = similarity_threshold

    def find_similar(self, prompt):
        if not self.entries:
            return None

        embedding = self.model.encode([prompt])
        cached_embeddings = np.array([e['embedding'] for e in self.entries])
        similarities = np.dot(cached_embeddings, embedding.T).flatten()

        best_idx = np.argmax(similarities)
        if similarities[best_idx] > self.threshold:
            return self.entries[best_idx]['response']
        return None

    def store(self, prompt, response):
        embedding = self.model.encode([prompt])[0]
        self.entries.append({
            'embedding': embedding,
            'response': response,
        })
Enter fullscreen mode Exit fullscreen mode

This catches rephrased versions of the same question. My app asks "summarize this article" in 50 slightly different ways — semantic cache catches 90% of them.

Real Numbers From My Production Setup

After 3 months of running this proxy in production:

  • Monthly API cost: $412 → $98
  • Avg cost per request: $0.045 → $0.011
  • Cache hit rate: 0% → 34%
  • Response quality (user ratings): 4.2/5 → 4.1/5

The slight quality dip? Nobody noticed. My users care about speed and correctness, not which model produced the answer.

Adding Provider Failover

One unexpected benefit: when OpenAI had an outage last month, my proxy automatically failed over to Claude. Zero downtime for my users:

def _call_with_fallback(self, prompt, primary_model):
    providers = ["gpt-4", "claude-3-haiku", "gemini-flash"]

    # Put primary model first
    ordered = [primary_model] + [p for p in providers if p != primary_model]

    for model in ordered:
        try:
            return self._call_provider(model, prompt)
        except Exception as e:
            print(f"{model} failed: {e}, trying next...")
            continue

    raise Exception("All providers failed")
Enter fullscreen mode Exit fullscreen mode

Deploy It in 15 Minutes

You don't have to build this from scratch. I packaged the entire system — proxy, caching, routing, analytics dashboard — into a complete toolkit.

Get the AI API Arbitrage Proxy for $29 →

It includes:

  • Production-ready proxy server with FastAPI
  • Semantic caching engine
  • Multi-provider routing with automatic failover
  • Cost analytics dashboard
  • Deployment configs for Docker, Railway, and Fly.io
  • Full documentation and setup guide

Key Takeaways

  1. Don't blindly use the most expensive model — most queries don't need GPT-4
  2. Cache aggressively — semantic caching catches rephrased duplicates
  3. Track your costs per-request — you can't optimize what you don't measure
  4. Build in failover — multi-provider routing is a resilience win that happens to save money too

The era of throwing expensive API calls at every prompt is over. With a smart routing layer, you can deliver the same quality at a fraction of the cost.


Have questions about LLM cost optimization? Drop a comment below — I'm happy to dig into the details.

Top comments (0)