zhongqiyue

Posted on Jun 4

I built a simple AI proxy to cut API costs — here's what I learned

#ai #python #webdev #proxy

A few months ago, my OpenAI API bill suddenly jumped from a modest $30 to over $150 in one month. I wasn't even doing anything crazy — just running a small Slack bot that answered questions about our internal docs. But between repeated prompts, failed retries, and my own debugging queries, the tokens added up fast.

I tried the obvious fixes first: adding client-side caching, switching to gpt-3.5-turbo from gpt-4, and even imposing manual rate limits on myself. None of it stuck. Caching exact prompts doesn’t work when users ask the same question but rephrase it slightly. And rate limits just made the bot feel sluggish.

So I built a lightweight AI proxy — a thin middleware layer between my app and the LLM provider. It wasn't flashy, but it immediately stopped the bleeding. Here’s the honest story of what I did, what I broke along the way, and what I’d do differently next time.

What I tried (and what didn’t work)

Client-side caching

I started with a simple dictionary in memory: store (prompt, model) as key, response as value. For a static FAQ bot, it works great. But for conversation threads with context, the prompts are never identical. Users ask “What’s the refund policy?” and then “What about returns?” — same intent, different wording. Cache miss every time.

Rate limiting on the app side

I added a time.sleep(0.5) between requests. That works until you have concurrent users — the whole app blocks. I even tried semaphore-based throttling, but it made the Slack bot feel unresponsive. Users hate waiting 5 seconds for a reply.

Switching models

Dropping to gpt-3.5-turbo cut cost per token, but the quality dipped noticeably. For internal docs that’s tolerable, but for customer-facing replies it wasn’t acceptable. I needed a smarter approach.

What eventually worked: a proxy with caching, rate limiting, and a token buffer

I wrote a small Python server using FastAPI (Flask would work too). It sits between my app and OpenAI’s API. Every request goes through the proxy, which does three things:

Caches responses — but not just by exact match. I compute a quick embedding (using text-embedding-3-small) and compare against recent queries using cosine similarity. If the similarity is above 0.95, I serve the cached response.
Rate-limits per user — using a token bucket algorithm backed by Redis, so bursts are smoothed out without blocking everyone.
Buffers retries — when OpenAI returns a 429 or 503, the proxy retries with exponential backoff instead of failing immediately.

Here’s the core of the proxy (I’ve stripped error handling for brevity):

# app.py — simplified proxy
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import redis, openai, numpy as np
from sklearn.metrics.pairwise import cosine_similarity

app = FastAPI()
r = redis.Redis()
openai.api_key = "sk-..."

class Request(BaseModel):
    prompt: str
    user_id: str

CACHE_THRESHOLD = 0.95
CACHE_TTL = 3600  # 1 hour

@app.post("/v1/chat/completions")
async def proxy(request: Request):
    # 1. Check rate limit
    if not check_rate_limit(request.user_id):
        raise HTTPException(429, "Rate limit exceeded")

    # 2. Compute embedding for semantic cache lookup
    prompt_vec = get_embedding(request.prompt)
    cached_keys = r.keys("cache:*")
    for key in cached_keys:
        data = r.hgetall(key)
        cached_vec = np.frombuffer(data[b"embedding"], dtype=np.float32)
        sim = cosine_similarity([prompt_vec], [cached_vec])[0][0]
        if sim > CACHE_THRESHOLD:
            return {"cached": True, "response": data[b"response"].decode()}

    # 3. Call OpenAI with retry buffer
    response = await call_openai_with_retry(request.prompt)

    # 4. Store in cache
    store_cache(request.prompt, prompt_vec, response)

    return {"cached": False, "response": response}

Yes, that’s a lot to absorb. But the key insight is: the proxy doesn’t just cache — it understands semantic similarity. That’s what finally cut my costs by 70%.

Trade-offs and limitations

I’ll be honest — this isn’t free lunch.

Embedding calls add latency. Each request now does an extra API call to get the embedding before hitting the LLM. On average, that adds 200ms. For my Slack bot, that’s fine. For a real-time chat, it might be too slow. You could use a local embedding model, but that’s more complexity.
Redis memory grows. Each cached response includes a 1536-dim float vector. After a few thousand entries, you’ll want to evict old ones. I used LRU via redis-py’s MAXMEMORY policy.
Cosine similarity isn’t perfect. Sometimes two prompts sound similar but need different answers (e.g., “What’s the price?” vs “What’s the price after discount?”). I set the threshold high enough to avoid false positives, but it still happens.
Retry buffering can mask real failures. If OpenAI is down for 10 minutes, the proxy will keep retrying and exhausting your request quota. I added a circuit breaker after 5 consecutive failures.

Alternatives I considered

Before building this, I looked at managed solutions — things like Portkey, Helicone, and even some AI-specific proxies (e.g., InterWest’s offering at https://ai.interwestinfo.com/). For a team with a big budget, those are great. But for a solo dev experiment, rolling my own taught me far more about the failure modes of LLM APIs.

If I were doing this again for production, I’d probably start with an open-source proxy like LiteLLM or AI Gateway and only customize if needed. Building from scratch gave me full control but also a maintenance burden.

What I’d do differently

Instrument everything from day one. I added logging and metrics only after the proxy was running for a week. Without it, I had no idea which prompts were being cached or dropped.
Use a proper message queue. Currently, the proxy handles requests synchronously. For high throughput, I’d offload OpenAI calls to a background worker (Celery, Redis Queue).
Separate caching from rate-limiting code. It’s all mixed in one handler now, making it hard to test independently.

Final thoughts

Building this proxy felt like learning to walk again — lots of stumbles, some face-plants, but eventually a steady stride. The biggest lesson? Don’t treat AI APIs as magical black boxes. They’re just HTTP endpoints with strange billing metrics. With a little middleware, you can make them behave the way your app needs.

What’s your setup looking like? Are you using a managed service or rolling your own? I’d love to hear what’s working — and what’s burning money.

DEV Community