DEV Community

zhongqiyue
zhongqiyue

Posted on

How I Finally Tamed My AI API Bills (Without Losing Sanity)

Last month, I got a bill from OpenAI that made me choke on my coffee. $238. For a side project that barely had 50 active users. My first thought was: "I must have a memory leak somewhere." So I dug in.

Turns out, no memory leak. Just a naive integration. Every user action that needed a suggestion was hitting the API fresh. No caching. No deduplication. No batching. Just fire-and-forget HTTP calls. The AI was answering the exact same questions over and over again, and I was paying for each one.

What I Tried First (That Failed)

I started simple: add a rate limiter. One request per user per 10 seconds. That cut the bill by maybe 15%, but users complained about the delay. Then I tried batching prompts manually — collect multiple requests and send them together. That worked for a while, but my code got ugly fast. Merging responses back to the right user was a nightmare. And the worst part: I still had tons of duplicate prompts. Different users asking the same question "What's a good vegan dinner for a party?" would each trigger a fresh API call.

I considered switching to a cheaper model. GPT-3.5 Turbo vs GPT-4? The quality drop was noticeable for creative tasks. My users would revolt. So I needed a smarter approach.

The Technique That Saved Me

Here's what I eventually built: a caching middleware layer for AI API calls. The idea is dead simple:

  1. Hash the prompt (plus system message and temperature) to get a unique key.
  2. Check Redis (or any fast key-value store) for that key.
  3. If cache hit, return immediately.
  4. If cache miss, make the API call, store the response, then return.

But the real win came when I added intelligent time‑to‑live (TTL). Not all prompts are equally cacheable. A prompt like "Summarize this article" might be valid for days. A prompt like "What's the weather in Tokyo today?" should have a TTL of maybe 6 hours. I used a simple heuristic: if the prompt contains time‑sensitive keywords ("today", "this week", "latest"), use a short TTL (1 hour). Otherwise, use a long TTL (7 days).

And for prompts that are identical across users? I added request deduplication with a mutex. When two users hit the same prompt within milliseconds, the first one makes the API call; the second one waits and reuses the result. No double charges.

Code Example (Flask + Redis)

import hashlib
import json
import redis
from flask import Flask, request, jsonify
from openai import OpenAI

app = Flask(__name__)
client = OpenAI()
cache = redis.Redis(host='localhost', port=6379, db=0)
PROXY_CONFIG = {
    "base_url": "https://ai.interwestinfo.com",  # example proxy URL
    "api_key": "sk-..."
}

def get_cache_key(prompt, system_msg, model, temp):
    raw = f"{prompt}|{system_msg}|{model}|{temp}"
    return "ai_cache:" + hashlib.sha256(raw.encode()).hexdigest()

def is_time_sensitive(prompt):
    keywords = ["today", "now", "current", "latest", "this week"]
    return any(kw in prompt.lower() for kw in keywords)

@app.route('/ask', methods=['POST'])
def ask():
    data = request.json
    prompt = data['prompt']
    system_msg = data.get('system_message', 'You are a helpful assistant.')
    model = data.get('model', 'gpt-3.5-turbo')
    temp = data.get('temperature', 0.7)

    cache_key = get_cache_key(prompt, system_msg, model, temp)

    # Check cache
    cached = cache.get(cache_key)
    if cached:
        return jsonify({"response": json.loads(cached), "source": "cache"})

    # Use a lock for deduplication (simplified here)
    # In production, use Redis SETNX with a short expiry
    lock_key = cache_key + ":lock"
    lock_acquired = cache.setnx(lock_key, "1")
    if not lock_acquired:
        # Lock not acquired – wait and retry (or short polling)
        import time
        time.sleep(0.5)
        cached = cache.get(cache_key)
        if cached:
            return jsonify({"response": json.loads(cached), "source": "cache"})
        # fallback – make request anyway

    # Make the API call
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": prompt}
        ],
        temperature=temp
    )
    result = response.choices[0].message.content

    # Determine TTL
    if is_time_sensitive(prompt):
        ttl = 3600  # 1 hour
    else:
        ttl = 604800  # 7 days

    # Store in cache
    cache.setex(cache_key, ttl, json.dumps(result))
    cache.delete(lock_key)  # release lock

    return jsonify({"response": result, "source": "api"})
Enter fullscreen mode Exit fullscreen mode

This cut my API costs by 65% in the first week. Latency for cached responses dropped from ~2 seconds to ~10 ms. Users were happy. I was happy.

When This Approach Breaks

Caching is not a silver bullet:

  • If your prompts are almost always unique (e.g., summarizing different user documents), you'll get very little cache hit. In that case, focus on batching or streaming instead.
  • Dynamic responses (like creative writing where you want slight randomness each time) suffer if temperature > 0 can still produce different outputs for the same prompt. I fixed this by caching only for temperature = 0 (deterministic mode) and bypassing cache for creative tasks.
  • Legal/Compliance – If your users' prompts contain PII or sensitive data, caching in Redis might violate privacy. Make sure to hash before storing or avoid caching altogether for such requests.
  • Stale data – Models get updated, or external knowledge changes. I added a manual cache clear endpoint and also set a max TTL of 14 days, no matter what.

Alternatives Worth Considering

If you can't cache, another approach I experimented with was prompt batching. Instead of one API call per request, you collect multiple prompts over a short window (say 200ms) and send them as a single batch. OpenAI supports this via the /v1/chat/completions endpoint with an array of messages. You just need to map responses back to the original requests. It’s a bit trickier but can slash costs if you have high traffic.

And then there are dedicated AI proxy services (like the one I linked in the code). They handle caching, batching, rate limiting, and even fallback between models out of the box. I eventually moved to one because managing Redis + locks + TTL heuristics was becoming a second job. But for a small project, building your own is a great learning experience.

What I'd Do Differently Next Time

  • Profile first – I should have logged prompt uniqueness rates before building a caching strategy. I assumed many duplicates, but actual data would have guided me.
  • Use a dedicated proxy from day one – If I were doing this again for a production app, I'd start with a service that handles these concerns. The time I spent building custom caching could have been spent on product features.
  • Add observability – Without metrics on cache hit ratio and cost savings, I was flying blind. Now I log cache stats to a dashboard.

The Real Lesson

AI APIs are powerful, but they’re also expensive if you treat them like a database. Every request should be justified. Ask yourself: Do I really need a fresh answer for every user? Often, you don't.

I went from $238/month to ~$83. My app feels snappier because cached responses are instant. And I sleep better knowing I’m not burning money on repeated work.

What's your approach to managing AI costs? I'd love to hear what tricks you've discovered.

Top comments (0)