Last month, I got a bill from OpenAI that made me choke on my coffee. $238. For a side project that barely had 50 active users. My first thought was: "I must have a memory leak somewhere." So I dug in.
Turns out, no memory leak. Just a naive integration. Every user action that needed a suggestion was hitting the API fresh. No caching. No deduplication. No batching. Just fire-and-forget HTTP calls. The AI was answering the exact same questions over and over again, and I was paying for each one.
What I Tried First (That Failed)
I started simple: add a rate limiter. One request per user per 10 seconds. That cut the bill by maybe 15%, but users complained about the delay. Then I tried batching prompts manually — collect multiple requests and send them together. That worked for a while, but my code got ugly fast. Merging responses back to the right user was a nightmare. And the worst part: I still had tons of duplicate prompts. Different users asking the same question "What's a good vegan dinner for a party?" would each trigger a fresh API call.
I considered switching to a cheaper model. GPT-3.5 Turbo vs GPT-4? The quality drop was noticeable for creative tasks. My users would revolt. So I needed a smarter approach.
The Technique That Saved Me
Here's what I eventually built: a caching middleware layer for AI API calls. The idea is dead simple:
- Hash the prompt (plus system message and temperature) to get a unique key.
- Check Redis (or any fast key-value store) for that key.
- If cache hit, return immediately.
- If cache miss, make the API call, store the response, then return.
But the real win came when I added intelligent time‑to‑live (TTL). Not all prompts are equally cacheable. A prompt like "Summarize this article" might be valid for days. A prompt like "What's the weather in Tokyo today?" should have a TTL of maybe 6 hours. I used a simple heuristic: if the prompt contains time‑sensitive keywords ("today", "this week", "latest"), use a short TTL (1 hour). Otherwise, use a long TTL (7 days).
And for prompts that are identical across users? I added request deduplication with a mutex. When two users hit the same prompt within milliseconds, the first one makes the API call; the second one waits and reuses the result. No double charges.
Code Example (Flask + Redis)
import hashlib
import json
import redis
from flask import Flask, request, jsonify
from openai import OpenAI
app = Flask(__name__)
client = OpenAI()
cache = redis.Redis(host='localhost', port=6379, db=0)
PROXY_CONFIG = {
"base_url": "https://ai.interwestinfo.com", # example proxy URL
"api_key": "sk-..."
}
def get_cache_key(prompt, system_msg, model, temp):
raw = f"{prompt}|{system_msg}|{model}|{temp}"
return "ai_cache:" + hashlib.sha256(raw.encode()).hexdigest()
def is_time_sensitive(prompt):
keywords = ["today", "now", "current", "latest", "this week"]
return any(kw in prompt.lower() for kw in keywords)
@app.route('/ask', methods=['POST'])
def ask():
data = request.json
prompt = data['prompt']
system_msg = data.get('system_message', 'You are a helpful assistant.')
model = data.get('model', 'gpt-3.5-turbo')
temp = data.get('temperature', 0.7)
cache_key = get_cache_key(prompt, system_msg, model, temp)
# Check cache
cached = cache.get(cache_key)
if cached:
return jsonify({"response": json.loads(cached), "source": "cache"})
# Use a lock for deduplication (simplified here)
# In production, use Redis SETNX with a short expiry
lock_key = cache_key + ":lock"
lock_acquired = cache.setnx(lock_key, "1")
if not lock_acquired:
# Lock not acquired – wait and retry (or short polling)
import time
time.sleep(0.5)
cached = cache.get(cache_key)
if cached:
return jsonify({"response": json.loads(cached), "source": "cache"})
# fallback – make request anyway
# Make the API call
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_msg},
{"role": "user", "content": prompt}
],
temperature=temp
)
result = response.choices[0].message.content
# Determine TTL
if is_time_sensitive(prompt):
ttl = 3600 # 1 hour
else:
ttl = 604800 # 7 days
# Store in cache
cache.setex(cache_key, ttl, json.dumps(result))
cache.delete(lock_key) # release lock
return jsonify({"response": result, "source": "api"})
This cut my API costs by 65% in the first week. Latency for cached responses dropped from ~2 seconds to ~10 ms. Users were happy. I was happy.
When This Approach Breaks
Caching is not a silver bullet:
- If your prompts are almost always unique (e.g., summarizing different user documents), you'll get very little cache hit. In that case, focus on batching or streaming instead.
- Dynamic responses (like creative writing where you want slight randomness each time) suffer if temperature > 0 can still produce different outputs for the same prompt. I fixed this by caching only for temperature = 0 (deterministic mode) and bypassing cache for creative tasks.
- Legal/Compliance – If your users' prompts contain PII or sensitive data, caching in Redis might violate privacy. Make sure to hash before storing or avoid caching altogether for such requests.
- Stale data – Models get updated, or external knowledge changes. I added a manual cache clear endpoint and also set a max TTL of 14 days, no matter what.
Alternatives Worth Considering
If you can't cache, another approach I experimented with was prompt batching. Instead of one API call per request, you collect multiple prompts over a short window (say 200ms) and send them as a single batch. OpenAI supports this via the /v1/chat/completions endpoint with an array of messages. You just need to map responses back to the original requests. It’s a bit trickier but can slash costs if you have high traffic.
And then there are dedicated AI proxy services (like the one I linked in the code). They handle caching, batching, rate limiting, and even fallback between models out of the box. I eventually moved to one because managing Redis + locks + TTL heuristics was becoming a second job. But for a small project, building your own is a great learning experience.
What I'd Do Differently Next Time
- Profile first – I should have logged prompt uniqueness rates before building a caching strategy. I assumed many duplicates, but actual data would have guided me.
- Use a dedicated proxy from day one – If I were doing this again for a production app, I'd start with a service that handles these concerns. The time I spent building custom caching could have been spent on product features.
- Add observability – Without metrics on cache hit ratio and cost savings, I was flying blind. Now I log cache stats to a dashboard.
The Real Lesson
AI APIs are powerful, but they’re also expensive if you treat them like a database. Every request should be justified. Ask yourself: Do I really need a fresh answer for every user? Often, you don't.
I went from $238/month to ~$83. My app feels snappier because cached responses are instant. And I sleep better knowing I’m not burning money on repeated work.
What's your approach to managing AI costs? I'd love to hear what tricks you've discovered.
Top comments (0)