Two months ago, I was staring at my OpenAI bill and feeling that familiar pit in my stomach. Our startup's customer support chatbot was working great—until it wasn't. The responses were good, but the cost per conversation had ballooned to nearly $0.12, and our monthly spend was on track to hit five figures. Something had to give.
This is the story of how I optimized our AI pipeline, what I tried that failed, and the approach that finally worked. Spoiler: it wasn't about switching models or sacrificing quality. It was about being smart about when and how we called the API.
The Real Problem
We were building a tool that automatically drafts personalized email responses for support agents. The idea was simple: given the customer's email and some context, the AI writes a first draft. The agent reviews and sends it. Simple, right?
The first version used raw OpenAI calls with a long system prompt stuffed with company guidelines. Every email resulted in a fresh API call—no caching, no deduplication. And because we wanted the responses to be consistent, we kept adding more instructions to the prompt until it was a 3,000-token monster.
Outputs were decent, but slow. Latency averaged 4 seconds per response, and the cost? Let's just say I could hear my CTO's teeth grinding during our budget review.
What I Tried That Didn't Work
1. Switching to a cheaper model
First instinct: swap gpt-4 for gpt-3.5-turbo. The latency dropped, but the quality fell off a cliff. The responses became templated and robotic. Customers noticed, and our support team started ignoring the drafts.
2. Running a local model
I spun up a LLaMA 2 instance on a GPU instance. Training it on our email dataset was a nightmare. The output was barely coherent, and managing the infrastructure (updates, scaling, GPU costs) ate up my weekends. Not viable.
3. Aggressive prompt caching
I implemented a simple dictionary cache: same input → same output. Problem was, most email queries were unique. Cache hit rate was under 5%. Useless.
What Eventually Worked: A Three-Layer Approach
Instead of treating every API call as a one-off, I built a small abstraction layer that does three things:
- Similarity-based caching – Before hitting the API, we check if we've seen a semantically similar request before.
- Prompt template manager – Instead of one monolithic system prompt, we use modular templates that are pre-computed and cached.
-
Adaptive token control – We dynamically set
max_tokensbased on the complexity of the response needed.
Here's the core of the solution in Python (using embedding-based caching):
import openai
from sentence_transformers import SentenceTransformer
import numpy as np
import hashlib
class SmartAICache:
def __init__(self, model_name='all-MiniLM-L6-v2', threshold=0.92):
self.embedder = SentenceTransformer(model_name)
self.threshold = threshold
self.cache = {} # prompt_hash -> (embedding, response)
def get_embedding(self, text):
return self.embedder.encode(text, normalize_embeddings=True)
def find_similar(self, prompt):
prompt_embedding = self.get_embedding(prompt)
best_sim = 0.0
best_response = None
for cached_hash, (cached_emb, response) in self.cache.items():
sim = np.dot(prompt_embedding, cached_emb)
if sim > best_sim:
best_sim = sim
best_response = response
if best_sim >= self.threshold:
return best_response
return None
def call_api(self, prompt, **kwargs):
cached = self.find_similar(prompt)
if cached:
return cached
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
**kwargs
).choices[0].message.content
prompt_hash = hashlib.md5(prompt.encode()).hexdigest() # not used for lookup, just storage
embedding = self.get_embedding(prompt)
self.cache[prompt_hash] = (embedding, response)
return response
This gave us a cache hit rate of around 30–40% because many customer queries share the same underlying intent (e.g., "my order is late" → similar responses).
Prompt Template Manager
I stopped shoving everything into one prompt. Instead, I broke it into reusable pieces:
class PromptTemplate:
templates = {
"greeting": "Write a friendly greeting from our support team.",
"apology": "Apologize for the inconvenience and acknowledge the issue.",
"solution_order": "Provide steps to resolve the order issue: {details}",
"closing": "End with a polite closing and next steps."
}
@classmethod
def compose(cls, sections):
return "\n\n".join(cls.templates[s] for s in sections)
This let me cache partial templates. The greeting template never changes, so it's only sent to the API once.
Adaptive Token Control
We analyzed past responses and found that simple issues needed fewer tokens. We added a classifier that estimates response length based on the customer's tone and issue complexity:
def estimate_max_tokens(customer_email: str) -> int:
words = len(customer_email.split())
# Simple logic: long angry emails need more explanation
if words > 100:
return 400
elif "urgent" in customer_email.lower() or "frustrated" in customer_email.lower():
return 300
else:
return 200
This cut token usage by an average of 35%.
Results
After deploying this three-layer approach:
- Cost: Dropped from ~$0.12 to ~$0.035 per response (70% reduction)
- Latency: From 4s to 1.2s (cache hits were instant)
- Quality: Slightly improved because we had more consistent templates
Admittedly, the similarity cache uses a small embedding model (50MB download) and adds ~50ms per request. Totally worth it.
Trade-offs and When NOT to Use This
- Similarity caching works poorly in creative or highly varied use cases. For example, generating poetry or code—every output is unique—so cache hit rate will be near zero.
- The embedding model adds complexity – if you're already paying for OpenAI embeddings, you could use those instead, but that adds latency and cost.
- Prompt templates require maintenance – as your business rules change, you need to update templates. We had a versioning issue in the first week.
- This approach is overkill for low-volume APIs – if you make <100 calls a day, just use the raw API.
What I'd Do Differently Next Time
If I were starting fresh, I'd first look for an existing managed service that does this caching and prompt management out of the box. There are several now—for example, I recently discovered a service that offers exactly this kind of smart caching and template management as a drop-in proxy. I'd probably start with that and only roll my own if I needed the customization.
But honestly, building it myself taught me a ton about prompt engineering, embeddings, and cost optimization. The code above is production-ready for small to medium loads.
Let's Talk
What's your setup for managing AI API costs? Have you found a clever caching trick or are you still making raw calls and praying the bill stays low? I'd love to hear what works (or didn't) for you.
Top comments (0)