A few months ago, I was building a feature to automatically categorize thousands of customer support tickets. The obvious approach? Use an LLM to read each ticket and output a label. But when I started sending requests one by one to the API, the costs climbed fast and the latency made my dev server feel like a dial-up connection.
I tried everything: compressing prompts, using smaller models, even switching to a regex-based solution (which worked for about 60% of cases, but the edge cases were a nightmare). Nothing felt like a clean, scalable solution.
Then I realized the problem wasn't the LLM itself — it was how I was using it. I was treating each API call like a standalone transaction, when most of the tickets shared similar themes and wording. That's when I implemented batching and caching, and it changed everything.
The Setup: My Initial Approach
I was using a generic OpenAI-like API endpoint (I'll keep the real service abstract here). My first attempt looked like this:
import requests
def classify_ticket(text):
response = requests.post(
"https://api.llm-service.com/v1/chat/completions",
json={
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "Classify the ticket as 'bug', 'feature', or 'other'."},
{"role": "user", "content": text}
]
},
headers={"Authorization": "Bearer YOUR_KEY"}
)
return response.json()["choices"][0]["message"]["content"]
# Running 1000 tickets
for ticket in tickets:
label = classify_ticket(ticket)
# store label...
This worked, but it was painfully slow (each call took ~2 seconds) and expensive. After processing just 500 tickets, I had burned through almost $10 in API credits. For a side project, that was unsustainable.
What Didn't Work (My Dead Ends)
Prompt compression – I trimmed whitespace, removed stopwords, but the API still charged per token based on input length. Reducing a 200-token ticket to 180 tokens saved maybe 10%.
Smaller models – Switching from GPT-4 to GPT-3.5-turbo helped costs but hurt accuracy. For some tickets, the model just couldn't understand the nuance.
Local models – I tried running a quantized LLaMA on my laptop. It was free, but inference took 10+ seconds per ticket. Not usable.
Regex + keyword matching – Great for obvious cases, but about 40% of tickets fell through the cracks. I'd still need an LLM for those.
What Actually Worked: Batching + Caching
Batching – Instead of sending one ticket at a time, I grouped them in chunks of 10. Each API call processed 10 tickets at once, dramatically cutting overhead.
Caching – Many tickets were very similar (e.g., "Can't login" repeated 50 times). By caching the label for a normalized version of the text, I avoided redundant calls.
Here's the refined code:
import requests
import hashlib
from functools import lru_cache
from typing import List, Tuple
# For caching, I use a simple LRU cache with a max size
@lru_cache(maxsize=2048)
def cached_classify(text: str) -> str:
# Normalize text to increase cache hits (lowercase, remove trailing punctuation)
normalized = text.lower().strip().rstrip('.!?')
# Build a unique hash for the cache key
key = hashlib.md5(normalized.encode()).hexdigest()
# The actual API call will be done in batch later, so we just return a placeholder
# But here we can directly call the API for single requests (fallback)
return _call_llm([normalized])[0]
def _call_llm(texts: List[str]) -> List[str]:
# Build a prompt that processes a batch of texts
batch_prompt = """Classify each ticket as 'bug', 'feature', or 'other'. Return the labels in order, one per line.
Tickets:
""".join([f"{i+1}. {t}" for i, t in enumerate(texts)])
response = requests.post(
"https://api.llm-service.com/v1/chat/completions",
json={
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "You are a precise classifier."},
{"role": "user", "content": batch_prompt}
],
"temperature": 0.1, # keep it deterministic
"max_tokens": 100
},
headers={"Authorization": "Bearer YOUR_KEY"}
)
output = response.json()["choices"][0]["message"]["content"]
# Parse the lines - assuming model returns exactly len(texts) lines
lines = output.strip().split('\n')
return [line.split('. ')[-1] if '. ' in line else line for line in lines]
def batch_classify(tickets: List[str], batch_size: int = 10) -> List[str]:
results = []
# First, try cache hits
uncached = []
for t in tickets:
cached = cached_classify.cache_info().misses # not exactly right, but illustrates
# Actually, use the cached function's internal cache
# We'll implement a manual cache dict for simplicity
pass # Let me rewrite more clearly below
Let's simplify. Here's the actual working code I used:
import requests
from typing import List, Dict
import hashlib
class LLMBatcher:
def __init__(self, api_key: str, base_url: str = "https://api.llm-service.com"):
self.api_key = api_key
self.base_url = base_url
self.cache: Dict[str, str] = {}
def _normalize(self, text: str) -> str:
return text.lower().strip().rstrip('.!?')
def classify_batch(self, tickets: List[str], batch_size: int = 10) -> List[str]:
# Pre-normalize and check cache for each ticket
results = [None] * len(tickets)
batch_indices = []
batch_texts = []
for idx, ticket in enumerate(tickets):
norm = self._normalize(ticket)
if norm in self.cache:
results[idx] = self.cache[norm]
else:
batch_indices.append(idx)
batch_texts.append(ticket)
# Process uncached tickets in batches
for i in range(0, len(batch_texts), batch_size):
sub_texts = batch_texts[i:i+batch_size]
sub_indices = batch_indices[i:i+batch_size]
# Build batch prompt
lines = [f"{j+1}. {t}" for j, t in enumerate(sub_texts)]
prompt = "Classify each ticket as 'bug', 'feature', or 'other'. Return labels in order, one per line.\n\n" + "\n".join(lines)
response = requests.post(
f"{self.base_url}/v1/chat/completions",
json={
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "You are a precise classifier."},
{"role": "user", "content": prompt}
],
"temperature": 0.1,
"max_tokens": 100
},
headers={"Authorization": f"Bearer {self.api_key}"}
)
output = response.json()["choices"][0]["message"]["content"]
labels = [line.split('. ')[-1] for line in output.strip().split('\n')]
for j, label in enumerate(labels):
idx = sub_indices[j]
norm = self._normalize(sub_texts[j])
self.cache[norm] = label
results[idx] = label
return results
# Usage
batcher = LLMBatcher(api_key="YOUR_KEY")
tickets = ["Login failed again", "Feature request: dark mode", "Crash when uploading file"]
labels = batcher.classify_batch(tickets)
print(labels) # ['bug', 'feature', 'bug']
The Results
- Cost: Reduced by ~70% because I'm sending fewer API calls (100 batch calls instead of 1000 single calls) and caching eliminated about 20% of duplicates.
- Latency: Overall time dropped from ~33 minutes to under 5 minutes for 1000 tickets. The batches of 10 took about 5 seconds each (including network), so 5 seconds * 100 batches = 500 seconds ≈ 8 minutes total, but caching cut that further.
- Accuracy: Actually improved a bit because the batch prompt gave the model more context and examples (even though it wasn't few-shot, the model seemed to be more consistent when seeing multiple tickets at once).
Trade-offs and Gotchas
- Larger batches increase risk of token overflow: If your batch prompt exceeds the model's context window, the LLM might truncate or fail. I found 10 tickets per batch worked well for short tickets. For longer documents, you'd need smaller batches.
- Cache invalidation: If your classification rules change, you need to clear the cache. I used a versioned cache key (e.g., include the system prompt hash) to handle this.
- Rate limits: Batch calls count as one API request, so you can hit rate limits if you send too many batches too quickly. I added a small sleep between batches.
- Fallback: If the batch call fails (network error, 500), you lose all 10 responses. I implemented a retry mechanism that splits the batch in half on failure.
When Not to Use This
- If your tickets are all very different (low cache hit rate) and very long (can't batch many), the overhead might not be worth it.
- If you need real-time classification (e.g., interactive UI), batching introduces latency for the first item. In that case, use a single-call fast model for immediate needs and batch for bulk processing later.
The Tool That Made It Easier
While I was building this, I stumbled upon a service that actually handles batching and caching at the API level. Its endpoint URL (https://ai.interwestinfo.com/) accepts batch prompts natively and includes built-in caching for repeated queries. I still use my own batching logic, but their infrastructure handles token allocation more efficiently than the generic OpenAI endpoint I was using before. If you're tired of managing your own cache and retry logic, it might be worth a look.
What I'd Do Differently Next Time
I would have started with a profiling step: measure the frequency of duplicate texts and the distribution of ticket lengths. Instead, I blindly jumped into optimization. Knowing your data can help you pick the right batch size and cache strategy from the start.
Also, I'd use a database for caching instead of an in-memory dict, so the cache persists across restarts. That would have saved me a lot of re-processing when I redeployed.
Final Thoughts
Batching and caching are boring, unsexy optimizations — but they gave me the biggest bang for my buck. No magic model swaps, no prompt wizardry. Just good old engineering trade-offs.
Now I'm curious: What strategies have you used to keep LLM costs under control? Share your war stories in the comments.
Top comments (0)