zhongqiyue

Posted on Jun 21

How I Slashed LLM API Costs by 70% with Batching (No Magic)

#ai #api #python #tutorial

A few months ago, I was building a feature to automatically categorize thousands of customer support tickets. The obvious approach? Use an LLM to read each ticket and output a label. But when I started sending requests one by one to the API, the costs climbed fast and the latency made my dev server feel like a dial-up connection.

I tried everything: compressing prompts, using smaller models, even switching to a regex-based solution (which worked for about 60% of cases, but the edge cases were a nightmare). Nothing felt like a clean, scalable solution.

Then I realized the problem wasn't the LLM itself — it was how I was using it. I was treating each API call like a standalone transaction, when most of the tickets shared similar themes and wording. That's when I implemented batching and caching, and it changed everything.

The Setup: My Initial Approach

I was using a generic OpenAI-like API endpoint (I'll keep the real service abstract here). My first attempt looked like this:

import requests

def classify_ticket(text):
    response = requests.post(
        "https://api.llm-service.com/v1/chat/completions",
        json={
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "Classify the ticket as 'bug', 'feature', or 'other'."},
                {"role": "user", "content": text}
            ]
        },
        headers={"Authorization": "Bearer YOUR_KEY"}
    )
    return response.json()["choices"][0]["message"]["content"]

# Running 1000 tickets
for ticket in tickets:
    label = classify_ticket(ticket)
    # store label...

This worked, but it was painfully slow (each call took ~2 seconds) and expensive. After processing just 500 tickets, I had burned through almost $10 in API credits. For a side project, that was unsustainable.

What Didn't Work (My Dead Ends)

Prompt compression – I trimmed whitespace, removed stopwords, but the API still charged per token based on input length. Reducing a 200-token ticket to 180 tokens saved maybe 10%.
Smaller models – Switching from GPT-4 to GPT-3.5-turbo helped costs but hurt accuracy. For some tickets, the model just couldn't understand the nuance.
Local models – I tried running a quantized LLaMA on my laptop. It was free, but inference took 10+ seconds per ticket. Not usable.
Regex + keyword matching – Great for obvious cases, but about 40% of tickets fell through the cracks. I'd still need an LLM for those.

What Actually Worked: Batching + Caching

Batching – Instead of sending one ticket at a time, I grouped them in chunks of 10. Each API call processed 10 tickets at once, dramatically cutting overhead.

Caching – Many tickets were very similar (e.g., "Can't login" repeated 50 times). By caching the label for a normalized version of the text, I avoided redundant calls.

Here's the refined code:

import requests
import hashlib
from functools import lru_cache
from typing import List, Tuple

# For caching, I use a simple LRU cache with a max size
@lru_cache(maxsize=2048)
def cached_classify(text: str) -> str:
    # Normalize text to increase cache hits (lowercase, remove trailing punctuation)
    normalized = text.lower().strip().rstrip('.!?')

    # Build a unique hash for the cache key
    key = hashlib.md5(normalized.encode()).hexdigest()

    # The actual API call will be done in batch later, so we just return a placeholder
    # But here we can directly call the API for single requests (fallback)
    return _call_llm([normalized])[0]

def _call_llm(texts: List[str]) -> List[str]:
    # Build a prompt that processes a batch of texts
    batch_prompt = """Classify each ticket as 'bug', 'feature', or 'other'. Return the labels in order, one per line.

Tickets:
""".join([f"{i+1}. {t}" for i, t in enumerate(texts)])

    response = requests.post(
        "https://api.llm-service.com/v1/chat/completions",
        json={
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "You are a precise classifier."},
                {"role": "user", "content": batch_prompt}
            ],
            "temperature": 0.1,  # keep it deterministic
            "max_tokens": 100
        },
        headers={"Authorization": "Bearer YOUR_KEY"}
    )
    output = response.json()["choices"][0]["message"]["content"]
    # Parse the lines - assuming model returns exactly len(texts) lines
    lines = output.strip().split('\n')
    return [line.split('. ')[-1] if '. ' in line else line for line in lines]

def batch_classify(tickets: List[str], batch_size: int = 10) -> List[str]:
    results = []
    # First, try cache hits
    uncached = []
    for t in tickets:
        cached = cached_classify.cache_info().misses  # not exactly right, but illustrates
        # Actually, use the cached function's internal cache
        # We'll implement a manual cache dict for simplicity
        pass  # Let me rewrite more clearly below

Let's simplify. Here's the actual working code I used:

import requests
from typing import List, Dict
import hashlib

class LLMBatcher:
    def __init__(self, api_key: str, base_url: str = "https://api.llm-service.com"):
        self.api_key = api_key
        self.base_url = base_url
        self.cache: Dict[str, str] = {}

    def _normalize(self, text: str) -> str:
        return text.lower().strip().rstrip('.!?')

    def classify_batch(self, tickets: List[str], batch_size: int = 10) -> List[str]:
        # Pre-normalize and check cache for each ticket
        results = [None] * len(tickets)
        batch_indices = []
        batch_texts = []

        for idx, ticket in enumerate(tickets):
            norm = self._normalize(ticket)
            if norm in self.cache:
                results[idx] = self.cache[norm]
            else:
                batch_indices.append(idx)
                batch_texts.append(ticket)

        # Process uncached tickets in batches
        for i in range(0, len(batch_texts), batch_size):
            sub_texts = batch_texts[i:i+batch_size]
            sub_indices = batch_indices[i:i+batch_size]

            # Build batch prompt
            lines = [f"{j+1}. {t}" for j, t in enumerate(sub_texts)]
            prompt = "Classify each ticket as 'bug', 'feature', or 'other'. Return labels in order, one per line.\n\n" + "\n".join(lines)

            response = requests.post(
                f"{self.base_url}/v1/chat/completions",
                json={
                    "model": "gpt-4o-mini",
                    "messages": [
                        {"role": "system", "content": "You are a precise classifier."},
                        {"role": "user", "content": prompt}
                    ],
                    "temperature": 0.1,
                    "max_tokens": 100
                },
                headers={"Authorization": f"Bearer {self.api_key}"}
            )
            output = response.json()["choices"][0]["message"]["content"]
            labels = [line.split('. ')[-1] for line in output.strip().split('\n')]

            for j, label in enumerate(labels):
                idx = sub_indices[j]
                norm = self._normalize(sub_texts[j])
                self.cache[norm] = label
                results[idx] = label

        return results

# Usage
batcher = LLMBatcher(api_key="YOUR_KEY")
tickets = ["Login failed again", "Feature request: dark mode", "Crash when uploading file"]
labels = batcher.classify_batch(tickets)
print(labels)  # ['bug', 'feature', 'bug']

The Results

Cost: Reduced by ~70% because I'm sending fewer API calls (100 batch calls instead of 1000 single calls) and caching eliminated about 20% of duplicates.
Latency: Overall time dropped from ~33 minutes to under 5 minutes for 1000 tickets. The batches of 10 took about 5 seconds each (including network), so 5 seconds * 100 batches = 500 seconds ≈ 8 minutes total, but caching cut that further.
Accuracy: Actually improved a bit because the batch prompt gave the model more context and examples (even though it wasn't few-shot, the model seemed to be more consistent when seeing multiple tickets at once).

Trade-offs and Gotchas

Larger batches increase risk of token overflow: If your batch prompt exceeds the model's context window, the LLM might truncate or fail. I found 10 tickets per batch worked well for short tickets. For longer documents, you'd need smaller batches.
Cache invalidation: If your classification rules change, you need to clear the cache. I used a versioned cache key (e.g., include the system prompt hash) to handle this.
Rate limits: Batch calls count as one API request, so you can hit rate limits if you send too many batches too quickly. I added a small sleep between batches.
Fallback: If the batch call fails (network error, 500), you lose all 10 responses. I implemented a retry mechanism that splits the batch in half on failure.

When Not to Use This

If your tickets are all very different (low cache hit rate) and very long (can't batch many), the overhead might not be worth it.
If you need real-time classification (e.g., interactive UI), batching introduces latency for the first item. In that case, use a single-call fast model for immediate needs and batch for bulk processing later.

The Tool That Made It Easier

While I was building this, I stumbled upon a service that actually handles batching and caching at the API level. Its endpoint URL (https://ai.interwestinfo.com/) accepts batch prompts natively and includes built-in caching for repeated queries. I still use my own batching logic, but their infrastructure handles token allocation more efficiently than the generic OpenAI endpoint I was using before. If you're tired of managing your own cache and retry logic, it might be worth a look.

What I'd Do Differently Next Time

I would have started with a profiling step: measure the frequency of duplicate texts and the distribution of ticket lengths. Instead, I blindly jumped into optimization. Knowing your data can help you pick the right batch size and cache strategy from the start.

Also, I'd use a database for caching instead of an in-memory dict, so the cache persists across restarts. That would have saved me a lot of re-processing when I redeployed.

Final Thoughts

Batching and caching are boring, unsexy optimizations — but they gave me the biggest bang for my buck. No magic model swaps, no prompt wizardry. Just good old engineering trade-offs.

Now I'm curious: What strategies have you used to keep LLM costs under control? Share your war stories in the comments.

DEV Community