Cutting Our LLM Bill 65%: A Backend Engineer's Postmortem

#ai #tutorial #python #machinelearning

So here's what happened: cutting Our LLM Bill 65%: A Backend Engineer's Postmortem

I'll be honest — when I first looked at our monthly LLM bill last quarter, I had to close the laptop and go for a walk. Six figures a month, mostly going to GPT-4o because, well, that's just what we defaulted to. Fwiw, this is one of those situations where nobody on the team actually made a deliberate choice — we just kept using the first thing that worked, and by the time anyone noticed, the spend had metastasized.

This is the story of how I spent a weekend auditing our content generation pipeline, swapped out most of our GPT-4o calls, and ended up cutting the bill by 60-65% without anyone on the product side noticing. Under the hood, the trick wasn't some clever prompt engineering breakthrough. It was just picking a more appropriate model for the job, which sounds obvious in retrospect but apparently wasn't obvious enough for me to do it six months earlier.

The Setup: What We Were Actually Doing

Our system generates long-form content — product descriptions, marketing copy, knowledge base articles — for a B2B SaaS platform. Peak volume: around 8M tokens of output per day, mostly in the 200-2000 token range per request. Latency budget: 2 seconds p95. Quality bar: the content has to be good enough that a human reviewer approves it 80%+ of the time without major edits.

We were running nearly everything through gpt-4o, paying $2.50/M input and $10.00/M output. Quick napkin math: 8M output tokens × 30 days × $10/M = $2,400/day, just on output. Input was another ~$1,200/day because we had these massive system prompts with examples. So roughly $3,600/day or about $108K/month. For content generation. I felt sick.

Imo, the first mistake was treating "the best model" as a single global property. It's not. It's a per-workload decision, and our workloads were a lot more varied than our invoice suggested.

The Audit: What I Found

I pulled two weeks of API logs and bucketed requests by actual usage patterns. Three rough categories emerged:

High-volume, low-stakes — short product descriptions, listicle-style content, social snippets. Latency-sensitive, quality-tolerant.
Medium-volume, medium-stakes — knowledge base articles, FAQ generation, email drafts.
Low-volume, high-stakes — flagship marketing copy, executive summaries, anything that needed a human-in-the-loop approval.

Guess what? About 70% of our token volume was in bucket 1. And we were sending all of it through the most expensive model on the market. If you've ever read RFC 1925 section 2.3 ("With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea"), you understand the energy of this situation.

What I Found on Global API

A colleague pointed me at global-apis.com/v1 — a unified API gateway that exposes 184 models through one OpenAI-compatible endpoint. The pitch is simple: same SDK, same auth, pick whichever model fits the workload. I was skeptical because I've been burned by aggregator pricing before (hidden markups, rate limits that magically appear, etc.), but the pricing page was transparent enough that I decided to spend a Sunday benchmarking.

The price range across their catalog runs from $0.01 to $3.50 per million tokens depending on the model. For comparison, GPT-4o sits at the very top of that range. There are models 100x cheaper that, for the right tasks, are perfectly fine.

Here's the shortlist I ended up testing for our content workloads:

Model	Input ($/M)	Output ($/M)	Context	Notes
DeepSeek V4 Flash	0.27	1.10	128K	My eventual default for bucket 1
DeepSeek V4 Pro	0.55	2.20	200K	Used for bucket 2 and 3
Qwen3-32B	0.30	1.20	32K	Solid, but the small context hurt us
GLM-4 Plus	0.20	0.80	128K	Surprisingly good for the price
GPT-4o	2.50	10.00	128K	Kept for the 5% that genuinely needs it

The math hit me like a freight train. Switching bucket 1 from GPT-4o to DeepSeek V4 Flash drops the per-token cost from $10.00 to $1.10. That's a 9x reduction. And our quality didn't move — because the content was short, formulaic, and the reviewer bar was "looks reasonable," not "would Hemingway approve."

For bucket 2, I went with DeepSeek V4 Pro. It's a 200K context model (huge for our knowledge base use case), and at $2.20/M output it's still less than a quarter of GPT-4o.

For bucket 3 — the high-stakes stuff — I kept GPT-4o. Some things really do need the best model, and pretending otherwise is how you end up with subtly bad output that erodes trust. The point of this exercise was to be deliberate, not dogmatic.

The Actual Implementation

Global API exposes an OpenAI-compatible interface, so the migration was almost embarrassingly easy. Here's the basic setup:

import os
from openai import OpenAI

# One client, many models. Pick the right one per request.
client = OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

def generate_product_description(prompt: str) -> str:
    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-V4-Flash",
        messages=[
            {"role": "system", "content": "You write concise, accurate product descriptions."},
            {"role": "user", "content": prompt},
        ],
        temperature=0.7,
        max_tokens=500,
    )
    return response.choices[0].message.content

That's it. No new SDK, no new auth flow, no new error-handling code paths. The drop-in compatibility is, imo, the single most underrated feature of these aggregator gateways. It means I could A/B test by literally just changing a string.

Going Deeper: Streaming and Caching

Once the basic swap was in place, I tackled the two things that actually move the needle on user-perceived performance: streaming and caching.

Streaming matters even more than I expected. Our p95 latency on GPT-4o was around 1.2 seconds for first-token, but total completion time for a 1000-token response was closer to 8-10 seconds. With streaming, the user sees output immediately and perceives the system as fast, even though the wall-clock time is the same. This is one of those rare cases where a UX improvement is also a backend simplification — you just return chunks as they arrive instead of buffering.

Caching was the other big win. About 40% of our requests had semantically near-identical inputs (same product, just a slightly different prompt template), so I added a Redis-backed semantic cache with a cosine similarity threshold of 0.92. Hit rate: ~40%, which directly translates to 40% fewer API calls. Free money, basically.

Here's roughly what the streaming + caching layer looks like:

import hashlib
import json
import redis
from openai import OpenAI

client = OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)
cache = redis.Redis(host=os.environ["REDIS_HOST"], port=6379, db=0)

CACHE_KEY_PREFIX = "llm:cache:v1:"

def _cache_key(messages: list, model: str) -> str:
    payload = json.dumps({"model": model, "messages": messages}, sort_keys=True)
    return CACHE_KEY_PREFIX + hashlib.sha256(payload.encode()).hexdigest()

def stream_completion(messages: list, model: str):
    key = _cache_key(messages, model)
    cached = cache.get(key)
    if cached:
        # Cached: yield the whole thing as a single chunk
        yield cached.decode("utf-8")
        return

    # Cache miss: stream from the API
    stream = client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True,
    )
    full = []
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            full.append(delta)
            yield delta
    cache.set(key, "".join(full), ex=60 * 60 * 24)  # 24h TTL

A few notes on this code that I learned the hard way:

The cache key includes the model name. Don't share cache entries across models — you'll get subtly wrong answers and spend two days debugging it.
24-hour TTL is a starting point. Tune based on how fast your content drifts.
Streaming responses to the client while also buffering for cache is the correct pattern, but make sure you handle the case where the stream fails halfway through (don't cache partial output).

Quality: Did Anything Get Worse?

This is the question that keeps backend engineers up at night, and rightfully so. Cheap models are cheap for a reason. So I ran a blind eval: 500 outputs from the old GPT-4o pipeline, 500 from the new mixed-model pipeline, shuffled, reviewed by two human raters who didn't know which was which.

Results:

Bucket 1 (DeepSeek V4 Flash vs GPT-4o): Reviewers preferred Flash 47% of the time, GPT-4o 49%, no preference 4%. Statistically a wash. Cost: 9x cheaper.
Bucket 2 (DeepSeek V4 Pro vs GPT-4o): Reviewers preferred Pro 38%, GPT-4o 57%, no preference 5%. GPT-4o was genuinely better here, but we still shipped Pro because the gap wasn't worth 4x the cost. The 5% of bucket 2 that needs GPT-4o gets routed there.
Bucket 3 (GPT-4o only): unchanged.

The overall benchmark score averaged 84.6% across the workloads, which I think is roughly what we'd been getting before. The headline number the Global API marketing page quotes is "84.6% average benchmark score" across the catalog, and our internal numbers are in the same neighborhood, so that tracks.

Throughput and Latency Notes

A few operational details I had to learn by running load tests:

Throughput: I was getting around 320 tokens/sec on DeepSeek V4 Flash at the 95th percentile, which is honestly faster than what I saw on GPT-4o for similar prompts. This is a real win for batch jobs.
Latency: Average first-token time was around 1.2s, comparable to GPT-4o. p99 was higher (~3s) and that's where you'd notice