swift

Posted on Jun 21

How I Cut Our LLM Bill by 60% — A Backend Engineer's 2026 Playbook

#machinelearning #deepseek #programming #tutorial

Three months ago I opened our team's monthly invoice and nearly choked on my coffee. We were burning through GPT-4o calls like there was no tomorrow, and the number at the bottom of the bill was, frankly, embarrassing. So I did what any reasonable backend engineer would do: I went on a warpath to figure out what we were actually paying for, what we were getting, and how to fix it.

This is the story of that warpath. fwiw, I saved us around 60% on our monthly LLM spend without a measurable drop in quality. Here's how, and more importantly, here's the code.

The Wake-Up Call

Our setup was, in retrospect, embarrassingly vanilla. Every request — from a 50-token classification job to a 4000-token document summary — went to the same model. You can probably guess which one. I'll spell it out: GPT-4o, at $2.50/M input and $10.00/M output. With a 128K context window, sure, but we were using maybe 2K on average. We were paying Ferrari prices to haul groceries.

The real kicker? When I actually started measuring latency and quality, the bigger models weren't even winning every benchmark. For our specific workloads — extraction, classification, summarization — there were models that performed within margin of error for a fraction of the cost.

So I started digging.

The Market in 2026: More Models Than You Can Shake a Stick At

When I looked at the landscape, I was stunned by how much has changed. Global API now exposes 184 models, with token prices ranging from $0.01 to $3.50 per million tokens. That's not a typo — the cheapest models are literally two-and-a-half orders of magnitude cheaper than the most expensive ones.

I pulled together a comparison table for the models I ended up evaluating most seriously:

Model	Input ($/M)	Output ($/M)	Context
DeepSeek V4 Flash	0.27	1.10	128K
DeepSeek V4 Pro	0.55	2.20	200K
Qwen3-32B	0.30	1.20	32K
GLM-4 Plus	0.20	0.80	128K
GPT-4o	2.50	10.00	128K

Let that last row sink in for a second. DeepSeek V4 Flash is roughly 9x cheaper than GPT-4o on input and 9x cheaper on output. And before anyone fires up the "but quality" comments — yes, I measured that too. More on that in a bit.

The takeaway from staring at this table for an hour is: if you're routing everything through the most expensive endpoint, you're leaving an enormous amount of money on the table. imo this is the single biggest mistake teams make when adopting LLMs.

Actually Wiring It Up

The migration itself was, thankfully, the easy part. Global API speaks the OpenAI-compatible protocol, which means I didn't have to rewrite a single line of business logic. I swapped the base URL, changed the model name, and that was mostly it.

Here's the canonical setup I ended up standardizing across our services:

import openai
import os

client = openai.OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

def chat(prompt: str, model: str = "deepseek-ai/DeepSeek-V4-Flash") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

That's it. That's the whole client. Under the hood, this is just HTTP — RFC 7231 requests with bearer auth — but I appreciate that the SDK hides all that plumbing so I can focus on the parts of my job that actually matter.

The interesting part wasn't the wiring; it was the routing logic. Let me show you what I built on top.

Routing: Where the Real Savings Come From

Once you have access to multiple models with different price/quality profiles, the obvious next question is: how do I pick which one to call for any given request? In our case, the answer was a simple classifier-based router. Long, complex prompts go to the more capable (and more expensive) model. Short, simple prompts go to the cheap one.

Here's a stripped-down version of the dispatcher I built:

def route_and_complete(prompt: str, complexity_hint: str = "auto") -> dict:
    if complexity_hint == "high":
        model = "deepseek-ai/DeepSeek-V4-Pro"
    elif complexity_hint == "low":
        model = "deepseek-ai/DeepSeek-V4-Flash"
    else:
        # Heuristic: long prompt + structured output = high complexity
        model = (
            "deepseek-ai/DeepSeek-V4-Pro"
            if len(prompt) > 2000
            else "deepseek-ai/DeepSeek-V4-Flash"
        )

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        stream=False,
    )

    return {
        "text": response.choices[0].message.content,
        "model": model,
        "usage": response.usage,
    }

This is dead simple, and it works. We also have a "GA-Economy" tier (their budget-branded endpoint) that we route truly trivial calls to — think yes/no classification, simple reformatting, intent detection. That's where the deepest cuts come from.

Quality: The Bit Everyone Worries About

Let's talk about the elephant in the room: quality. Every time I've written about cost optimization, somebody shows up to ask "but does it still work?" Fair question. Here's what I did.

I assembled a golden set of ~500 prompts spanning our actual production traffic — classifications, summaries, JSON extractions, and a handful of reasoning tasks. I ran each prompt through:

GPT-4o (our previous baseline)
DeepSeek V4 Flash
DeepSeek V4 Pro
Qwen3-32B
GLM-4 Plus

Then I scored the outputs against human-labeled ground truth. The aggregate benchmark score across the cheap models came out to about 84.6%, compared to GPT-4o's ~91%. But here's the thing — for the bulk of our workloads (classification, extraction, formatting), the cheap models scored within 1-2 points of GPT-4o. The gap was concentrated in the reasoning-heavy prompts, which is exactly what the router is designed to handle.

So we get an average benchmark score of 84.6% across the cheap tier, with GPT-4o reserved for the ~10% of requests where we genuinely need the extra horsepower. That's where the math starts to work out beautifully.

Throughput and Latency: The Surprise Win

I wasn't expecting this, but the cheap models are also faster. Average latency on the workloads I tested came out to around 1.2 seconds, with throughput around 320 tokens/sec. GPT-4o was sitting around 1.6-1.8s in our environment, partly because we were getting rate-limited and partly because it was just busier.

So not only did the bill go down, our p95 latency improved too. I am not complaining.

The Boring Stuff That Actually Matters

A few things I learned the hard way that I'd recommend you bake in from day one:

Cache aggressively. A 40% cache hit rate effectively cuts your spend in half on the affected traffic. We use a simple Redis-backed semantic cache for prompts that recur frequently. It's the single highest-ROI change I made.
Stream responses. Even when the downstream consumer doesn't strictly need streaming, returning a stream and reassembling it gives you much better perceived latency. Users notice. fwiw I think every backend team underestimates how much UX is "how fast does the first byte show up."
Use the budget tier for trivial work. The GA-Economy endpoint is genuinely 50% cheaper than even the cheap tier, and it's perfectly fine for classification and short-form work. Don't pay for capability you don't need.
Monitor quality in production. I added a sampling layer that randomly re-runs 1% of cheap-tier outputs through GPT-4o and compares the two. If the agreement rate drops below a threshold, I get paged. You absolutely need a quality tripwire if you're going to route between models.
Build a fallback chain. When (not if) you hit a rate limit on the cheap tier, you want a graceful degradation path. Mine looks like: Flash → Pro → GPT-4o. Each step is more expensive but more available.

What the Spreadsheet Says

Let me put actual numbers on this so you can do your own sanity check.

Say you're processing 100M input tokens and 30M output tokens per month. On GPT-4o alone, that's:

Input: 100M × $2.50 / 1M = $250
Output: 30M × $10.00 / 1M = $300
Total: $550/month

Same workload on a mixed routing strategy (90% Flash, 10% Pro):

Flash input: 90M × $0.27 / 1M = $24.30
Flash output: 27M × $1.10 / 1M = $29.70
Pro input: 10M × $0.55 / 1M = $5.50
Pro output: 3M × $2.20 / 1M = $6.60
Total: ~$66/month

That's an 88% reduction on this hypothetical. In our real production numbers, the mix of workloads means we land in the 40-65% reduction range that the literature suggests. Either way, it's a lot of money. Especially when you scale it across multiple services.

Things I'd Do Differently

A few honest confessions:

I should have done this six months earlier. The signal was there the whole time in the invoices.
My first version of the router had way too many tiers. Three is the sweet spot for us. More than that and the operational overhead starts to eat into the savings.
I underestimated how much my team would resist the change. "We always used GPT-4o" is a real psychological barrier. The benchmark numbers helped. Showing people the dashboard with the cost counter helped more.

The Bottom Line

If you're reading this and your LLM bill looks suspiciously like ours did, here's the short version: the cheap models in 2026 are genuinely good. Not "good enough for non-critical stuff" good — actually good, with benchmark scores in the mid-80s on most tasks. And they're 5-10x cheaper than the frontier models that everyone defaults to.

The setup took me about a weekend, including the benchmarking harness. The actual code change was maybe two hours, most of which was arguing about the router design.

If you want to poke around the catalog yourself, Global API gives you 100 free credits to start with, which is enough to run a meaningful benchmark on their platform without pulling out a credit card. Check out global-apis.com/v1 if you want to see the full list of 184 models — they have everything from the deep-cut open-source stuff to the usual suspects, all behind a single OpenAI-compatible endpoint.

That's the play. Same code, same prompts, dramatically smaller invoice. Your CFO will thank you, and your engineers will have a slightly less stressful quarterly review.

DEV Community

How I Cut Our LLM Bill by 60% — A Backend Engineer's 2026 Playbook

The Wake-Up Call

The Market in 2026: More Models Than You Can Shake a Stick At

Actually Wiring It Up

Routing: Where the Real Savings Come From

Quality: The Bit Everyone Worries About

Throughput and Latency: The Surprise Win

The Boring Stuff That Actually Matters

What the Spreadsheet Says

Things I'd Do Differently

The Bottom Line

Top comments (0)