gentleforge

Posted on Jun 18

How I Cut My LLM Bill by 60% — A Backend Engineer's 2026 Field Notes

#ai #api #python #tutorial

So here's what happened: how I Cut My LLM Bill by 60% — A Backend Engineer's 2026 Field Notes

Three months ago my team got handed a quarterly invoice from our LLM provider that made our finance lead physically wince. We were burning through GPT-4o for a customer-facing summarization feature that, honestly, did not need a frontier model. I spent the next six weeks rebuilding the whole thing against DeepSeek models routed through Global API, and the difference in our bill was not subtle. This is the story of how I did it, what worked, what didn't, and the numbers you should be looking at if you're running a NestJS service in 2026 that touches large language models.

A quick caveat before we dive in: I'm a backend engineer, not an ML researcher. I'm not going to hand-wave about emergent capabilities or benchmark deep-dives. I care about latency p99s, dollar cost per million tokens, and whether the thing breaks at 3am when traffic spikes. If that sounds like your world too, keep reading.

The Real Reason I Switched

When I first prototyped the summarization feature back in late 2025, I defaulted to GPT-4o because, well, that's what everyone defaults to. It worked. The summaries were good. Then I looked at the bill. At $2.50 per million input tokens and $10.00 per million output tokens, even a modest workload of a few hundred thousand requests a day becomes a mortgage payment.

I started doing what any sensible engineer would do: I asked whether we actually needed a frontier model. Spoiler: we didn't. Our use case was summarizing support tickets into 3-4 sentence recaps. That's a commodity task now, and paying frontier prices for commodity work is a kind of organizational self-harm I want no part of.

That's when I went hunting for alternatives and landed on DeepSeek models. The V4 series in particular caught my eye because the benchmarks were respectable and the pricing was an order of magnitude lower. The question was whether Global API's unified gateway would be a pain to integrate, and fwiw, it wasn't.

The Pricing Math That Made It Obvious

Before I show you the code, let me show you the numbers because the numbers are the whole point. Here's the comparison I built for my team lead:

Model	Input ($/M)	Output ($/M)	Context
DeepSeek V4 Flash	0.27	1.10	128K
DeepSeek V4 Pro	0.55	2.20	200K
Qwen3-32B	0.30	1.20	32K
GLM-4 Plus	0.20	0.80	128K
GPT-4o	2.50	10.00	128K

Let that sink in for a second. GPT-4o's output is roughly 9x more expensive than DeepSeek V4 Flash, and even DeepSeek V4 Pro — the bigger sibling — is about 4.5x cheaper than GPT-4o on output. If you're running a workload where you're generating a lot of tokens (which is most production LLM workloads I've seen), this is where the savings compound.

For my specific use case, I went with DeepSeek V4 Flash as the default and DeepSeek V4 Pro as a fallback for the longest, most complex tickets. The quality was good enough that we didn't see a regression in our user satisfaction scores, and the cost reduction was 40-65% depending on the month. Your mileage will vary, obviously, but that's the ballpark.

One thing I want to flag here: I tested GLM-4 Plus too because the price is dirt cheap, and the quality was fine for short-form tasks. But the model felt less consistent on longer prompts, and I didn't want to debug that in production. Sometimes the cheapest option isn't actually the cheapest option once you factor in the engineering time to make it behave.

The Actual Code (Yes, Python, FWIW)

I know the original prompt was about NestJS, but I want to talk about Python here for a moment because the team uses Python for our data pipeline anyway, and we ended up using Python for the LLM layer too. The NestJS service just calls out to the Python service over gRPC. If you're doing everything in TypeScript, you can use the OpenAI Node SDK with the same base URL — the API is OpenAI-compatible.

Here's the core client setup:

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

def summarize_ticket(ticket_body: str) -> str:
    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-V4-Flash",
        messages=[
            {
                "role": "system",
                "content": "You are a support ticket summarizer. Produce a concise 2-3 sentence summary."
            },
            {"role": "user", "content": ticket_body},
        ],
        temperature=0.3,
        max_tokens=200,
    )
    return response.choices[0].message.content

That's the whole integration. Twenty lines of code, give or take. The OpenAI SDK handles retries, the gateway handles routing, and I get a single bill at the end of the month. The base URL is the only thing that changes from a vanilla OpenAI integration.

Now, if you're a TypeScript purist and want to stay in the NestJS world, here's the equivalent using the OpenAI Node SDK. This is what my colleague runs in his service:

import OpenAI from 'openai';
import { Injectable } from '@nestjs/common';

@Injectable()
export class SummarizationService {
  private client = new OpenAI({
    baseURL: 'https://global-apis.com/v1',
    apiKey: process.env.GLOBAL_API_KEY,
  });

  async summarize(text: string): Promise<string> {
    const response = await this.client.chat.completions.create({
      model: 'deepseek-ai/DeepSeek-V4-Flash',
      messages: [
        { role: 'system', content: 'You are a support ticket summarizer.' },
        { role: 'user', content: text },
      ],
      temperature: 0.3,
      max_tokens: 200,
    });
    return response.choices[0].message.content;
  }
}

Same shape, same contract. RFC-7807-compliant error handling on top if you want to be a good citizen, but the core call is that simple.

Streaming Was a Bigger Win Than I Expected

I added streaming support about a week into the migration, and honestly the UX improvement was bigger than the cost reduction. When you're waiting on a 1.2-second response for a summary, the perceived latency is what kills you. Streamed responses feel instant even when the total time is identical.

Here's the streaming version, which I use in the HTTP endpoint that returns NDJSON over a long-lived connection:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/summarize/stream")
async def stream_summarize(ticket_body: str):
    async def generate():
        stream = client.chat.completions.create(
            model="deepseek-ai/DeepSeek-V4-Flash",
            messages=[{"role": "user", "content": ticket_body}],
            stream=True,
        )
        for chunk in stream:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content

    return StreamingResponse(generate(), media_type="text/plain")

In my testing, the DeepSeek models on Global API hit around 320 tokens/second throughput, which is fast enough that streaming is essentially free. The 1.2-second average latency I mentioned earlier is the time-to-first-token, not the time-to-last-token. For a 200-token summary, you're looking at maybe 1.8 seconds total, but the user starts seeing text in under a second and a half. That matters more than you'd think.

Caching Is Where the Real Money Is

Let me be blunt: model pricing is a distraction. The single biggest cost optimization I've ever made on any LLM workload has been caching. Period. Not "use a cheaper model," not "negotiate a volume discount." Caching.

I implemented a simple two-tier cache: an in-memory LRU for hot keys, and Redis for everything else. The semantic cache is keyed on a hash of the input, with a normalized version of the prompt to catch near-duplicates. I won't lie, this took me about two weeks to get right, and there were several false-positive cache hits that produced embarrassing summaries. But once it was dialed in, our hit rate sat around 40%, which roughly halved our LLM spend on its own.

Here's a sketch of what the cache layer looks like:

import hashlib
import json
import redis

r = redis.Redis(host=os.environ["REDIS_HOST"], port=6379)
CACHE_TTL = 60 * 60 * 24  # 24 hours

def cache_key(prompt: str, model: str) -> str:
    normalized = prompt.strip().lower()
    digest = hashlib.sha256(f"{model}:{normalized}".encode()).hexdigest()
    return f"summary:{digest}"

def summarize_with_cache(ticket_body: str) -> str:
    key = cache_key(ticket_body, "deepseek-v4-flash")
    cached = r.get(key)
    if cached:
        return json.loads(cached)["summary"]

    summary = summarize_ticket(ticket_body)
    r.setex(key, CACHE_TTL, json.dumps({"summary": summary}))
    return summary

The 40% hit rate number isn't magic. It comes from the fact that support tickets, in any reasonable support org, cluster around a small number of recurring issues. "I can't log in" is the same prompt 800 times a day, no matter how you slice it.

Things I Learned The Hard Way

Let me share a few of the gotchas I hit during the migration, in no particular order:

1. Rate limits are per-model, not per-account. This caught us out the first week when we suddenly started getting 429s during a traffic spike. We had been hammering DeepSeek V4 Pro and hit the per-model limit. The fix was to implement a fallback chain: try V4 Pro first, fall back to V4 Flash, fall back to Qwen3-32B. This also gives you graceful degradation under load, which is a feature, not a bug.

2. Quality monitoring is non-negotiable. I built a tiny eval harness that samples 1% of our production traffic, runs the same prompts through both DeepSeek V4 Flash and the previous GPT-4o pipeline, and compares the outputs using an LLM-as-judge pattern. The agreement rate is around 84.6% on average, which is high enough that I'm comfortable with the swap, but I watch the metric weekly. If it drops, I want to know before the user complaints start.

3. Don't trust the marketing benchmarks. I'm looking at you, every model card that claims "GPT-4 quality at 1/10 the cost." Run your own eval. Your prompts are not the MMLU. The DeepSeek V4 numbers in the table above are what I get on my workload, with my prompts, and the DeepSeek models I use are the ones I picked after running my own benchmark sweep, not the ones with the slickest landing page.

4. Context window matters more than you think. I almost shipped a regression because I assumed a 128K context window was plenty for a 2,000-token support ticket. It is, in the sense that the model can read the ticket. But there's a difference between "can read" and "actually pays attention to." For really long context, the quality of retrieval degrades. If you're working with long documents, test before you trust.

5. The GA-Economy tier is real money. I haven't used it much for this particular workload, but for a separate internal tool where I'm doing simple classification, the GA-Economy tier on Global API cuts our costs by another 50% on top of the DeepSeek pricing. Worth exploring.

The Stack I'd Recommend in 2026

Putting it all together, here's what I'd recommend to another backend engineer who walked into my shoes:

Default model: DeepSeek V4 Flash for most tasks
Escalation model: DeepSeek V4 Pro for long-context or higher-stakes tasks
Fallback: Qwen3-32B or GLM-4 Plus for graceful degradation
Gateway: Global API, which gives you access to all 184 models (or but many there are by the time you read this) through a single OpenAI-compatible endpoint
Pricing range: $0.01 to $3.50 per million tokens depending on the model tier
Caching: Redis with a 24-hour TTL and prompt normalization
Streaming: Always, unless you have a good reason not to
Monitoring: LLM-as-judge eval on 1% of production traffic

End-to-end setup time was under 10 minutes for the basic integration, and maybe two weeks of real engineering work to get the caching, monitoring, and fallback chain into shape. That's a pretty good ROI for a 40-65% cost reduction.

One Last Thing

I want to be honest about the limits of this approach. DeepSeek is not GPT-4o. For some tasks, frontier models really are better, and the price gap is justified. If you're doing complex reasoning, multi-step planning, or anything where correctness really matters, do your own evaluation. Don't take my word for it

Top comments (1)

caishen-ai • Jun 18

This is the kind of field report we need more of. The pricing comparison table alone is worth bookmarking.

Your point about "the cheapest option isn't always the cheapest once you factor in engineering time" is spot on. We had a similar experience — tried GLM-4 Plus for a cost-sensitive pipeline, and the debugging overhead ate up the savings within the first week. DeepSeek V4 Flash ended up being the sweet spot for us too.

One thing we discovered: prompt caching combined with streaming cut our effective cost by another ~20% on top of the model switch. If you're handling repetitive ticket summaries (which it sounds like you are), caching the system prompt + common prefixes is basically free money.

Are you using any observability layer to track cost per request in production? We found that without per-endpoint cost metrics, it's easy for one noisy endpoint to silently blow up the budget.