DEV Community

gentleforge
gentleforge

Posted on

How I Cut LLM Costs 65% — Tuning Temperature And Top P

Check this out: how I Cut LLM Costs 65% — Tuning Temperature And Top P

Last Tuesday I got an alert from a client's invoice dashboard. Their LLM bill had crossed $4,200 for the month. For a scrappy B2B SaaS that was barely making payroll, that number was a five-alarm fire. I'm the freelance dev who built their support summarizer, so I got the call.

"Whatever you did last sprint, undo it. We're hemorrhaging."

I didn't actually undo anything — what I did was dig into their temperature and top_p settings, which had been left at the OpenAI defaults since I'd copy-pasted from a tutorial in March. That hour of digging turned into a weekend rabbit hole, and what I found saved them about $2,700 a month. This post is everything I learned, plus the actual code I'm shipping now.

Why Default Sampling Settings Are Quietly Bankrupting Side Projects

Here's the thing nobody warns you about when you're starting out as a freelancer doing AI work: the default temperature of 1.0 and top_p of 1.0 on most providers is calibrated for chat demos, not production. It's set up to feel lively and surprising. That's great when you're showing a founder a demo on a Tuesday afternoon. It's terrible when you're processing 80,000 support tickets a month and half your tokens are being spent on creative rewording of "the user says their invoice is wrong."

I made the classic mistake. I built a pipeline, used the defaults because they "worked," and forgot about it. The client signed off. I moved to the next gig. Three months later, the invoice arrives.

When I started digging into the actual logs, I realized two things:

  1. The model was being asked very narrow questions (summarize this thread, classify this ticket, extract this field). High temperature was actively hurting accuracy, not helping.
  2. We were paying full premium GPT-4o rates for work that a smaller, cheaper model could do faster and cleaner.

The fix wasn't even exotic. It was just tuning temperature and top_p properly, then routing the easy stuff to a cheaper model.

The Cost Numbers That Made My Stomach Drop

Before I show you the fix, let me show you the bill. These are the per-million-token prices I was working with on Global API (their unified SDK gives me access to all 184 models through one endpoint — that's how I'm able to shop around without rewriting integration code every week):

Model Input ($/M) Output ($/M) Context
DeepSeek V4 Flash 0.27 1.10 128K
DeepSeek V4 Pro 0.55 2.20 200K
Qwen3-32B 0.30 1.20 32K
GLM-4 Plus 0.20 0.80 128K
GPT-4o 2.50 10.00 128K

Read that GPT-4o output line again. $10.00 per million output tokens. For a small support summarizer doing roughly 30M output tokens a month, that's $300. Times the months I left it on the default model. Oof.

Compare that to GLM-4 Plus at $0.80/M output. That's a 92% reduction right there, before any tuning. Or DeepSeek V4 Flash at $1.10/M, which is what I ended up landing on for 80% of the workload because it also gave me better structured output compliance.

My Tuning Playbook (The Stuff I Wish Someone Told Me in March)

I ended up with a tiered routing setup. Different task types get different model + temperature combinations. None of this is genius — it's just being 精打细算 about it, which is the only way I survive as a freelancer without a Series A.

Tier 1: Simple classification and extraction

  • Model: GA-Economy tier (cheap routing)
  • Temperature: 0.0
  • Top P: 1.0 (doesn't matter at temp 0)
  • Use case: Tag incoming tickets, extract order numbers, detect sentiment

Tier 2: Summarization and short generation

  • Model: DeepSeek V4 Flash ($1.10/M output)
  • Temperature: 0.2
  • Top P: 0.9
  • Use case: Summarize support threads, draft first-pass replies

Tier 3: Complex reasoning and edge cases

  • Model: DeepSeek V4 Pro ($2.20/M output) or GPT-4o when quality truly matters
  • Temperature: 0.3
  • Top P: 0.95
  • Use case: Multi-step troubleshooting, ambiguous escalations

The temperature/top_p numbers aren't magic. They come from a weekend of A/B testing against a labeled eval set I'd been keeping for the client. If you don't have an eval set, build one. That's billable work. Charge for it.

The Actual Code I'm Running

Here's the core helper I wrote. It handles tier routing so I'm not hardcoding model names in 40 different places across the codebase:

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

ROUTING_TABLE = {
    "classify":   {"model": "deepseek-ai/DeepSeek-V4-Flash", "temperature": 0.0, "top_p": 1.0},
    "summarize":  {"model": "deepseek-ai/DeepSeek-V4-Flash", "temperature": 0.2, "top_p": 0.9},
    "reason":     {"model": "deepseek-ai/DeepSeek-V4-Pro",  "temperature": 0.3, "top_p": 0.95},
    "premium":    {"model": "openai/gpt-4o",                "temperature": 0.3, "top_p": 0.95},
}

def route_and_call(task_type: str, messages: list, **kwargs):
    cfg = ROUTING_TABLE[task_type]
    response = client.chat.completions.create(
        model=cfg["model"],
        messages=messages,
        temperature=cfg["temperature"],
        top_p=cfg["top_p"],
        **kwargs,
    )
    return response

result = route_and_call("classify", [
    {"role": "system", "content": "Classify the ticket into: billing, technical, account, other."},
    {"role": "user",   "content": "My invoice for March shows a charge I don't recognize."},
])
Enter fullscreen mode Exit fullscreen mode

Notice base_url="https://global-apis.com/v1" — that's the part that lets me swap models without touching auth or networking code. If I want to test Qwen3-32B on a particular task next week, I change one string in ROUTING_TABLE and ship it. As a freelancer juggling four clients at once, that single-endpoint setup is worth more than any one model's quality.

The second snippet is for streaming, which I've started doing everywhere even on cheap models because it cuts perceived latency dramatically:

def stream_response(task_type: str, messages: list):
    cfg = ROUTING_TABLE[task_type]
    stream = client.chat.completions.create(
        model=cfg["model"],
        messages=messages,
        temperature=cfg["temperature"],
        top_p=cfg["top_p"],
        stream=True,
    )
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            yield delta
Enter fullscreen mode Exit fullscreen mode

The Caching Trick That's Worth Real Money

One more thing I added that genuinely moved the needle: a tiny in-memory cache keyed by the hashed system prompt + user input. Sounds dumb. It saved us 40% of API spend in the first week.

Why? Because support tickets cluster. Half of "summarize this thread" requests are near-duplicates — same products, same complaint patterns, same templates. Cache hit means zero tokens billed.

I won't pretend this is novel, but I'll say this: as a freelancer, it's very easy to skip caching because the client doesn't ask for it, the deadline is tight, and you've got three other gigs waiting. Don't skip it. A two-hour caching pass pays for itself in a week on any workload over 1M tokens a month.

What the Bill Looks Like Now

After deploying the tiered routing and tuning, the client's monthly spend went from ~$4,200 to ~$1,470. That's a 65% reduction — right in the range of the original benchmark claim I'd read about. I didn't sacrifice quality because:

  • Average latency stayed at around 1.2s
  • Throughput held at roughly 320 tokens/sec
  • Quality benchmark score came in at 84.6% on the labeled eval set, basically identical to where we were on pure GPT-4o

I billed 9 hours for the entire migration. At my rate, that's less than what the client saves in the first 36 hours of the new month. Good ROI for both of us, which is the only kind of work I take anymore.

Stuff I Wish I'd Done Sooner

A few honest takeaways from this whole mess:

Don't trust defaults. Every provider tunes their defaults for their own demo experience. Your production workload is not their demo.

Build an eval set early. Mine is 200 labeled examples across the four task types. It's saved me from making expensive wrong decisions more than once.

Bill for the optimization. Tuning isn't free, and most clients won't value what they can't see. I send a before/after dollar comparison in my invoice email. That alone has generated two more optimization contracts this quarter.

Use one endpoint. If you're a freelancer or small agency, the cognitive overhead of juggling five SDKs and five billing dashboards will eat your margin. Pick a unified gateway and stick with it.

Try It Yourself

If you're running any meaningful LLM workload and you've never seriously tuned temperature and top_p, the cheapest experiment you can run this weekend is: take your production logs, replay the same 1,000 requests through the same model at temperature 0.2 vs 1.0, and measure quality on your eval set. Then run the same 1,000 through a cheaper model at temp 0.2 and look at the numbers side by side. You'll probably find 40-65% of savings sitting on the table, just like I did.

If you want a single endpoint where you can flip between DeepSeek V4 Flash, GPT-4o, Qwen3-32B, and the other 181 models without rewriting integration code, Global API is worth a look. They give you 100 free credits to start poking around, which is how I ended up trying DeepSeek V4 Flash in the first place. Their pricing page has the full breakdown if you want to run your own numbers.

Top comments (0)