bolddeck

Posted on Jun 13

Tuning Temperature and Top-P: A Backend Engineer's Field Notes

#python #ai #webdev #tutorial

I've spent the last six months running LLM workloads in production, and if there's one thing that's bitten me more than once, it's the temptation to just leave temperature and top_p at their defaults. fwiw, the defaults aren't bad — but they're not optimised for your workload either. This post is the guide I wish someone had handed me before I started tuning sampling parameters on 184 different models.

Let me walk you through what actually matters in 2026, with real numbers pulled from benchmarks and pricing data I've gathered while shipping a deep_dive analytics pipeline. Some of these results surprised me. Some of them made me want to throw my laptop.

Why Sampling Parameters Aren't Optional Anymore

Here's the thing nobody tells you when you start integrating LLMs: temperature and top_p aren't knobs you fiddle with during a hackathon. They're load-bearing parameters. They decide whether your chatbot hallucinates a fake return policy, whether your summarization pipeline produces coherent paragraphs, and whether your code-completion feature actually completes the code or invents a function from another dimension.

When I first started, I treated these parameters like set-and-forget config. Then I watched our eval suite go sideways because someone bumped temperature from 0.2 to 0.7 "to make it feel more creative." Took me three days to trace the regression. Lesson learned.

Now I treat sampling parameters the same way I treat database connection pool sizes — something that gets reviewed every quarter and tuned against actual usage data.

The Current Pricing Landscape (Why This Even Matters)

Before we dive into parameters, let's talk about what you're paying for. Because sampling behavior interacts with cost in ways that aren't obvious. A higher temperature doesn't change the price per token, but it absolutely changes how many tokens you burn on retries when the model goes off the rails.

Here's what I'm working with on Global API right now:

Model	Input ($/M)	Output ($/M)	Context Window
DeepSeek V4 Flash	0.27	1.10	128K
DeepSeek V4 Pro	0.55	2.20	200K
Qwen3-32B	0.30	1.20	32K
GLM-4 Plus	0.20	0.80	128K
GPT-4o	2.50	10.00	128K

Look at GPT-4o. 2.50 input, 10.00 output. Per million tokens. That's not a typo. And if you're running that on temperature=0.9 with no top_p constraint, you're not just paying premium prices — you're paying premium prices for stochastic nonsense.

Now look at GLM-4 Plus. 0.20 input, 0.80 output. Same 128K context. The sampling parameters behave similarly across these models, but the cost-per-explosion when temperature goes rogue is dramatically different. This is why I always tune sampling before I tune model selection.

Temperature: What It Actually Does (RFC 9110 Style Explanation)

imo, the worst explanation of temperature I've ever read was "higher = more creative." That's like saying a database "higher = more creative" when you increase connection limits. Technically not wrong, completely useless.

Here's the actual mechanism. The model produces a probability distribution over its vocabulary. Temperature divides the logits by T before the softmax. At T=1.0, you get the model's raw distribution. At T=0.5, you sharpen it — the model becomes more confident in its top picks. At T=2.0, you flatten it — suddenly "the" and "quantum" are competing for the same probability mass.

In production, I default to:

0.0–0.2 for structured extraction, classification, JSON output
0.3–0.5 for summarization, rewriting, translation
0.7–1.0 for creative writing, brainstorming, ideation

Notice I didn't say "0.7 for chatbots." Because every chatbot is different. A customer support bot should probably be at 0.3. A role-playing game NPC should probably be at 0.9.

Top-P: The Nucleus Sampling Trap

Top-p (also called nucleus sampling) is where things get spicy. Instead of sampling from the full distribution, you sample from the smallest set of tokens whose cumulative probability exceeds p. So at top_p=0.9, you're cutting off the long tail.

The official OpenAI docs recommend either temperature or top_p, not both. This is sound advice and I agree with it. But let me explain why with the kind of clarity I wish the docs had:

When you adjust temperature, you're reshaping the distribution. When you adjust top_p, you're cropping it. Doing both is like zooming in with a lens and then cropping the photo. The result is unpredictable unless you have empirical data on that specific combination.

In my pipeline, I use one or the other:

Temperature alone for most cases (because it's continuous and predictable)
Top_p=0.95 with temperature=1.0 for specific creative-writing workloads
Never both, unless I have a benchmark suite telling me it's safe

A Real Code Example (Python)

Let me show you the actual setup I'm running. This uses the OpenAI SDK pointed at Global API's unified endpoint, which is honestly one of the nicer DX choices I've made this year.

import openai
import os

client = openai.OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

def classify_ticket(subject: str, body: str) -> dict:
    """Classify a support ticket into a category. Low temperature for stability."""
    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-V4-Flash",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a ticket classifier. "
                    "Respond with a JSON object: "
                    '{"category": "billing|technical|account|other", '
                    '"confidence": 0.0-1.0}'
                ),
            },
            {"role": "user", "content": f"Subject: {subject}\n\nBody: {body}"},
        ],
        temperature=0.1,  # We want consistency, not creativity
        top_p=1.0,         # Let temperature do the work
        response_format={"type": "json_object"},
    )
    return response.choices[0].message.content

Notice what I did there. Temperature=0.1, top_p=1.0. That's deliberate. For classification, I want the model to commit to its top prediction, not explore alternatives. The DeepSeek V4 Flash at 0.27 input / 1.10 output means I can run thousands of these classifications per dollar. At GPT-4o pricing (2.50/10.00), the same volume would cost roughly 9x more.

When You Actually Want Creativity

Here's the flip side. Sometimes you want the model to explore. For a feature I'm building that generates product names from feature descriptions, low temperature would be catastrophic. I'd get the same three names every time.

def generate_product_names(feature_description: str, n: int = 5) -> list[str]:
    """Generate candidate product names. Higher temperature for diversity."""
    response = client.chat.completions.create(
        model="qwen3-32b",  # Good balance of creativity and coherence
        messages=[
            {
                "role": "system",
                "content": (
                    "Generate short, marketable product names. "
                    "Be inventive but pronounceable. Return as JSON list."
                ),
            },
            {"role": "user", "content": feature_description},
        ],
        temperature=0.8,
        top_p=0.95,    # Crop the really weird tokens
        n=n,           # Generate multiple candidates in one call
        max_tokens=200,
    )
    return response.choices[0].message.content

The combination temperature=0.8, top_p=0.95 is something I landed on after running about 200 generation tasks through manual eval. Higher than 0.8 temperature and the names started getting unhinged ("Quantum Flux Capacitor" for a to-do app). Lower than 0.95 top_p and we occasionally cut off genuinely good candidates. It's a narrow band, but it works.

Production Best Practices (From The Trenches)

Let me share the actual practices that survived contact with real users. These are not aspirational — these are what we do.

1. Pin your sampling parameters per use case, not per model. I have a config file that maps use cases to sampling params. Classification → 0.1. Summarization → 0.3. Creative → 0.8. The model can change underneath without the behavior shifting.

2. Cache aggressively. This isn't a sampling-parameter tip, but it's the biggest cost lever you have. A 40% hit rate on your prompt cache literally cuts your bill in half for cached traffic. fwiw, I cache on normalized prompt + temperature + top_p, so different sampling configs don't collide.

3. Stream responses for UX, not for cost. Streaming doesn't reduce token usage. It reduces perceived latency. Users see the first token faster and feel like the system is snappier. Use it for anything user-facing over 200 tokens.

4. Use economy-tier models for simple queries. GLM-4 Plus at 0.20 input is shockingly capable for short, well-scoped tasks. The 50% cost reduction vs mid-tier is real, and quality is fine if you keep your prompts tight.

5. Monitor quality with real eval suites, not vibes. I run a weekly eval that replays 500 production prompts through the current config and scores the outputs against a held-out gold set. When the score drops by more than 2%, I investigate. This caught a temperature regression last month that would've shipped to production otherwise.

6. Implement fallback for rate limits. Even with 184 models available, you will hit rate limits. Always have a secondary model configured with a relaxed temperature (slightly higher for diversity in fallback outputs is fine).

Benchmarks I'm Seeing in 2026

Here's the high-level picture from my internal testing across the Global API catalog. These numbers are for a deep_dive workload — long-context analysis with structured output requirements:

Average cost reduction: 40–65% vs running on GPT-4o for everything
Average latency: 1.2s to first token across the catalog
Average throughput: 320 tokens/sec on mid-tier models
Average benchmark score: 84.6% across my eval suite

The "average" hides a lot. Some workloads do better, some worse. But the headline is: tuning your sampling parameters and your model selection together gives you dramatically more headroom than either alone.

Common Mistakes I See (And Have Made)

Let me catalog the errors I've personally shipped or reviewed, so you don't have to:

Mistake 1: Setting temperature to 0 and expecting deterministic output. It's not deterministic. Floating point math across different hardware, batching, and even model versions can introduce variance. If you need true determinism, you need to set seed and temperature=0 and use greedy decoding explicitly.

Mistake 2: Using temperature=0.7 because "that's what OpenAI's docs say for chat." That default is a compromise. It works for the average chat workload. Your workload is not the average chat workload.

Mistake 3: Setting top_p=0.5 "to avoid hallucination." Low top_p doesn't prevent hallucination. It prevents the model from choosing unlikely tokens, but hallucinations come from confidently wrong predictions, not unlikely tokens. This is a misconception I've had to correct in code review more times than I can count.

Mistake 4: Not benchmarking sampling parameter changes. "I just changed temperature from 0.7 to 0.3" — did you re-run the eval suite? Did you check production metrics? If not, you're flying blind.

Mistake 5: Mixing sampling parameters across models without re-evaluating. A temperature that works on DeepSeek V4 Pro may not work on GLM-4 Plus. The distributions are calibrated differently. Always re-eval when you switch models.

The Cost Math That Actually Matters

Let me do some napkin math so this isn't abstract. Suppose you're processing 10 million output tokens per month on classification tasks.

On GPT-4o at temperature=0.7 with no tuning: 10M × $10.00/M = $100,000/month. Plus you have a 12% retry rate from inconsistent outputs, so add another ~$12,000. Total: $112,000.

On DeepSeek V4 Flash at temperature=0.1 with proper prompt caching: 10M × $1.10/M = $11,000/month. Retry rate drops to 2%. That's $220. Total: $11,220.

That's a 90% reduction. And the quality is better because the outputs are more consistent.

If you're a startup burning runway, this is the difference between 18 months of runway and 6 months. fwiw, I've seen this exact analysis change a company's hiring plans.

One More Code Snippet: The Eval Harness

Here's the eval pattern I use weekly. It's unglamorous but it works:

import json
from pathlib import Path

EVAL_SET = Path("./eval/gold_set.jsonl")

def run_sampling_eval(model: str, temperature: float, top_p: float) -> float:
    """Run the eval suite and return the match rate."""
    correct = 0
    total = 0

    with EVAL_SET.open() as f:
        for line in f:
            case = json.loads(line)
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": case["system"]},
                    {"role": "user", "content": case["user"]},
                ],
                temperature=temperature,
                top_p=top_p,
            )
            output = response.choices[0].message.content.strip()
            if output == case["expected"]:
                correct += 1
            total += 1

    return correct / total if total else 0.0

# Sweep across configurations
for temp in [0.0, 0.1, 0.3, 0.5, 0.7]:
    score = run_sampling_eval("deepseek-ai/DeepSeek-V4-Flash", temp, 1.0)
    print(f"temperature={temp}: score={score:.3f}")

This takes maybe 10 minutes to run against a 500-case gold set. The results almost always surprise me. Sometimes 0.1 is best, sometimes 0.3. The point is: you don't know until you measure.

Closing Thoughts

If you take one thing from this post, take this: temperature and top_p are not defaults to ignore. They're tunable parameters that affect cost, quality, and user experience in measurable ways. The cheapest improvement you can make to your LLM pipeline this week is running a sampling-parameter sweep against your eval suite.

And if you're choosing where to run these workloads, I'd genuinely recommend checking out Global API. The unified SDK pointing at global-apis.com/v1 means I can swap models without rewriting integration code, which is the kind of DX detail that matters when you're A/B testing sampling parameters across different models. The pricing tiers I showed above — from 0.20 to 10.00 per million tokens — give you the headroom to actually experiment instead of treating every API call like it costs a kidney.

That's it from me. Go tune your sampling parameters, measure twice, ship once.

DEV Community

Tuning Temperature and Top-P: A Backend Engineer's Field Notes

Why Sampling Parameters Aren't Optional Anymore

The Current Pricing Landscape (Why This Even Matters)

Temperature: What It Actually Does (RFC 9110 Style Explanation)

Top-P: The Nucleus Sampling Trap

A Real Code Example (Python)

When You Actually Want Creativity

Production Best Practices (From The Trenches)

Benchmarks I'm Seeing in 2026

Common Mistakes I See (And Have Made)

The Cost Math That Actually Matters

One More Code Snippet: The Eval Harness

Closing Thoughts

Top comments (0)