Temperature vs top-p: Which Knob to Turn and When

#ai #llm #prompt #tutorial

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You're tuning an extraction prompt that keeps returning slightly wrong JSON. Someone on the team drops temperature=0.2 into the call. The output gets a bit steadier. So the next person, chasing the last few percent, also sets top_p=0.5. Now the output is steadier still, but nobody can explain why, and when the model ships a regression three weeks later, the two settings interact in a way no one can reason about.

That's the trap. Both knobs control the same thing from two different angles, and turning both at once means you've stopped controlling anything you can name.

What the two knobs actually do

A language model doesn't pick the next token. It produces a probability for every token in its vocabulary. Sampling is how you turn that distribution into one choice. Temperature and top-p are two different ways to reshape the distribution before the draw.

Temperature rescales the whole distribution. Below 1.0 it sharpens the curve, pushing probability toward the already-likely tokens and starving the long tail. Above 1.0 it flattens the curve, handing more weight to unlikely tokens. At 0.0 the model becomes greedy: it takes the single highest-probability token every time. Temperature touches every token's odds at once.

Top-p (nucleus sampling) doesn't rescale anything. It truncates. It sorts tokens by probability, walks down the list adding them up, and stops once the cumulative mass reaches p. Everything past that cutoff is discarded; the model samples only from the surviving set. With top_p=0.9, you keep the smallest group of tokens whose combined probability is 90%, and drop the rest entirely.

Here is the distinction that matters. Temperature changes how the weight is spread. Top-p changes which tokens are allowed to be drawn at all. One reshapes, the other clips.

Why turning both is a mistake

When you set temperature and top-p together, they compose in an order you don't control and can't predict by reading the numbers.

Most APIs apply temperature first, then top-p on the already-rescaled distribution. So temperature=0.3 sharpens the curve, and then top_p=0.5 clips the sharpened curve. The clip threshold now sits on a distribution that temperature already moved. Change one value and the other one's effect shifts under it. You have two dials wired to the same outcome through a multiplication you can't see.

# Hard to reason about: both knobs active.
# top_p=0.5 clips a distribution temperature
# has already sharpened. The effective cutoff
# is not "the top 50% of mass" anymore.
resp = client.responses.create(
    model="gpt-5.1",
    input=prompt,
    temperature=0.3,
    top_p=0.5,
)

The practical failure looks like this. You tune temperature down to fix a problem, it half-works, so you also pull top-p down. Now the output is tight. Six weeks later you bump temperature back up to add some variety and the output barely changes, because top-p is still clipping the tail you were trying to reopen. You spend an afternoon confused before someone remembers the second knob exists.

Pick one. Leave the other at its provider default (temperature 1.0, top-p 1.0, both of which are no-ops in their respective dimensions). Vendor guidance from OpenAI and Anthropic both say the same thing: alter one, not both.

Which one to pick

The two knobs are not interchangeable even though they overlap. They fail differently at the edges, and that difference is the whole basis for choosing.

Temperature gives you a smooth, continuous dial. Sliding from 0.0 to 1.0 to 1.5 moves output from rigid to varied to unhinged in a steady way. It never hard-bans a token; even at low temperature the long tail keeps a sliver of probability. That makes it predictable to tune and the right default for most work.

Top-p gives you a hard floor on quality. It removes the genuinely improbable tokens entirely, no matter what. At high temperature, top-p is a guardrail: you let the model be creative but forbid it from ever drawing from the absolute garbage at the bottom of the distribution. That's its real job, and it's a narrow one.

So the rule:

Default to temperature. It's the dial you can reason about.
Reach for top-p only when you want creativity with a safety rail — high temperature for variety, top-p around 0.9–0.95 to clip the nonsense tail.
Never turn both as tuning knobs. If you're adjusting both to chase a metric, you've lost the plot.

Settings per task

The right setting follows from what failure costs you. When being wrong is expensive and being boring is free, push toward determinism. When being repetitive is the failure and variety is the point, open it up.

Extraction, classification, structured output. You want the same input to produce the same output. Set temperature=0 and leave top-p alone. Pulling fields from a document, routing a ticket, returning a typed JSON object: there's one right answer, and sampling variety is pure downside. At temperature 0 the model is greedy and as close to reproducible as you'll get.

# Extraction: one right answer, want it every time.
resp = client.responses.create(
    model="gpt-5.1",
    input=extract_prompt,
    temperature=0,
)

Code generation, SQL, transforms. Stay low: temperature=0 to 0.2. Slightly above zero buys a little flexibility for the model to recover from an awkward token without inviting drift. Past 0.3 you start seeing the same function written four different ways across runs, which makes diffs and caching worse.

Summaries, rewrites, explanation. Mid-range, temperature=0.5 to 0.7. You want fluent prose with some freedom of phrasing, but not invention. This is the band where output reads natural without wandering off the source.

Ideation, brainstorming, naming, marketing copy. Open it up: temperature=0.9 to 1.1. Here repetition is the failure mode. If you run "give me ten product names" at temperature 0, you'll get ten variations on one idea. This is also the one place top-p earns a turn: keep temperature high for range, add top_p=0.95 to fence off the truly broken tokens, and leave temperature at its default of 1.0 if you go that route.

# Ideation: variety is the goal. ONE knob.
# Either crank temperature OR use top_p, not both.
resp = client.responses.create(
    model="gpt-5.1",
    input=brainstorm_prompt,
    temperature=1.1,
)

What the outputs look like

Take a single prompt ("Name a function that retries a failed network call with backoff") and run it ten times at each setting. The shape of the results, not exact strings, is what to watch.

At temperature=0, all ten runs return the same name, usually retryWithBackoff. Reproducible, and dull if you wanted options.

At temperature=0.7, you get maybe four distinct names across ten runs: retryWithBackoff, withRetry, fetchWithBackoff, resilientFetch. Real variation, all still sensible.

At temperature=1.2, you get eight or nine distinct names, and one or two start drifting — tenaciousFetch, networkPerseverator. Variety with a rising junk rate.

That junk rate is exactly what top-p is for. Run temperature=1.2 with top_p=0.9 and the broken outliers thin out while the spread stays wide, because the cumulative-mass cutoff drops the lowest-probability tokens that produced networkPerseverator in the first place. That's the legitimate both-knobs case, and notice it's a deliberate creativity-with-a-rail decision, not blind tuning.

For the determinism end, run the extraction prompt at temperature=0 across a 100-item eval set, twice. The two runs should agree on nearly every item. Any disagreement is a signal worth chasing — usually a genuinely ambiguous input rather than sampling noise, because at temperature 0 there's almost no sampling noise to blame.

The short version

Temperature reshapes the probability curve; top-p clips its tail. They overlap enough that turning both makes the system impossible to reason about, and they differ enough that the choice between them is real.

Default to temperature. Use 0 for extraction and structured output, low for code, mid for prose, high for ideas. Touch top-p only when you deliberately want wide-but-guardrailed creativity, and when you do, leave temperature at its default. One knob at a time, and write down which one you turned and why — future-you will not remember.

If this was useful

Sampling is one of those settings people copy from a Stack Overflow answer and never revisit, which is how a top_p=0.5 ends up fighting a temperature=0.3 in production for a year. The Prompt Engineering Pocket Guide walks through the sampling parameters with worked examples per task type, plus the eval setup that tells you whether a setting change actually moved your numbers or just your nerves.