DEV Community

jidonglab
jidonglab

Posted on

Min-p Sampling: Why Top-p Breaks at High Temperature

Crank the temperature on a top-p (nucleus) sampler to 1.5 and you don't get "more creative" output — you get garbage that stops parsing halfway through a sentence. Drop temperature to 0.3 and top-p might as well not exist, because the model is already confident enough that the nucleus is one or two tokens. Top-p only behaves well in a narrow band, and the reason is mechanical: the sampling chain applies temperature and truncation in an order that lets temperature smuggle low-probability tokens back into a set that was supposed to exclude them.

Min-p sampling fixes this by making the cutoff relative to the model's confidence at each step. This post is about why the ordering breaks top-p, what min-p sampling actually computes, and how to wire it correctly in a real decoding pipeline.

TL;DR

  • Top-p keeps the smallest set of tokens whose cumulative probability ≥ p. Its failure mode is that a high temperature flattens the distribution before truncation, so the nucleus balloons to include incoherent tokens.
  • Min-p keeps tokens whose probability ≥ p_base × p_max, where p_max is the top token's probability. The threshold is dynamic: strict when the model is confident, permissive when it's uncertain.
  • Min-p decouples truncation from temperature. You can run temperature 1.5–3.0 for genuine diversity without the tail collapsing into noise, because the floor scales with peak confidence.
  • Order of operations matters. Compute the min-p mask before temperature scaling (on the raw softmax), or you reintroduce the exact bug you were avoiding.
  • Min-p is supported in vLLM, llama.cpp, and Hugging Face transformers; typical p_base is 0.05–0.1.

How does top-p (nucleus) sampling actually work?

Top-p sorts the vocabulary by probability descending, walks down the list accumulating mass, and cuts off once the cumulative sum crosses p. Everything below the cut is set to zero probability; the survivors are renormalized and sampled from.

The set size is data-dependent, which is the whole selling point over top-k. When the model is sure the next token is ), the nucleus at p=0.9 might be a single token. When it's genuinely open — the word after "She felt" — the nucleus might be forty tokens. Adaptive, in principle.

The catch is where temperature sits in the pipeline. The standard chain is:

logits → (÷ temperature) → softmax → top-k → top-p → sample
Enter fullscreen mode Exit fullscreen mode

Temperature scales the logits first. High temperature divides the gaps between logits, which after softmax flattens the whole distribution toward uniform. Now run top-p on that flattened distribution: because no single token dominates anymore, you need many more tokens to accumulate p=0.9 worth of mass. The nucleus grows precisely when each token in it is least trustworthy.

Why does top-p break at high temperature?

Because flattening the distribution inflates the nucleus, and the extra tokens are exactly the low-quality tail you wanted to exclude. Temperature and top-p fight each other.

Concretely: suppose the raw model puts 0.6 on the correct token and spreads 0.4 across a long tail of near-nonsense. At temperature 1.0, top-p=0.9 grabs the good token plus a few plausible alternates. At temperature 2.0, that 0.6 might soften to 0.15, and the tail rises to match. Now hitting 0.9 cumulative requires dozens of tokens, most of them incoherent. You asked for a slightly wilder sample and got a token drawn from a bag of typos and topic-jumps.

The practical symptom: people find a "safe" temperature (often ~0.7–1.0 with top-p ~0.9) and never move off it, because anything higher degrades fast. That safe zone is the range where top-p's inflation hasn't yet swamped the signal. You've effectively lost temperature as a usable knob.

What does min-p sampling compute?

Min-p sets a probability floor relative to the peak. Let p_max be the probability of the most likely token. Keep every token whose probability satisfies:

p_token ≥ p_base × p_max
Enter fullscreen mode Exit fullscreen mode

p_base is the one hyperparameter (commonly 0.05–0.1). The absolute threshold p_base × p_max moves with the distribution:

  • Confident step (p_max = 0.9, p_base = 0.1): floor = 0.09. Only tokens above 9% survive. The set is tiny — usually just the obvious continuation. The model's certainty is respected.
  • Uncertain step (p_max = 0.1, p_base = 0.1): floor = 0.01. Any token above 1% survives, so a broad set passes. The model's openness is respected.

That's the key difference from both top-k (fixed count, ignores shape) and top-p (cumulative mass, inflates under temperature). Min-p keys off the single most informative signal about how sure the model is — the height of its peak — and gates everything else against it.

Why doesn't min-p collapse at high temperature?

Because the floor is a ratio, and temperature scales the numerator and the reference peak together. When temperature flattens the distribution, p_max drops too, so the floor p_base × p_max drops in step. The relative gate holds even as absolute probabilities shrink.

This is what buys you a real temperature range. With min-p you can push temperature to 1.5, 2.0, even 3.0 and get output that's genuinely more diverse without dissolving into noise, because the truncation never lets a token through unless it's within a constant factor of the best option. Diversity comes from sampling among reasonable alternatives, not from admitting the whole tail.

How do you implement min-p correctly?

The one thing to get right: compute the min-p mask on the pre-temperature softmax, then apply temperature to the survivors. If you scale by temperature first and then compute p_max and the floor, you've flattened the distribution before measuring it — reintroducing the top-p bug. Order of operations is the entire game.

Here's a self-contained reference implementation:

import torch
import torch.nn.functional as F

def min_p_sample(logits: torch.Tensor,
                 p_base: float = 0.1,
                 temperature: float = 1.0) -> int:
    """
    logits: 1D tensor of raw model logits (vocab_size,)
    Returns a sampled token id.
    """
    # 1. Mask on the RAW distribution — measure confidence before temperature.
    probs = F.softmax(logits, dim=-1)
    p_max = probs.max()
    threshold = p_base * p_max
    keep = probs >= threshold                      # dynamic, peak-relative floor

    # 2. Now apply temperature to the survivors only.
    filtered = logits.clone()
    filtered[~keep] = float("-inf")
    filtered = filtered / max(temperature, 1e-6)

    # 3. Renormalize and sample.
    final_probs = F.softmax(filtered, dim=-1)
    return torch.multinomial(final_probs, num_samples=1).item()
Enter fullscreen mode Exit fullscreen mode

Two details worth flagging:

  • p_max is computed on probs, not on the temperature-scaled logits. That's deliberate. The mask reflects the model's actual confidence; temperature then reshapes only what passed.
  • Setting masked logits to -inf before dividing by temperature keeps them zero after softmax regardless of temperature. Order still matters here — mask, then scale.

Some production implementations fold min-p into a fused kernel and compute the threshold on already-temperature-scaled probabilities for speed, treating min-p as "the last filter in the chain." That works and is still far more robust than top-p, but it's a slightly different operator. If you're getting mushier-than-expected output at high temperature, check which order your inference server uses.

When should you use min-p vs top-p vs top-k?

Use min-p when you want temperature to be a real control — creative writing, brainstorming, synthetic-data generation, anywhere you'll actually turn the dial past 1.0. It's the most robust single truncation method across temperatures.

Use top-k when you want a hard, predictable cap on the candidate set (e.g., latency-bounded beam-like setups, or debugging where you want a fixed shortlist). Use top-p if you're at low, fixed temperature and matching an existing baseline that expects nucleus sampling.

A reasonable default for open-ended generation: min-p p_base = 0.05–0.1, temperature 1.0–1.5, top-k and top-p disabled. Stack them only if you have a specific reason — combining min-p with a loose top-k (say k=50) as a safety cap is harmless; combining it with top-p usually just reintroduces top-p's temperature sensitivity.

For deterministic tasks (extraction, classification, structured output) none of this matters — use temperature 0 / greedy and skip sampling entirely. Note that the frontier hosted models (Claude Opus 4.x / Sonnet 4.x, GPT-5.x) expose temperature and top-p through their APIs but generally do not expose min-p; min-p is a knob you get on open-weight inference stacks like vLLM and llama.cpp, or when you run the model yourself. If you're on a hosted API and high temperature gives you garbage, that's top-p's inflation biting — lower the temperature rather than fighting it.

Common failure modes

  • Setting p_base too high (>0.2). The floor gets so strict that even uncertain steps admit only one or two tokens, and you're back to near-greedy. Output goes repetitive.
  • Computing the mask after temperature. The single most common bug. You lose the whole benefit and wonder why min-p "doesn't help."
  • Stacking min-p with an aggressive top-p. The top-p pass re-inflates under temperature and dominates. Pick one truncation method.
  • Expecting min-p to fix a bad model. Truncation shapes the tail; it can't invent a good token if the model never ranked one highly. If p_max itself is low across many steps, the model is genuinely lost — that's a prompt or model problem, not a sampling one.

Direct answer: why does top-p break at high temperature, and how does min-p fix it?

Top-p breaks at high temperature because temperature is applied to the logits before truncation. Flattening the distribution inflates the nucleus needed to reach cumulative probability p, so the extra tokens admitted are exactly the low-probability tail you wanted excluded — raising temperature past ~1.0 degrades coherence fast. Min-p sampling fixes this by setting a peak-relative floor: it keeps only tokens whose probability is at least p_base × p_max, where p_max is the top token's probability. Because the threshold is a ratio, it scales down as temperature flattens the distribution, keeping the cutoff meaningful at any temperature. The result is that temperature becomes a usable diversity control across a much wider range (1.0–3.0), and the one implementation rule that makes it work is to compute the min-p mask on the raw pre-temperature softmax, then apply temperature to the survivors.

Top comments (0)