Thousand Miles AI

Posted on Mar 6

How LLMs Actually Generate Text — Temperature, Top-K, Top-P, and the Dice Rolls You Never See

#learning #ai #softwareengineering

You set temperature to 0.7 because a tutorial told you to. But do you know what that actually does? Under the hood of every LLM response is a probability game — here's how the dice are loaded.

How LLMs Actually Generate Text — Temperature, Top-K, Top-P, and the Dice Rolls You Never See

Every token an LLM outputs is a gamble. Understanding how that gamble works changes how you use these models forever.

The Same Prompt, Three Different Answers

Try this experiment. Open any LLM — ChatGPT, Claude, Gemini, whatever you have access to. Ask it: "Write a one-sentence product description for a coffee mug." Hit send. Copy the result. Now ask the exact same question again. And again.

Three attempts. Three different sentences. Maybe slightly different, maybe wildly different. But almost certainly not identical.

Why? You gave it the exact same input. The model's weights didn't change between requests. The system prompt is the same. So where does the randomness come from?

It comes from the sampling step — the moment after the model calculates probabilities for every possible next word, and before it actually picks one. That choice — how the model selects from thousands of candidates — is controlled by parameters you've probably seen but maybe never understood: temperature, top-K, top-P.

These aren't minor settings. They fundamentally change the model's behavior. Get them wrong, and your creative writing tool sounds robotic. Or your code assistant hallucinates syntax that doesn't exist. Or your customer support bot gives a different answer to the same question every time.

Why Should You Care?

If you're building anything with an LLM — even just making API calls — you're setting these parameters, whether you know it or not. Every API has defaults. Every playground has sliders. And most developers just leave them alone or copy values from tutorials without understanding what they do.

Understanding sampling isn't academic — it's one of the highest-leverage ways to improve LLM output quality without changing a single word of your prompt. It also shows up in interviews constantly. "Explain how temperature works" is practically a warmup question at any AI-focused company.

Let Me Back Up — How an LLM Picks the Next Word

Here's what happens every time an LLM generates a single token:

The model processes your input and produces a set of logits — raw scores for every token in its vocabulary (typically 30,000–100,000+ tokens).
Those logits go through a softmax function, which converts them into probabilities that sum to 1.
A sampling strategy picks one token from that probability distribution.
That token gets appended to the output, and the whole process repeats for the next token.

The model generates text one token at a time, left to right. It doesn't plan ahead. It doesn't have a draft that it edits. Every single token is a fresh probabilistic choice based on everything that came before it.

The generation loop: predict probabilities for all tokens, sample one, append, repeat. The sampling strategy is where the magic (and danger) happens.

The Sampling Strategies — One by One

Greedy Decoding: Always Pick the Winner

The simplest strategy. At every step, pick the token with the highest probability. No randomness, no dice rolling. If "the" has probability 0.35 and "a" has 0.20, you always pick "the."

Sounds sensible, right? But greedy decoding has a nasty problem: it's boring. It tends to produce repetitive, predictable text. It gets stuck in loops. It picks the "safe" word every time, and the result reads like it was written by someone who's afraid to take any creative risk.

Greedy decoding is fine for tasks where you want the single most likely answer — like classification or extraction. For anything generative, it's almost never what you want.

Temperature: Turning Up the Creativity Dial

Temperature is the parameter everyone knows and almost nobody understands precisely. Here's what it actually does.

Before the softmax function converts logits to probabilities, temperature divides the logits by a number. That's it. That's the whole mechanism.

But the effect is dramatic:

Temperature = 1.0 — No change. The probabilities are whatever the model naturally produces.

Temperature < 1.0 (say, 0.3) — The logits get divided by a small number, which amplifies the differences between them. High-probability tokens become even more probable. Low-probability tokens become nearly impossible. The distribution gets "peaky" — the model becomes more confident, more predictable, more conservative.

Temperature > 1.0 (say, 1.5) — The logits get divided by a large number, which flattens the differences. Every token becomes more equally likely. The distribution spreads out — the model becomes more random, more creative, more surprising. Also more likely to say something unhinged.

Think of temperature like a volume knob for randomness. Turn it down for math homework. Turn it up for poetry. Turn it all the way down (temperature = 0) and you get greedy decoding — pure determinism.

Top-K Sampling: Only Consider the Top Candidates

Top-K is a filter. Before sampling, it looks at all 50,000+ tokens in the vocabulary, keeps only the K most probable ones, and throws the rest away. The probability mass gets redistributed among the survivors.

Set K = 50, and the model can only choose from its top 50 candidates. Set K = 5, and it's stuck with the top 5. Set K = 1, and you're back to greedy decoding.

The problem with top-K? The number K is fixed, regardless of context. Sometimes the model is very confident — 3 tokens account for 95% of the probability, and everything else is noise. A K of 50 would include 47 tokens that have almost zero chance of being right. Other times the model is uncertain — 200 tokens each have a small but meaningful probability. A K of 50 would cut off potentially good options.

Top-K doesn't adapt to the shape of the distribution. It's blunt.

Top-P (Nucleus Sampling): The Smart Filter

Top-P, also called nucleus sampling, is the clever answer to top-K's rigidity. Instead of keeping a fixed number of tokens, it keeps the smallest set of tokens whose combined probability exceeds a threshold P.

Set P = 0.9, and the model keeps adding tokens (from most to least probable) until their probabilities sum to 0.9. If the model is confident, that might be only 3 tokens. If the model is uncertain, it might be 200.

The beauty is that top-P adapts to context. When the next word is obvious ("The Eiffel Tower is in __"), it narrows down to very few candidates. When the next word could genuinely go many ways ("She felt __"), it keeps a wider pool.

This is why top-P has become the default sampling strategy in most production systems. It's more robust across different situations than top-K.

Top-K keeps a fixed number regardless of confidence. Top-P adapts — tight when confident, wide when uncertain.

Min-P: The 2026 Newcomer

There's a newer approach that's gaining traction, especially in open-source communities. Min-P sets a threshold relative to the most probable token. If the top token has probability 0.8 and min-P is 0.1, any token with probability below 0.08 (10% of 0.8) gets cut.

The elegance is that it scales with the model's own confidence. When the model is very sure (top token at 0.95), the threshold is high and very few alternatives survive. When the model is less sure (top token at 0.2), the threshold drops and more tokens stay in the pool.

As of early 2026, the combination of temperature + min-P is what many open-source LLM users have converged on as the most practical setup.

Practical Guide: What Settings for What Task

Here's a cheat sheet based on how these strategies interact:

Code generation, factual Q&A, data extraction: Temperature 0–0.3, top-P 0.9. You want determinism and accuracy. The model should pick the most likely token almost every time.

General chatbot, customer support: Temperature 0.5–0.7, top-P 0.9. A balance of reliability and natural-sounding language. Not robotic, not chaotic.

Creative writing, brainstorming, poetry: Temperature 0.8–1.2, top-P 0.95. Give the model room to explore. Higher temperature means more surprising word choices.

Never go above 1.5 for temperature unless you're doing it for fun. At that point, the probability distribution is so flat that the model starts producing incoherent output — like a writer who's had too much coffee and is just free-associating.

Mistakes That Bite — Common Misunderstandings

"Temperature controls how smart the model is." No. It controls the randomness of token selection. A low temperature doesn't make the model think harder — it makes it pick the highest-probability token more consistently. If the model's probabilities are wrong, low temperature just makes it confidently wrong.

"I should always use top-K AND top-P together." You can, but be careful. If you set K=50 and P=0.9, the effective filter is whichever is more restrictive. Often one overrides the other, and the second parameter does nothing. Pick one or understand how they interact in your specific framework.

"Temperature 0 means the same output every time." Almost. It means greedy decoding — always picking the highest-probability token. But some implementations have floating-point tie-breaking that can occasionally vary. For true determinism, also set a fixed random seed if the API supports it.

Now Go Break Something — Where to Go from Here

The best way to internalize these concepts is to play with them:

Use the OpenAI or Anthropic playground — they have real-time sliders for temperature, top-P, and top-K. Ask the same question at different settings and watch how the output changes.
Try the Hugging Face text generation playground — it shows the token probabilities alongside the generated text, so you can literally see the dice being rolled.
Search for "LLM sampling parameters interactive demo" — several blog posts have visual explainers that let you see how temperature reshapes the probability distribution.
Read the Hugging Face blog post "Decoding Strategies in Large Language Models" — it covers everything from greedy search to min-P with code examples.
For open-source users: Experiment with llama.cpp's sampler chain — it lets you compose multiple sampling strategies in sequence and see how each one transforms the distribution.

Next time you set temperature to 0.7 and top-P to 0.95, you'll know exactly what's happening: the model calculates probabilities for 50,000 tokens, temperature sharpens the distribution slightly, top-P keeps only the tokens that matter, and one gets picked. Every word you read from an LLM went through this gauntlet. The same prompt, the same model, but different dice rolls — and that's why you get a different coffee mug description every time.

Author: Shibin

DEV Community

How LLMs Actually Generate Text — Temperature, Top-K, Top-P, and the Dice Rolls You Never See

How LLMs Actually Generate Text — Temperature, Top-K, Top-P, and the Dice Rolls You Never See

The Same Prompt, Three Different Answers

Why Should You Care?

Let Me Back Up — How an LLM Picks the Next Word

The Sampling Strategies — One by One

Greedy Decoding: Always Pick the Winner

Temperature: Turning Up the Creativity Dial

Top-K Sampling: Only Consider the Top Candidates

Top-P (Nucleus Sampling): The Smart Filter

Min-P: The 2026 Newcomer

Practical Guide: What Settings for What Task

Mistakes That Bite — Common Misunderstandings

Now Go Break Something — Where to Go from Here

Top comments (0)