Language models like GPT don’t “think” in full sentences — they predict one token at a time, where a token is a chunk of text (often a word or part of a word) created through a process called tokenization. At each step, the model chooses the next token based on probabilities — and decoding parameters like temperature, top-k, and top-p control how predictable, random, or creative those token choices are.
Temperature
Temperature controls how random or focused the model’s word choices are.
A low temperature (e.g., 0.2) makes the output more predictable — the model sticks to the most likely words.
A high temperature (e.g., 1.0 or more) makes the output more creative, possibly even risky or unusual.
Temperature | Behavior |
---|---|
0.0 | Deterministic, safest |
0.3 – 0.7 | Predictable, less risky |
1.0 | Balanced randomness (default for GPT) |
>1.0 | Creative, more surprising, possibly noisy |
>1.5 | Often too chaotic or nonsensical |
Temperature is usually in the range of 0.0 to 2.0.
Top-k Sampling
Top-k limits the model to the top k most likely tokens, then picks one randomly from that group.
Top-k = 1 → Always picks the most probable word (like greedy decoding).
Top-k = 40 → Picks from the 40 best guesses, adding variety without going off-topic.
Top-k Value | Behavior |
---|---|
1 | Safe but repetitive |
10–50 | Good diversity, still smart |
100+ | More variety, more risk |
Top-k value ranges from a minimum of 1 up to a maximum of the total vocabulary size (which can be ~50,000 tokens for GPT models)
Top-p Sampling (Nucleus Sampling)
Top-p chooses from the smallest set of tokens whose total probability adds up to at least p.
If Top-p = 0.9, the model picks from the most likely words that together make up 90% of the probability mass.
Unlike top-k, this list can grow or shrink dynamically depending on the situation.
Top-k Value | Behavior |
---|---|
0.7 | Very conservative |
0.9 | Balanced, avoids outliers |
1.0 | Like no filter — all options allowed |
Top-p value ranges from 0 to 1
Top-p sampling (short for “probability sampling”) is also known as Nucleus sampling. It works in the following way:
- Sorts the tokens from most to least likely
- Selects the smallest group of tokens whose cumulative probabilities add up to at least p (like 0.9)
- Randomly picks one token from that set This selected group is what we call the nucleus.
The nucleus is the tight cluster of highest-probability tokens — the model’s most confident guesses.
Instead of sampling from all possible tokens (many of which are low-probability and often nonsensical), we focus on the most meaningful subset — the nucleus of the probability mass.
Top comments (0)