A Transformer Decoder does not generate a sentence all at once.
It predicts one token.
Then it feeds that token back and predicts the next one.
That simple loop is the core of modern LLM generation.
Core Idea
A Transformer Decoder is built for autoregressive generation.
That means:
previous tokens → next token prediction → repeat
The Decoder creates hidden representations.
The LM Head converts those representations into vocabulary scores.
A decoding strategy chooses the actual next token.
This matters because generation quality is not only about the model.
It also depends on how tokens are selected.
The Key Structure
A simplified generation pipeline looks like this:
Input Context
→ Decoder Layers
→ Hidden State
→ LM Head
→ Logits
→ Softmax
→ Decoding Strategy
→ Next Token
More compactly:
Text Generation = decoder representation + vocabulary scoring + token selection
The Decoder answers:
What should the next representation be?
The LM Head answers:
Which vocabulary tokens are likely?
The decoding strategy answers:
Which token should we actually output?
Pseudo-code View
Autoregressive decoding looks like this:
context = prompt_tokens
while not stop:
hidden = decoder(context)
logits = lm_head(hidden[-1])
probs = softmax(logits / temperature)
next_token = decode(probs)
context.append(next_token)
The key loop is:
predict → append → repeat
This is why LLM inference is sequential.
Even if training can be parallelized, generation still produces tokens one step at a time.
Transformer Decoder Structure
A Transformer Decoder layer usually contains:
- Masked Self-Attention
- Cross-Attention
- Feed-Forward Network
Masked Self-Attention lets the Decoder look only at previous tokens.
Cross-Attention lets it look at Encoder outputs when an input sequence exists.
The Feed-Forward Network transforms each token representation.
For decoder-only LLMs, Cross-Attention is usually removed.
The model only continues from the current context.
Causal Masking
The Decoder must not cheat.
When predicting token 5, it cannot look at token 6.
That is the role of the causal mask.
The generation probability can be written as:
P(y₁, y₂, ..., yₜ | x) = Π P(yₜ | y₁, ..., yₜ₋₁, x)
Each token depends only on previous output tokens and the input.
This is important.
Without causal masking, the model could see future answers during training.
Then it would fail during real generation.
Concrete Example
Target sentence:
I love you
During training, the Decoder input is shifted right:
Input:
I love
Target:
I love you
So the model learns:
→ I
I → love
I love → you
At inference time, there is no target sentence.
The model must use its own previous output.
That is why errors can accumulate during generation.
Teacher Forcing
Teacher forcing is used during training.
Instead of feeding the model’s wrong prediction back into the next step, we feed the correct previous token.
This makes training more stable.
Training:
input = correct previous tokens
Inference:
input = model-generated previous tokens
This difference matters.
A model can behave well during training but drift during generation.
That is why decoding strategy and evaluation matter in real systems.
LM Head and Logits
The Decoder outputs hidden vectors.
But hidden vectors are not tokens.
The LM Head maps a hidden vector to vocabulary-sized scores.
These scores are called logits.
If the vocabulary size is 50,000, the LM Head outputs 50,000 scores.
Each score corresponds to one possible next token.
Logits are not probabilities yet.
Softmax converts them into probabilities.
The pipeline is:
hidden state → logits → probabilities → selected token
Temperature Scaling
Temperature controls how sharp or flat the probability distribution becomes.
The formula is:
pᵢ(τ) = exp(zᵢ / τ) / Σ exp(zⱼ / τ)
Lower temperature:
- sharper distribution
- more deterministic output
- less randomness
Higher temperature:
- flatter distribution
- more diverse output
- more randomness
Example:
With logits [2, 1, 0]:
temperature = 0.5 makes the top token much stronger.
temperature = 2 makes lower-ranked tokens more likely.
This matters in practice.
Temperature is one of the simplest ways to control creativity.
What Decoding Means
Decoding means selecting the next token from probabilities.
The model gives a distribution.
The decoding algorithm makes a choice.
That choice affects:
- correctness
- creativity
- repetition
- diversity
- determinism
- latency
So decoding is not a small detail.
It is part of the generation behavior.
Greedy Decoding
Greedy decoding always chooses the most likely token.
If probabilities are:
A = 0.70
B = 0.20
C = 0.10
Greedy always picks A.
It is simple and fast.
But it can be repetitive.
It can also choose a locally good token that leads to a worse full sentence.
Beam Search
Beam search keeps multiple candidate sequences.
Instead of only keeping the best next token, it keeps the best k paths.
If beam size = 3, the model tracks three candidate continuations.
This can improve structured generation.
But it can also reduce diversity.
When k = 1, beam search becomes greedy decoding.
Top-k Sampling
Top-k sampling keeps only the k most likely tokens.
Then it samples from that smaller set.
Example:
k = 3
Only the top 3 tokens can be selected.
This prevents the model from choosing extremely unlikely tokens.
But it still allows some randomness.
Top-k is useful when you want controlled diversity.
Top-p Sampling
Top-p sampling is also called nucleus sampling.
Instead of keeping a fixed number of tokens, it keeps the smallest set whose cumulative probability exceeds p.
Example:
Token probabilities:
honeycomb = 0.45
gingerbread = 0.20
donut = 0.12
cupcake = 0.04
If p = 0.6:
honeycomb + gingerbread = 0.65
So only those two tokens enter the sampling set.
Top-p adapts to the confidence of the model.
That makes it more flexible than fixed Top-k.
Deterministic vs Stochastic Decoding
Deterministic decoding:
- greedy decoding
- beam search
- same input usually gives same output
- useful for predictable tasks
Stochastic decoding:
- Top-k sampling
- Top-p sampling
- can generate different outputs
- useful for creative tasks
The difference is simple:
Deterministic = choose the best-looking path
Stochastic = sample from likely paths
For coding tasks, deterministic settings are often useful.
For brainstorming, stochastic settings are often better.
Encoder-Decoder vs Decoder-Only Models
Encoder-Decoder models use both input understanding and output generation.
They are useful for tasks like translation.
The Encoder reads the source sequence.
The Decoder generates the target sequence.
Decoder-only models use only the generation stack.
They predict the next token from the previous context.
Most GPT-style LLMs are decoder-only.
The architecture is simpler for open-ended text generation.
Implementation Perspective
In real inference code, generation is not just:
model(prompt)
It is closer to:
tokenize prompt
run decoder
get logits from LM Head
apply temperature
filter with top-k or top-p
sample or choose token
append token
repeat
This matters because small decoding changes can produce very different outputs.
A model can feel precise, boring, creative, unstable, or repetitive depending on decoding settings.
The model gives probabilities.
Your decoding pipeline turns those probabilities into behavior.
Naive vs Practical View
Naive view:
LLM = text in, text out
Practical view:
LLM = token loop + logits + decoding policy
Naive mindset:
ask model
receive answer
Practical mindset:
manage context
control temperature
choose decoding strategy
stop generation correctly
handle repetition
optimize inference cost
This is why developers need to understand the Decoder.
Generation is a system, not a single function call.
Important Conditions and Limits
Decoder generation is sequential.
Each new token depends on previous tokens.
That can make inference slow.
Causal masking is required to prevent future-token leakage.
Teacher forcing helps training, but inference uses the model’s own predictions.
Decoding strategy changes output behavior.
Temperature, Top-k, and Top-p are not cosmetic options.
They directly shape the generated text.
Takeaway
The Transformer Decoder generates text by predicting one token at a time.
Masked Self-Attention prevents future-token access.
The LM Head converts hidden states into vocabulary logits.
Softmax turns logits into probabilities.
Decoding chooses the actual next token.
The shortest version is:
Decoder generation = causal attention + LM Head + decoding loop
If you understand that loop, you understand how LLMs actually produce text.
Discussion
When tuning LLM output, which setting do you usually adjust first?
Temperature, Top-k, Top-p, or the prompt itself?
Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/transformer-decoder-lm-head-decoding-en/
GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai
Top comments (0)