DEV Community

zeromathai
zeromathai

Posted on • Originally published at zeromathai.com

How Transformer Decoders Generate Text — From Causal Masking to Decoding

A Transformer Decoder does not generate a sentence all at once.

It predicts one token.

Then it feeds that token back and predicts the next one.

That simple loop is the core of modern LLM generation.

Core Idea

A Transformer Decoder is built for autoregressive generation.

That means:

previous tokens → next token prediction → repeat

The Decoder creates hidden representations.

The LM Head converts those representations into vocabulary scores.

A decoding strategy chooses the actual next token.

This matters because generation quality is not only about the model.

It also depends on how tokens are selected.

The Key Structure

A simplified generation pipeline looks like this:

Input Context

→ Decoder Layers

→ Hidden State

→ LM Head

→ Logits

→ Softmax

→ Decoding Strategy

→ Next Token

More compactly:

Text Generation = decoder representation + vocabulary scoring + token selection

The Decoder answers:

What should the next representation be?

The LM Head answers:

Which vocabulary tokens are likely?

The decoding strategy answers:

Which token should we actually output?

Pseudo-code View

Autoregressive decoding looks like this:

context = prompt_tokens

while not stop:
    hidden = decoder(context)

    logits = lm_head(hidden[-1])

    probs = softmax(logits / temperature)

    next_token = decode(probs)

    context.append(next_token)
Enter fullscreen mode Exit fullscreen mode

The key loop is:

predict → append → repeat

This is why LLM inference is sequential.

Even if training can be parallelized, generation still produces tokens one step at a time.

Transformer Decoder Structure

A Transformer Decoder layer usually contains:

  • Masked Self-Attention
  • Cross-Attention
  • Feed-Forward Network

Masked Self-Attention lets the Decoder look only at previous tokens.

Cross-Attention lets it look at Encoder outputs when an input sequence exists.

The Feed-Forward Network transforms each token representation.

For decoder-only LLMs, Cross-Attention is usually removed.

The model only continues from the current context.

Causal Masking

The Decoder must not cheat.

When predicting token 5, it cannot look at token 6.

That is the role of the causal mask.

The generation probability can be written as:

P(y₁, y₂, ..., yₜ | x) = Π P(yₜ | y₁, ..., yₜ₋₁, x)

Each token depends only on previous output tokens and the input.

This is important.

Without causal masking, the model could see future answers during training.

Then it would fail during real generation.

Concrete Example

Target sentence:

I love you

During training, the Decoder input is shifted right:

Input:

I love

Target:

I love you

So the model learns:

→ I

I → love

I love → you

At inference time, there is no target sentence.

The model must use its own previous output.

That is why errors can accumulate during generation.

Teacher Forcing

Teacher forcing is used during training.

Instead of feeding the model’s wrong prediction back into the next step, we feed the correct previous token.

This makes training more stable.

Training:

input = correct previous tokens
Enter fullscreen mode Exit fullscreen mode

Inference:

input = model-generated previous tokens
Enter fullscreen mode Exit fullscreen mode

This difference matters.

A model can behave well during training but drift during generation.

That is why decoding strategy and evaluation matter in real systems.

LM Head and Logits

The Decoder outputs hidden vectors.

But hidden vectors are not tokens.

The LM Head maps a hidden vector to vocabulary-sized scores.

These scores are called logits.

If the vocabulary size is 50,000, the LM Head outputs 50,000 scores.

Each score corresponds to one possible next token.

Logits are not probabilities yet.

Softmax converts them into probabilities.

The pipeline is:

hidden state → logits → probabilities → selected token

Temperature Scaling

Temperature controls how sharp or flat the probability distribution becomes.

The formula is:

pᵢ(τ) = exp(zᵢ / τ) / Σ exp(zⱼ / τ)

Lower temperature:

  • sharper distribution
  • more deterministic output
  • less randomness

Higher temperature:

  • flatter distribution
  • more diverse output
  • more randomness

Example:

With logits [2, 1, 0]:

temperature = 0.5 makes the top token much stronger.

temperature = 2 makes lower-ranked tokens more likely.

This matters in practice.

Temperature is one of the simplest ways to control creativity.

What Decoding Means

Decoding means selecting the next token from probabilities.

The model gives a distribution.

The decoding algorithm makes a choice.

That choice affects:

  • correctness
  • creativity
  • repetition
  • diversity
  • determinism
  • latency

So decoding is not a small detail.

It is part of the generation behavior.

Greedy Decoding

Greedy decoding always chooses the most likely token.

If probabilities are:

A = 0.70

B = 0.20

C = 0.10

Greedy always picks A.

It is simple and fast.

But it can be repetitive.

It can also choose a locally good token that leads to a worse full sentence.

Beam Search

Beam search keeps multiple candidate sequences.

Instead of only keeping the best next token, it keeps the best k paths.

If beam size = 3, the model tracks three candidate continuations.

This can improve structured generation.

But it can also reduce diversity.

When k = 1, beam search becomes greedy decoding.

Top-k Sampling

Top-k sampling keeps only the k most likely tokens.

Then it samples from that smaller set.

Example:

k = 3

Only the top 3 tokens can be selected.

This prevents the model from choosing extremely unlikely tokens.

But it still allows some randomness.

Top-k is useful when you want controlled diversity.

Top-p Sampling

Top-p sampling is also called nucleus sampling.

Instead of keeping a fixed number of tokens, it keeps the smallest set whose cumulative probability exceeds p.

Example:

Token probabilities:

honeycomb = 0.45

gingerbread = 0.20

donut = 0.12

cupcake = 0.04

If p = 0.6:

honeycomb + gingerbread = 0.65

So only those two tokens enter the sampling set.

Top-p adapts to the confidence of the model.

That makes it more flexible than fixed Top-k.

Deterministic vs Stochastic Decoding

Deterministic decoding:

  • greedy decoding
  • beam search
  • same input usually gives same output
  • useful for predictable tasks

Stochastic decoding:

  • Top-k sampling
  • Top-p sampling
  • can generate different outputs
  • useful for creative tasks

The difference is simple:

Deterministic = choose the best-looking path

Stochastic = sample from likely paths

For coding tasks, deterministic settings are often useful.

For brainstorming, stochastic settings are often better.

Encoder-Decoder vs Decoder-Only Models

Encoder-Decoder models use both input understanding and output generation.

They are useful for tasks like translation.

The Encoder reads the source sequence.

The Decoder generates the target sequence.

Decoder-only models use only the generation stack.

They predict the next token from the previous context.

Most GPT-style LLMs are decoder-only.

The architecture is simpler for open-ended text generation.

Implementation Perspective

In real inference code, generation is not just:

model(prompt)
Enter fullscreen mode Exit fullscreen mode

It is closer to:

tokenize prompt

run decoder

get logits from LM Head

apply temperature

filter with top-k or top-p

sample or choose token

append token

repeat
Enter fullscreen mode Exit fullscreen mode

This matters because small decoding changes can produce very different outputs.

A model can feel precise, boring, creative, unstable, or repetitive depending on decoding settings.

The model gives probabilities.

Your decoding pipeline turns those probabilities into behavior.

Naive vs Practical View

Naive view:

LLM = text in, text out

Practical view:

LLM = token loop + logits + decoding policy

Naive mindset:

ask model
receive answer
Enter fullscreen mode Exit fullscreen mode

Practical mindset:

manage context
control temperature
choose decoding strategy
stop generation correctly
handle repetition
optimize inference cost
Enter fullscreen mode Exit fullscreen mode

This is why developers need to understand the Decoder.

Generation is a system, not a single function call.

Important Conditions and Limits

Decoder generation is sequential.

Each new token depends on previous tokens.

That can make inference slow.

Causal masking is required to prevent future-token leakage.

Teacher forcing helps training, but inference uses the model’s own predictions.

Decoding strategy changes output behavior.

Temperature, Top-k, and Top-p are not cosmetic options.

They directly shape the generated text.

Takeaway

The Transformer Decoder generates text by predicting one token at a time.

Masked Self-Attention prevents future-token access.

The LM Head converts hidden states into vocabulary logits.

Softmax turns logits into probabilities.

Decoding chooses the actual next token.

The shortest version is:

Decoder generation = causal attention + LM Head + decoding loop

If you understand that loop, you understand how LLMs actually produce text.

Discussion

When tuning LLM output, which setting do you usually adjust first?

Temperature, Top-k, Top-p, or the prompt itself?

Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/transformer-decoder-lm-head-decoding-en/

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

Top comments (0)