Understanding Transformers Part 18: Completing the Decoding Process

#ai #machinelearning

In the previous article, we generated the first output word from the transformer.

So far, the translation is correct. However, the decoder does not stop until it produces an <EOS> token.

Feeding the Output Back into the Decoder

Now, we take the translated word “vamos” and feed it back into a copy of the decoder’s embedding layer to continue the process.

Just like before, we repeat the same steps:

Get the word embeddings for vamos
Add positional encoding
Calculate self-attention values using the same weights used for the <EOS> token
Add residual connections
Compute encoder–decoder attention using the same set of weights
Add another set of residual connections

Generating the Next Word

Next, we pass the values representing “vamos” through the same fully connected layer and softmax function that we used earlier.

This time, the decoder outputs the <EOS> token, which signals the end of the sentence.

Final Output

At this point, the decoding process is complete.

We have successfully translated the input phrase using the transformer.

So, just to recap, the transformer works as follows:

Word embeddings convert words into numerical representations
Positional encoding keeps track of word order
Self-attention captures relationships within the input and output
Encoder–decoder attention connects input and output, ensuring important information is preserved
Residual connections help different components focus on specific tasks and improve training

In the next article, we will start exploring decoder-only transformers.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Top comments (1)

PEACEBINFLOW • May 4

The step where the decoder feeds its own output back into itself — that autoregressive loop — is one of those design choices that feels almost reckless when you first encounter it. The entire sequence depends on each token being right, and if one is wrong, that error becomes part of the context for the next prediction. There's no recovery mechanism built into the architecture itself. It just trusts that the training was good enough.

What I find interesting is how that design reflects a deeper assumption about the problem: that generating a sequence is fundamentally different from recognizing one. Recognition can be parallelized — you can look at all the words at once, which is what the encoder does. But generation is treated as inherently sequential, as if the act of choosing the next word requires living with the consequences of the previous choice. Whether that's actually true or just a constraint we've inherited from the left-to-right nature of language is something I still wonder about. Some of the newer decoding strategies try to loosen that assumption — speculative decoding, parallel decoding — but the core architecture still builds on this idea that generation is a step-by-step commitment.

Curious whether the next article in the series gets into why decoder-only models dropped the encoder entirely and just leaned harder into that autoregressive loop. That's the part that still feels like magic to me.