Understanding Transformers Part 15: Scaling and Combining Values in Encoder–Decoder Attention

#ai #machinelearning

In the previous article, we gained an understanding how much each input word contributes, in this article we will start to compute the value vectors for each input word and combine them accordingly.

We scale those values using the Softmax percentages, and add the scaled values together to obtain the encoder–decoder attention values.

The sets of weights used to calculate the queries, keys, and values for encoder–decoder attention are different from the sets of weights used in self-attention.

Just like in self-attention, these sets of weights are copied and reused for each word, which allows the model to be flexible with different input and output lengths.

We can also stack encoder–decoder attention layers, just like we do with self-attention, to better handle more complex phrases.

We will continue with more details in the next article

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: