Understanding Transformers Part 14: Calculating Encoder–Decoder Attention

#ai #machinelearning

In the previous article, we just began introducing the concept of encoder-decoder attention.

Now lets start digging into the details.

Encoder–Decoder Attention in Action

Just like in self-attention, we start by creating query values.

In this case, we create two values to represent the query for the <EOS> token in the decoder.

Next, we create key values for each word in the encoder output.

Calculating Similarity

Now, we calculate the similarity between the <EOS> token in the decoder and each word in the encoder.

This is done using the dot product.

Applying Softmax

We then pass these similarity scores through a softmax function:

This gives us weights that determine how much attention the decoder should pay to each input word.

In this example:

The first input word gets 100% attention
The second word gets 0% attention

This means the decoder will focus entirely on the first input word when deciding the first translated word.

What’s Next?

Now that we know how much each input word contributes, the next step is to compute the value vectors for each input word and combine them accordingly.

We will explore this in the next article.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: