Understanding Transformers Part 8: Shared Weights in Self-Attention

#ai #machinelearning

In the previous article, we started calculating the self-attention values.

Let’s now calculate the self-attention values for the word “go”.

We do not need to recalculate the keys and values.

Instead, we only need to create the query that represents the word “go”, and then perform the same calculations as before.

After completing the calculations, we get the self-attention values for “go” as:

2.5 and -2.1

Key Observations About Self-Attention

The weights used to calculate queries are the same for both “Let’s” and “go”.
- This means that regardless of the number of words, we use one shared set of weights.
- Similarly, the same sets of weights are reused to calculate keys and values for every input word.
- No matter how many words are given as input, the transformer reuses the same weights for queries, keys, and values.
We do not need to compute queries, keys, and values sequentially.
- All of them can be computed at the same time.
- This allows transformers to take advantage of parallel computation, making them very efficient.

We will continue building our transformer step by step in the next article.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: