DEV Community

Cover image for Understanding Transformers Part 8: Shared Weights in Self-Attention
Rijul Rajesh
Rijul Rajesh

Posted on

Understanding Transformers Part 8: Shared Weights in Self-Attention

In the previous article, we started calculating the self-attention values.

Let’s now calculate the self-attention values for the word “go”.

We do not need to recalculate the keys and values.

Instead, we only need to create the query that represents the word “go”, and then perform the same calculations as before.

After completing the calculations, we get the self-attention values for “go” as:

2.5 and -2.1

Key Observations About Self-Attention

  • The weights used to calculate queries are the same for both “Let’s” and “go”.

    • This means that regardless of the number of words, we use one shared set of weights.
    • Similarly, the same sets of weights are reused to calculate keys and values for every input word.
    • No matter how many words are given as input, the transformer reuses the same weights for queries, keys, and values.
  • We do not need to compute queries, keys, and values sequentially.

    • All of them can be computed at the same time.
    • This allows transformers to take advantage of parallel computation, making them very efficient.

We will continue building our transformer step by step in the next article.


Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name
Enter fullscreen mode Exit fullscreen mode

… and you’re done! 🚀

Installerpedia Screenshot

🔗 Explore Installerpedia here

Top comments (0)