Attention is the idea that turned NLP into LLMs. Strip away the hype and it's simple: each word looks at every other word and decides how much to care. Here's self-attention, computed for real and visualized.
๐ญ Click any word and watch: https://dev48v.infy.uk/dl/day12-attention.html
Query, Key, Value
Every token gets three vectors. To decide how much word A attends to word B, take the dot product of A's Query with B's Key โ that's a similarity score.
Scale + softmax
Divide the scores by โ(dimension) to keep them stable, then softmax so they become weights that sum to 1. High weight = "pay attention here."
The context vector
Each word's output is the weighted sum of all the Value vectors. So "it" in "the cat sat... because it was tired" ends up mostly made of "cat" โ the model resolves the reference by attention alone.
Stack multiple heads (different views) + positional encoding (order) and you have the Transformer block behind GPT and BERT.
๐จ Full build (QยทK โ scale โ softmax โ weighted sum โ multi-head) with live numbers: https://dev48v.infy.uk/dl/day12-attention.html
Part of DeepLearningFromZero. ๐ https://dev48v.infy.uk
Top comments (0)