DEV Community

Devanshu Biswas
Devanshu Biswas

Posted on

Attention From Scratch: How Transformers Read Everything at Once

Attention is the idea that turned NLP into LLMs. Strip away the hype and it's simple: each word looks at every other word and decides how much to care. Here's self-attention, computed for real and visualized.

๐Ÿ”ญ Click any word and watch: https://dev48v.infy.uk/dl/day12-attention.html

Query, Key, Value

Every token gets three vectors. To decide how much word A attends to word B, take the dot product of A's Query with B's Key โ€” that's a similarity score.

Scale + softmax

Divide the scores by โˆš(dimension) to keep them stable, then softmax so they become weights that sum to 1. High weight = "pay attention here."

The context vector

Each word's output is the weighted sum of all the Value vectors. So "it" in "the cat sat... because it was tired" ends up mostly made of "cat" โ€” the model resolves the reference by attention alone.

Stack multiple heads (different views) + positional encoding (order) and you have the Transformer block behind GPT and BERT.

๐Ÿ”จ Full build (QยทK โ†’ scale โ†’ softmax โ†’ weighted sum โ†’ multi-head) with live numbers: https://dev48v.infy.uk/dl/day12-attention.html

Part of DeepLearningFromZero. ๐ŸŒ https://dev48v.infy.uk

Top comments (0)