Akash

Posted on Jun 16

How Transformers Actually Work: Self-Attention, Step by Step

#deeplearning #llm #machinelearning #nlp

Building Context: Query, Key, Value, and the Transformer Block

The last two posts ended on the same cliffhanger: RNNs carry context through a single thread of hidden state that frays on long sequences and forces everything through one bottleneck vector. The fix, I kept saying, is attention. This is the post that delivers it. By the end, you'll know why a static word embedding can't represent a word's meaning in context, how self-attention computes a fresh, context-aware embedding for every token by letting it "attend to" the words around it, exactly what the query, key, and value vectors are and where they come from, why transformers use many attention heads instead of one, and how a full transformer block wraps attention in layer norm, a feedforward layer, and residual connections. For this post, we'll stop right before the matrix tricks that make all of this run in parallel.

One sentence captures the whole shift from RNNs to transformers. An RNN processes tokens one at a time, each step waiting on the last. A transformer processes every token at once, in parallel, and recovers the dependencies between words through attention instead of recurrence. That trade (parallel computation + attention, in place of sequential recurrence) is the engine behind every modern LLM, and it traces back to one 2017 paper: Attention Is All You Need.

The Problem: Static Embeddings Don't Know Context

Go back to word2vec. It gives every word a single fixed vector. The word chicken is always the same vector, no matter the sentence. We already flagged this as a limitation, and transformers are where we finally fix it.

Here's the example the textbook (and the lecture) builds everything on:

The chicken didn't cross the road because it was too tired.

The chicken didn't cross the road because it was too wide.

In the first sentence, it means the chicken. In the second, it means the road. The word is identical; the meaning flips entirely based on context. A static embedding for it (one frozen vector) cannot capture both. It would have to encode "pronoun, used for animals and inanimate things" and stop there, blind to which one is meant here.

Now read it the way a left-to-right language model does, stopping at it:

The chicken didn't cross the road because it _____

At this exact point, you don't yet know whether it will turn out to be the chicken or the road. So a good representation of it right here should carry information about both candidate words, chicken and road, ready to be resolved once the next word arrives.

Contextual embeddings. The meaning of a word should be a different vector in different contexts. Instead of one frozen vector per word, we want a representation that's computed on the fly from the surrounding words. Building those context-aware vectors is what attention is for.

Why Not Just Use RNNs? (A Near Miss)

Fair question, and worth treating as a near miss, because the RNN almost solves this. LSTMs already carry information forward: the sixth word's hidden state holds some trace of the first word. So context does reach across the sentence. Couldn't that give us the contextual embeddings we want?

It can, which is exactly why it's a near miss and not a dead end. But it's clunky, and it has one fatal inefficiency:

RNNs are iterative. Transformers are parallel.

An RNN has to compute the hidden state for token 1 before token 2, token 2 before token 3, and so on down the sequence. Every step waits on the one before it. That's inherently serial, and it's slow.

A transformer computes the representation for every token at the same time. Token 5 doesn't wait for token 4. This massively parallel computation is the single biggest computational difference between RNNs and transformers, and that's the reason why transformers scale to the enormous models we have now.

But there's a catch. If every token is processed independently and in parallel, how do dependencies between words survive? If token 5 doesn't wait for token 4, how does it know about token 4 at all?

That's the job of attention. It's the one component that lets tokens see each other, and it's built so that all those cross-token comparisons can still happen in parallel.

Attention, Intuitively

We build the contextual embedding for a word by selectively integrating information from its neighbors. The key word is selectively; some neighbors matter more than others.

The vocabulary here: a word attends to the neighbors it draws meaning from. In our example, it attends strongly to chicken and road, and only weakly to the, didn't, cross. Those two nouns are where the meaning of it lives, so that's where it pays attention.

Picture the network in layers. At layer $k$ , you have a representation for every word. To compute the representation of it at layer $k+1$ , you look back across all the layer- $k$ representations of the prior words and pull in information from them — weighted, so that chicken and road contribute the most. Stack enough of these layers and the embedding for it ends up carrying an enormous amount of resolved, contextual meaning.

So, formally: attention is a method for computing a weighted sum of vectors. The whole game is figuring out the weights.

Attention Is Left-to-Right (for now)

In the causal language models we're building, a word can only attend to itself and the words before it, never ahead. When we compute the attention output for token 5, it draws on tokens 1 through 5. When we compute it for token 2, it draws on tokens 1 and 2. No peeking at the future, because at generation time, the future doesn't exist yet.

(There's a version of attention that does look both ways, that's BERT, and it's a post for another day.)

What attention is NOT. It isn't recurrence in disguise; there's no hidden state handed from one step to the next. It isn't a lookup in a fixed table; nothing is retrieved from stored memory. And the attention weights themselves aren't constants: the projection matrices are learned once during training, but the actual weights (the

\alpha

's) are recomputed for every token from the content of the words in front of it. The same word draws different attention in different sentences. That's the point of going contextual in the first place.

The Simplified Version

Before the real machinery, the textbook gives a stripped-down version that captures the idea. The attention output $a_i$ for token $i$ is just a weighted sum of all the prior token vectors:

a_i = \sum_{j \le i} \alpha_{ij} \, x_j

Each $\alpha_{ij}$ is a scalar that says how much token $j$ should contribute to token $i$ 's new representation. How do we get those weights? By similarity. A word should draw most from the prior words most similar to it.

The simplest similarity measure between two vectors is the dot product, it maps two vectors to a single number, larger when they point in the same direction:

\text{score}(x_i, x_j) = x_i \cdot x_j

Then we push the scores through a softmax to turn them into weights between 0 and 1 that sum to 1:

\alpha_{ij} = \text{softmax}(\text{score}(x_i, x_j)) \quad \forall j \le i

That's the idea, start to finish. Compare the current word to each prior word using dot product, softmax the scores to obtain weights, and take the weighted sum. The most similar words contribute the most.

Hold onto this skeleton. Everything that follows (queries, keys, values, multiple heads) is decoration on this one move: score by similarity, normalize with softmax, sum the values by weight. If you lose the thread in the equations below, come back to these three lines.

Query, Key, Value: The Real Attention Head

The simplified version uses each word vector $x_i$ directly for everything. The real version notices that each vector actually plays three different roles in the attention computation, and gives each role its own representation:

Query: the current word, doing the looking. ("I'm it; what should I pay attention to?")
Key: a prior word, being looked at, used to compute how similar it is to the query. ("I'm chicken; how relevant am I to you?")
Value: the actual information a prior word contributes once it's been judged relevant. ("Here's what I, chicken, add to your meaning.")

To produce these three roles, transformers learn three weight matrices: $W^Q$ , $W^K$ , $W^V$ — that project each input vector into a query, a key, and a value:

q_i = x_i W^Q; \quad k_j = x_j W^K; \quad v_j = x_j W^V

Where do those matrices come from? Training. They're learned, like every other weight in the network.

Now the similarity is computed between the current word's query and each prior word's key, and we scale the dot product by $\sqrt{d_k}$ (the square root of the key/query dimension) to keep the numbers from blowing up and wrecking the softmax:

\text{score}(x_i, x_j) = \frac{q_i \cdot k_j}{\sqrt{d_k}}

\alpha_{ij} = \text{softmax}(\text{score}(x_i, x_j)) \quad \forall j \le i

The output sums the value vectors (not the raw inputs), weighted by those attention scores:

\text{head}i = \sum{j \le i} \alpha_{ij} \, v_j

And one last matrix, $W^O$ , reshapes that result back to the model dimension:

a_i = \text{head}_i \, W^O

Mind the Shapes

One piece of advice for this diagram: watch the dimensions. They're the key to understanding what's moving where.

Quick reference: the shapes

xᵢ (input token) - [1 × d], where d is the model dimension.

W^Q and W^K - shape [d × dₖ]. So queries and keys are dₖ-dimensional, and their dot product is a single scalar.

W^V — shape [d × dᵥ]. So the value vectors and the head output are dᵥ-dimensional.

W^O — shape [dᵥ × d]. Reshapes the head output back to d dimensions.

Original transformer paper: d = 512, dₖ = dᵥ = 64.

So you start with a $d$ -dimensional token and end with a $d$ -dimensional token, same shape in, same shape out. In many architectures, the key/query and value dimensions are set equal, but they don't have to be.

That "same shape in, same shape out" is not an accident; it's what lets us stack these layers, which we'll need in a minute.

Multi-Head Attention

One attention head computes one kind of similarity. But words relate to each other in many ways at once (syntactic agreement, synonymy, coreference, topical relatedness, etc.) and patterns we don't even have names for. Cramming all of that into a single head is asking too much.

So transformers run many attention heads in parallel, each with its own $W^Q$ , $W^K$ , $W^V$ matrices, each free to specialize in a different aspect of how words relate:

q_i^c = x_i W^{Q c}; \quad k_j^c = x_j W^{K c}; \quad v_j^c = x_j W^{V c}; \quad 1 \le c \le A

Each head produces its own output. You concatenate all $A$ of them and project back down to the model dimension with a final matrix $W^O$ :

a_i = (\text{head}_1 \oplus \text{head}_2 \oplus \cdots \oplus \text{head}_A) \, W^O

How many heads? Real models use a lot. The original paper used 8; large models run 128 or more. You can't cleanly point at head #37 and say "this one does subject-verb agreement," but collectively the heads cover the many latent kinds of relationships between words.

The single-head drawback: With only one attention head, one set of weights has to encode every type of linguistic relationship at once: syntax, semantics, coreference, all of it. That's inefficient and underpowered. Multiple heads let each one focus on a different slice of the relationship space, and the model is far richer for it.

Notice the bookkeeping: you start each token as a $d$ -dimensional vector and, after all the heads and the projection, you end with a $d$ -dimensional vector. A much richer one: it now encodes the token plus weighted information from its whole left context, across many relationship types. But the same shape, which means you can feed it into another attention layer and do it all again.

The Transformer Block

Self-attention is the heart of the thing, but it isn't the whole thing. Attention gets wrapped inside a transformer block, along with three other pieces: a feedforward layer, residual connections, and layer normalization.

The Residual Stream

Think of each token as a stream flowing up through the block. The token's embedding enters at the bottom, and each component reads from the stream, computes something, and adds its result back in. Nothing replaces the stream; everything augments it. This is the residual-stream picture, and it's also exactly what makes parallel computation natural; each token's stream is mostly independent.

The one exception: attention is the only component that reads from other tokens' streams. Everything else in the block operates on a single token's stream in isolation. That's why people call attention the token-mixing component; it's literally the step that moves information between tokens. Everything else just refines a token in place.

Why a Feedforward Layer?

A subtle point, easy to miss. Attention, for all its machinery, is essentially a bunch of dot products and weighted averages (linear operations). Stack linear operations and you still have a linear operation. To actually learn rich representations, you need non-linearity.

That's what the feedforward layer is for. It's a plain two-layer network with a ReLU, applied to each token independently:

\text{FFN}(x_i) = \text{ReLU}(x_i W_1 + b_1) W_2 + b_2

One detail to flag: the hidden layer of this network is usually bigger than the model dimension. In the original transformer, $d = 512$ but the feedforward hidden size was $d_{ff} = 2048$ .

Layer Norm

At two points in the block, the token vector gets normalized. Layer norm is essentially the z-score from statistics, applied to a single token's vector: compute the mean $\mu$ and standard deviation $\sigma$ across the vector's components, subtract the mean, divide by the standard deviation, then scale and shift by two learned parameters $\gamma$ (gain) and $\beta$ (offset):

\text{LayerNorm}(x) = \gamma \, \frac{(x - \mu)}{\sigma} + \beta

It keeps the values in a range that makes gradient-based training behave. Despite the name, it normalizes a single token's embedding, not a whole layer.

Putting the Block Together

Stack those pieces, and a single transformer block computes the following. (This is the prenorm arrangement, where layer norm comes before attention and feedforward, the version used in most modern transformers.)

t_i^1 = \text{LayerNorm}(x_i)

t_i^2 = \text{MultiHeadAttention}(t_i^1, [t_1^1, \ldots, t_N^1])

t_i^3 = t_i^2 + x_i

t_i^4 = \text{LayerNorm}(t_i^3)

t_i^5 = \text{FFN}(t_i^4)

h_i = t_i^5 + t_i^3

Read the + x_i and + t_i^3 lines as the residual connections, the stream carrying the original vector forward and getting added back at each stage. The output $h_i$ is the block's representation of token $i$ .

Stacking

Two facts make stacking work:

Same dimensionality in and out. Every vector (input, output, and the intermediate $t$ vectors) is $d$ -dimensional. So a block's output is exactly the right shape to be another block's input.
Weights are shared across token positions but differ across layers. Within one block, every token position uses the same weight matrices. But block 1 and block 2 have their own separate weights.

So you stack these blocks, 12 in a small model, 96 or more in a large one, and each one builds a richer contextual representation on top of the last. At the very bottom, a token's vector mostly represents that token. Near the top, it's increasingly representing the next token, since the whole stack is trained to predict what comes next.

This is the magic, and it's worth saying plainly. You feed in a token as a plain embedding. It flows up through dozens of blocks. At each one, attention mixes in weighted context from every prior token across many relationship types, and the feedforward layer adds non-linearity. What comes out the top is a far richer, context-resolved representation, and yet it's the same shape as what went in. That modularity is the whole trick.

New terms in this post, at a glance

Contextual embedding — a token's vector computed from its surrounding words, so it differs by context (unlike a static word2vec vector).

Self-attention — computing a token's new representation as a weighted sum of the tokens it attends to.

Query / Key / Value — the three roles each token plays: the query does the looking, keys are matched against it, values are what gets summed.

Attention head — one full query-key-value attention computation.

Multi-head attention — many heads in parallel, each specializing in a different kind of relationship.

Model dimension (d) — the size of the vector flowing through the network; the same in and out of every block.

Residual stream — the per-token vector that flows up through a block, with each component adding into it.

Layer norm — z-score normalization applied to a single token's vector.

Token-mixing — the property that only attention moves information between tokens.

What You Now Have

Seven things from this lecture:

Static embeddings can't represent context. It in "the chicken didn't cross the road because it was too tired/wide" means two different things; one frozen vector can't capture both. Transformers compute a fresh contextual embedding per token instead.
The RNN-vs-transformer difference is parallelism. RNNs compute token by token, each waiting on the last. Transformers compute every token at once — far faster — and recover cross-token dependencies through attention rather than recurrence.
Attention is a weighted sum of vectors. Score each prior word's similarity to the current word, softmax the scores into weights, take the weighted sum. The most relevant words contribute the most. In causal models, a word attends only to itself and the words before it.
Query, key, value. Each token is projected by learned matrices $W^Q, W^K, W^V$ into three roles: the query (current word looking), the keys (prior words being matched), and the values (what prior words contribute). Score = scaled dot product of query and key; output = weighted sum of values, reshaped by $W^O$ .
Multi-head attention. Many heads run in parallel, each with its own weights, each specializing in a different kind of word relationship. Concatenate them and project back down to the model dimension $d$ . One head would have to encode every relationship type at once; many heads share the load.
The transformer block wraps attention in layer norm (z-score normalization), a feedforward layer (ReLU — the non-linearity attention lacks, with a hidden size larger than $d$ ), and residual connections (the stream that carries each token forward). Only attention mixes information across tokens.
Stacking works because the shapes match. Same $d$ -dimensional vector in and out, weights shared across positions but distinct per layer. Stack 12 to 96+ blocks, each building a richer representation than the last.

Next up: making it run, and making it generate. We have the mechanics of attention and the block, but three pieces are still missing. First, the matrix formulation that lets us compute all of this in genuine parallel, packing the whole sequence into one matrix

X

and doing attention as a few big matrix multiplies (with one consequence: attention cost grows quadratically with sequence length). Second, causal masking, the trick that stops a token from attending to the future when everything is computed at once. Third, position embeddings and the language modeling head that turns the top-of-stack vector into a next-word prediction. That's where the transformer stops being an idea and starts being a language model.

DEV Community