Building Context: Query, Key, Value, and the Transformer Block
The last two posts ended on the same cliffhanger: RNNs carry context through a single thread of hidden state that frays on long sequences and forces everything through one bottleneck vector. The fix, I kept saying, is attention. This is the post that delivers it. By the end, you'll know why a static word embedding can't represent a word's meaning in context, how self-attention computes a fresh, context-aware embedding for every token by letting it "attend to" the words around it, exactly what the query, key, and value vectors are and where they come from, why transformers use many attention heads instead of one, and how a full transformer block wraps attention in layer norm, a feedforward layer, and residual connections. For this post, we'll stop right before the matrix tricks that make all of this run in parallel.
The Problem: Static Embeddings Don't Know Context
Go back to word2vec. It gives every word a single fixed vector. The word chicken is always the same vector, no matter the sentence. We already flagged this as a limitation, and transformers are where we finally fix it.
Here's the example the textbook (and the lecture) builds everything on:
The chicken didn't cross the road because it was too tired.
The chicken didn't cross the road because it was too wide.
In the first sentence, it means the chicken. In the second, it means the road. The word is identical; the meaning flips entirely based on context. A static embedding for it (one frozen vector) cannot capture both. It would have to encode "pronoun, used for animals and inanimate things" and stop there, blind to which one is meant here.
Now read it the way a left-to-right language model does, stopping at it:
The chicken didn't cross the road because it _____
At this exact point, you don't yet know whether it will turn out to be the chicken or the road. So a good representation of it right here should carry information about both candidate words, chicken and road, ready to be resolved once the next word arrives.
Why Not Just Use RNNs? (A Near Miss)
Fair question, and worth treating as a near miss, because the RNN almost solves this. LSTMs already carry information forward: the sixth word's hidden state holds some trace of the first word. So context does reach across the sentence. Couldn't that give us the contextual embeddings we want?
It can, which is exactly why it's a near miss and not a dead end. But it's clunky, and it has one fatal inefficiency:
RNNs are iterative. Transformers are parallel.
An RNN has to compute the hidden state for token 1 before token 2, token 2 before token 3, and so on down the sequence. Every step waits on the one before it. That's inherently serial, and it's slow.
A transformer computes the representation for every token at the same time. Token 5 doesn't wait for token 4. This massively parallel computation is the single biggest computational difference between RNNs and transformers, and that's the reason why transformers scale to the enormous models we have now.
But there's a catch. If every token is processed independently and in parallel, how do dependencies between words survive? If token 5 doesn't wait for token 4, how does it know about token 4 at all?
That's the job of attention. It's the one component that lets tokens see each other, and it's built so that all those cross-token comparisons can still happen in parallel.
Attention, Intuitively
We build the contextual embedding for a word by selectively integrating information from its neighbors. The key word is selectively; some neighbors matter more than others.
The vocabulary here: a word attends to the neighbors it draws meaning from. In our example, it attends strongly to chicken and road, and only weakly to the, didn't, cross. Those two nouns are where the meaning of it lives, so that's where it pays attention.
Picture the network in layers. At layer , you have a representation for every word. To compute the representation of it at layer , you look back across all the layer- representations of the prior words and pull in information from them — weighted, so that chicken and road contribute the most. Stack enough of these layers and the embedding for it ends up carrying an enormous amount of resolved, contextual meaning.
So, formally: attention is a method for computing a weighted sum of vectors. The whole game is figuring out the weights.
Attention Is Left-to-Right (for now)
In the causal language models we're building, a word can only attend to itself and the words before it, never ahead. When we compute the attention output for token 5, it draws on tokens 1 through 5. When we compute it for token 2, it draws on tokens 1 and 2. No peeking at the future, because at generation time, the future doesn't exist yet.
(There's a version of attention that does look both ways, that's BERT, and it's a post for another day.)
The Simplified Version
Before the real machinery, the textbook gives a stripped-down version that captures the idea. The attention output for token is just a weighted sum of all the prior token vectors:
Each is a scalar that says how much token should contribute to token 's new representation. How do we get those weights? By similarity. A word should draw most from the prior words most similar to it.
The simplest similarity measure between two vectors is the dot product, it maps two vectors to a single number, larger when they point in the same direction:
Then we push the scores through a softmax to turn them into weights between 0 and 1 that sum to 1:
That's the idea, start to finish. Compare the current word to each prior word using dot product, softmax the scores to obtain weights, and take the weighted sum. The most similar words contribute the most.
Query, Key, Value: The Real Attention Head
The simplified version uses each word vector directly for everything. The real version notices that each vector actually plays three different roles in the attention computation, and gives each role its own representation:
- Query: the current word, doing the looking. ("I'm it; what should I pay attention to?")
- Key: a prior word, being looked at, used to compute how similar it is to the query. ("I'm chicken; how relevant am I to you?")
- Value: the actual information a prior word contributes once it's been judged relevant. ("Here's what I, chicken, add to your meaning.")
To produce these three roles, transformers learn three weight matrices: , , — that project each input vector into a query, a key, and a value:
Where do those matrices come from? Training. They're learned, like every other weight in the network.
Now the similarity is computed between the current word's query and each prior word's key, and we scale the dot product by (the square root of the key/query dimension) to keep the numbers from blowing up and wrecking the softmax:
The output sums the value vectors (not the raw inputs), weighted by those attention scores:
And one last matrix, , reshapes that result back to the model dimension:
Mind the Shapes
One piece of advice for this diagram: watch the dimensions. They're the key to understanding what's moving where.
xᵢ (input token) - [1 × d], where d is the model dimension. W^Q and W^K - shape [d × dₖ]. So queries and keys are dₖ-dimensional, and their dot product is a single scalar. W^V — shape [d × dᵥ]. So the value vectors and the head output are dᵥ-dimensional. W^O — shape [dᵥ × d]. Reshapes the head output back to d dimensions. Original transformer paper: d = 512, dₖ = dᵥ = 64.Quick reference: the shapes
So you start with a -dimensional token and end with a -dimensional token, same shape in, same shape out. In many architectures, the key/query and value dimensions are set equal, but they don't have to be.
That "same shape in, same shape out" is not an accident; it's what lets us stack these layers, which we'll need in a minute.
Multi-Head Attention
One attention head computes one kind of similarity. But words relate to each other in many ways at once (syntactic agreement, synonymy, coreference, topical relatedness, etc.) and patterns we don't even have names for. Cramming all of that into a single head is asking too much.
So transformers run many attention heads in parallel, each with its own , , matrices, each free to specialize in a different aspect of how words relate:
Each head produces its own output. You concatenate all of them and project back down to the model dimension with a final matrix :
How many heads? Real models use a lot. The original paper used 8; large models run 128 or more. You can't cleanly point at head #37 and say "this one does subject-verb agreement," but collectively the heads cover the many latent kinds of relationships between words.
Notice the bookkeeping: you start each token as a -dimensional vector and, after all the heads and the projection, you end with a -dimensional vector. A much richer one: it now encodes the token plus weighted information from its whole left context, across many relationship types. But the same shape, which means you can feed it into another attention layer and do it all again.
The Transformer Block
Self-attention is the heart of the thing, but it isn't the whole thing. Attention gets wrapped inside a transformer block, along with three other pieces: a feedforward layer, residual connections, and layer normalization.
The Residual Stream
Think of each token as a stream flowing up through the block. The token's embedding enters at the bottom, and each component reads from the stream, computes something, and adds its result back in. Nothing replaces the stream; everything augments it. This is the residual-stream picture, and it's also exactly what makes parallel computation natural; each token's stream is mostly independent.
The one exception: attention is the only component that reads from other tokens' streams. Everything else in the block operates on a single token's stream in isolation. That's why people call attention the token-mixing component; it's literally the step that moves information between tokens. Everything else just refines a token in place.
Why a Feedforward Layer?
A subtle point, easy to miss. Attention, for all its machinery, is essentially a bunch of dot products and weighted averages (linear operations). Stack linear operations and you still have a linear operation. To actually learn rich representations, you need non-linearity.
That's what the feedforward layer is for. It's a plain two-layer network with a ReLU, applied to each token independently:
One detail to flag: the hidden layer of this network is usually bigger than the model dimension. In the original transformer, but the feedforward hidden size was .
Layer Norm
At two points in the block, the token vector gets normalized. Layer norm is essentially the z-score from statistics, applied to a single token's vector: compute the mean and standard deviation across the vector's components, subtract the mean, divide by the standard deviation, then scale and shift by two learned parameters (gain) and (offset):
It keeps the values in a range that makes gradient-based training behave. Despite the name, it normalizes a single token's embedding, not a whole layer.
Putting the Block Together
Stack those pieces, and a single transformer block computes the following. (This is the prenorm arrangement, where layer norm comes before attention and feedforward, the version used in most modern transformers.)
Read the + x_i and + t_i^3 lines as the residual connections, the stream carrying the original vector forward and getting added back at each stage. The output
is the block's representation of token
.
Stacking
Two facts make stacking work:
- Same dimensionality in and out. Every vector (input, output, and the intermediate vectors) is -dimensional. So a block's output is exactly the right shape to be another block's input.
- Weights are shared across token positions but differ across layers. Within one block, every token position uses the same weight matrices. But block 1 and block 2 have their own separate weights.
So you stack these blocks, 12 in a small model, 96 or more in a large one, and each one builds a richer contextual representation on top of the last. At the very bottom, a token's vector mostly represents that token. Near the top, it's increasingly representing the next token, since the whole stack is trained to predict what comes next.
Contextual embedding — a token's vector computed from its surrounding words, so it differs by context (unlike a static word2vec vector). Self-attention — computing a token's new representation as a weighted sum of the tokens it attends to. Query / Key / Value — the three roles each token plays: the query does the looking, keys are matched against it, values are what gets summed. Attention head — one full query-key-value attention computation. Multi-head attention — many heads in parallel, each specializing in a different kind of relationship. Model dimension (d) — the size of the vector flowing through the network; the same in and out of every block. Residual stream — the per-token vector that flows up through a block, with each component adding into it. Layer norm — z-score normalization applied to a single token's vector. Token-mixing — the property that only attention moves information between tokens.New terms in this post, at a glance
What You Now Have
Seven things from this lecture:
Static embeddings can't represent context. It in "the chicken didn't cross the road because it was too tired/wide" means two different things; one frozen vector can't capture both. Transformers compute a fresh contextual embedding per token instead.
The RNN-vs-transformer difference is parallelism. RNNs compute token by token, each waiting on the last. Transformers compute every token at once — far faster — and recover cross-token dependencies through attention rather than recurrence.
Attention is a weighted sum of vectors. Score each prior word's similarity to the current word, softmax the scores into weights, take the weighted sum. The most relevant words contribute the most. In causal models, a word attends only to itself and the words before it.
Query, key, value. Each token is projected by learned matrices into three roles: the query (current word looking), the keys (prior words being matched), and the values (what prior words contribute). Score = scaled dot product of query and key; output = weighted sum of values, reshaped by .
Multi-head attention. Many heads run in parallel, each with its own weights, each specializing in a different kind of word relationship. Concatenate them and project back down to the model dimension . One head would have to encode every relationship type at once; many heads share the load.
The transformer block wraps attention in layer norm (z-score normalization), a feedforward layer (ReLU — the non-linearity attention lacks, with a hidden size larger than ), and residual connections (the stream that carries each token forward). Only attention mixes information across tokens.
Stacking works because the shapes match. Same -dimensional vector in and out, weights shared across positions but distinct per layer. Stack 12 to 96+ blocks, each building a richer representation than the last.






Top comments (0)