DEV Community

Pavan Kumar Varanasi
Pavan Kumar Varanasi

Posted on

GPT Has No Idea What Words Mean. That's the Whole Point.

And the attention mechanism is exactly how it figures things out anyway, with nothing but numbers.

Most explanations of attention stop at the cartoon: arrows between words, some glowing connections, a vague idea that tokens "look at each other."

I traced every single number by hand. Not pseudocode. Actual matrix multiplications, step by step.

Here's what actually happens.

The Setup

The walkthrough uses 6 sentences. That's it. Not the internet. Not Wikipedia. Just:

"I love pizza. You love burgers. She loves pizza. The cat chased the mouse. The dog chased the cat because it was fast. He hated broccoli but loved cake."

From these, the model gets 22 unique tokens and 23 training pairs, each one a (input sequence → next token) task. The first training example: given [I], predict love.

Before any training, the model assigns each word a random 2D coordinate vector called an embedding. "love" starts at [0.10, 0.30]. Completely meaningless right now. Just two numbers, initialized at random.

Then comes attention.

Before the Math: A Simple Example That Makes It Click

Take sentence 5 from the corpus: "The dog chased the cat because it was fast."

When the model reaches the word "it", it needs to answer: does "it" refer to the dog or the cat?

There's no rule it can look up. It has to decide by looking back at the earlier words and figuring out which one "it" is most compatible with. That's literally what attention does. For every word, it looks back at every earlier word and assigns a score: how much should I pay attention to you right now?

Now here's what that looks like with concrete numbers for our simpler sequence [I, love, pizza].

After all the Q·K dot products and softmax math runs, you get this table. Each row is a token asking questions; each column is a token being looked at:

Token asking → I → love → pizza
I 100% (masked) (masked)
love 51% 49% (masked)
pizza 34% 27% 39%

Read this like: "love" splits its attention. 51% goes to "I", 49% stays on itself. "pizza" can look at everyone and spreads across all three.

The "masked" entries aren't zeros. They're -∞, which becomes 0 after softmax. This is called causal masking. When the model is predicting what comes after "I", it is literally blocked from seeing that "love" is already there. It can only use what came before.

Three things to notice before any equations:

  1. "I" sees only itself because nothing has come before it. It has no context to borrow from yet.
  2. "love" already has context to lean on. It pays 51% attention to the subject "I" that came before it.
  3. "pizza" can see the full picture. It knows both "I" and "love" came before it and spreads attention across all three.

This table is the output of the entire Q/K/V machinery. Let's trace how it gets built.

What Q, K, V Actually Are

Every article explains Q, K, V conceptually: Query is what you're looking for, Key is what you're advertising, Value is what you actually send. Fine.

But what does that look like in numbers?

For the sequence [I, love, pizza], each token passes through three separate learned weight matrices (Wq, Wk, Wv) to produce its Q, K, and V vectors. For "love":

Q_love = [0.10, 0.25]   ← what "love" is asking for
K_love = [0.24, 0.08]   ← what "love" is advertising
V_love = [0.14, 0.27]   ← what "love" will send if attended to
Enter fullscreen mode Exit fullscreen mode

Then attention scores get computed: every token's Q vector is dot-producted against every other token's K vector. The result is scaled by √2, masked so future tokens can't be seen, and softmaxed into weights that sum to 1.

For "love" in this sequence, the weights come out as:

  • Attend to "I": 51%
  • Attend to itself: 49%
  • Attend to "pizza": masked (it hasn't been predicted yet)

Then the output for "love" is computed as a weighted sum of the V vectors:

out(love) = 0.51 × V_I + 0.49 × V_love
           = 0.51 × [0.32, 0.60] + 0.49 × [0.14, 0.27]
           = [0.163, 0.306] + [0.069, 0.132]
           = [0.232, 0.438]
Enter fullscreen mode Exit fullscreen mode

Stop there. Look at what just happened.

"love" started as [0.10, 0.30]. After attention, it's [0.232, 0.438]. That is not the same vector. It is a different point in space. It's now carrying 51% of what "I" had to say.

This is not metaphor. The token "love" in the context of "I" is literally a different vector than the token "love" standing alone. That's how a Transformer handles polysemy. Not by building a lookup table for word senses, but by blending vectors in proportion to learned attention weights. Context flows through math.

The Training Part Nobody Visualizes

After the forward pass, the model makes its first prediction. Given just [I], it should predict love. Initial result:

  • pizza: 22%
  • broccoli: 21%
  • love ✓: 17%

Loss = 1.772. Pizza wins by accident. Its Wo column happens to start with larger random weights. The model is confidently wrong.

Then backpropagation runs. Every weight matrix gets a gradient and a tiny nudge:

  • Wq[0][0] drops from 0.400 to 0.397
  • Wo_love[0] bumps from 1.300 to 1.305
  • The love embedding shifts from [0.100, 0.300] to [0.099, 0.289]

These changes are tiny. The largest single update is 0.020. That's it.

But here's the thing: they compound. After 4 iterations on the same example:

Iteration P(love) Loss
1 17% 1.772
2 24% 1.427
3 38% 0.968
4 56% 0.580

Wq[0][0] ended up at 0.391. Total shift: 0.009. Nine thousandths of a point, accumulated over four steps, flipped the prediction from pizza winning at 22% to love winning at 56%.

GPT-3 ran approximately 300 billion of these steps.

The Generalization Moment

After those 4 iterations, the model was only ever shown [I] → love. Never [You] → love. Never anything with "You" at all.

When [You] is passed through the frozen weights:

  • P(love): 18%
  • Random baseline: 4.5%

The model has never seen this example. But "You" and "I" live nearby in embedding space (both initialized with similar coordinates, both appearing as subjects before verbs). They produce similar Q vectors when multiplied through Wq, similar V vectors through Wv. So "You" gets 18% on love, not because it was trained on that pair, but because geometric proximity in embedding space transferred the signal.

That's zero-shot generalization. Not from a rule. Not from grammar. From the shape of a 2D vector space.

What Changed in My Head

What actually clicked was watching a single vector ([0.10, 0.30]) turn into [0.232, 0.438] mid-sequence, simply because the token next to it contributed 51% of its payload. And then watching a weight change of 0.009 compound across four steps until the prediction flipped entirely.

The math is the same at GPT-4 scale. The heads are 128-dimensional instead of 2. There are 96 of them instead of 1. There are 120 layers instead of 1. But every Q·K dot product, every softmax row, every weighted V sum is the same operation, just bigger.

The model never learned a rule. It learned a direction.

Every weight is a tiny compass needle. Training points all of them so that together, when a subject token sits before a verb, the attention weights tilt the verb's vector toward the subject's value. No grammar book. No verb tables. Just geometry, and enough examples to make the geometry useful.

That's the thing I wish someone had shown me with actual numbers three years ago.

Attention Mechanisim

Top comments (0)