DEV Community

Cover image for From Tokens to Attention: My First Real Mental Model of LLMs
Upayan Ghosh
Upayan Ghosh

Posted on

From Tokens to Attention: My First Real Mental Model of LLMs

NOTE - I intentionally simplified the vector mathematics concept here to keep things simple for a greater audience.

I wanted to learn LLMs properly.

Not just use an API. Not just call generate() from a library and pretend I understood what happened underneath. I wanted to know how a model takes plain text, turns it into numbers, reads context, and predicts the next word.

The end goal is ambitious: build a mini LLM from scratch.

But before touching PyTorch, I realized I needed the mental model first. Code without understanding becomes copy-paste gymnastics. So I started with the real basics.

Text Is Not Text to a Model

A language model does not read words the way we do.

If I write:

hello
Enter fullscreen mode Exit fullscreen mode

The model cannot directly understand those letters. First, the text has to be converted into numbers. That conversion is called tokenization.

The simplest version is character-level tokenization.

If our tiny training text is:

hello world
Enter fullscreen mode Exit fullscreen mode

We collect every unique character and assign each one an ID.

For example:

d -> 0
e -> 1
h -> 2
l -> 3
o -> 4
r -> 5
w -> 6
space -> 7
Enter fullscreen mode Exit fullscreen mode

Now the word:

led
Enter fullscreen mode Exit fullscreen mode

becomes:

[3, 1, 0]
Enter fullscreen mode Exit fullscreen mode

That was the first click for me. Tokenization is basically a dictionary lookup. Text goes in, integer IDs come out.

And decoding is just the reverse. If encoding turns "he" into [2, 1], decoding turns [2, 1] back into "he".

So the first pipeline looks like this:

Raw text -> tokenizer -> token IDs
Enter fullscreen mode Exit fullscreen mode

But token IDs alone are not enough.

Why Token IDs Are Not Meaning

If h = 2 and l = 3, that does not mean l is somehow "greater" than h.

Those numbers are just labels. The model needs a richer representation.

That is where embeddings come in.

An embedding turns each token ID into a vector, which is just a list of numbers.

Instead of:

cat -> 12
Enter fullscreen mode Exit fullscreen mode

We get something more like:

cat -> [0.9, 0.2, 0.5, ...]
Enter fullscreen mode Exit fullscreen mode

A useful way to imagine this is a hidden map of meaning.

Words used in similar ways slowly move closer together during training. So after enough learning, words like:

king and queen
apple and orange
cat and dog
Enter fullscreen mode Exit fullscreen mode

end up close in vector space.

The model does not start with this knowledge. The embeddings begin as random numbers. Training adjusts them until useful patterns emerge.

That was the second click: embeddings are not manually written meanings. They are learned coordinates.

The Order Problem

Once tokens become embeddings, there is still a huge problem.

Transformers process tokens in parallel. That is powerful, but it means the model does not automatically know word order.

These two sentences contain the same words:

The cat ate the mouse.
The mouse ate the cat.
Enter fullscreen mode Exit fullscreen mode

But they mean completely different things.

So we need to inject position information.

The obvious idea is to use indexes:

The -> position 0
cat -> position 1
ate -> position 2
Enter fullscreen mode Exit fullscreen mode

At first, this sounds perfect. Arrays already have indexes, so why not just use them?

The issue is scale.

If we directly add raw indexes, later positions can become huge. A word at position 1999 gets a massive position number compared to the small values inside its embedding. The position can overpower the meaning.

A normalized index also causes trouble.

For a 3-word sentence:

index 2 / length 3 = 0.666
Enter fullscreen mode Exit fullscreen mode

For a 100-word sentence:

index 2 / length 100 = 0.02
Enter fullscreen mode Exit fullscreen mode

Same index. Completely different value.

That means the model has to deal with position values that shift depending on sentence length.

Sine and cosine positional encodings solve this in a neat way.

A sine wave always stays between -1 and 1, so the values never explode. Also, the position value depends on the index and a fixed rhythm, not on the total sentence length.

For example:

sin(index * frequency)
Enter fullscreen mode Exit fullscreen mode

If index = 2 and frequency = 0.5:

sin(2 * 0.5) = sin(1) = 0.841
Enter fullscreen mode Exit fullscreen mode

That value stays the same whether the sentence has 3 words or 100 words.

Real transformers use many sine and cosine waves with different frequencies. Fast waves capture small position changes. Slow waves help distinguish positions farther apart.

That was the third click: positional encoding gives each token a position signature without depending on sentence length.

Attention Is Context Mixing

Now comes the heart of the transformer: attention.

The easiest sentence to understand attention is:

The bank of the river was muddy.
Enter fullscreen mode Exit fullscreen mode

The word bank is ambiguous. It could mean a financial institution, or it could mean the edge of a river.

Attention lets bank look at surrounding words and decide which ones matter.

In this sentence, river is important. So the representation of bank gets pulled toward the water and geography meaning.

The mechanism uses three vectors for every token:

Query: What am I looking for?
Key: What information do I contain?
Value: What information do I pass along?
Enter fullscreen mode Exit fullscreen mode

A simple analogy:

Query = search question
Key = article title
Value = article content
Enter fullscreen mode Exit fullscreen mode

If the query for bank matches the key for river, then bank receives a strong contribution from the value of river.

Mathematically, attention does this:

score = Query dot Key
weights = softmax(scores)
output = weights * Values
Enter fullscreen mode Exit fullscreen mode

The softmax step turns scores into percentages.

So if bank gives 80 percent attention to river, it absorbs a large part of river's value vector.

That was the fourth click: attention is not magic. It is weighted context blending.

Why Masked Attention Exists

For GPT-style models, there is one more important rule.

They predict the next token.

If the training sentence is:

The cat sat on the mat
Enter fullscreen mode Exit fullscreen mode

and the model is learning from:

The cat sat
Enter fullscreen mode Exit fullscreen mode

it should predict:

on
Enter fullscreen mode Exit fullscreen mode

But during training, the full sentence already exists in memory.

So without a mask, the model could cheat. The token sat could look ahead and see on.

That would be like taking an exam with the answer sheet open.

Masked attention prevents this. Each token can only look at itself and previous tokens.

So in:

The cat sat on the mat
Enter fullscreen mode Exit fullscreen mode

when processing sat, the model can attend to:

The, cat, sat
Enter fullscreen mode Exit fullscreen mode

It cannot attend to:

on, the, mat
Enter fullscreen mode Exit fullscreen mode

During real conversation, future tokens do not exist yet. The model generates one token at a time. But during training, future tokens do exist, so we hide them.

That was the fifth click: masking makes training behave like real generation.

Prediction Is a Probability Game

After all the transformer layers finish processing the context, the model predicts the next token.

If the prompt is:

The cat sat
Enter fullscreen mode Exit fullscreen mode

The model does not think there is only one possible answer.

It produces probabilities:

on        35%
down      30%
suddenly  15%
furiously 5%
magically 1%
Enter fullscreen mode Exit fullscreen mode

Then decoding settings decide how to choose.

At low temperature, the model picks the most likely token.

At higher temperature, it samples more creatively from the probability distribution.

So if the prompt changes to:

The wizard waved his wand and the cat sat
Enter fullscreen mode Exit fullscreen mode

then magically becomes much more likely because the earlier context changed the probability distribution.

That was the sixth click: generation is not fixed autocomplete. It is probability shaped by context.

Top comments (1)

Collapse
 
monom profile image
Rasmus Ros

Where do you draw the line on understanding before building, and what do you think someone needs to learn first to be productive here, like tokenization, embeddings, positional encoding, and attention?