NOTE - I intentionally simplified the vector mathematics concept here to keep things simple for a greater audience.
I wanted to learn LLMs properly.
Not just use an API. Not just call generate() from a library and pretend I understood what happened underneath. I wanted to know how a model takes plain text, turns it into numbers, reads context, and predicts the next word.
The end goal is ambitious: build a mini LLM from scratch.
But before touching PyTorch, I realized I needed the mental model first. Code without understanding becomes copy-paste gymnastics. So I started with the real basics.
Text Is Not Text to a Model
A language model does not read words the way we do.
If I write:
hello
The model cannot directly understand those letters. First, the text has to be converted into numbers. That conversion is called tokenization.
The simplest version is character-level tokenization.
If our tiny training text is:
hello world
We collect every unique character and assign each one an ID.
For example:
d -> 0
e -> 1
h -> 2
l -> 3
o -> 4
r -> 5
w -> 6
space -> 7
Now the word:
led
becomes:
[3, 1, 0]
That was the first click for me. Tokenization is basically a dictionary lookup. Text goes in, integer IDs come out.
And decoding is just the reverse. If encoding turns "he" into [2, 1], decoding turns [2, 1] back into "he".
So the first pipeline looks like this:
Raw text -> tokenizer -> token IDs
But token IDs alone are not enough.
Why Token IDs Are Not Meaning
If h = 2 and l = 3, that does not mean l is somehow "greater" than h.
Those numbers are just labels. The model needs a richer representation.
That is where embeddings come in.
An embedding turns each token ID into a vector, which is just a list of numbers.
Instead of:
cat -> 12
We get something more like:
cat -> [0.9, 0.2, 0.5, ...]
A useful way to imagine this is a hidden map of meaning.
Words used in similar ways slowly move closer together during training. So after enough learning, words like:
king and queen
apple and orange
cat and dog
end up close in vector space.
The model does not start with this knowledge. The embeddings begin as random numbers. Training adjusts them until useful patterns emerge.
That was the second click: embeddings are not manually written meanings. They are learned coordinates.
The Order Problem
Once tokens become embeddings, there is still a huge problem.
Transformers process tokens in parallel. That is powerful, but it means the model does not automatically know word order.
These two sentences contain the same words:
The cat ate the mouse.
The mouse ate the cat.
But they mean completely different things.
So we need to inject position information.
The obvious idea is to use indexes:
The -> position 0
cat -> position 1
ate -> position 2
At first, this sounds perfect. Arrays already have indexes, so why not just use them?
The issue is scale.
If we directly add raw indexes, later positions can become huge. A word at position 1999 gets a massive position number compared to the small values inside its embedding. The position can overpower the meaning.
A normalized index also causes trouble.
For a 3-word sentence:
index 2 / length 3 = 0.666
For a 100-word sentence:
index 2 / length 100 = 0.02
Same index. Completely different value.
That means the model has to deal with position values that shift depending on sentence length.
Sine and cosine positional encodings solve this in a neat way.
A sine wave always stays between -1 and 1, so the values never explode. Also, the position value depends on the index and a fixed rhythm, not on the total sentence length.
For example:
sin(index * frequency)
If index = 2 and frequency = 0.5:
sin(2 * 0.5) = sin(1) = 0.841
That value stays the same whether the sentence has 3 words or 100 words.
Real transformers use many sine and cosine waves with different frequencies. Fast waves capture small position changes. Slow waves help distinguish positions farther apart.
That was the third click: positional encoding gives each token a position signature without depending on sentence length.
Attention Is Context Mixing
Now comes the heart of the transformer: attention.
The easiest sentence to understand attention is:
The bank of the river was muddy.
The word bank is ambiguous. It could mean a financial institution, or it could mean the edge of a river.
Attention lets bank look at surrounding words and decide which ones matter.
In this sentence, river is important. So the representation of bank gets pulled toward the water and geography meaning.
The mechanism uses three vectors for every token:
Query: What am I looking for?
Key: What information do I contain?
Value: What information do I pass along?
A simple analogy:
Query = search question
Key = article title
Value = article content
If the query for bank matches the key for river, then bank receives a strong contribution from the value of river.
Mathematically, attention does this:
score = Query dot Key
weights = softmax(scores)
output = weights * Values
The softmax step turns scores into percentages.
So if bank gives 80 percent attention to river, it absorbs a large part of river's value vector.
That was the fourth click: attention is not magic. It is weighted context blending.
Why Masked Attention Exists
For GPT-style models, there is one more important rule.
They predict the next token.
If the training sentence is:
The cat sat on the mat
and the model is learning from:
The cat sat
it should predict:
on
But during training, the full sentence already exists in memory.
So without a mask, the model could cheat. The token sat could look ahead and see on.
That would be like taking an exam with the answer sheet open.
Masked attention prevents this. Each token can only look at itself and previous tokens.
So in:
The cat sat on the mat
when processing sat, the model can attend to:
The, cat, sat
It cannot attend to:
on, the, mat
During real conversation, future tokens do not exist yet. The model generates one token at a time. But during training, future tokens do exist, so we hide them.
That was the fifth click: masking makes training behave like real generation.
Prediction Is a Probability Game
After all the transformer layers finish processing the context, the model predicts the next token.
If the prompt is:
The cat sat
The model does not think there is only one possible answer.
It produces probabilities:
on 35%
down 30%
suddenly 15%
furiously 5%
magically 1%
Then decoding settings decide how to choose.
At low temperature, the model picks the most likely token.
At higher temperature, it samples more creatively from the probability distribution.
So if the prompt changes to:
The wizard waved his wand and the cat sat
then magically becomes much more likely because the earlier context changed the probability distribution.
That was the sixth click: generation is not fixed autocomplete. It is probability shaped by context.
Top comments (1)
Where do you draw the line on understanding before building, and what do you think someone needs to learn first to be productive here, like tokenization, embeddings, positional encoding, and attention?