Upayan Ghosh

Posted on May 23

From Tokens to Attention: My First Real Mental Model of LLMs

#llm #ai #machinelearning #beginners

NOTE - I intentionally simplified the vector mathematics concept here to keep things simple for a greater audience.

I wanted to learn LLMs properly.

Not just use an API. Not just call generate() from a library and pretend I understood what happened underneath. I wanted to know how a model takes plain text, turns it into numbers, reads context, and predicts the next word.

The end goal is ambitious: build a mini LLM from scratch.

But before touching PyTorch, I realized I needed the mental model first. Code without understanding becomes copy-paste gymnastics. So I started with the real basics.

Text Is Not Text to a Model

A language model does not read words the way we do.

If I write:

hello

The model cannot directly understand those letters. First, the text has to be converted into numbers. That conversion is called tokenization.

The simplest version is character-level tokenization.

If our tiny training text is:

hello world

We collect every unique character and assign each one an ID.

For example:

d -> 0
e -> 1
h -> 2
l -> 3
o -> 4
r -> 5
w -> 6
space -> 7

Now the word:

led

becomes:

[3, 1, 0]

That was the first click for me. Tokenization is basically a dictionary lookup. Text goes in, integer IDs come out.

And decoding is just the reverse. If encoding turns "he" into [2, 1], decoding turns [2, 1] back into "he".

So the first pipeline looks like this:

Raw text -> tokenizer -> token IDs

But token IDs alone are not enough.

Why Token IDs Are Not Meaning

If h = 2 and l = 3, that does not mean l is somehow "greater" than h.

Those numbers are just labels. The model needs a richer representation.

That is where embeddings come in.

An embedding turns each token ID into a vector, which is just a list of numbers.

Instead of:

cat -> 12

We get something more like:

cat -> [0.9, 0.2, 0.5, ...]

A useful way to imagine this is a hidden map of meaning.

Words used in similar ways slowly move closer together during training. So after enough learning, words like:

king and queen
apple and orange
cat and dog

end up close in vector space.

The model does not start with this knowledge. The embeddings begin as random numbers. Training adjusts them until useful patterns emerge.

That was the second click: embeddings are not manually written meanings. They are learned coordinates.

The Order Problem

Once tokens become embeddings, there is still a huge problem.

Transformers process tokens in parallel. That is powerful, but it means the model does not automatically know word order.

These two sentences contain the same words:

The cat ate the mouse.
The mouse ate the cat.

But they mean completely different things.

So we need to inject position information.

The obvious idea is to use indexes:

The -> position 0
cat -> position 1
ate -> position 2

At first, this sounds perfect. Arrays already have indexes, so why not just use them?

The issue is scale.

If we directly add raw indexes, later positions can become huge. A word at position 1999 gets a massive position number compared to the small values inside its embedding. The position can overpower the meaning.

A normalized index also causes trouble.

For a 3-word sentence:

index 2 / length 3 = 0.666

For a 100-word sentence:

index 2 / length 100 = 0.02

Same index. Completely different value.

That means the model has to deal with position values that shift depending on sentence length.

Sine and cosine positional encodings solve this in a neat way.

A sine wave always stays between -1 and 1, so the values never explode. Also, the position value depends on the index and a fixed rhythm, not on the total sentence length.

For example:

sin(index * frequency)

If index = 2 and frequency = 0.5:

sin(2 * 0.5) = sin(1) = 0.841

That value stays the same whether the sentence has 3 words or 100 words.

Real transformers use many sine and cosine waves with different frequencies. Fast waves capture small position changes. Slow waves help distinguish positions farther apart.

That was the third click: positional encoding gives each token a position signature without depending on sentence length.

Attention Is Context Mixing

Now comes the heart of the transformer: attention.

The easiest sentence to understand attention is:

The bank of the river was muddy.

The word bank is ambiguous. It could mean a financial institution, or it could mean the edge of a river.

Attention lets bank look at surrounding words and decide which ones matter.

In this sentence, river is important. So the representation of bank gets pulled toward the water and geography meaning.

The mechanism uses three vectors for every token:

Query: What am I looking for?
Key: What information do I contain?
Value: What information do I pass along?

A simple analogy:

Query = search question
Key = article title
Value = article content

If the query for bank matches the key for river, then bank receives a strong contribution from the value of river.

Mathematically, attention does this:

score = Query dot Key
weights = softmax(scores)
output = weights * Values

The softmax step turns scores into percentages.

So if bank gives 80 percent attention to river, it absorbs a large part of river's value vector.

That was the fourth click: attention is not magic. It is weighted context blending.

Why Masked Attention Exists

For GPT-style models, there is one more important rule.

They predict the next token.

If the training sentence is:

The cat sat on the mat

and the model is learning from:

The cat sat

it should predict:

on

But during training, the full sentence already exists in memory.

So without a mask, the model could cheat. The token sat could look ahead and see on.

That would be like taking an exam with the answer sheet open.

Masked attention prevents this. Each token can only look at itself and previous tokens.

So in:

The cat sat on the mat

when processing sat, the model can attend to:

The, cat, sat

It cannot attend to:

on, the, mat

During real conversation, future tokens do not exist yet. The model generates one token at a time. But during training, future tokens do exist, so we hide them.

That was the fifth click: masking makes training behave like real generation.

Prediction Is a Probability Game

After all the transformer layers finish processing the context, the model predicts the next token.

If the prompt is:

The cat sat

The model does not think there is only one possible answer.

It produces probabilities:

on        35%
down      30%
suddenly  15%
furiously 5%
magically 1%

Then decoding settings decide how to choose.

At low temperature, the model picks the most likely token.

At higher temperature, it samples more creatively from the probability distribution.

So if the prompt changes to:

The wizard waved his wand and the cat sat

then magically becomes much more likely because the earlier context changed the probability distribution.

That was the sixth click: generation is not fixed autocomplete. It is probability shaped by context.

Top comments (2)

Rasmus Ros • May 24

Where do you draw the line on understanding before building, and what do you think someone needs to learn first to be productive here, like tokenization, embeddings, positional encoding, and attention?

Harjot Singh • May 31

The tokens-to-attention path is exactly the right first mental model, because it demystifies the two things that confuse newcomers most: why the model "thinks" in tokens (not words or characters), and how it decides what in the context actually matters (attention weighting relationships between tokens rather than reading left-to-right like we do). Once those click, a lot of practical behavior stops feeling magic - why tokenization quirks cause weird failures, why context order and position matter, why "just add more context" isn't free. Writing it up while it's fresh is the best way to cement it, and it's a great post for the next person at the same stage.

The thing worth carrying forward from this foundation: attention is also why these models are confidently wrong sometimes - they attend to and complete plausible patterns, with no built-in notion of "is this true." Which is exactly why everything I build sits a verification layer on top. That's the core of Moonshift, the thing I work on - a multi-agent pipeline that takes a prompt to a deployed SaaS, where the model proposes and a verify step checks, because understanding attention also means understanding why you can't just trust the output. Multi-model routing keeps a build ~$3 flat, first run free no card. Really clear writeup for an early mental model. What's the next concept you're tackling - positional encoding, or jumping to how attention scales (KV cache / context limits)? The cost/limits side is where the theory meets your wallet.