We use LLMs every day, but most explanations stop at
“it’s a transformer” and move on.
What actually happens between a prompt and the next generated word?
How does the model decide what matters and what doesn’t?
This article breaks down that flow — step by step — without math,
and without hand-waving.
🧠 How Transformers Differ from Traditional Models
Older language models processed text sequentially, focusing mostly on neighboring words.
That meant:
- Limited long-range understanding
- Difficulty connecting distant words in a sentence
Transformers changed this by doing something radical:
They consider the relationship between every word and every other word — all at once.
Instead of asking only:
“What word comes next based on the previous one?”
They ask:
“How does every word relate to every other word in this sentence?”
This is what allows LLMs to understand context at scale.
🧩 Breakdown of the Transformer's core components
Below are the key components that transform raw text into predictions.
1. Tokenization - Turning Text Into Numbers
Before anything else, the prompt is converted into tokens.
Example:
Prompt: "Write a story about dragon"
Tokens: [9566, 261, 4869, 1078, 103944]
Why this step exists?
Models don’t understand raw text.
They operate on numbers.
At this stage:
- Tokens are just identifiers
- They carry no meaning or context
- “dragon” is just a number, not a concept
That limitation is solved in the next step.
2. Vector Embeddings - Adding Meaning Beyond Words
Vector embeddings capture semantic meaning — words with similar meanings end up closer together in vector space.
Consider these two sentences:
- “He deposited money in the bank”
- “They sat near the river bank”
Tokenization treats bank the same in both cases.
Why embeddings are needed?
Vector embeddings represent words in a multi-dimensional space where meaning depends on context.
Example:
bank (finance) → [0.82, -0.14, 0.56, 0.09]
bank (river) → [-0.21, 0.77, -0.63, 0.48]
The numbers themselves don’t matter.
What matters is distance and direction between vectors.
This is how the model distinguishes meaning.
3. Positional Encoding - Preserving Word Order
Embeddings capture meaning — but not order.
Without positional information, these two sentences look identical to the model:
- “The dog chased the cat”
- “The cat chased the dog”
Positional encoding injects order information into each word embedding.
So now we have:
Embedding + Position
4. Self-Attention (The Core Idea)
Once embeddings + positional data are ready, they pass through the self-attention layer.
Self-attention assigns a weight to every word relative to every other word.
This allows the model to:
- Focus on relevant relationships
- Ignore irrelevant ones
Why self-attention exists?
Not all words matter equally.
In the sentence:
“The fisherman caught the fish with a net”
The model needs to figure out:
- Does “with a net” describe fisherman or fish?
5. Multi-Head Self-Attention - Looking at Multiple Meanings at Once
A single attention pattern isn’t enough.
Different relationships exist at the same time:
- grammatical
- semantic
- long-range dependencies
Multi-head attention solves this by running multiple attention layers in parallel.
Each head learns a different aspect of language:
- one may focus on subject–verb relationships
- another on modifiers
- another on overall context
6. Feed-Forward Network
After attention, the representation goes into a feed-forward network.
What happens here?
- The feed-forward layer helps the model decide what word should come next.
- It does this by assigning a score to every word in the model’s vocabulary.
- If the vocabulary contains 50,000 tokens, the output is a list of 50,000 scores.
- These scores are called logits.
Example:
For sentence: "The cat is ..."
Logits →
[2.3, 4.97, 84.21, -5.65, ...]
where:
- “sleeping” → very high score
- “running” → medium score
- “apple” → very low score
At this stage:
- These are raw scores
- They are not probabilities
- Higher score = more likely next word
7. Softmax Output
The logits are passed through a softmax function.
Softmax:
- converts scores into probabilities (0 → 1)
- ensures they add up to 1
Now the model has a probability distribution over all possible next words.
The word with the highest probability is selected.
🔄 Putting It All Together: Encoder → Decoder Flow
Transformers are split into two major parts:
- Encoder (Left side in the above image)
- Decoder (Right side in the above image)
Let’s walk through them using an example.
Example Prompt:
"Write a short story about dragon"
🔐 Encoder Flow
- Prompt → Tokens
- Tokens → Vector Embeddings
- Embeddings + Positional Encoding
- Multi-Head Self-Attention
The encoder produces a rich contextual representation.
It learns things like:
- “story” relates to “dragon”
- “short” modifies “story”
- overall intent of the prompt
This output is not text — it’s meaning.
🎯 Decoder Flow (Word by Word Generation)
The decoder generates text one word at a time.
Step 1: Start Token
Initially, the decoder receives:
<START>
Because during training, the model learned patterns like:
- “Write a story about…”
- “Tell a story about…”
Many stories statistically start with:
"Once upon a time"
So the model predicts:
Once
The same process repeats for the next word, producing:
Once upon
Step 2: Masked Self-Attention
Masked self-attention ensures the model cannot see future words.
It allows:
- “Once” → can see
<START> - “upon” to look at both
<START>andOnce - but "Once" cannot attend to later tokens like
upon, even though they are already part of the input
Step 3: Cross-Attention
Masked self-attention only looks at generated words.
But the model also needs to remember:
- what the user asked for
- what the prompt means
Why cross-attention exists?
Cross-attention allows the decoder to:
- look at the encoder’s output
- align generated words with the prompt’s meaning
For example, the encoder representation contains:
- “story”
- “dragon”
So when generating words, the decoder is reminded:
- this is a story
- it must involve a dragon
- tone should match the prompt
Without cross-attention:
- the model could drift off-topic
- or generate generic text unrelated to the prompt
Step 4: Predict Next Word
At this stage, the decoder predicts the next word in three clear steps:
1. Feed-Forward Network (Logits Generation)
Based on the prompt and previously generated words, the feed-forward layer assigns a score to every word in the vocabulary.
2. Softmax (Probability Distribution)
The logits are passed through a softmax function, converting them into probabilities between 0 and 1, where all values sum to 1.
3. Token Selection
The word with the highest probability is chosen as the next token.
Example:
<START> Once upon
→ next token: "there"
The decoder input now becomes:
<START> Once upon there
This loop repeats token by token until the output is complete.
📝 Note on Modern LLMs
The original Transformer architecture includes both an encoder and a decoder.
However, many modern large language models (like GPT models) use a decoder-only architecture.
In these models:
- The prompt is treated as part of the input sequence
- The model uses masked self-attention
- There is no separate encoder block
Despite this difference, the core idea — self-attention — remains the foundation.
🌱 Final Takeaway
LLMs don’t “understand” language like humans.
They:
- learn patterns
- assign probabilities
- repeat this process thousands of times per response
But the Transformer architecture makes this process powerful by allowing:
- global context
- parallel processing
- deep relationships between words
Seeing how fast LLM apps like ChatGPT respond,
I never imagined such a large, iterative process was running underneath.
Once you understand this flow, LLMs stop feeling magical — and start feeling engineered.



Top comments (0)