Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.
If you've used ChatGPT, Claude, Gemini, or any modern LLM, you've already benefited from one of the most influential transformer architecture in modern AI.
In 2017, a team of researchers at Google published a paper titled "Attention Is All You Need." Hidden inside it was the now-famous architecture diagram shown below—the blueprint for the Transformer.
At first glance, it looks intimidating: boxes, arrows, loops, encoder stacks, decoder stacks, attention blocks...
But underneath the complexity lies a surprisingly elegant idea.
By the end of this article, you'll understand every single component in this diagram, why it exists, how information flows through it, and how all these pieces collaborate to generate human-like language.
Let's build our understanding layer by layer.
The Big Picture: Two Factories Working Together
Before diving into the individual blocks, ignore the details and look at the overall shape.
There are two major halves:
- Encoder (left)
- Decoder (right)
Think of them as two specialized factories.
Input sentence
│
▼
┌───────────────┐
│ Encoder │
└───────────────┘
│
Learned representation
│
▼
┌───────────────┐
│ Decoder │
└───────────────┘
│
▼
Generated text
Originally, this architecture was designed for machine translation.
Example:
English:
"The cat sat on the mat."
Encoder understands it.
↓
Decoder generates:
"Le chat était assis sur le tapis."
The encoder's only responsibility is understanding.
The decoder's responsibility is generation.
Today's GPT models actually use only the decoder, while models like BERT use only the encoder. The original Transformer paper contained both because translation requires understanding one language before producing another.
Step 1 — Input Embeddings: Converting Words into Numbers
Computers don't understand words.
They understand vectors.
When the sentence
The cat sat
enters the Transformer, each word is converted into a dense numerical vector.
"The"
↓
[0.17, -0.42, 1.33, ...]
These vectors are called embeddings.
Words with similar meanings naturally end up close together in this high-dimensional space.
For example,
king
queen
prince
princess
occupy nearby regions.
Instead of manually designing these vectors, the Transformer learns them during training.
In the architecture diagram, this is the very first pink box:
Inputs
│
▼
Input Embedding
At this stage, every token is simply represented as a learned numerical vector.
Step 2 — Positional Encoding: Giving Words an Order
Here's an interesting problem.
Attention doesn't inherently know word order.
Consider:
Dog bites man
Man bites dog
Same words.
Completely different meaning.
Since attention processes every word simultaneously, we must explicitly tell the model where each word appears.
That's the purpose of Positional Encoding.
The positional vector is added directly to the embedding.
Embedding
+
Position
=
Final input vector
This explains the small ⊕ symbol in the diagram.
Embedding
│
▼
⊕
/ \
Embedding Position
Rather than learning grammar from sequence alone, the model receives positional information immediately.
You can think of it as giving every token both:
- what it is
- where it occurs
Step 3 — The Encoder Stack: Understanding the Entire Sentence
Now we reach the large box labelled N×.
┌─────────────────────┐
│ Multi-Head Attention│
│ Add & Norm │
│ Feed Forward │
│ Add & Norm │
└─────────────────────┘
Repeated N times
The paper used N = 6.
Modern LLMs often use dozens or even hundreds of layers.
Each encoder layer gradually refines the representation.
Think of reading a paragraph.
Your first reading identifies words.
The second discovers phrases.
The third understands relationships.
The fourth extracts meaning.
Each encoder layer performs another refinement pass.
Step 4 — Multi-Head Attention: The Heart of the Transformer
This is the innovation that changed deep learning.
Suppose the sentence is:
The animal didn't cross the street because it was tired.
What does it refer to?
Attention allows every word to examine every other word before updating its representation.
animal ←──────────┐
│
cross ─────────────┤
│
street ────────────┤
│
it ◄──────────────┘
Instead of only looking at nearby words like RNNs, every token has access to the entire sentence.
This allows long-distance dependencies to be captured naturally.
Why Multiple Heads?
One attention mechanism isn't enough.
Different relationships matter.
One head may learn:
- grammatical structure
Another:
- pronoun resolution
Another:
- verb-object relationships
Another:
- semantic similarity
Imagine several experts reading the same sentence simultaneously.
Head 1:
Grammar
Head 2:
Meaning
Head 3:
Syntax
Head 4:
Long-range context
Their outputs are combined into a richer representation.
This is why it's called Multi-Head Attention.
Step 5 — Add & Norm: Keeping Training Stable
Notice that after every major block we see
Add & Norm
This performs two operations.
Residual Connection (Add)
Instead of replacing information, we preserve the original.
Output
=
Attention(x)
+
x
This shortcut helps gradients flow through deep networks.
Without it, training hundreds of layers becomes extremely difficult.
Layer Normalization (Norm)
Different layers naturally produce values on different scales.
Layer normalization keeps activations well-behaved.
Think of it as recalibrating measurements after every processing stage.
Without normalization:
Layer 1
0.3
Layer 2
500
Layer 3
0.0004
Training quickly becomes unstable.
Normalization keeps everything numerically manageable.
Step 6 — Feed Forward Networks: Thinking Independently
Attention allows tokens to exchange information.
The Feed Forward layer allows each token to process what it has learned.
For every token independently:
Vector
↓
Linear
↓
Activation
↓
Linear
No interaction happens here.
Instead, this stage performs deeper feature extraction.
An analogy:
Attention is a group discussion.
Feed Forward is everyone quietly thinking afterward.
This alternating pattern—
Discuss
↓
Think
↓
Discuss
↓
Think
is repeated across every Transformer layer.
Step 7 — The Decoder: Generating One Token at a Time
Now we move to the right half of the diagram.
The decoder is responsible for producing text.
Its input isn't the original sentence.
Instead it receives:
<START>
↓
The
↓
The cat
↓
The cat sat
↓
...
Notice the label:
Outputs
(shifted right)
This means that during training, the decoder receives the correct previous token as input while learning to predict the next one.
If the target sentence is:
The cat sat
the decoder sees:
<START>
The
The cat
and learns to predict:
The
cat
sat
This "teacher forcing" strategy makes training much more efficient because the model always conditions on the correct history rather than its own mistakes.
Step 8 — Masked Multi-Head Attention: Preventing Cheating
Imagine predicting the next word in:
The cat sat on the __
If the model could already see
mat
there would be nothing to learn.
So the decoder applies Masked Multi-Head Attention.
Current token
↓
Can see:
Previous words ✔
Future words ✘
During generation:
I love
cannot attend to
pizza
until pizza has actually been generated.
The mask preserves causality.
This is precisely why GPT models generate text one token at a time.
Step 9 — Cross-Attention: Looking Back at the Encoder
The second attention block inside the decoder is different.
Here, the decoder attends to the encoder output.
Encoder
↓
Sentence meaning
↓
Decoder consults it
Suppose we're translating:
"The red car."
While generating
voiture
the decoder continually asks:
Which parts of the original sentence are relevant right now?
This interaction between encoder and decoder is called cross-attention (shown simply as "Multi-Head Attention" in the original diagram, with arrows coming from the encoder stack).
It lets the decoder ground each generated token in the encoded meaning of the source sentence instead of relying only on previously generated words.
Decoder-only models like GPT omit this block because there is no separate encoder to consult.
Step 10 — Linear Layer + Softmax: Choosing the Next Word
After the decoder finishes processing, we finally reach the top of the diagram.
Decoder Output
↓
Linear
↓
Softmax
↓
Output Probabilities
The Linear layer converts the decoder's hidden representation into one score for every token in the vocabulary.
Imagine a vocabulary containing 50,000 words.
The output might look like:
cat 5.8
dog 2.1
apple -0.4
car 1.7
...
These raw scores (often called logits) aren't probabilities yet.
The Softmax layer transforms them into a probability distribution:
cat 0.81
dog 0.09
car 0.04
apple 0.01
...
The model can then choose the next token—either the most probable one or a sampled alternative depending on the decoding strategy.
That chosen token is fed back into the decoder, and the entire process repeats until an end-of-sequence token is produced.
Putting It All Together: Following the Data Through the Diagram
Now the entire figure becomes much easier to read.
Input text
↓
Input Embedding
↓
Positional Encoding
↓
Encoder Stack (N layers)
├─ Multi-Head Attention
├─ Add & Norm
├─ Feed Forward
└─ Add & Norm
↓
Context-rich representation
↓
Decoder receives previous outputs
(shifted right)
↓
Output Embedding
↓
Positional Encoding
↓
Masked Multi-Head Attention
(looks only at earlier generated tokens)
↓
Cross-Attention
(consults the encoder output)
↓
Feed Forward
↓
Repeat N layers
↓
Linear
↓
Softmax
↓
Next token probability
↓
Repeat until complete sentence
Every block has a specific role:
| Component | Purpose |
|---|---|
| Input Embedding | Convert tokens into dense vectors |
| Positional Encoding | Encode word order |
| Multi-Head Attention | Let tokens exchange information globally |
| Masked Multi-Head Attention | Prevent access to future tokens during generation |
| Cross-Attention | Allow the decoder to consult the encoder's understanding |
| Feed Forward | Transform each token's representation independently |
| Add & Norm | Stabilize optimization and preserve information via residual connections |
| Encoder Stack (N×) | Build increasingly rich contextual representations |
| Decoder Stack (N×) | Generate the output sequence one token at a time |
| Linear | Produce a score for every vocabulary token |
| Softmax | Convert scores into probabilities for selecting the next token |
A Crisp Summary To Help You Remember This
The Transformer succeeded because it replaced sequential processing with parallel attention, enabling models to reason over entire sequences at once while still generating coherent text token by token.
Almost every major language model today—from GPT and Claude to Llama, Mistral, and Gemini—can trace its lineage back to this deceptively simple diagram. While modern architectures introduce refinements such as rotary positional embeddings, grouped-query attention, mixture-of-experts layers, and optimized decoding strategies, the core ideas remain strikingly similar to those introduced in 2017.
The next time someone says an LLM is "just predicting the next token," remember what's happening under the hood: embeddings capture meaning, positional encodings preserve order, attention weaves relationships across the sequence, feed-forward networks refine those representations, residual connections keep deep networks trainable, and the decoder repeatedly transforms all of that into one probability distribution after another until a coherent response emerges.
What part of the Transformer architecture surprised you the most—the fact that every token can attend to every other token, the masking that enables autoregressive generation, or how such a simple stack of repeated blocks scales to models with hundreds of billions of parameters?
*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.
git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*
Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.
HexmosTech
/
git-lrc
Free, Micro AI Code Reviews That Run on Git Commit
| 🇩🇰 Dansk | 🇪🇸 Español | 🇮🇷 Farsi | 🇫🇮 Suomi | 🇯🇵 日本語 | 🇳🇴 Norsk | 🇵🇹 Português | 🇷🇺 Русский | 🇦🇱 Shqip | 🇨🇳 中文 | 🇮🇳 हिन्दी |
git-lrc
Free, Micro AI Code Reviews That Run on Commit
GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.
git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.
In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen
At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…


Top comments (0)