Shrijith Venkatramana

Posted on Jun 29

The Transformer Architecture Behind Modern LLMs: A Developer's Guide to the Diagram That Changed AI

#ai #webdev #productivity #programming

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

If you've used ChatGPT, Claude, Gemini, or any modern LLM, you've already benefited from one of the most influential transformer architecture in modern AI.

In 2017, a team of researchers at Google published a paper titled "Attention Is All You Need." Hidden inside it was the now-famous architecture diagram shown below—the blueprint for the Transformer.

At first glance, it looks intimidating: boxes, arrows, loops, encoder stacks, decoder stacks, attention blocks...

But underneath the complexity lies a surprisingly elegant idea.

By the end of this article, you'll understand every single component in this diagram, why it exists, how information flows through it, and how all these pieces collaborate to generate human-like language.

Let's build our understanding layer by layer.

The Big Picture: Two Factories Working Together

Before diving into the individual blocks, ignore the details and look at the overall shape.

There are two major halves:

Encoder (left)
Decoder (right)

Think of them as two specialized factories.

Input sentence
      │
      ▼
 ┌───────────────┐
 │    Encoder    │
 └───────────────┘
      │
 Learned representation
      │
      ▼
 ┌───────────────┐
 │    Decoder    │
 └───────────────┘
      │
      ▼
 Generated text

Originally, this architecture was designed for machine translation.

Example:

English:
"The cat sat on the mat."

Encoder understands it.

↓

Decoder generates:

"Le chat était assis sur le tapis."

The encoder's only responsibility is understanding.

The decoder's responsibility is generation.

Today's GPT models actually use only the decoder, while models like BERT use only the encoder. The original Transformer paper contained both because translation requires understanding one language before producing another.

Step 1 — Input Embeddings: Converting Words into Numbers

Computers don't understand words.

They understand vectors.

When the sentence

The cat sat

enters the Transformer, each word is converted into a dense numerical vector.

"The"
↓

[0.17, -0.42, 1.33, ...]

These vectors are called embeddings.

Words with similar meanings naturally end up close together in this high-dimensional space.

For example,

king
queen
prince
princess

occupy nearby regions.

Instead of manually designing these vectors, the Transformer learns them during training.

In the architecture diagram, this is the very first pink box:

Inputs
   │
   ▼
Input Embedding

At this stage, every token is simply represented as a learned numerical vector.

Step 2 — Positional Encoding: Giving Words an Order

Here's an interesting problem.

Attention doesn't inherently know word order.

Consider:

Dog bites man

Man bites dog

Same words.

Completely different meaning.

Since attention processes every word simultaneously, we must explicitly tell the model where each word appears.

That's the purpose of Positional Encoding.

The positional vector is added directly to the embedding.

Embedding

+

Position

=

Final input vector

This explains the small ⊕ symbol in the diagram.

Embedding
      │
      ▼
      ⊕
     / \
Embedding Position

Rather than learning grammar from sequence alone, the model receives positional information immediately.

You can think of it as giving every token both:

what it is
where it occurs

Step 3 — The Encoder Stack: Understanding the Entire Sentence

Now we reach the large box labelled N×.

┌─────────────────────┐
│ Multi-Head Attention│
│ Add & Norm          │
│ Feed Forward        │
│ Add & Norm          │
└─────────────────────┘

Repeated N times

The paper used N = 6.

Modern LLMs often use dozens or even hundreds of layers.

Each encoder layer gradually refines the representation.

Think of reading a paragraph.

Your first reading identifies words.

The second discovers phrases.

The third understands relationships.

The fourth extracts meaning.

Each encoder layer performs another refinement pass.

Step 4 — Multi-Head Attention: The Heart of the Transformer

This is the innovation that changed deep learning.

Suppose the sentence is:

The animal didn't cross the street because it was tired.

What does it refer to?

Attention allows every word to examine every other word before updating its representation.

animal  ←──────────┐
                   │
cross ─────────────┤
                   │
street ────────────┤
                   │
it  ◄──────────────┘

Instead of only looking at nearby words like RNNs, every token has access to the entire sentence.

This allows long-distance dependencies to be captured naturally.

Why Multiple Heads?

One attention mechanism isn't enough.

Different relationships matter.

One head may learn:

grammatical structure

Another:

pronoun resolution

Another:

verb-object relationships

Another:

semantic similarity

Imagine several experts reading the same sentence simultaneously.

Head 1:
Grammar

Head 2:
Meaning

Head 3:
Syntax

Head 4:
Long-range context

Their outputs are combined into a richer representation.

This is why it's called Multi-Head Attention.

Step 5 — Add & Norm: Keeping Training Stable

Notice that after every major block we see

Add & Norm

This performs two operations.

Residual Connection (Add)

Instead of replacing information, we preserve the original.

Output

=

Attention(x)

+

x

This shortcut helps gradients flow through deep networks.

Without it, training hundreds of layers becomes extremely difficult.

Layer Normalization (Norm)

Different layers naturally produce values on different scales.

Layer normalization keeps activations well-behaved.

Think of it as recalibrating measurements after every processing stage.

Without normalization:

Layer 1

0.3

Layer 2

500

Layer 3

0.0004

Training quickly becomes unstable.

Normalization keeps everything numerically manageable.

Step 6 — Feed Forward Networks: Thinking Independently

Attention allows tokens to exchange information.

The Feed Forward layer allows each token to process what it has learned.

For every token independently:

Vector

↓

Linear

↓

Activation

↓

Linear

No interaction happens here.

Instead, this stage performs deeper feature extraction.

An analogy:

Attention is a group discussion.

Feed Forward is everyone quietly thinking afterward.

This alternating pattern—

Discuss

↓

Think

↓

Discuss

↓

Think

is repeated across every Transformer layer.

Step 7 — The Decoder: Generating One Token at a Time

Now we move to the right half of the diagram.

The decoder is responsible for producing text.

Its input isn't the original sentence.

Instead it receives:

<START>

↓

The

↓

The cat

↓

The cat sat

↓

...

Notice the label:

Outputs
(shifted right)

This means that during training, the decoder receives the correct previous token as input while learning to predict the next one.

If the target sentence is:

The cat sat

the decoder sees:

<START>

The

The cat

and learns to predict:

The

cat

sat

This "teacher forcing" strategy makes training much more efficient because the model always conditions on the correct history rather than its own mistakes.

Step 8 — Masked Multi-Head Attention: Preventing Cheating

Imagine predicting the next word in:

The cat sat on the __

If the model could already see

mat

there would be nothing to learn.

So the decoder applies Masked Multi-Head Attention.

Current token

↓

Can see:

Previous words ✔

Future words ✘

During generation:

I love

cannot attend to

pizza

until pizza has actually been generated.

The mask preserves causality.

This is precisely why GPT models generate text one token at a time.

Step 9 — Cross-Attention: Looking Back at the Encoder

The second attention block inside the decoder is different.

Here, the decoder attends to the encoder output.

Encoder

↓

Sentence meaning

↓

Decoder consults it

Suppose we're translating:

"The red car."

While generating

voiture

the decoder continually asks:

Which parts of the original sentence are relevant right now?

This interaction between encoder and decoder is called cross-attention (shown simply as "Multi-Head Attention" in the original diagram, with arrows coming from the encoder stack).

It lets the decoder ground each generated token in the encoded meaning of the source sentence instead of relying only on previously generated words.

Decoder-only models like GPT omit this block because there is no separate encoder to consult.

Step 10 — Linear Layer + Softmax: Choosing the Next Word

After the decoder finishes processing, we finally reach the top of the diagram.

Decoder Output

↓

Linear

↓

Softmax

↓

Output Probabilities

The Linear layer converts the decoder's hidden representation into one score for every token in the vocabulary.

Imagine a vocabulary containing 50,000 words.

The output might look like:

cat      5.8
dog      2.1
apple   -0.4
car      1.7
...

These raw scores (often called logits) aren't probabilities yet.

The Softmax layer transforms them into a probability distribution:

cat      0.81
dog      0.09
car      0.04
apple    0.01
...

The model can then choose the next token—either the most probable one or a sampled alternative depending on the decoding strategy.

That chosen token is fed back into the decoder, and the entire process repeats until an end-of-sequence token is produced.

Putting It All Together: Following the Data Through the Diagram

Now the entire figure becomes much easier to read.

Input text

↓

Input Embedding

↓

Positional Encoding

↓

Encoder Stack (N layers)
    ├─ Multi-Head Attention
    ├─ Add & Norm
    ├─ Feed Forward
    └─ Add & Norm

↓

Context-rich representation

↓

Decoder receives previous outputs
(shifted right)

↓

Output Embedding

↓

Positional Encoding

↓

Masked Multi-Head Attention
(looks only at earlier generated tokens)

↓

Cross-Attention
(consults the encoder output)

↓

Feed Forward

↓

Repeat N layers

↓

Linear

↓

Softmax

↓

Next token probability

↓

Repeat until complete sentence

Every block has a specific role:

Component	Purpose
Input Embedding	Convert tokens into dense vectors
Positional Encoding	Encode word order
Multi-Head Attention	Let tokens exchange information globally
Masked Multi-Head Attention	Prevent access to future tokens during generation
Cross-Attention	Allow the decoder to consult the encoder's understanding
Feed Forward	Transform each token's representation independently
Add & Norm	Stabilize optimization and preserve information via residual connections
Encoder Stack (N×)	Build increasingly rich contextual representations
Decoder Stack (N×)	Generate the output sequence one token at a time
Linear	Produce a score for every vocabulary token
Softmax	Convert scores into probabilities for selecting the next token

A Crisp Summary To Help You Remember This

The Transformer succeeded because it replaced sequential processing with parallel attention, enabling models to reason over entire sequences at once while still generating coherent text token by token.

Almost every major language model today—from GPT and Claude to Llama, Mistral, and Gemini—can trace its lineage back to this deceptively simple diagram. While modern architectures introduce refinements such as rotary positional embeddings, grouped-query attention, mixture-of-experts layers, and optimized decoding strategies, the core ideas remain strikingly similar to those introduced in 2017.

The next time someone says an LLM is "just predicting the next token," remember what's happening under the hood: embeddings capture meaning, positional encodings preserve order, attention weaves relationships across the sequence, feed-forward networks refine those representations, residual connections keep deep networks trainable, and the decoder repeatedly transforms all of that into one probability distribution after another until a coherent response emerges.

What part of the Transformer architecture surprised you the most—the fact that every token can attend to every other token, the masking that enables autoregressive generation, or how such a simple stack of repeated blocks scales to models with hundreds of billions of parameters?

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Git Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.

git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.

In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen

At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…

View on GitHub

DEV Community

The Transformer Architecture Behind Modern LLMs: A Developer's Guide to the Diagram That Changed AI

The Big Picture: Two Factories Working Together

Step 1 — Input Embeddings: Converting Words into Numbers

Step 2 — Positional Encoding: Giving Words an Order

Step 3 — The Encoder Stack: Understanding the Entire Sentence

Step 4 — Multi-Head Attention: The Heart of the Transformer

Why Multiple Heads?

Step 5 — Add & Norm: Keeping Training Stable

Residual Connection (Add)

Layer Normalization (Norm)

Step 6 — Feed Forward Networks: Thinking Independently

Step 7 — The Decoder: Generating One Token at a Time

Step 8 — Masked Multi-Head Attention: Preventing Cheating

Step 9 — Cross-Attention: Looking Back at the Encoder

Step 10 — Linear Layer + Softmax: Choosing the Next Word

Putting It All Together: Following the Data Through the Diagram

A Crisp Summary To Help You Remember This

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Git Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

Top comments (0)