DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

The Transformer Architecture Behind Modern LLMs: A Developer's Guide to the Diagram That Changed AI

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.


If you've used ChatGPT, Claude, Gemini, or any modern LLM, you've already benefited from one of the most influential transformer architecture in modern AI.

In 2017, a team of researchers at Google published a paper titled "Attention Is All You Need." Hidden inside it was the now-famous architecture diagram shown below—the blueprint for the Transformer.

blueprint for transformer

At first glance, it looks intimidating: boxes, arrows, loops, encoder stacks, decoder stacks, attention blocks...

But underneath the complexity lies a surprisingly elegant idea.

By the end of this article, you'll understand every single component in this diagram, why it exists, how information flows through it, and how all these pieces collaborate to generate human-like language.

Let's build our understanding layer by layer.

The Big Picture: Two Factories Working Together

Before diving into the individual blocks, ignore the details and look at the overall shape.

There are two major halves:

  • Encoder (left)
  • Decoder (right)

Think of them as two specialized factories.

Input sentence
      │
      ▼
 ┌───────────────┐
 │    Encoder    │
 └───────────────┘
      │
 Learned representation
      │
      ▼
 ┌───────────────┐
 │    Decoder    │
 └───────────────┘
      │
      ▼
 Generated text
Enter fullscreen mode Exit fullscreen mode

Originally, this architecture was designed for machine translation.

Example:

English:
"The cat sat on the mat."

Encoder understands it.

↓

Decoder generates:

"Le chat était assis sur le tapis."
Enter fullscreen mode Exit fullscreen mode

The encoder's only responsibility is understanding.

The decoder's responsibility is generation.

Today's GPT models actually use only the decoder, while models like BERT use only the encoder. The original Transformer paper contained both because translation requires understanding one language before producing another.

Step 1 — Input Embeddings: Converting Words into Numbers

Computers don't understand words.

They understand vectors.

When the sentence

The cat sat
Enter fullscreen mode Exit fullscreen mode

enters the Transformer, each word is converted into a dense numerical vector.

"The"
↓

[0.17, -0.42, 1.33, ...]
Enter fullscreen mode Exit fullscreen mode

These vectors are called embeddings.

Words with similar meanings naturally end up close together in this high-dimensional space.

For example,

king
queen
prince
princess
Enter fullscreen mode Exit fullscreen mode

occupy nearby regions.

Instead of manually designing these vectors, the Transformer learns them during training.

In the architecture diagram, this is the very first pink box:

Inputs
   │
   ▼
Input Embedding
Enter fullscreen mode Exit fullscreen mode

At this stage, every token is simply represented as a learned numerical vector.

Step 2 — Positional Encoding: Giving Words an Order

Here's an interesting problem.

Attention doesn't inherently know word order.

Consider:

Dog bites man

Man bites dog
Enter fullscreen mode Exit fullscreen mode

Same words.

Completely different meaning.

Since attention processes every word simultaneously, we must explicitly tell the model where each word appears.

That's the purpose of Positional Encoding.

The positional vector is added directly to the embedding.

Embedding

+

Position

=

Final input vector
Enter fullscreen mode Exit fullscreen mode

This explains the small ⊕ symbol in the diagram.

Embedding
      │
      ▼
      ⊕
     / \
Embedding Position
Enter fullscreen mode Exit fullscreen mode

Rather than learning grammar from sequence alone, the model receives positional information immediately.

You can think of it as giving every token both:

  • what it is
  • where it occurs

Step 3 — The Encoder Stack: Understanding the Entire Sentence

Now we reach the large box labelled .

┌─────────────────────┐
│ Multi-Head Attention│
│ Add & Norm          │
│ Feed Forward        │
│ Add & Norm          │
└─────────────────────┘

Repeated N times
Enter fullscreen mode Exit fullscreen mode

The paper used N = 6.

Modern LLMs often use dozens or even hundreds of layers.

Each encoder layer gradually refines the representation.

Think of reading a paragraph.

Your first reading identifies words.

The second discovers phrases.

The third understands relationships.

The fourth extracts meaning.

Each encoder layer performs another refinement pass.

Step 4 — Multi-Head Attention: The Heart of the Transformer

This is the innovation that changed deep learning.

Suppose the sentence is:

The animal didn't cross the street because it was tired.

What does it refer to?

Attention allows every word to examine every other word before updating its representation.

animal  ←──────────┐
                   │
cross ─────────────┤
                   │
street ────────────┤
                   │
it  ◄──────────────┘
Enter fullscreen mode Exit fullscreen mode

Instead of only looking at nearby words like RNNs, every token has access to the entire sentence.

This allows long-distance dependencies to be captured naturally.

Why Multiple Heads?

One attention mechanism isn't enough.

Different relationships matter.

One head may learn:

  • grammatical structure

Another:

  • pronoun resolution

Another:

  • verb-object relationships

Another:

  • semantic similarity

Imagine several experts reading the same sentence simultaneously.

Head 1:
Grammar

Head 2:
Meaning

Head 3:
Syntax

Head 4:
Long-range context
Enter fullscreen mode Exit fullscreen mode

Their outputs are combined into a richer representation.

This is why it's called Multi-Head Attention.

Step 5 — Add & Norm: Keeping Training Stable

Notice that after every major block we see

Add & Norm
Enter fullscreen mode Exit fullscreen mode

This performs two operations.

Residual Connection (Add)

Instead of replacing information, we preserve the original.

Output

=

Attention(x)

+

x
Enter fullscreen mode Exit fullscreen mode

This shortcut helps gradients flow through deep networks.

Without it, training hundreds of layers becomes extremely difficult.

Layer Normalization (Norm)

Different layers naturally produce values on different scales.

Layer normalization keeps activations well-behaved.

Think of it as recalibrating measurements after every processing stage.

Without normalization:

Layer 1

0.3

Layer 2

500

Layer 3

0.0004
Enter fullscreen mode Exit fullscreen mode

Training quickly becomes unstable.

Normalization keeps everything numerically manageable.

Step 6 — Feed Forward Networks: Thinking Independently

Attention allows tokens to exchange information.

The Feed Forward layer allows each token to process what it has learned.

For every token independently:

Vector

↓

Linear

↓

Activation

↓

Linear
Enter fullscreen mode Exit fullscreen mode

No interaction happens here.

Instead, this stage performs deeper feature extraction.

An analogy:

Attention is a group discussion.

Feed Forward is everyone quietly thinking afterward.

This alternating pattern—

Discuss

↓

Think

↓

Discuss

↓

Think
Enter fullscreen mode Exit fullscreen mode

is repeated across every Transformer layer.

Step 7 — The Decoder: Generating One Token at a Time

Now we move to the right half of the diagram.

The decoder is responsible for producing text.

Its input isn't the original sentence.

Instead it receives:

<START>

↓

The

↓

The cat

↓

The cat sat

↓

...
Enter fullscreen mode Exit fullscreen mode

Notice the label:

Outputs
(shifted right)
Enter fullscreen mode Exit fullscreen mode

This means that during training, the decoder receives the correct previous token as input while learning to predict the next one.

If the target sentence is:

The cat sat
Enter fullscreen mode Exit fullscreen mode

the decoder sees:

<START>

The

The cat
Enter fullscreen mode Exit fullscreen mode

and learns to predict:

The

cat

sat
Enter fullscreen mode Exit fullscreen mode

This "teacher forcing" strategy makes training much more efficient because the model always conditions on the correct history rather than its own mistakes.

Step 8 — Masked Multi-Head Attention: Preventing Cheating

Imagine predicting the next word in:

The cat sat on the __
Enter fullscreen mode Exit fullscreen mode

If the model could already see

mat
Enter fullscreen mode Exit fullscreen mode

there would be nothing to learn.

So the decoder applies Masked Multi-Head Attention.

Current token

↓

Can see:

Previous words ✔

Future words ✘
Enter fullscreen mode Exit fullscreen mode

During generation:

I love
Enter fullscreen mode Exit fullscreen mode

cannot attend to

pizza
Enter fullscreen mode Exit fullscreen mode

until pizza has actually been generated.

The mask preserves causality.

This is precisely why GPT models generate text one token at a time.

Step 9 — Cross-Attention: Looking Back at the Encoder

The second attention block inside the decoder is different.

Here, the decoder attends to the encoder output.

Encoder

↓

Sentence meaning

↓

Decoder consults it
Enter fullscreen mode Exit fullscreen mode

Suppose we're translating:

"The red car."
Enter fullscreen mode Exit fullscreen mode

While generating

voiture
Enter fullscreen mode Exit fullscreen mode

the decoder continually asks:

Which parts of the original sentence are relevant right now?

This interaction between encoder and decoder is called cross-attention (shown simply as "Multi-Head Attention" in the original diagram, with arrows coming from the encoder stack).

It lets the decoder ground each generated token in the encoded meaning of the source sentence instead of relying only on previously generated words.

Decoder-only models like GPT omit this block because there is no separate encoder to consult.

Step 10 — Linear Layer + Softmax: Choosing the Next Word

After the decoder finishes processing, we finally reach the top of the diagram.

Decoder Output

↓

Linear

↓

Softmax

↓

Output Probabilities
Enter fullscreen mode Exit fullscreen mode

The Linear layer converts the decoder's hidden representation into one score for every token in the vocabulary.

Imagine a vocabulary containing 50,000 words.

The output might look like:

cat      5.8
dog      2.1
apple   -0.4
car      1.7
...
Enter fullscreen mode Exit fullscreen mode

These raw scores (often called logits) aren't probabilities yet.

The Softmax layer transforms them into a probability distribution:

cat      0.81
dog      0.09
car      0.04
apple    0.01
...
Enter fullscreen mode Exit fullscreen mode

The model can then choose the next token—either the most probable one or a sampled alternative depending on the decoding strategy.

That chosen token is fed back into the decoder, and the entire process repeats until an end-of-sequence token is produced.

Putting It All Together: Following the Data Through the Diagram

Now the entire figure becomes much easier to read.

Input text

↓

Input Embedding

↓

Positional Encoding

↓

Encoder Stack (N layers)
    ├─ Multi-Head Attention
    ├─ Add & Norm
    ├─ Feed Forward
    └─ Add & Norm

↓

Context-rich representation

↓

Decoder receives previous outputs
(shifted right)

↓

Output Embedding

↓

Positional Encoding

↓

Masked Multi-Head Attention
(looks only at earlier generated tokens)

↓

Cross-Attention
(consults the encoder output)

↓

Feed Forward

↓

Repeat N layers

↓

Linear

↓

Softmax

↓

Next token probability

↓

Repeat until complete sentence
Enter fullscreen mode Exit fullscreen mode

Every block has a specific role:

Component Purpose
Input Embedding Convert tokens into dense vectors
Positional Encoding Encode word order
Multi-Head Attention Let tokens exchange information globally
Masked Multi-Head Attention Prevent access to future tokens during generation
Cross-Attention Allow the decoder to consult the encoder's understanding
Feed Forward Transform each token's representation independently
Add & Norm Stabilize optimization and preserve information via residual connections
Encoder Stack (N×) Build increasingly rich contextual representations
Decoder Stack (N×) Generate the output sequence one token at a time
Linear Produce a score for every vocabulary token
Softmax Convert scores into probabilities for selecting the next token

A Crisp Summary To Help You Remember This

The Transformer succeeded because it replaced sequential processing with parallel attention, enabling models to reason over entire sequences at once while still generating coherent text token by token.

Almost every major language model today—from GPT and Claude to Llama, Mistral, and Gemini—can trace its lineage back to this deceptively simple diagram. While modern architectures introduce refinements such as rotary positional embeddings, grouped-query attention, mixture-of-experts layers, and optimized decoding strategies, the core ideas remain strikingly similar to those introduced in 2017.

The next time someone says an LLM is "just predicting the next token," remember what's happening under the hood: embeddings capture meaning, positional encodings preserve order, attention weaves relationships across the sequence, feed-forward networks refine those representations, residual connections keep deep networks trainable, and the decoder repeatedly transforms all of that into one probability distribution after another until a coherent response emerges.

What part of the Transformer architecture surprised you the most—the fact that every token can attend to every other token, the masking that enables autoregressive generation, or how such a simple stack of repeated blocks scales to models with hundreds of billions of parameters?


*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

GitHub logo HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Git Commit




GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.

git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.

In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen

At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…

Top comments (0)