DEV Community

Cover image for How Transformers Work Inside an LLM (Step by Step)
Jack Pritom Soren
Jack Pritom Soren

Posted on

How Transformers Work Inside an LLM (Step by Step)

1️⃣ Big Picture: Where Do Transformers Fit in an LLM?

The full LLM pipeline looks like this:

Input Text
   ↓
Tokenizer
   ↓
Embedding + Positional Encoding
   ↓
🔥 Transformer Blocks (Core Brain)
   ↓
Softmax (Probability)
   ↓
Next Token
Enter fullscreen mode Exit fullscreen mode

👉 The Transformer is the brain of the LLM
👉 All context understanding and relationship modeling happens here


2️⃣ What Happens First When Input Enters the Model?

Example Input

"Today the server is down"
Enter fullscreen mode Exit fullscreen mode

🔹 Step 1: Tokenization

["Today", "the", "server", "is", "down"]
Enter fullscreen mode Exit fullscreen mode

LLMs don’t process words—they process tokens.


🔹 Step 2: Embedding

Each token is converted into a numerical vector:

"server" → [0.32, -1.10, 0.87, ...]
Enter fullscreen mode Exit fullscreen mode

This is a mathematical representation of meaning.


🔹 Step 3: Positional Encoding (Very Important)

Transformers do not understand word order by default.

So position information is added to embeddings.

Example:

Today the server is down
The server is down today
Enter fullscreen mode Exit fullscreen mode

Without positional encoding, these would look identical ❌


3️⃣ What’s Inside a Transformer Block?

A single Transformer block has two main components:

[ Multi-Head Self-Attention ]
            ↓
[ Feed Forward Neural Network ]
Enter fullscreen mode Exit fullscreen mode

LLMs contain many such blocks:

  • Small models → 12–24 blocks
  • Large models → 48–96+ blocks

Each block refines the representation further.


4️⃣ Self-Attention: The Core Power of Transformers 🧠

Self-attention means:

To understand one token, the model determines which other tokens are relevant.

Example

Rahim fixed the server because he understands debugging.
Enter fullscreen mode Exit fullscreen mode

The token “he” attends to “Rahim”.

This is how Transformers learn context and relationships.


5️⃣ How Attention Works Internally (Q, K, V)

Each token is transformed into three vectors:

  • Query (Q): What am I looking for?
  • Key (K): What information do I contain?
  • Value (V): What content should be passed forward?

The core calculation:

Attention Score = Q · K
Enter fullscreen mode Exit fullscreen mode

Tokens with higher scores contribute more strongly.

Final output:

Output = weighted sum of V
Enter fullscreen mode Exit fullscreen mode

6️⃣ Why Multi-Head Attention?

One attention mechanism isn’t enough.

Different heads focus on different aspects:

  • Subject relationships
  • Time / tense
  • Cause–effect
  • Intent
Multi-head attention = multiple perspectives
Enter fullscreen mode Exit fullscreen mode

This makes Transformers extremely powerful.


7️⃣ Masked Self-Attention (Critical for GPT Models)

GPT-style models cannot see future tokens.

Example:

Today the server ___
Enter fullscreen mode Exit fullscreen mode

The token “Today” cannot attend to future tokens.

👉 This is enforced using masked self-attention
👉 The model only looks at past tokens

That’s why LLMs generate text step by step.


8️⃣ Feed Forward Network: Pattern Builder

Attention finds relationships.
The feed forward network:

  • Learns abstractions
  • Builds patterns
  • Extracts deeper meaning

Structure:

Linear → Activation → Linear
Enter fullscreen mode Exit fullscreen mode

But at massive scale.


9️⃣ Residual Connections & Layer Normalization

Deep Transformers can become unstable.

To fix this:

  • Residual connections preserve information
  • Layer normalization stabilizes training

Without these, modern LLMs wouldn’t work.


🔟 What Happens After Multiple Transformer Blocks?

After passing through many blocks, tokens become:

  • More context-aware
  • More meaningful
  • More informed

At the final layer:

"server" understands the full sentence context
Enter fullscreen mode Exit fullscreen mode

1️⃣1️⃣ How Is the Next Token Chosen?

Final output flows through:

Linear Layer
   ↓
Softmax
Enter fullscreen mode Exit fullscreen mode

Softmax produces probabilities:

down → 55%
slow → 30%
offline → 10%
Enter fullscreen mode Exit fullscreen mode

The selected token becomes the next output, and the process repeats 🔁


1️⃣2️⃣ One-Line Summary

Transformers process all tokens together,
use attention to understand relationships,
and leverage that context to predict the next token.


1️⃣3️⃣ Why Were Transformers Necessary for LLMs?

Because Transformers provide:

✅ Long-range context understanding
✅ Parallel computation
✅ Strong attention-based modeling
✅ Massive scalability

No previous architecture offered all of these together.


Follow me on : Github Linkedin Threads Youtube Channel

Top comments (0)