zeromathai

Posted on Jun 15 • Originally published at zeromathai.com

How Transformers Work — From Self-Attention to Modern LLM Architecture

#ai #machinelearning #llm #deeplearning

Transformers changed AI because they stopped reading sequences one token at a time.

Instead of moving step by step like an RNN, a Transformer compares tokens directly.

That one design shift made modern LLMs possible.

Core Idea

A Transformer is a neural network architecture built around attention.

It looks at a sequence of tokens and learns how those tokens relate to each other.

This matters because language is contextual.

A word is not understood alone.

It is understood through its relationship with surrounding words.

That is why Self-Attention became the core mechanism.

The Key Structure

A simplified Transformer flow looks like this:

Tokens → Embeddings → Positional Information → Self-Attention → Feed-Forward Network → Output

More compactly:

Transformer = token representations + attention + position + stacked blocks

The model first converts text into token vectors.

Then it injects position information.

Then each Transformer block updates the token representations using attention and feed-forward layers.

Implementation View

At a high level, a Transformer processes text like this:

split text into tokens

convert tokens into embeddings

add positional information

for each Transformer block:
    compute Self-Attention

    mix token information

    apply feed-forward transformation

    keep stable flow with residual connections and normalization

produce contextual token representations

For decoder-based LLMs, generation continues like this:

predict next token

append generated token

reuse cached keys and values

repeat until stopping condition

This is why Transformers are practical for large-scale generation.

They can learn relationships across many tokens.

And with caching, they can generate efficiently.

Concrete Example

Take this sentence:

The animal did not cross the street because it was tired.

What does “it” refer to?

A simple left-to-right model may struggle if long context matters.

Self-Attention lets the token “it” compare itself with other tokens like “animal” and “street.”

The model can assign stronger attention to the token that best explains the meaning.

That is the intuition.

Attention lets tokens ask:

Which other tokens matter for understanding me?

RNN vs Transformer

This comparison explains why Transformers became so important.

RNN:

processes tokens step by step
carries information through hidden state
naturally captures order
is harder to parallelize
can struggle with long-range dependencies

Transformer:

processes tokens in parallel
compares tokens directly through attention
needs positional information for order
scales well on GPUs
handles long-range relationships more flexibly

So the Transformer was not just faster.

It changed how sequence relationships are represented.

RNNs remember through recurrence.

Transformers relate through attention.

Self-Attention

Self-Attention computes relationships between tokens in the same sequence.

Each token creates three vectors:

Query
Key
Value

The intuition is simple:

Query = what this token is looking for

Key = what each token offers for matching

Value = information to retrieve if the match is strong

The core formula is:

Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k))V

This means:

compare queries and keys
turn scores into weights
use those weights to combine values

That is how each token becomes context-aware.

Multi-Head Attention

One attention calculation is useful.

But one view is not enough.

Multi-Head Attention runs several attention heads in parallel.

Each head can focus on a different type of relationship.

One head may track syntax.

Another may track semantic similarity.

Another may track long-distance references.

Then the outputs are combined into one representation.

This makes attention richer than a single similarity calculation.

Why Positional Encoding Is Needed

Self-Attention does not automatically know token order.

If you only give it a bag of token embeddings, the model needs another signal to know which token came first.

That is why positional information is added.

Common positional methods include:

Absolute Positional Embedding
Relative Positional Embedding
Rotary Positional Embedding

APE gives each position its own vector.

RPE focuses on relative distance between tokens.

RoPE rotates query and key vectors based on position, making relative position work naturally inside attention.

This is why RoPE became common in modern LLMs.

Encoder, Decoder, and LLMs

The original Transformer used an Encoder-Decoder structure.

Encoder:

reads the input
builds contextual representations
works well for understanding tasks

Decoder:

generates output tokens
uses causal masking
works well for autoregressive generation

Encoder-Decoder:

connects input understanding with output generation
useful for translation-style tasks

Modern GPT-style LLMs are mostly decoder-based.

They generate text one token at a time.

The decoder predicts the next token, appends it, and repeats.

Decoding Strategies

Once the model produces logits, it needs to choose the next token.

Different decoding strategies create different behavior.

Greedy decoding:

chooses the most likely token
simple and deterministic
can be repetitive

Beam search:

keeps multiple candidate sequences
useful for structured generation
can still feel less diverse

Top-k sampling:

samples from the top k likely tokens
adds diversity

Top-p sampling:

samples from the smallest probability mass above a threshold
adapts the candidate set dynamically

So generation quality is not only about the model.

It also depends on decoding.

The Efficiency Problem

Full Attention is powerful but expensive.

If the sequence length is n, attention has roughly O(n^2) cost.

That means longer context becomes expensive quickly.

This is why efficient attention matters.

Local Attention reduces the view to nearby tokens.

Sparse Attention computes only selected attention links.

FlashAttention keeps the formula but improves GPU memory access.

The key idea:

Do less unnecessary work, or move data more efficiently.

Both make longer context more practical.

KV Cache

Autoregressive generation has another problem.

When generating one token at a time, the model repeatedly needs past key and value tensors.

KV Cache stores those tensors.

So the model does not recompute them from scratch at every step.

The flow looks like this:

Generated tokens → cached keys and values → new query attends to cache → next token

This makes inference faster.

But it creates a memory problem.

Longer context means a larger KV Cache.

That is why modern LLMs use techniques like:

Multi-Query Attention
Grouped-Query Attention
Multi-Head Latent Attention

These methods reduce the memory cost of storing key-value information.

Modern Transformer Blocks

Modern LLMs still use the Transformer idea.

But the block has evolved.

A typical modern block looks like this:

Input
→ RMSNorm or Pre-Layer Normalization
→ Self-Attention with GQA and RoPE
→ Residual Connection
→ RMSNorm or Pre-Layer Normalization
→ Feed-Forward Network with SwiGLU or Mixture of Experts
→ Residual Connection

Important upgrades include:

RMSNorm for simpler normalization
RoPE for positional representation
GQA for efficient inference
SwiGLU for stronger feed-forward layers
MoE for sparse expert-based scaling

So today’s Transformer is not exactly the 2017 Transformer copied directly.

It is an evolved architecture family.

Transformer vs Modern LLM Architecture

Original Transformer:

encoder-decoder structure
standard multi-head attention
sinusoidal positional encoding
layer normalization
dense feed-forward layers

Modern LLM architecture:

often decoder-only
causal self-attention
RoPE
RMSNorm
GQA or related KV-sharing methods
SwiGLU
sometimes Mixture of Experts
KV Cache for inference

The core idea stayed the same.

The engineering changed dramatically.

Recommended Learning Order

If Transformer architecture feels too large, learn it in this order:

Attention Mechanism
Self-Attention
QKV Computation
Multi-Head Attention
Positional Encoding
Encoder-Decoder Architecture
Transformer Decoder
KV Cache
Efficient Attention
Modern Transformer Block

This order works because you first understand the relationship mechanism.

Then you understand generation.

Then you understand why modern LLMs needed efficiency upgrades.

Takeaway

The Transformer is the architecture language of modern LLMs.

The shortest version is:

Transformer = attention + position + stacked blocks + efficient generation

Self-Attention computes token relationships.

Positional encoding injects order.

The decoder generates tokens.

KV Cache makes autoregressive inference practical.

Modern upgrades like RoPE, RMSNorm, GQA, SwiGLU, and MoE make the architecture scalable.

If you remember one idea, remember this:

Transformers work by turning a sequence into a set of contextual relationships, then refining those relationships through stacked attention-based blocks.

Discussion

When learning Transformers, do you find it easier to start from the attention formula, the decoder generation loop, or the modern LLM block structure?

Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/transformer-architecture-overview-en/

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

DEV Community