DEV Community

zeromathai
zeromathai

Posted on • Originally published at zeromathai.com

How Transformers Work — From Self-Attention to Modern LLM Architecture

Transformers changed AI because they stopped reading sequences one token at a time.

Instead of moving step by step like an RNN, a Transformer compares tokens directly.

That one design shift made modern LLMs possible.

Core Idea

A Transformer is a neural network architecture built around attention.

It looks at a sequence of tokens and learns how those tokens relate to each other.

This matters because language is contextual.

A word is not understood alone.

It is understood through its relationship with surrounding words.

That is why Self-Attention became the core mechanism.

The Key Structure

A simplified Transformer flow looks like this:

Tokens → Embeddings → Positional Information → Self-Attention → Feed-Forward Network → Output

More compactly:

Transformer = token representations + attention + position + stacked blocks

The model first converts text into token vectors.

Then it injects position information.

Then each Transformer block updates the token representations using attention and feed-forward layers.

Implementation View

At a high level, a Transformer processes text like this:

split text into tokens

convert tokens into embeddings

add positional information

for each Transformer block:
    compute Self-Attention

    mix token information

    apply feed-forward transformation

    keep stable flow with residual connections and normalization

produce contextual token representations
Enter fullscreen mode Exit fullscreen mode

For decoder-based LLMs, generation continues like this:

predict next token

append generated token

reuse cached keys and values

repeat until stopping condition
Enter fullscreen mode Exit fullscreen mode

This is why Transformers are practical for large-scale generation.

They can learn relationships across many tokens.

And with caching, they can generate efficiently.

Concrete Example

Take this sentence:

The animal did not cross the street because it was tired.

What does “it” refer to?

A simple left-to-right model may struggle if long context matters.

Self-Attention lets the token “it” compare itself with other tokens like “animal” and “street.”

The model can assign stronger attention to the token that best explains the meaning.

That is the intuition.

Attention lets tokens ask:

Which other tokens matter for understanding me?

RNN vs Transformer

This comparison explains why Transformers became so important.

RNN:

  • processes tokens step by step
  • carries information through hidden state
  • naturally captures order
  • is harder to parallelize
  • can struggle with long-range dependencies

Transformer:

  • processes tokens in parallel
  • compares tokens directly through attention
  • needs positional information for order
  • scales well on GPUs
  • handles long-range relationships more flexibly

So the Transformer was not just faster.

It changed how sequence relationships are represented.

RNNs remember through recurrence.

Transformers relate through attention.

Self-Attention

Self-Attention computes relationships between tokens in the same sequence.

Each token creates three vectors:

  • Query
  • Key
  • Value

The intuition is simple:

Query = what this token is looking for

Key = what each token offers for matching

Value = information to retrieve if the match is strong

The core formula is:

Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k))V

This means:

  1. compare queries and keys
  2. turn scores into weights
  3. use those weights to combine values

That is how each token becomes context-aware.

Multi-Head Attention

One attention calculation is useful.

But one view is not enough.

Multi-Head Attention runs several attention heads in parallel.

Each head can focus on a different type of relationship.

One head may track syntax.

Another may track semantic similarity.

Another may track long-distance references.

Then the outputs are combined into one representation.

This makes attention richer than a single similarity calculation.

Why Positional Encoding Is Needed

Self-Attention does not automatically know token order.

If you only give it a bag of token embeddings, the model needs another signal to know which token came first.

That is why positional information is added.

Common positional methods include:

  • Absolute Positional Embedding
  • Relative Positional Embedding
  • Rotary Positional Embedding

APE gives each position its own vector.

RPE focuses on relative distance between tokens.

RoPE rotates query and key vectors based on position, making relative position work naturally inside attention.

This is why RoPE became common in modern LLMs.

Encoder, Decoder, and LLMs

The original Transformer used an Encoder-Decoder structure.

Encoder:

  • reads the input
  • builds contextual representations
  • works well for understanding tasks

Decoder:

  • generates output tokens
  • uses causal masking
  • works well for autoregressive generation

Encoder-Decoder:

  • connects input understanding with output generation
  • useful for translation-style tasks

Modern GPT-style LLMs are mostly decoder-based.

They generate text one token at a time.

The decoder predicts the next token, appends it, and repeats.

Decoding Strategies

Once the model produces logits, it needs to choose the next token.

Different decoding strategies create different behavior.

Greedy decoding:

  • chooses the most likely token
  • simple and deterministic
  • can be repetitive

Beam search:

  • keeps multiple candidate sequences
  • useful for structured generation
  • can still feel less diverse

Top-k sampling:

  • samples from the top k likely tokens
  • adds diversity

Top-p sampling:

  • samples from the smallest probability mass above a threshold
  • adapts the candidate set dynamically

So generation quality is not only about the model.

It also depends on decoding.

The Efficiency Problem

Full Attention is powerful but expensive.

If the sequence length is n, attention has roughly O(n^2) cost.

That means longer context becomes expensive quickly.

This is why efficient attention matters.

Local Attention reduces the view to nearby tokens.

Sparse Attention computes only selected attention links.

FlashAttention keeps the formula but improves GPU memory access.

The key idea:

Do less unnecessary work, or move data more efficiently.

Both make longer context more practical.

KV Cache

Autoregressive generation has another problem.

When generating one token at a time, the model repeatedly needs past key and value tensors.

KV Cache stores those tensors.

So the model does not recompute them from scratch at every step.

The flow looks like this:

Generated tokens → cached keys and values → new query attends to cache → next token

This makes inference faster.

But it creates a memory problem.

Longer context means a larger KV Cache.

That is why modern LLMs use techniques like:

  • Multi-Query Attention
  • Grouped-Query Attention
  • Multi-Head Latent Attention

These methods reduce the memory cost of storing key-value information.

Modern Transformer Blocks

Modern LLMs still use the Transformer idea.

But the block has evolved.

A typical modern block looks like this:

Input
→ RMSNorm or Pre-Layer Normalization
→ Self-Attention with GQA and RoPE
→ Residual Connection
→ RMSNorm or Pre-Layer Normalization
→ Feed-Forward Network with SwiGLU or Mixture of Experts
→ Residual Connection

Important upgrades include:

  • RMSNorm for simpler normalization
  • RoPE for positional representation
  • GQA for efficient inference
  • SwiGLU for stronger feed-forward layers
  • MoE for sparse expert-based scaling

So today’s Transformer is not exactly the 2017 Transformer copied directly.

It is an evolved architecture family.

Transformer vs Modern LLM Architecture

Original Transformer:

  • encoder-decoder structure
  • standard multi-head attention
  • sinusoidal positional encoding
  • layer normalization
  • dense feed-forward layers

Modern LLM architecture:

  • often decoder-only
  • causal self-attention
  • RoPE
  • RMSNorm
  • GQA or related KV-sharing methods
  • SwiGLU
  • sometimes Mixture of Experts
  • KV Cache for inference

The core idea stayed the same.

The engineering changed dramatically.

Recommended Learning Order

If Transformer architecture feels too large, learn it in this order:

  1. Attention Mechanism
  2. Self-Attention
  3. QKV Computation
  4. Multi-Head Attention
  5. Positional Encoding
  6. Encoder-Decoder Architecture
  7. Transformer Decoder
  8. KV Cache
  9. Efficient Attention
  10. Modern Transformer Block

This order works because you first understand the relationship mechanism.

Then you understand generation.

Then you understand why modern LLMs needed efficiency upgrades.

Takeaway

The Transformer is the architecture language of modern LLMs.

The shortest version is:

Transformer = attention + position + stacked blocks + efficient generation

Self-Attention computes token relationships.

Positional encoding injects order.

The decoder generates tokens.

KV Cache makes autoregressive inference practical.

Modern upgrades like RoPE, RMSNorm, GQA, SwiGLU, and MoE make the architecture scalable.

If you remember one idea, remember this:

Transformers work by turning a sequence into a set of contextual relationships, then refining those relationships through stacked attention-based blocks.

Discussion

When learning Transformers, do you find it easier to start from the attention formula, the decoder generation loop, or the modern LLM block structure?

Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/transformer-architecture-overview-en/

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

Top comments (0)