DEV Community

zeromathai
zeromathai

Posted on • Originally published at zeromathai.com

How Modern Transformer Blocks Work — From RMSNorm to MoE

The original Transformer idea is still alive.

But modern LLM blocks are not just the 2017 Transformer copied and scaled.

They are engineered for deeper training, longer context, cheaper inference, and larger capacity.

That is why components like RMSNorm, GQA, RoPE, SwiGLU, and MoE matter.

Core Idea

A modern Transformer block still follows the same basic pattern:

Attention updates token relationships.

The Feed-Forward Network transforms each token representation.

Residual connections keep information flowing.

But modern LLMs changed the details.

Those details are not cosmetic.

They make large-scale training and inference practical.

The Key Structure

A typical modern Transformer block looks like this:

Input

→ RMSNorm or Pre-Layer Normalization

→ Self-Attention with GQA and RoPE

→ Residual Connection

→ RMSNorm or Pre-Layer Normalization

→ Feed-Forward Network with SwiGLU or MoE

→ Residual Connection

More compactly:

Modern Transformer Block = stable normalization + efficient attention + stronger FFN + residual flow

Each component solves a real scaling problem.

Pre-LN improves deep training stability.

GQA reduces KV Cache memory.

RoPE injects position into attention.

SwiGLU improves FFN expressiveness.

MoE increases capacity without activating all parameters.

Pseudo-code View

A simplified modern block looks like this:

def transformer_block(x):
    h = rms_norm(x)

    attn = grouped_query_attention(
        q=apply_rope(query(h)),
        k=apply_rope(key(h)),
        v=value(h)
    )

    x = x + attn

    h = rms_norm(x)

    ffn = swiglu_ffn(h)

    x = x + ffn

    return x
Enter fullscreen mode Exit fullscreen mode

With MoE, the FFN part can become:

h = rms_norm(x)

selected_experts = router(h)

ffn = run_top_k_experts(h, selected_experts)

x = x + ffn
Enter fullscreen mode Exit fullscreen mode

The pattern stays simple.

Normalize.

Transform.

Add back.

Repeat.

Concrete Example

Imagine the model processes this token:

"bank"

The attention block helps decide whether “bank” means:

a financial institution

or the side of a river

RoPE helps the model understand token order and distance.

GQA helps attention run with a smaller KV Cache.

The FFN then transforms the contextual representation.

If the model uses MoE, the router may send this token to experts specialized for finance, geography, or general language.

That is the intuition.

Modern Transformer blocks are not just bigger.

They are more selective, stable, and hardware-aware.

Pre-LN vs Post-LN

The original Transformer commonly used Post-LN.

Post-LN:

x = LayerNorm(x + Sublayer(x))
Enter fullscreen mode Exit fullscreen mode

Modern LLMs often use Pre-LN.

Pre-LN:

x = x + Sublayer(LayerNorm(x))
Enter fullscreen mode Exit fullscreen mode

The difference looks small.

But it matters.

Pre-LN normalizes before the sublayer.

That helps gradients flow through deep Transformer stacks.

When a model has dozens or hundreds of layers, this becomes critical.

Pre-LN is not just a formatting choice.

It is a training stability choice.

RMSNorm

RMSNorm is a simpler normalization method.

LayerNorm recenters and rescales.

RMSNorm mainly rescales using the root mean square.

The RMS is:

RMS(h) = sqrt((1 / n) * Σ hᵢ²)

Then the normalized vector is:

h_norm = h / (RMS(h) + ε) * g

Why use it?

It keeps activation scale stable.

It removes some computation compared with LayerNorm.

It works well in large LLMs.

Example:

h = [3, 4]

RMS(h) = sqrt((9 + 16) / 2) ≈ 3.54

Normalized h ≈ [0.85, 1.13]

The key idea:

RMSNorm stabilizes scale without doing more than necessary.

Attention Block: GQA + RoPE

Modern attention is often not plain Multi-Head Attention.

It usually combines memory-aware attention with positional encoding.

Grouped-Query Attention reduces KV Cache size.

Rotary Positional Embedding injects position into Query and Key.

The attention flow becomes:

Input

→ Q, K, V projection

→ Apply RoPE to Q and K

→ Share K/V by groups using GQA

→ Compute attention

→ Output projection

This matters for inference.

Long-context generation is often limited by KV Cache memory.

GQA reduces that pressure.

RoPE keeps position information inside attention without adding a large position table.

SwiGLU

The Feed-Forward Network is not just a simple MLP anymore.

Many modern LLMs use SwiGLU.

SwiGLU is a gated activation.

One path carries information.

Another path controls how much passes through.

A simplified formula:

SwiGLU(x) = (W₁x) * Swish(W₂x)

Example:

W₁x = 4

Swish(W₂x) = 0.5

Output = 2

The gate decides how much information moves forward.

That gives the FFN more control than a plain activation.

Mixture of Experts

Mixture of Experts increases model capacity without activating every parameter for every token.

Instead of one FFN, the model has multiple expert networks.

A router chooses which experts handle each token.

Example router output:

Expert 1 = 0.45

Expert 2 = 0.19

Expert 3 = 0.05

Expert 4 = 0.31

With Top-2 routing:

Expert 1 and Expert 4 are selected.

Only those experts run.

This is why MoE is called sparse.

The model may have many parameters.

But each token uses only a small subset.

Dense FFN vs MoE

Dense FFN:

  • every token uses the same FFN
  • all FFN parameters are active
  • simpler to train and serve
  • compute grows directly with FFN size

MoE:

  • each token is routed to selected experts
  • only part of the model activates
  • increases total capacity efficiently
  • adds routing and load-balancing complexity

The key difference:

Dense FFN = same compute path for every token

MoE = conditional compute path per token

MoE is powerful.

But it is not free.

It introduces routing instability, expert imbalance, and distributed communication overhead.

Multi-Token Prediction

Standard language modeling predicts one next token.

At position t:

predict token t + 1

Multi-Token Prediction trains the model to predict multiple future tokens.

At position t:

predict token t + 1, t + 2, t + 3 ...

This gives more learning signals from the same representation.

Standard training:

one position → one supervision signal

MTP training:

one position → multiple supervision signals

This can improve sample efficiency.

In some systems, it can also support faster generation ideas.

Naive vs Modern View

Naive view:

Transformer block = attention + FFN

Modern view:

Transformer block = stable normalization + efficient attention + gated FFN + sparse scaling

Naive block:

attention
ffn
Enter fullscreen mode Exit fullscreen mode

Modern block:

rmsnorm
rope
gqa
residual
rmsnorm
swiglu or moe
residual
Enter fullscreen mode Exit fullscreen mode

This matters because modern LLM performance is not just about parameter count.

It is about architecture details that make those parameters trainable and deployable.

Implementation Perspective

When reading modern LLM code, look for these patterns:

self.input_layernorm = RMSNorm(...)

self.self_attn = Attention(..., rope=True, num_key_value_heads=...)

self.post_attention_layernorm = RMSNorm(...)

self.mlp = SwiGLU(...) or MoE(...)
Enter fullscreen mode Exit fullscreen mode

The key clue for GQA is:

number of query heads > number of key-value heads

The key clue for RoPE is:

position is applied to Q and K before attention

The key clue for MoE is:

router logits decide which experts run

These details tell you what kind of Transformer block you are actually looking at.

Important Conditions and Limits

Pre-LN improves stability, but the whole optimization setup still matters.

RMSNorm is efficient, but it does not replace good initialization or training design.

GQA reduces KV Cache memory, but may trade off some attention flexibility.

RoPE works well for long contexts, but very long extrapolation may still need scaling techniques.

SwiGLU improves FFN behavior, but increases FFN structure complexity.

MoE increases capacity, but adds routing and system complexity.

Modern Transformer design is a trade-off system.

Every upgrade solves one bottleneck and introduces another design choice.

Why This Matters Again

Modern LLMs are not just large neural networks.

They are carefully engineered stacks.

If you understand the block, you can better understand:

  • why inference needs KV Cache optimization
  • why RoPE appears in attention code
  • why RMSNorm replaces LayerNorm
  • why GQA changes memory usage
  • why MoE models can be huge but still sparse

This is the difference between using LLMs and understanding how they scale.

Takeaway

Modern Transformer blocks preserve the original Transformer idea.

But they upgrade almost every practical detail.

The shortest version:

Modern Transformer Block = Pre-LN/RMSNorm + GQA/RoPE Attention + SwiGLU/MoE FFN + Residual Connections

If Self-Attention is the core idea, the modern block is the production-grade version of that idea.

It is built for depth, context length, inference memory, and scalable capacity.

Discussion

When reading modern LLM architecture, which component feels most important to understand first?

RMSNorm, RoPE, GQA, SwiGLU, or MoE?

Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/modern-transformer-blocks-llm-en/

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

Top comments (0)