The original Transformer idea is still alive.
But modern LLM blocks are not just the 2017 Transformer copied and scaled.
They are engineered for deeper training, longer context, cheaper inference, and larger capacity.
That is why components like RMSNorm, GQA, RoPE, SwiGLU, and MoE matter.
Core Idea
A modern Transformer block still follows the same basic pattern:
Attention updates token relationships.
The Feed-Forward Network transforms each token representation.
Residual connections keep information flowing.
But modern LLMs changed the details.
Those details are not cosmetic.
They make large-scale training and inference practical.
The Key Structure
A typical modern Transformer block looks like this:
Input
→ RMSNorm or Pre-Layer Normalization
→ Self-Attention with GQA and RoPE
→ Residual Connection
→ RMSNorm or Pre-Layer Normalization
→ Feed-Forward Network with SwiGLU or MoE
→ Residual Connection
More compactly:
Modern Transformer Block = stable normalization + efficient attention + stronger FFN + residual flow
Each component solves a real scaling problem.
Pre-LN improves deep training stability.
GQA reduces KV Cache memory.
RoPE injects position into attention.
SwiGLU improves FFN expressiveness.
MoE increases capacity without activating all parameters.
Pseudo-code View
A simplified modern block looks like this:
def transformer_block(x):
h = rms_norm(x)
attn = grouped_query_attention(
q=apply_rope(query(h)),
k=apply_rope(key(h)),
v=value(h)
)
x = x + attn
h = rms_norm(x)
ffn = swiglu_ffn(h)
x = x + ffn
return x
With MoE, the FFN part can become:
h = rms_norm(x)
selected_experts = router(h)
ffn = run_top_k_experts(h, selected_experts)
x = x + ffn
The pattern stays simple.
Normalize.
Transform.
Add back.
Repeat.
Concrete Example
Imagine the model processes this token:
"bank"
The attention block helps decide whether “bank” means:
a financial institution
or the side of a river
RoPE helps the model understand token order and distance.
GQA helps attention run with a smaller KV Cache.
The FFN then transforms the contextual representation.
If the model uses MoE, the router may send this token to experts specialized for finance, geography, or general language.
That is the intuition.
Modern Transformer blocks are not just bigger.
They are more selective, stable, and hardware-aware.
Pre-LN vs Post-LN
The original Transformer commonly used Post-LN.
Post-LN:
x = LayerNorm(x + Sublayer(x))
Modern LLMs often use Pre-LN.
Pre-LN:
x = x + Sublayer(LayerNorm(x))
The difference looks small.
But it matters.
Pre-LN normalizes before the sublayer.
That helps gradients flow through deep Transformer stacks.
When a model has dozens or hundreds of layers, this becomes critical.
Pre-LN is not just a formatting choice.
It is a training stability choice.
RMSNorm
RMSNorm is a simpler normalization method.
LayerNorm recenters and rescales.
RMSNorm mainly rescales using the root mean square.
The RMS is:
RMS(h) = sqrt((1 / n) * Σ hᵢ²)
Then the normalized vector is:
h_norm = h / (RMS(h) + ε) * g
Why use it?
It keeps activation scale stable.
It removes some computation compared with LayerNorm.
It works well in large LLMs.
Example:
h = [3, 4]
RMS(h) = sqrt((9 + 16) / 2) ≈ 3.54
Normalized h ≈ [0.85, 1.13]
The key idea:
RMSNorm stabilizes scale without doing more than necessary.
Attention Block: GQA + RoPE
Modern attention is often not plain Multi-Head Attention.
It usually combines memory-aware attention with positional encoding.
Grouped-Query Attention reduces KV Cache size.
Rotary Positional Embedding injects position into Query and Key.
The attention flow becomes:
Input
→ Q, K, V projection
→ Apply RoPE to Q and K
→ Share K/V by groups using GQA
→ Compute attention
→ Output projection
This matters for inference.
Long-context generation is often limited by KV Cache memory.
GQA reduces that pressure.
RoPE keeps position information inside attention without adding a large position table.
SwiGLU
The Feed-Forward Network is not just a simple MLP anymore.
Many modern LLMs use SwiGLU.
SwiGLU is a gated activation.
One path carries information.
Another path controls how much passes through.
A simplified formula:
SwiGLU(x) = (W₁x) * Swish(W₂x)
Example:
W₁x = 4
Swish(W₂x) = 0.5
Output = 2
The gate decides how much information moves forward.
That gives the FFN more control than a plain activation.
Mixture of Experts
Mixture of Experts increases model capacity without activating every parameter for every token.
Instead of one FFN, the model has multiple expert networks.
A router chooses which experts handle each token.
Example router output:
Expert 1 = 0.45
Expert 2 = 0.19
Expert 3 = 0.05
Expert 4 = 0.31
With Top-2 routing:
Expert 1 and Expert 4 are selected.
Only those experts run.
This is why MoE is called sparse.
The model may have many parameters.
But each token uses only a small subset.
Dense FFN vs MoE
Dense FFN:
- every token uses the same FFN
- all FFN parameters are active
- simpler to train and serve
- compute grows directly with FFN size
MoE:
- each token is routed to selected experts
- only part of the model activates
- increases total capacity efficiently
- adds routing and load-balancing complexity
The key difference:
Dense FFN = same compute path for every token
MoE = conditional compute path per token
MoE is powerful.
But it is not free.
It introduces routing instability, expert imbalance, and distributed communication overhead.
Multi-Token Prediction
Standard language modeling predicts one next token.
At position t:
predict token t + 1
Multi-Token Prediction trains the model to predict multiple future tokens.
At position t:
predict token t + 1, t + 2, t + 3 ...
This gives more learning signals from the same representation.
Standard training:
one position → one supervision signal
MTP training:
one position → multiple supervision signals
This can improve sample efficiency.
In some systems, it can also support faster generation ideas.
Naive vs Modern View
Naive view:
Transformer block = attention + FFN
Modern view:
Transformer block = stable normalization + efficient attention + gated FFN + sparse scaling
Naive block:
attention
ffn
Modern block:
rmsnorm
rope
gqa
residual
rmsnorm
swiglu or moe
residual
This matters because modern LLM performance is not just about parameter count.
It is about architecture details that make those parameters trainable and deployable.
Implementation Perspective
When reading modern LLM code, look for these patterns:
self.input_layernorm = RMSNorm(...)
self.self_attn = Attention(..., rope=True, num_key_value_heads=...)
self.post_attention_layernorm = RMSNorm(...)
self.mlp = SwiGLU(...) or MoE(...)
The key clue for GQA is:
number of query heads > number of key-value heads
The key clue for RoPE is:
position is applied to Q and K before attention
The key clue for MoE is:
router logits decide which experts run
These details tell you what kind of Transformer block you are actually looking at.
Important Conditions and Limits
Pre-LN improves stability, but the whole optimization setup still matters.
RMSNorm is efficient, but it does not replace good initialization or training design.
GQA reduces KV Cache memory, but may trade off some attention flexibility.
RoPE works well for long contexts, but very long extrapolation may still need scaling techniques.
SwiGLU improves FFN behavior, but increases FFN structure complexity.
MoE increases capacity, but adds routing and system complexity.
Modern Transformer design is a trade-off system.
Every upgrade solves one bottleneck and introduces another design choice.
Why This Matters Again
Modern LLMs are not just large neural networks.
They are carefully engineered stacks.
If you understand the block, you can better understand:
- why inference needs KV Cache optimization
- why RoPE appears in attention code
- why RMSNorm replaces LayerNorm
- why GQA changes memory usage
- why MoE models can be huge but still sparse
This is the difference between using LLMs and understanding how they scale.
Takeaway
Modern Transformer blocks preserve the original Transformer idea.
But they upgrade almost every practical detail.
The shortest version:
Modern Transformer Block = Pre-LN/RMSNorm + GQA/RoPE Attention + SwiGLU/MoE FFN + Residual Connections
If Self-Attention is the core idea, the modern block is the production-grade version of that idea.
It is built for depth, context length, inference memory, and scalable capacity.
Discussion
When reading modern LLM architecture, which component feels most important to understand first?
RMSNorm, RoPE, GQA, SwiGLU, or MoE?
Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/modern-transformer-blocks-llm-en/
GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai
Top comments (0)