The 2017 paper "Attention Is All You Need" by Vaswani et al. introduced the Transformer — the architecture behind GPT, Claude, Gemini, and every major LLM today. It replaced recurrent models entirely with attention mechanisms, and the field has never looked back.
This post walks through the key ideas.
The problem with RNNs
Before Transformers, sequence modeling meant RNNs and LSTMs. These process tokens one at a time, left to right. That sequential dependency creates two problems:
- No parallelization — each step depends on the previous hidden state, so you can't process tokens simultaneously during training
- Long-range dependencies decay — by the time an RNN reaches token 500, the signal from token 1 has been compressed through hundreds of hidden states
Attention mechanisms existed before this paper (Bahdanau attention, 2014), but they were bolted onto RNNs. The radical idea here: what if attention is all you need? Drop the recurrence entirely.
The Encoder-Decoder architecture
The Transformer follows the classic encoder-decoder structure used in machine translation:
- Encoder (left side): Takes the input sequence and produces a rich representation. 6 identical layers stacked.
- Decoder (right side): Takes the encoder's output + previously generated tokens to produce the next token. Also 6 layers.
Each layer in both stacks has the same building blocks: multi-head attention, feed-forward networks, residual connections, and layer normalization.
Self-attention: the core mechanism
Self-attention lets every token in a sequence look at every other token and decide how much to "pay attention" to it.
For each token, the model computes three vectors:
- Query (Q) — "what am I looking for?"
- Key (K) — "what do I contain?"
- Value (V) — "what information do I provide?"
These are produced by multiplying the input embeddings by learned weight matrices: .
The attention score between two tokens is the dot product of the query of one with the key of the other. High dot product = high relevance. The formula:
The scaling factor prevents the dot products from growing too large as dimensionality increases — without it, the softmax would produce extremely peaked distributions, effectively killing gradients.
Multi-head attention
Instead of computing attention once with the full dimensionality, the model splits Q, K, and V into multiple heads (8 in the original paper). Each head operates on a smaller subspace ( dimensions per head).
Why? Different heads can learn different types of relationships:
- One head might focus on syntactic structure (subject-verb agreement)
- Another might capture positional proximity
- Another might track semantic similarity
The outputs of all heads are concatenated and projected back to the full dimension.
Three types of attention in the Transformer
The paper uses multi-head attention in three distinct ways:
- Encoder self-attention — every input token attends to every other input token
- Masked decoder self-attention — each output token attends only to previous output tokens (the mask prevents looking ahead, preserving autoregressive generation)
- Cross-attention — decoder tokens attend to encoder outputs, connecting the input representation to the output generation
Positional encoding
Self-attention has no inherent notion of order — it's a set operation. "The cat sat on the mat" and "mat the on sat cat the" would produce identical attention patterns without positional information.
The paper adds positional encodings using sine and cosine functions of different frequencies:
These are added (not concatenated) to the input embeddings. The sinusoidal approach was chosen because it allows the model to generalize to sequence lengths longer than those seen during training — any relative position can be expressed as a linear function of the encodings.
Position-wise feed-forward networks
Each attention sub-layer is followed by a feed-forward network applied independently to each position:
This is two linear transformations with a ReLU in between. The inner dimension expands to 2048 (4× the model dimension of 512), then projects back down. Think of it as each token individually "processing" the information it gathered from attention.
Residual connections and layer norm
Every sub-layer (attention or FFN) is wrapped with:
The residual connection ensures gradients flow easily through deep stacks — without them, training a 6-layer stack would be much harder. Layer normalization stabilizes the activations.
Training details
- Optimizer: Adam with
- Learning rate schedule: Warmup + decay. LR increases linearly for 4000 steps, then decays proportionally to the inverse square root of the step number
- Regularization: Dropout (0.1) on attention weights and after each sub-layer, plus label smoothing (0.1)
- Training data: WMT English-German (4.5M sentence pairs) and English-French (36M pairs)
- Hardware: 8 NVIDIA P100 GPUs, 3.5 days for the big model
The results
The Transformer achieved state-of-the-art on English-to-German and English-to-French translation, beating all previous models including deep ensembles — while training significantly faster due to full parallelization.
But translation was just the beginning. The architecture turned out to be the foundation for:
- BERT (encoder-only) — bidirectional pretraining
- GPT (decoder-only) — autoregressive language modeling
- Vision Transformers — applying the same architecture to images
- Basically everything in modern AI
Key takeaway
The paper's core insight is elegant: you don't need recurrence or convolutions for sequence modeling. Attention alone — properly scaled, split into multiple heads, and stacked with residual connections — is sufficient. And because attention computes all pairwise relationships in parallel, it's dramatically faster to train.
That's why nine years later, every frontier model is still a Transformer at its core.
Top comments (1)
Fantastic breakdown — "Attention Is All You Need" is one of those papers where the title is doing a lot of work. The self-attention mechanism is what makes modern LLMs understand long-range dependencies in text.
One thing the paper implies but rarely gets discussed in practical terms: because transformers process context holistically, the structure of your input matters enormously. A well-structured prompt with clear semantic sections gets attended to more predictably than a wall of unstructured text. That's the core insight behind flompt — structured prompts, compiled to XML, for more reliable model attention.
flompt.dev / github.com/Nyrok/flompt