seah-js

Posted on Mar 7

Attention Is All You Need — Full Paper Breakdown

#ai #transformers #deeplearning #machinelearning

The 2017 paper "Attention Is All You Need" by Vaswani et al. introduced the Transformer — the architecture behind GPT, Claude, Gemini, and every major LLM today. It replaced recurrent models entirely with attention mechanisms, and the field has never looked back.

This post walks through the key ideas.

The problem with RNNs

Before Transformers, sequence modeling meant RNNs and LSTMs. These process tokens one at a time, left to right. That sequential dependency creates two problems:

No parallelization — each step depends on the previous hidden state, so you can't process tokens simultaneously during training
Long-range dependencies decay — by the time an RNN reaches token 500, the signal from token 1 has been compressed through hundreds of hidden states

Attention mechanisms existed before this paper (Bahdanau attention, 2014), but they were bolted onto RNNs. The radical idea here: what if attention is all you need? Drop the recurrence entirely.

The Encoder-Decoder architecture

The Transformer follows the classic encoder-decoder structure used in machine translation:

Encoder (left side): Takes the input sequence and produces a rich representation. 6 identical layers stacked.
Decoder (right side): Takes the encoder's output + previously generated tokens to produce the next token. Also 6 layers.

Each layer in both stacks has the same building blocks: multi-head attention, feed-forward networks, residual connections, and layer normalization.

Self-attention: the core mechanism

Self-attention lets every token in a sequence look at every other token and decide how much to "pay attention" to it.

For each token, the model computes three vectors:

Query (Q) — "what am I looking for?"
Key (K) — "what do I contain?"
Value (V) — "what information do I provide?"

These are produced by multiplying the input embeddings by learned weight matrices: $W_Q, W_K, W_V$ .

The attention score between two tokens is the dot product of the query of one with the key of the other. High dot product = high relevance. The formula:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

The $\sqrt{d_k}$ scaling factor prevents the dot products from growing too large as dimensionality increases — without it, the softmax would produce extremely peaked distributions, effectively killing gradients.

Multi-head attention

Instead of computing attention once with the full dimensionality, the model splits Q, K, and V into multiple heads (8 in the original paper). Each head operates on a smaller subspace ( $d_{\text{model}} / h = 512 / 8 = 64$ dimensions per head).

Why? Different heads can learn different types of relationships:

One head might focus on syntactic structure (subject-verb agreement)
Another might capture positional proximity
Another might track semantic similarity

The outputs of all heads are concatenated and projected back to the full dimension.

Three types of attention in the Transformer

The paper uses multi-head attention in three distinct ways:

Encoder self-attention — every input token attends to every other input token
Masked decoder self-attention — each output token attends only to previous output tokens (the mask prevents looking ahead, preserving autoregressive generation)
Cross-attention — decoder tokens attend to encoder outputs, connecting the input representation to the output generation

Positional encoding

Self-attention has no inherent notion of order — it's a set operation. "The cat sat on the mat" and "mat the on sat cat the" would produce identical attention patterns without positional information.

The paper adds positional encodings using sine and cosine functions of different frequencies:

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

These are added (not concatenated) to the input embeddings. The sinusoidal approach was chosen because it allows the model to generalize to sequence lengths longer than those seen during training — any relative position can be expressed as a linear function of the encodings.

Position-wise feed-forward networks

Each attention sub-layer is followed by a feed-forward network applied independently to each position:

\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

This is two linear transformations with a ReLU in between. The inner dimension expands to 2048 (4× the model dimension of 512), then projects back down. Think of it as each token individually "processing" the information it gathered from attention.

Residual connections and layer norm

Every sub-layer (attention or FFN) is wrapped with:

\text{LayerNorm}(x + \text{SubLayer}(x))

The residual connection $x + \text{SubLayer}(x)$ ensures gradients flow easily through deep stacks — without them, training a 6-layer stack would be much harder. Layer normalization stabilizes the activations.

Training details

Optimizer: Adam with $\beta_1 = 0.9, \beta_2 = 0.98$
Learning rate schedule: Warmup + decay. LR increases linearly for 4000 steps, then decays proportionally to the inverse square root of the step number
Regularization: Dropout (0.1) on attention weights and after each sub-layer, plus label smoothing (0.1)
Training data: WMT English-German (4.5M sentence pairs) and English-French (36M pairs)
Hardware: 8 NVIDIA P100 GPUs, 3.5 days for the big model

The results

The Transformer achieved state-of-the-art on English-to-German and English-to-French translation, beating all previous models including deep ensembles — while training significantly faster due to full parallelization.

But translation was just the beginning. The architecture turned out to be the foundation for:

BERT (encoder-only) — bidirectional pretraining
GPT (decoder-only) — autoregressive language modeling
Vision Transformers — applying the same architecture to images
Basically everything in modern AI

Key takeaway

The paper's core insight is elegant: you don't need recurrence or convolutions for sequence modeling. Attention alone — properly scaled, split into multiple heads, and stacked with residual connections — is sufficient. And because attention computes all pairwise relationships in parallel, it's dramatically faster to train.

That's why nine years later, every frontier model is still a Transformer at its core.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.