The "Attention Is All You Need" paper introduced the Transformer model, which shifted AI from sequential processing to a parallel approach built on attention. This architecture powers today's large language models. We'll break down its key ideas, focusing on why it replaced RNNs and how its mechanisms work.
Why RNNs Struggled with Language Tasks
Before Transformers, RNNs dominated tasks like translation. They process sequences one step at a time, updating a hidden state with each word.
Key issues:
Long-range dependencies: In a long sentence, early words fade from memory by the end. This creates recency bias, where recent words dominate.
Sequential processing: Each step depends on the previous one, blocking parallel computation.
For example, in "The animal didn't cross the street because it was too tired," an RNN might link "it" to "street" instead of "animal" if the sentence is long.
RNNs train slowly on large datasets because you can't split work across GPUs efficiently.
Read the original paper for more on RNN limitations.
How Transformers Fix RNN Problems
Transformers drop recurrence entirely, using self-attention to process all words simultaneously.
This solves two core RNN flaws:
Attention lets any word reference any other, handling long dependencies without fading memory.
Parallel processing speeds up training, even on massive data.
In practice, Transformers train faster despite higher per-step computation, thanks to GPU optimization.
Aspect | RNNs | Transformers |
---|---|---|
Dependency Handling | Sequential, fades over distance | Global, all words at once |
Training Speed | Slow, no parallelism | Fast, fully parallel |
Complexity | O(n) per layer | O(n²) but parallelizable |
The Core of Self-Attention Mechanism
Self-attention computes how much each word should focus on others when building its representation.
For a word like "it" in "The animal didn't cross the street because it was too tired," attention scores high for "animal" and low for "street."
This happens by transforming each word into three vectors: query, key, and value.
Query: Represents what the current word is "asking" about.
Key: Advertises what other words offer.
Value: Holds the actual content to blend in.
All derive from the word's embedding via learned matrices.
Queries, Keys, and Values in Action
Word embeddings start as learned vectors capturing meaning—similar words get similar vectors.
For a sentence like "my name is john doe":
Embed each word into a vector (e.g., dimension 512).
Multiply by W_Q for query, W_K for key, W_V for value.
These matrices are parameters updated during training.
Here's a simple Python example using NumPy to compute Q, K, V for a tiny embedding space. Assume random initial embeddings and matrices for demo.
import numpy as np
# Sample embeddings for words: "my", "name", "is", "john", "doe" (dim=4 for simplicity)
embeddings = np.array([
[0.1, 0.2, 0.3, 0.4], # my
[0.5, 0.6, 0.7, 0.8], # name
[0.9, 1.0, 1.1, 1.2], # is
[1.3, 1.4, 1.5, 1.6], # john
[1.7, 1.8, 1.9, 2.0] # doe
])
# Random learned matrices (dim=4x4)
W_Q = np.random.rand(4, 4)
W_K = np.random.rand(4, 4)
W_V = np.random.rand(4, 4)
# Compute Q, K, V
Q = np.dot(embeddings, W_Q)
K = np.dot(embeddings, W_K)
V = np.dot(embeddings, W_V)
print("Queries:\n", Q)
print("Keys:\n", K)
print("Values:\n", V)
# Output: Random floats based on seed; run to see matrices like:
# Queries:
# [[0.123 0.456 ...]]
# (Actual output varies with random seed)
This code runs standalone—copy and execute to see vector transformations.
Computing Attention Scores Step by Step
Attention scores measure similarity via dot products.
For each query, dot with all keys, scale by sqrt(key_dim), apply softmax.
Dot product: Query_i • Key_j gives raw score.
Scale: Divide by sqrt(d_k) to stabilize gradients.
Softmax: Turns scores into probabilities summing to 1.
In code, extending the previous example:
import numpy as np
from scipy.special import softmax # For stable softmax
# From previous: Q, K, V (assume dim=4, seq_len=5)
dk = 4 # Key dimension
# Attention scores: Q @ K.T / sqrt(dk)
raw_scores = np.dot(Q, K.T) / np.sqrt(dk)
# Softmax along rows
attention_weights = softmax(raw_scores, axis=1)
print("Attention Weights:\n", attention_weights)
# Output example (varies):
# [[0.2 0.15 0.25 0.2 0.2]
# ...]
# Each row sums to 1, showing focus distribution.
This snippet computes weights—add to prior code for a full runnable script.
High scores link related words, like pronouns to nouns.
The Parallelism Trade-Off in Attention
Self-attention is O(n²)—every word attends to every other.
But unlike RNN's O(n) sequential steps, attention parallelizes fully.
On GPUs, this means faster training overall, even for long sequences.
For short n (e.g., 512), it's efficient; for very long (e.g., 100k), optimizations like sparse attention help, but the paper's base idea enabled scaling.
This shift made training on trillions of tokens feasible.
Why Positional Encoding Is Essential
Attention treats inputs as sets, ignoring order.
Without order, "Rama killed Ravana" equals "Ravana killed Rama."
Positional encoding adds position info to embeddings using sine/cosine functions.
For position p and dimension i: PE(p, i) = sin(p / 10000^{2i/d}) or cos, alternating.
Added to embeddings before attention.
This lets the model distinguish order without recurrence.
In code, a basic implementation:
import numpy as np
def positional_encoding(seq_len, d_model):
PE = np.zeros((seq_len, d_model))
for pos in range(seq_len):
for i in range(d_model):
if i % 2 == 0:
PE[pos, i] = np.sin(pos / np.power(10000, 2 * i / d_model))
else:
PE[pos, i] = np.cos(pos / np.power(10000, 2 * (i - 1) / d_model))
return PE
# Example: seq_len=5, d_model=4
pe = positional_encoding(5, 4)
print(pe)
# Output: Sine/cosine values like:
# [[ 0. 1. 0. 1. ]
# [ 0.8415 0.5403 0.0998 0.995 ]]
# Add this to your embeddings matrix.
Run this to generate encodings—fully standalone.
How Training Shapes the Transformer
Training uses backpropagation on large datasets to minimize prediction errors.
Updated parameters include:
Word embeddings: Refine meanings.
Q, K, V matrices: Tune attention judgments.
Other layers: Feed-forward nets, etc.
Over millions of examples, the model learns similarities—like query for "it" aligning with key for "animal."
This process encodes language patterns into parameters, enabling accurate attention.
The Transformer's design—parallel attention with positional info—unlocked efficient scaling. It handles context windows beyond sentences, powering modern AI. Experiment with the code snippets to build intuition, and adapt for your projects.
LiveReview helps you get great feedback on your PR/MR in a few minutes.
Saves hours on every PR by giving fast, automated first-pass reviews.
If you're tired of waiting for your peer to review your code or are not confident that they'll provide valid feedback, here's LiveReview for you.
Top comments (0)