Vuk Rosić

Posted on Jul 6

Attention Mechanism Tutorial: From Simple to Advanced

Part 1: The Core Idea

Attention is like a spotlight - it helps models focus on what's important.

import torch
import torch.nn.functional as F

# Simple example: Which word is most important?
sentence = ["I", "love", "pizza"]
importance = torch.tensor([0.1, 0.3, 0.6])  # pizza is most important

Intuition: Instead of treating all words equally, attention assigns different weights to focus on what matters most.

Part 2: Basic Attention Weights

# Raw attention scores (how much to focus on each word)
scores = torch.tensor([2.0, 1.0, 3.0])  # [I, love, pizza]

# Convert to probabilities (softmax)
weights = F.softmax(scores, dim=0)
print(weights)  # [0.24, 0.09, 0.67] - pizza gets most attention

What happened: Softmax converts raw scores to probabilities that sum to 1.

Part 3: Weighted Combination

# Word representations (simplified vectors)
words = torch.tensor([[1.0, 0.0],  # "I"
                      [0.0, 1.0],  # "love" 
                      [1.0, 1.0]]) # "pizza"

# Apply attention weights
attended = torch.sum(weights.unsqueeze(1) * words, dim=0)
print(attended)  # Mostly "pizza" representation

Intuition: We combine all word vectors, but "pizza" contributes most because it has the highest attention weight.

Part 4: Computing Attention Scores

# How similar are words? (dot product)
query = torch.tensor([1.0, 1.0])  # What we're looking for
key1 = torch.tensor([1.0, 0.0])  # "I"
key2 = torch.tensor([0.0, 1.0])  # "love"

score1 = torch.dot(query, key1)  # 1.0
score2 = torch.dot(query, key2)  # 1.0

Theory: Attention scores measure how well a query matches each key.

Part 5: The Q, K, V Concept

# Three roles for each word:
# Q (Query): "What am I looking for?"
# K (Key): "What do I represent?" 
# V (Value): "What information do I carry?"

query = torch.tensor([1.0, 0.0])    # Looking for subject
keys = torch.tensor([[1.0, 0.0],   # "I" - matches query well
                     [0.0, 1.0]])   # "love" - doesn't match
values = torch.tensor([[2.0, 3.0], # "I" carries this info
                       [1.0, 4.0]]) # "love" carries this info

Intuition: Query asks "what do I need?", Keys answer "what do I offer?", Values provide the actual information.

Part 6: One-Line Attention

# Complete attention in one line
attention_output = torch.sum(F.softmax(torch.mv(keys, query), dim=0).unsqueeze(1) * values, dim=0)

What it does: Computes scores (query·keys), applies softmax, weights the values.

Part 7: Self-Attention Intuition

# In self-attention, each word can attend to every other word
sentence = ["The", "cat", "sat"]
# "cat" might attend to "sat" (what did the cat do?)
# "sat" might attend to "cat" (who sat?)

Key insight: Words can look at each other to understand relationships and context.

Part 8: Multi-Head Attention (Simple)

# Multiple "attention heads" look for different things
head1_query = torch.tensor([1.0, 0.0])  # Looking for subjects
head2_query = torch.tensor([0.0, 1.0])  # Looking for actions

# Each head focuses on different aspects

Why multiple heads: Different heads can specialize in different types of relationships (subject-verb, adjective-noun, etc.).

Part 9: Scaling Up

# Real sentences have many words
seq_len = 10  # 10 words in sentence
d_model = 64  # Each word is 64-dimensional vector

# Q, K, V matrices transform word vectors
Q = torch.randn(seq_len, d_model)  # Queries for each word
K = torch.randn(seq_len, d_model)  # Keys for each word
V = torch.randn(seq_len, d_model)  # Values for each word

Scale: Real models use hundreds of dimensions and thousands of words.

Part 10: Attention Matrix

# Attention scores between all word pairs
attention_scores = torch.mm(Q, K.transpose(0, 1))  # [10, 10] matrix
attention_weights = F.softmax(attention_scores, dim=1)  # Each row sums to 1

# Row i, column j = how much word i attends to word j

Visualization: Each row shows where one word "looks" in the sentence.

Part 11: Why Attention Works

# Traditional RNN: Information flows sequentially
# Word 1 → Word 2 → Word 3 → Word 4

# Attention: All words can interact directly
# Word 1 ↔ Word 2 ↔ Word 3 ↔ Word 4

Advantage: No information loss over long distances, parallel processing.

Part 12: Putting It All Together

# Complete self-attention step by step
def simple_attention(X):
    Q = X  # Queries (simplified)
    K = X  # Keys 
    V = X  # Values

    scores = torch.mm(Q, K.transpose(0, 1))  # Compute similarities
    weights = F.softmax(scores, dim=1)        # Convert to probabilities
    output = torch.mm(weights, V)             # Weighted combination
    return output

# Usage
word_vectors = torch.randn(5, 8)  # 5 words, 8 dimensions each
attended_vectors = simple_attention(word_vectors)

Result: Each word vector is now updated with information from all other words, weighted by attention.

Key Takeaways

Attention = Weighted Average: Focus more on important parts
Q·K = Similarity: How well query matches key
Softmax = Probability: Convert scores to weights that sum to 1
Weighted V = Output: Combine values using attention weights
Self-Attention = Words talking to each other: Every word can attend to every other word

This foundation prepares you for transformer models, which are built entirely on attention mechanisms!

Understanding Attention: From Words to Vectors

1. Word Embeddings - The Foundation

import torch
import torch.nn as nn
import torch.nn.functional as F

# Sample sentence: "The cat sat on the mat"
vocab = {"<pad>": 0, "the": 1, "cat": 2, "sat": 3, "on": 4, "mat": 5}
sentence = [1, 2, 3, 4, 1, 5]  # token IDs

# Create embeddings
vocab_size = len(vocab)
embed_dim = 64
embedding = nn.Embedding(vocab_size, embed_dim)

# Convert tokens to vectors
tokens = torch.tensor(sentence)
embeddings = embedding(tokens)
print(f"Shape: {embeddings.shape}")  # [6, 64]
print(f"'cat' vector: {embeddings[1][:8]}...")  # First 8 dimensions

Each word becomes a 64-dimensional vector that captures semantic meaning.

2. The Q, K, V Matrices - Core of Attention

# Attention dimensions
d_model = 64
num_heads = 8
d_k = d_model // num_heads  # 8

# Linear transformations to create Q, K, V
W_q = nn.Linear(d_model, d_model, bias=False)
W_k = nn.Linear(d_model, d_model, bias=False)
W_v = nn.Linear(d_model, d_model, bias=False)

# Transform embeddings
Q = W_q(embeddings)  # Queries: "What am I looking for?"
K = W_k(embeddings)  # Keys: "What do I represent?"
V = W_v(embeddings)  # Values: "What information do I carry?"

print(f"Q shape: {Q.shape}")  # [6, 64]
print(f"K shape: {K.shape}")  # [6, 64]
print(f"V shape: {V.shape}")  # [6, 64]

Intuition:

Q (Query): "What information does this word need?"
K (Key): "What kind of information does this word offer?"
V (Value): "What actual information does this word contain?"

3. Computing Attention Scores

# Reshape for multi-head attention
batch_size, seq_len = 1, 6
Q = Q.view(batch_size, seq_len, num_heads, d_k).transpose(1, 2)  # [1, 8, 6, 8]
K = K.view(batch_size, seq_len, num_heads, d_k).transpose(1, 2)  # [1, 8, 6, 8]
V = V.view(batch_size, seq_len, num_heads, d_k).transpose(1, 2)  # [1, 8, 6, 8]

# Attention scores: How much should each word pay attention to others?
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
print(f"Attention scores shape: {scores.shape}")  # [1, 8, 6, 6]

# Example: How much does "cat" attend to each word?
cat_attention = scores[0, 0, 1, :]  # First head, "cat" position
words = ["the", "cat", "sat", "on", "the", "mat"]
for i, word in enumerate(words):
    print(f"cat -> {word}: {cat_attention[i]:.3f}")

4. Softmax and Weighted Values

# Convert scores to probabilities
attention_weights = F.softmax(scores, dim=-1)
print(f"Attention weights shape: {attention_weights.shape}")  # [1, 8, 6, 6]

# Apply attention to values
attended_values = torch.matmul(attention_weights, V)  # [1, 8, 6, 8]

# Concatenate heads and project back
attended_values = attended_values.transpose(1, 2).contiguous().view(
    batch_size, seq_len, d_model)  # [1, 6, 64]

print(f"Final attended values shape: {attended_values.shape}")

# Show attention pattern for "cat"
print("\nAttention pattern for 'cat':")
cat_weights = attention_weights[0, 0, 1, :]  # First head
for i, word in enumerate(words):
    print(f"  {word}: {cat_weights[i]:.3f}")

5. Complete Self-Attention Implementation

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x):
        batch_size, seq_len, d_model = x.size()

        # Linear transformations
        Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
        attention_weights = F.softmax(scores, dim=-1)
        attended_values = torch.matmul(attention_weights, V)

        # Concatenate heads
        attended_values = attended_values.transpose(1, 2).contiguous().view(
            batch_size, seq_len, d_model)

        # Final projection
        output = self.W_o(attended_values)
        return output, attention_weights

# Usage
attention = MultiHeadAttention(d_model=64, num_heads=8)
output, weights = attention(embeddings.unsqueeze(0))
print(f"Output shape: {output.shape}")  # [1, 6, 64]

6. Visualizing Attention Patterns

# Extract attention weights for visualization
attention_matrix = weights[0, 0].detach().numpy()  # First head
words = ["the", "cat", "sat", "on", "the", "mat"]

print("Attention Matrix (first head):")
print("From -> To:")
for i, from_word in enumerate(words):
    print(f"{from_word:>4}: ", end="")
    for j, to_word in enumerate(words):
        print(f"{attention_matrix[i,j]:.2f} ", end="")
    print()

7. Key Insights

What happens in attention?

Each word creates a query (what it's looking for)
Each word creates a key (what it represents)
We compute similarity between queries and keys
Higher similarity = more attention
We use attention weights to combine values (actual information)

Example: When processing "cat", the model might:

Query: "I need information about animals"
Look at all keys: "the" (determiner), "sat" (action), "mat" (object)
Pay most attention to "sat" because it's the relevant action
Combine information weighted by attention scores

8. Practical Example with Real Meaning

# Sentence: "The cat chased the mouse"
sentence = "The cat chased the mouse"
words = sentence.lower().split()

# Simulate what attention might learn
print("Attention patterns the model might learn:")
print("- 'cat' attends to 'chased' (subject-verb relationship)")
print("- 'chased' attends to 'cat' and 'mouse' (verb-subject-object)")
print("- 'mouse' attends to 'chased' (object-verb relationship)")
print("- 'the' attends to following nouns ('cat', 'mouse')")

# This allows the model to understand:
# - Who did what to whom
# - Grammatical relationships
# - Semantic dependencies

Summary

Attention mechanism allows models to:

Focus on relevant parts of the input
Relate different words to each other
Combine information based on relevance
Understand long-range dependencies

The magic is in the learned Q, K, V matrices that transform word embeddings into queries, keys, and values that can interact meaningfully.

DEV Community