what is the mathematical realization of attention maps from multiple heads?

The mathematical realization of attention maps from multiple heads in a Vision Transformer (ViT) like ViT-S/8 trained with DINO involves computing the self-attention scores for each head and then visualizing them, often for a specific query (e.g., the [CLS] token). Here's a step-by-step breakdown:

1. Self-Attention in Multi-Head Attention (MHA)

In a transformer, the input embeddings are split into ( H ) heads (e.g., ( H = 6 ) for ViT-S). For each head ( h ), the self-attention is computed as:

[
\text{Attention}_h(Q_h, K_h, V_h) = \text{softmax}\left(\frac{Q_h K_h^T}{\sqrt{d_k}}\right) V_h
]

where:

( Q_h = X W_h^Q ) (queries for head ( h )),
( K_h = X W_h^K ) (keys for head ( h )),
( V_h = X W_h^V ) (values for head ( h )),
( X ) is the input embedding (including positional encoding),
( W_h^Q, W_h^K, W_h^V ) are learned projection matrices for head ( h ),
( d_k ) is the dimension of the key vectors (typically ( d_k = d_{\text{model}} / H )).

2. Attention Scores for `[CLS]` Query

When visualizing attention for the [CLS] token (often used as a global image representation), we focus on the attention scores where the query is the [CLS] token. For head ( h ), the attention scores are:

[
A_h = \text{softmax}\left(\frac{q_{\text{[CLS]},h} K_h^T}{\sqrt{d_k}}\right)
]

where ( q_{\text{[CLS]},h} ) is the query vector for the [CLS] token in head ( h ).

3. Reshaping Attention Maps for Visualization

The attention scores ( A_h ) are of shape ( (1, N+1) ), where ( N ) is the number of patches + 1 (for [CLS]). To visualize how the [CLS] token attends to image patches:

Discard the attention score for [CLS] itself (since it’s trivial).
Reshape the remaining ( N ) scores into a 2D grid corresponding to the spatial layout of patches (e.g., ( \sqrt{N} \times \sqrt{N} ) for a square image).

4. Multi-Head Attention Visualization

Each head ( h ) produces a distinct attention map ( A_h ), highlighting different regions of the image. These maps are often:

Color-coded: Each head is assigned a unique color (e.g., red, blue, green).
Overlaid: The maps are combined (e.g., averaged or max-pooled) to show collective attention.

5. Example in ViT-S/8 Trained with DINO

In DINO (self-distillation with no labels), the [CLS] token’s attention maps often correspond to semantically meaningful regions (e.g., object boundaries). The last layer’s heads may capture:

Head 1: Focus on object centers.
Head 2: Focus on edges.
Head 3: Attend to background context.

6. Mathematical Summary

For ( H ) heads, the attention maps for [CLS] are:

[
{A_h}_{h=1}^H, \quad A_h \in \mathbb{R}^{1 \times N}
]

These are reshaped into ( H ) 2D maps, each showing where the [CLS] token "looks" in head ( h ).

Pseudocode (PyTorch-like)

# X: Input embeddings (shape: [B, N+1, D])
# W_Q, W_K: Projection matrices (shape: [D, D])
# H: Number of heads

B, N_plus_1, D = X.shape
d_k = D // H

# Split into heads
q = X @ W_Q  # [B, N+1, D]
k = X @ W_K  # [B, N+1, D]

q = q.view(B, N_plus_1, H, d_k).transpose(1, 2)  # [B, H, N+1, d_k]
k = k.view(B, N_plus_1, H, d_k).transpose(1, 2)  # [B, H, N+1, d_k]

# Compute attention scores for [CLS] (query index 0)
attn_scores = (q[:, :, 0, :] @ k.transpose(-2, -1)) / (d_k ** 0.5)  # [B, H, N+1]
attn_weights = torch.softmax(attn_scores, dim=-1)  # [B, H, N+1]

# Extract attention maps for visualization (exclude [CLS] self-attention)
cls_attn_maps = attn_weights[:, :, 1:]  # [B, H, N]
cls_attn_maps = cls_attn_maps.reshape(B, H, h_patches, w_patches)  # Reshape to spatial dims