DEV Community

Henri Wang
Henri Wang

Posted on

How is [CLS] token assumed to carry global information during training?

During training, the DINO loss forces the [CLS] output to be view-invariant, i.e., different augmentations must map to similar [CLS] outputs.
This encourages the network to focus on high-level, global semantics rather than local patterns (which change under crops/augmentations).

why that different view augmentations must map to similar cls outputs makes cls have to be gather e global information of the image?

Excellent — you're now asking why the DINO loss mathematically forces the [CLS] token to gather global information, and how this arises from the attention mechanism in Vision Transformers.

Let’s prove this through the structure of self-attention, and show why matching [CLS] outputs across different augmentations enforces global feature learning.


🧠 Key Question Restated:

Why does requiring the [CLS] outputs of different views (augmentations) of the same image to be similar force the [CLS] token to encode global information?

We'll explain this from two perspectives:

  1. Architecture: how self-attention distributes information into [CLS]
  2. Loss signal: how the DINO loss pushes [CLS] to represent augmentation-invariant (i.e., global) content

1️⃣ Self-Attention Architecture: [CLS] as an Information Aggregator

🧮 Transformer Input

For an image split into $N$ patches, the input is:

$$
X = [x_{\text{cls}}, x_1, x_2, \dots, x_N] \in \mathbb{R}^{(N+1) \times d}
$$

  • $x_{\text{cls}}$ is a learned [CLS] token.
  • $x_i$ are patch tokens from image.

⚙️ Self-Attention Mechanism

Each token (including [CLS]) is updated by attending to all other tokens:

Let:

  • $Q = XW^Q$, $K = XW^K$, $V = XW^V$, where $W^{Q,K,V} \in \mathbb{R}^{d \times d_h}$
  • Attention output for token $t$:

$$
\text{Attn}(x_t) = \sum_{j=1}^{N+1} \text{softmax}_j \left( \frac{Q_t K_j^\top}{\sqrt{d_h}} \right) V_j
$$

So the updated [CLS] token is:

$$
x_{\text{cls}}^{(l+1)} = \text{Attn}(x_{\text{cls}}^{(l)})
$$

It is a weighted sum over all token embeddings — meaning, the [CLS] token aggregates all patch information in each layer.

This is architectural: the [CLS] token is **designed to gather global information.


2️⃣ DINO Loss Pushes for View-Invariance → Forces Global Semantics

Now add the DINO loss, applied to [CLS] tokens from different augmented views of the same image:

🖼️ View 1 (e.g. crop A):

$$
X_A = [x_{\text{cls}}^A, x_1^A, \dots, x_N^A] \to f_A(x_{\text{cls}}^A)
$$

🖼️ View 2 (e.g. crop B):

$$
X_B = [x_{\text{cls}}^B, x_1^B, \dots, x_M^B] \to f_B(x_{\text{cls}}^B)
$$

Then:

$$
\mathcal{L}{\text{DINO}} = \text{CrossEntropy}\left(\text{softmax}\left(\frac{f{\text{teacher}}(x_{\text{cls}}^A)}{\tau_t}\right), \text{softmax}\left(\frac{f_{\text{student}}(x_{\text{cls}}^B)}{\tau_s}\right)\right)
$$

🚨 Critical Insight:

  • The augmentations (views A and B) change local pixel-level content (e.g., different crops, color distortions).
  • But the underlying object or concept in the image is still the same.

So:

To minimize this loss, the [CLS] token must be invariant to local differences, and hence focus on global, persistent features that remain after augmentation.


🔄 Mathematical Pressure from the Loss

If:

$$
x_{\text{cls}}^A = \sum_{j=1}^{N+1} \alpha_j^A V_j^A \quad \text{and} \quad x_{\text{cls}}^B = \sum_{k=1}^{M+1} \alpha_k^B V_k^B
$$

Then the only way to make $f(x_{\text{cls}}^A) \approx f(x_{\text{cls}}^B)$ is if:

  • The attention weights $\alpha_j^A$ and $\alpha_k^B$ focus on patches representing the same global content.
  • The values $V_j, V_k$ contain representations that are semantic, not raw pixel statistics.

So: the network is forced to learn representations where the [CLS] token becomes robust and global, because only that minimizes the loss over different views.


📌 Final Takeaway

Component Role
[CLS] token Structurally gathers all patch information via self-attention
Self-attention Allows [CLS] to attend to full image — nothing stops it from becoming global
DINO loss Forces [CLS] to be invariant across views, which only happens if it encodes global, high-level semantics
Patch tokens Still encode local details, but no constraint is placed on them by the DINO loss (unless masked patch loss is added)

Would you like a visual diagram of the attention + loss flow, or dive into how masked patch loss mathematically works in a similar way for local representation?

Top comments (0)