Great! Now that you understand DINO, let’s dive deeply into iBOT — Image BERT with Online Tokenizer — and how it extends DINO with masked image modeling (MIM).
You’ll see that iBOT is a hybrid of DINO and BERT-style pretraining, designed for Vision Transformers (ViT). The key innovation is that iBOT adds a patch-level loss in addition to the global [CLS]
alignment, which helps the model learn both global and local representations.
🧠 Big Picture: What is iBOT?
Component | Description |
---|---|
Architecture | ViT + student-teacher model (like DINO) |
Global supervision | Match [CLS] token across views (same as DINO) |
Local supervision | Match masked patch tokens using teacher-student alignment (BERT-like) |
Goal | Learn both global semantics and fine-grained local features without labels |
So:
iBOT = DINO + masked patch prediction, without needing a tokenizer like MAE or BEiT.
⚙️ iBOT Architecture Overview
Two networks:
- Student ViT: takes in masked patches
- Teacher ViT: takes in full (unmasked) image
The models share the same architecture but have different parameters:
- Teacher: EMA (momentum updated)
- Student: directly optimized
📐 Input Representation
For each image, generate two augmented views:
- View A (e.g., full crop)
- View B (e.g., random crop)
In each view, some patch tokens are masked randomly (e.g., 40%).
🧩 1. Global Loss: [CLS]
Token Alignment (same as DINO)
Just like in DINO:
$$
\mathcal{L}{\text{global}} = \text{CrossEntropy}\left(\text{softmax}\left(\frac{f{\text{teacher}}([CLS]A)}{\tau_t}\right), \text{softmax}\left(\frac{f{\text{student}}([CLS]_B)}{\tau_s}\right)\right)
$$
This enforces view-invariant global features.
🧩 2. Local Loss: Patch Token Alignment (iBOT’s core)
This is iBOT’s main addition over DINO.
🔒 Student:
- Input has masked patches.
- Student can only see unmasked ones.
- Still produces output tokens for all patch positions (including masked ones).
👁 Teacher:
- Sees full image (no masked patches).
- Generates patch token outputs at all positions.
Now, for each masked patch, iBOT aligns the student’s predicted patch embedding to the teacher’s embedding at the same position.
🧮 Local loss:
For each masked patch index $i$:
$$
\mathcal{L}{\text{local}} = \sum{i \in \text{masked}} \text{CrossEntropy}\left(\text{softmax}\left(\frac{z_i^{\text{teacher}}}{\tau_t}\right), \text{softmax}\left(\frac{z_i^{\text{student}}}{\tau_s}\right)\right)
$$
Where $z_i$ is the patch token output at position $i$.
Alternatively, some variants use:
$$
\text{MSE}\left(z_i^{\text{student}}, \text{stopgrad}(z_i^{\text{teacher}})\right)
$$
but the paper emphasizes the distributional matching (softmax + cross entropy) as in DINO.
💡 Why is this powerful?
This loss teaches the student to reconstruct the semantic embedding of masked patches, not raw pixels.
Compared to MAE or BEiT:
Method | Predicts |
---|---|
MAE | Raw pixel values (in decoder) |
BEiT | Discrete codebook token |
iBOT | Teacher’s semantic patch embeddings |
So, iBOT avoids the need for any external tokenizer, making it fully self-supervised end-to-end.
🧪 Total Loss in iBOT:
$$
\mathcal{L}{\text{iBOT}} = \lambda{\text{cls}} \mathcal{L}{\text{global}} + \lambda{\text{patch}} \mathcal{L}_{\text{local}}
$$
With typical values:
- $\lambda_{\text{cls}} = 1.0$
- $\lambda_{\text{patch}} = 1.0$ or a decaying schedule
🧠 Summary: How iBOT Works
Part | Role |
---|---|
[CLS] token |
Same as DINO: global feature learning via view-invariant loss |
Patch tokens | Trained to predict teacher’s semantic embeddings at masked positions |
No tokenizer | Unlike BEiT, uses teacher’s ViT as an online target generator |
Benefit | Learns both global and local representations, robust across scales and details |
👀 Bonus: Comparison with Related Models
Method | Global Loss | Local Loss | Tokenizer |
---|---|---|---|
DINO | ✅ [CLS] only |
❌ | No |
iBOT | ✅ [CLS]
|
✅ masked patches | No |
MAE | ❌ | ✅ pixels | No |
BEiT | ❌ | ✅ discrete tokens | ✅ yes |
for iBOT part, the output to be compared between teacher and student is not CLS any more, but all masked patches?
Exactly! In iBOT, the key innovation is that the output compared between teacher and student is the embeddings of masked patches, not just the [CLS]
token.
Here's how it differs from DINO:
Key Differences between iBOT and DINO:
DINO:
-
Global loss is calculated only between the
[CLS]
tokens from the teacher and student. - The
[CLS]
token represents the global feature of the image and captures the overall semantics of the image.
iBOT:
-
Global loss still uses the
[CLS]
token for view-invariant learning between teacher and student, just like DINO. -
Local loss is added in iBOT, and it focuses on matching the student’s prediction for masked patches to the teacher’s corresponding embeddings.
- The student only sees the unmasked patches and has to predict the teacher’s patch embeddings for the masked positions.
Detailed Breakdown of iBOT’s Masked Patch Alignment:
- The teacher network sees the full image and computes embeddings for all patches (including the masked ones).
- The student network sees the image with some patches masked out and computes embeddings for all patches (including the masked ones), but only the unmasked patches are visible to the student during training.
The student is trained to predict the teacher’s embeddings for the masked patches. This is done using a local loss function, which can be something like cross-entropy or mean squared error between the teacher and student outputs for each masked patch.
💡 Example:
Let’s say we have an image with 16 patches (4x4 grid). During training, the student sees only 10 unmasked patches, and the teacher sees all 16.
Teacher (full) | Student (masked) |
---|---|
$P_1, P_2, \dots, P_{16}$ | $P_1, P_2, P_3, P_4, \dots, P_9$ + masked $P_{10} \dots P_{16}$ |
- The teacher computes embeddings for all patches: $z_{\text{teacher}}^1, z_{\text{teacher}}^2, \dots, z_{\text{teacher}}^{16}$.
- The student computes embeddings for all patches, but is forced to predict embeddings for the masked patches: $z_{\text{student}}^{10}, z_{\text{student}}^{11}, \dots, z_{\text{student}}^{16}$.
During training, the local loss ensures that the student’s predictions for the masked patches are close to the teacher’s embeddings for those same patches.
How the Local Loss Works:
The local loss pushes the student to learn semantic consistency between its predictions and the teacher's embeddings for masked patch tokens.
For example, using softmax and cross-entropy:
$$
\mathcal{L}{\text{local}} = \sum{\text{masked}} \text{CrossEntropy}\left( \text{softmax}\left( \frac{z_i^{\text{teacher}}}{\tau_t} \right), \text{softmax}\left( \frac{z_i^{\text{student}}}{\tau_s} \right) \right)
$$
This makes the student learn to predict contextual patch representations and forces it to understand local spatial structure.
Final Thoughts
-
In iBOT, the student model is not limited to comparing just the
[CLS]
tokens, like in DINO. - The masked patches in iBOT introduce a local loss function, which makes the model learn local details (spatial, texture, etc.) by predicting masked patch embeddings from the teacher.
-
The
[CLS]
token is still used for global supervision, but iBOT adds local supervision through the masked patch embeddings, allowing it to capture both global and local image features.
Top comments (0)