<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Henri Wang</title>
    <description>The latest articles on DEV Community by Henri Wang (@henri_wang_d48b1e9bc1ea79).</description>
    <link>https://dev.to/henri_wang_d48b1e9bc1ea79</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1685439%2Fd47c862d-dccd-49cd-a49b-40da9c2c71fc.png</url>
      <title>DEV Community: Henri Wang</title>
      <link>https://dev.to/henri_wang_d48b1e9bc1ea79</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/henri_wang_d48b1e9bc1ea79"/>
    <language>en</language>
    <item>
      <title>rurussian is a smart russian learning platform, but first of all, it is a public dictionary.</title>
      <dc:creator>Henri Wang</dc:creator>
      <pubDate>Wed, 01 Apr 2026 14:03:40 +0000</pubDate>
      <link>https://dev.to/henri_wang_d48b1e9bc1ea79/rurussian-is-a-smart-russian-learning-platform-but-first-of-all-it-is-a-public-dictionary-2gb9</link>
      <guid>https://dev.to/henri_wang_d48b1e9bc1ea79/rurussian-is-a-smart-russian-learning-platform-but-first-of-all-it-is-a-public-dictionary-2gb9</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/henri_wang_d48b1e9bc1ea79/rurussian-as-a-public-dictionary-a-systems-level-perspective-56b2" class="crayons-story__hidden-navigation-link"&gt;RuRussian as a Public Dictionary: A Systems-Level Perspective&lt;/a&gt;
    &lt;div class="crayons-article__cover crayons-article__cover__image__feed"&gt;
      &lt;iframe src="https://www.youtube.com/embed/YsDKGzmFVs4" title="RuRussian as a Public Dictionary: A Systems-Level Perspective"&gt;&lt;/iframe&gt;
    &lt;/div&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/henri_wang_d48b1e9bc1ea79" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1685439%2Fd47c862d-dccd-49cd-a49b-40da9c2c71fc.png" alt="henri_wang_d48b1e9bc1ea79 profile" class="crayons-avatar__image" width="96" height="96"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/henri_wang_d48b1e9bc1ea79" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Henri Wang
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Henri Wang
                
              
              &lt;div id="story-author-preview-content-3438478" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/henri_wang_d48b1e9bc1ea79" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1685439%2Fd47c862d-dccd-49cd-a49b-40da9c2c71fc.png" class="crayons-avatar__image" alt="" width="96" height="96"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Henri Wang&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/henri_wang_d48b1e9bc1ea79/rurussian-as-a-public-dictionary-a-systems-level-perspective-56b2" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 1&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/henri_wang_d48b1e9bc1ea79/rurussian-as-a-public-dictionary-a-systems-level-perspective-56b2" id="article-link-3438478"&gt;
          RuRussian as a Public Dictionary: A Systems-Level Perspective
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/saas"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;saas&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ux"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ux&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/webdev"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;webdev&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/community"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;community&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/henri_wang_d48b1e9bc1ea79/rurussian-as-a-public-dictionary-a-systems-level-perspective-56b2#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            3 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>RuRussian as a Public Dictionary: A Systems-Level Perspective</title>
      <dc:creator>Henri Wang</dc:creator>
      <pubDate>Wed, 01 Apr 2026 02:01:11 +0000</pubDate>
      <link>https://dev.to/henri_wang_d48b1e9bc1ea79/rurussian-as-a-public-dictionary-a-systems-level-perspective-56b2</link>
      <guid>https://dev.to/henri_wang_d48b1e9bc1ea79/rurussian-as-a-public-dictionary-a-systems-level-perspective-56b2</guid>
      <description>&lt;h2&gt;
  
  
  0. Framing the Problem
&lt;/h2&gt;

&lt;p&gt;If you model a “dictionary” as a function:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;f(word) → meaning&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;then most traditional dictionaries are just key–value stores with light annotations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://rurussian.com/" rel="noopener noreferrer"&gt;RuRussian&lt;/a&gt; breaks this abstraction. Instead, it behaves more like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;f(word_form) → structured linguistic state space&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;where the output is not a scalar (translation), but a rich object graph encoding morphology, syntax, semantics, and usage. &lt;br&gt;
This is the key mental shift: &lt;a href="https://rurussian.com/" rel="noopener noreferrer"&gt;RuRussian&lt;/a&gt; is not a lookup table—it is a runtime over a linguistic knowledge graph.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. System Overview
&lt;/h2&gt;

&lt;p&gt;At a high level, rurussian.com is a hybrid system combining:&lt;br&gt;
a lexical database&lt;br&gt;
a curated corpus&lt;br&gt;
a grammar engine&lt;br&gt;
a human + AI annotation layer&lt;br&gt;
You can think of it as a read-optimized OLAP system for language, where queries are exploratory rather than transactional.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Core Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 Morphology as the Primary Index
&lt;/h3&gt;

&lt;p&gt;In most systems, the primary key is the lemma.&lt;br&gt;
In &lt;a href="https://rurussian.com/" rel="noopener noreferrer"&gt;RuRussian&lt;/a&gt;, the effective key is closer to:&lt;br&gt;
(word_form, stress_pattern, aspect)&lt;br&gt;
The system accepts arbitrary surface forms and resolves them via an implicit:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;reverse morphological parser&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So instead of:&lt;br&gt;
&lt;code&gt;lookup("учиться")&lt;br&gt;
you can do:&lt;br&gt;
lookup("учился") → canonical_entry("учиться")&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This implies a normalization pipeline roughly like:&lt;br&gt;
input_token&lt;br&gt;
  → morphological analysis&lt;br&gt;
  → lemma resolution&lt;br&gt;
  → graph node retrieval&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Word Entry = Structured Object
&lt;/h3&gt;

&lt;p&gt;Each entry is not a flat record—it’s closer to a serialized object:&lt;br&gt;
&lt;code&gt;WORD_ENTRY = {&lt;br&gt;
  "lemma": "учиться",&lt;br&gt;
  "aspect_pair": ["научиться", "выучиться"],&lt;br&gt;
  "inflections": [...],&lt;br&gt;
  "stress_map": {...},&lt;br&gt;
  "government_rules": [...],&lt;br&gt;
  "examples": [...]&lt;br&gt;
}&lt;/code&gt;&lt;br&gt;
This is already beyond dictionary territory—it resembles a typed schema for linguistic computation.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 Graph Topology
&lt;/h3&gt;

&lt;p&gt;The entire system can be modeled as a graph:&lt;br&gt;
Nodes:&lt;br&gt;
&lt;code&gt;lemmas&lt;br&gt;
inflected forms&lt;br&gt;
sentences&lt;/code&gt;&lt;br&gt;
Edges:&lt;br&gt;
&lt;code&gt;aspect_pair (bidirectional)&lt;br&gt;
derivation (prefix transforms)&lt;br&gt;
usage (word → sentence)&lt;br&gt;
grammar constraints&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This gives you something like:&lt;br&gt;
&lt;code&gt;учиться&lt;br&gt;
  ├── aspect → научиться&lt;br&gt;
  ├── aspect → выучиться&lt;br&gt;
  ├── form → учился&lt;br&gt;
  ├── form → учусь&lt;br&gt;
  └── used_in → sentence_42&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
In other words, RuRussian is effectively a domain-specific knowledge graph for Russian.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Verb System = First-Class Citizen
&lt;/h2&gt;

&lt;p&gt;Russian verbs are where most learners (and models) fail. RuRussian treats them correctly—as a system, not a list.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 Aspect as a Relation, Not a Field
&lt;/h3&gt;

&lt;p&gt;Instead of:&lt;br&gt;
&lt;code&gt;verb.aspect = "perfective"&lt;/code&gt;&lt;br&gt;
you get:&lt;br&gt;
&lt;code&gt;edge(учиться ↔ научиться)&lt;br&gt;
edge(учиться ↔ выучиться)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This matters because aspect is relational, multiple perfectives can exist, and meaning shifts are non-linear.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 Prefixes = Semantic Operators
&lt;/h3&gt;

&lt;p&gt;Prefixes are modeled implicitly as transformations:&lt;br&gt;
&lt;code&gt;учить + на- → научить   (acquire skill)&lt;br&gt;
учить + вы- → выучить   (learn completely)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;So you can think of them as:&lt;br&gt;
&lt;code&gt;prefix: function(lemma) → new_semantic_state&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This is much closer to functional composition than to static vocabulary.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Sentence-Centric Design (Corpus Mode)
&lt;/h2&gt;

&lt;p&gt;Most dictionaries do:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;definition → examples&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://rurussian.com/" rel="noopener noreferrer"&gt;RuRussian&lt;/a&gt; inverts this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;examples → inferred meaning&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Each entry is backed by a curated mini-corpus:&lt;br&gt;
&lt;code&gt;low-noise&lt;br&gt;
grammar-controlled&lt;br&gt;
pedagogically staged&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;So the system doubles as a:&lt;br&gt;
&lt;code&gt;queryable, labeled dataset for human learning&lt;/code&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Grammar as Embedded Metadata
Instead of separating grammar into another subsystem, &lt;a href="https://rurussian.com/" rel="noopener noreferrer"&gt;RuRussian&lt;/a&gt; inlines it.
Each entry encodes:
&lt;code&gt;case requirements
verb government
prepositional constraints
aspect compatibility&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So effectively:&lt;br&gt;
&lt;code&gt;word = lexical_unit + grammar_rules&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This collapses the boundary between dictionary &amp;amp; grammar book.&lt;/p&gt;

&lt;h2&gt;
  
  
  6.  UX as Query Interface
&lt;/h2&gt;

&lt;p&gt;The UI is not just design—it reflects the underlying data model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Progressive Disclosure
&lt;/h3&gt;

&lt;p&gt;level 0 → basic meaning&lt;br&gt;
level 1 → examples&lt;br&gt;
level 2 → full morphology&lt;br&gt;
level 3 → grammar constraints&lt;br&gt;
This is essentially a multi-resolution view over the same graph.&lt;/p&gt;

&lt;h3&gt;
  
  
  Search = Fuzzy + Structural
&lt;/h3&gt;

&lt;p&gt;Search accepts:&lt;br&gt;
&lt;code&gt;inflected forms&lt;br&gt;
partial inputs&lt;/code&gt;&lt;br&gt;
and resolves them structurally.&lt;br&gt;
So it behaves less like:&lt;br&gt;
&lt;code&gt;string match&lt;/code&gt;&lt;br&gt;
and more like:&lt;br&gt;
&lt;code&gt;parse → normalize → retrieve&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  7. AI Layer (Dynamic Augmentation)
&lt;/h2&gt;

&lt;p&gt;The GPT-5 integration adds a generative component:&lt;br&gt;
entry → prompt → generated_examples&lt;br&gt;
So the system becomes:&lt;br&gt;
static knowledge base + dynamic generator&lt;br&gt;
This is analogous to:&lt;br&gt;
retrieval-augmented generation (RAG), but for language learning&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Comparison: Flat vs Graph Systems
&lt;/h2&gt;

&lt;p&gt;Property    Traditional Dictionary  &lt;a href="https://rurussian.com/" rel="noopener noreferrer"&gt;RuRussian&lt;/a&gt;&lt;br&gt;
Data model  Key–value Graph&lt;br&gt;
Unit    Lemma   Morphological system&lt;br&gt;
Verbs   Flat entries    Networked&lt;br&gt;
Examples    Optional    Core&lt;br&gt;
Grammar External    Embedded&lt;br&gt;
Learning signal Low High&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Strengths (Why This Design Works)
&lt;/h2&gt;

&lt;p&gt;Morphology-native → aligned with Russian’s complexity&lt;br&gt;
Graph structure → captures relationships explicitly&lt;br&gt;
Example-first → better for acquisition&lt;br&gt;
Schema-rich → ML-friendly (high signal density)&lt;br&gt;
In ML terms, this is a highly structured, low-noise supervised dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. Limitations (Trade-offs)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Not O(1) Lookup Friendly
&lt;/h3&gt;

&lt;p&gt;If your goal is:&lt;br&gt;
word → quick translation&lt;br&gt;
this system is overkill.&lt;br&gt;
Latency (cognitive + UI) is higher because:&lt;br&gt;
output size is large&lt;br&gt;
structure must be interpreted&lt;/p&gt;

&lt;h3&gt;
  
  
  Coverage vs Quality
&lt;/h3&gt;

&lt;p&gt;Because data is curated:&lt;br&gt;
precision ↑&lt;br&gt;
recall ↓&lt;br&gt;
i.e., better data, smaller surface area&lt;/p&gt;

&lt;h2&gt;
  
  
  conclusion
&lt;/h2&gt;

&lt;p&gt;The cleanest abstraction is:&lt;br&gt;
&lt;code&gt;RuRussian = Linguistic Knowledge Graph + Query Interface + Generative Layer&lt;/code&gt;&lt;br&gt;
Or more formally:&lt;br&gt;
&lt;code&gt;System = (Graph, Parser, UI, Generator)&lt;/code&gt;&lt;br&gt;
Where:&lt;br&gt;
&lt;code&gt;Graph = structured linguistic data&lt;br&gt;
Parser = morphology resolver&lt;br&gt;
UI = multi-level query interface&lt;br&gt;
Generator = GPT-based augmentation&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://rurussian.com/" rel="noopener noreferrer"&gt;RuRussian&lt;/a&gt; is not “a better dictionary.” It is a different class of system. Instead of answering: “What does this word mean?” it answers:&lt;br&gt;
“What is the full state space of this word in the language system?”&lt;/p&gt;

&lt;p&gt;That shift—from lookup to structure—is what makes it powerful, and also what makes it fundamentally non-traditional as a public dictionary.&lt;/p&gt;

</description>
      <category>saas</category>
      <category>ux</category>
      <category>webdev</category>
      <category>community</category>
    </item>
    <item>
      <title>What is KoLeo regularizer doing in DINOv2？</title>
      <dc:creator>Henri Wang</dc:creator>
      <pubDate>Mon, 30 Jun 2025 09:24:24 +0000</pubDate>
      <link>https://dev.to/henri_wang_d48b1e9bc1ea79/what-is-koleo-regularizer-doing-in-dinov2-307j</link>
      <guid>https://dev.to/henri_wang_d48b1e9bc1ea79/what-is-koleo-regularizer-doing-in-dinov2-307j</guid>
      <description>&lt;p&gt;The &lt;strong&gt;Kozachenko-Leonenko (Koleo) regularizer&lt;/strong&gt; in &lt;strong&gt;DINOv2&lt;/strong&gt; is a technique used to encourage &lt;strong&gt;uniform distribution&lt;/strong&gt; of the learned features in the embedding space. It helps prevent feature collapse (where all features become identical) and promotes better representation learning by maximizing the information content of the embeddings.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How KoLeo Regularizer Works in DINOv2&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Motivation&lt;/strong&gt;:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-supervised learning methods like DINOv2 rely on contrastive learning or distillation to learn useful representations without labels.&lt;/li&gt;
&lt;li&gt;A common failure mode is &lt;strong&gt;feature collapse&lt;/strong&gt;, where embeddings become too similar, reducing discriminative power.&lt;/li&gt;
&lt;li&gt;The KoLeo regularizer encourages embeddings to be &lt;strong&gt;well-spread&lt;/strong&gt; in the feature space, improving generalization.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Mathematical Formulation&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
The KoLeo regularizer is based on &lt;strong&gt;entropy estimation&lt;/strong&gt; using nearest-neighbor distances. For a batch of embeddings ({x_i}&lt;em&gt;{i=1}^N), the regularizer computes:&lt;br&gt;&lt;br&gt;
[&lt;br&gt;
\mathcal{L}&lt;/em&gt;{koleo} = -\frac{1}{N} \sum_{i=1}^N \log \rho_i&lt;br&gt;
]&lt;br&gt;
where (\rho_i) is the distance of (x_i) to its nearest neighbor in the batch.  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Minimizing this loss pushes embeddings &lt;strong&gt;away from each other&lt;/strong&gt;, promoting uniformity.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Role in DINOv2&lt;/strong&gt;:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DINOv2 uses a &lt;strong&gt;student-teacher distillation framework&lt;/strong&gt; where the student network is trained to match the teacher's outputs.&lt;/li&gt;
&lt;li&gt;The KoLeo regularizer is applied to the &lt;strong&gt;student's embeddings&lt;/strong&gt; to prevent them from collapsing into a single point or small cluster.&lt;/li&gt;
&lt;li&gt;It complements other techniques like &lt;strong&gt;centering &amp;amp; sharpening&lt;/strong&gt; in the teacher network.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Advantages&lt;/strong&gt;:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Avoids collapse&lt;/strong&gt;: Ensures diverse and informative features.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No explicit negative samples&lt;/strong&gt;: Unlike contrastive learning (e.g., SimCLR), it doesn’t require large batches for negative pairs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Computationally efficient&lt;/strong&gt;: Only requires nearest-neighbor distances within a batch.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Comparison to Other Regularization Techniques&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;KoLeo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prevent feature collapse&lt;/td&gt;
&lt;td&gt;Maximizes nearest-neighbor distances&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Uniformity Loss (e.g., in SimCLR)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Spread out embeddings&lt;/td&gt;
&lt;td&gt;Contrastive learning with negative pairs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sharpening (DINO)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Avoid trivial solutions&lt;/td&gt;
&lt;td&gt;Temperature scaling in softmax&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
    </item>
    <item>
      <title>Since i know how DINO loss works, i want to know how iBOT does its job.</title>
      <dc:creator>Henri Wang</dc:creator>
      <pubDate>Mon, 30 Jun 2025 09:00:36 +0000</pubDate>
      <link>https://dev.to/henri_wang_d48b1e9bc1ea79/ok-good-now-since-i-know-how-dino-loss-works-i-want-to-know-how-ibot-does-its-job-16h9</link>
      <guid>https://dev.to/henri_wang_d48b1e9bc1ea79/ok-good-now-since-i-know-how-dino-loss-works-i-want-to-know-how-ibot-does-its-job-16h9</guid>
      <description>&lt;p&gt;Great! Now that you understand &lt;strong&gt;DINO&lt;/strong&gt;, let’s dive deeply into &lt;strong&gt;iBOT&lt;/strong&gt; — &lt;strong&gt;Image BERT with Online Tokenizer&lt;/strong&gt; — and how it extends DINO with &lt;strong&gt;masked image modeling (MIM)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You’ll see that iBOT is a &lt;strong&gt;hybrid&lt;/strong&gt; of DINO and BERT-style pretraining, designed for Vision Transformers (ViT). The key innovation is that &lt;strong&gt;iBOT adds a patch-level loss in addition to the global &lt;code&gt;[CLS]&lt;/code&gt; alignment&lt;/strong&gt;, which helps the model learn both global and local representations.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 Big Picture: What is iBOT?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Architecture&lt;/td&gt;
&lt;td&gt;ViT + student-teacher model (like DINO)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Global supervision&lt;/td&gt;
&lt;td&gt;Match &lt;code&gt;[CLS]&lt;/code&gt; token across views (same as DINO)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Local supervision&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Match &lt;strong&gt;masked patch tokens&lt;/strong&gt; using teacher-student alignment (BERT-like)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Goal&lt;/td&gt;
&lt;td&gt;Learn both &lt;strong&gt;global semantics&lt;/strong&gt; and &lt;strong&gt;fine-grained local features&lt;/strong&gt; without labels&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;So:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;iBOT = DINO + masked patch prediction, &lt;strong&gt;without needing a tokenizer like MAE or BEiT&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  ⚙️ iBOT Architecture Overview
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Two networks:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Student ViT&lt;/strong&gt;: takes in &lt;em&gt;masked&lt;/em&gt; patches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teacher ViT&lt;/strong&gt;: takes in full (unmasked) image&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The models share the same architecture but have different parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Teacher: EMA (momentum updated)&lt;/li&gt;
&lt;li&gt;Student: directly optimized&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  📐 Input Representation
&lt;/h2&gt;

&lt;p&gt;For each image, generate &lt;strong&gt;two augmented views&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;View A (e.g., full crop)&lt;/li&gt;
&lt;li&gt;View B (e.g., random crop)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In each view, some patch tokens are &lt;strong&gt;masked&lt;/strong&gt; randomly (e.g., 40%).&lt;/p&gt;




&lt;h2&gt;
  
  
  🧩 1. Global Loss: &lt;code&gt;[CLS]&lt;/code&gt; Token Alignment (same as DINO)
&lt;/h2&gt;

&lt;p&gt;Just like in DINO:&lt;/p&gt;

&lt;p&gt;$$&lt;br&gt;
\mathcal{L}&lt;em&gt;{\text{global}} = \text{CrossEntropy}\left(\text{softmax}\left(\frac{f&lt;/em&gt;{\text{teacher}}([CLS]&lt;em&gt;A)}{\tau_t}\right), \text{softmax}\left(\frac{f&lt;/em&gt;{\text{student}}([CLS]_B)}{\tau_s}\right)\right)&lt;br&gt;
$$&lt;/p&gt;

&lt;p&gt;This enforces &lt;strong&gt;view-invariant global features&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧩 2. Local Loss: Patch Token Alignment (iBOT’s core)
&lt;/h2&gt;

&lt;p&gt;This is iBOT’s &lt;strong&gt;main addition&lt;/strong&gt; over DINO.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔒 Student:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Input has masked patches.&lt;/li&gt;
&lt;li&gt;Student can &lt;strong&gt;only see&lt;/strong&gt; unmasked ones.&lt;/li&gt;
&lt;li&gt;Still produces &lt;em&gt;output tokens&lt;/em&gt; for all patch positions (including masked ones).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  👁 Teacher:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Sees full image (no masked patches).&lt;/li&gt;
&lt;li&gt;Generates patch token outputs at all positions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, for each &lt;strong&gt;masked patch&lt;/strong&gt;, iBOT aligns the &lt;strong&gt;student’s predicted patch embedding&lt;/strong&gt; to the &lt;strong&gt;teacher’s embedding&lt;/strong&gt; at the same position.&lt;/p&gt;

&lt;h3&gt;
  
  
  🧮 Local loss:
&lt;/h3&gt;

&lt;p&gt;For each masked patch index $i$:&lt;/p&gt;

&lt;p&gt;$$&lt;br&gt;
\mathcal{L}&lt;em&gt;{\text{local}} = \sum&lt;/em&gt;{i \in \text{masked}} \text{CrossEntropy}\left(\text{softmax}\left(\frac{z_i^{\text{teacher}}}{\tau_t}\right), \text{softmax}\left(\frac{z_i^{\text{student}}}{\tau_s}\right)\right)&lt;br&gt;
$$&lt;/p&gt;

&lt;p&gt;Where $z_i$ is the patch token output at position $i$.&lt;/p&gt;

&lt;p&gt;Alternatively, some variants use:&lt;/p&gt;

&lt;p&gt;$$&lt;br&gt;
\text{MSE}\left(z_i^{\text{student}}, \text{stopgrad}(z_i^{\text{teacher}})\right)&lt;br&gt;
$$&lt;/p&gt;

&lt;p&gt;but the paper emphasizes the &lt;strong&gt;distributional matching (softmax + cross entropy)&lt;/strong&gt; as in DINO.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 Why is this powerful?
&lt;/h2&gt;

&lt;p&gt;This loss teaches the student to &lt;strong&gt;reconstruct the semantic embedding&lt;/strong&gt; of masked patches, &lt;em&gt;not&lt;/em&gt; raw pixels.&lt;/p&gt;

&lt;p&gt;Compared to MAE or BEiT:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Predicts&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MAE&lt;/td&gt;
&lt;td&gt;Raw pixel values (in decoder)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BEiT&lt;/td&gt;
&lt;td&gt;Discrete codebook token&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;iBOT&lt;/td&gt;
&lt;td&gt;Teacher’s &lt;strong&gt;semantic patch embeddings&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;So, iBOT &lt;strong&gt;avoids the need for any external tokenizer&lt;/strong&gt;, making it fully self-supervised end-to-end.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧪 Total Loss in iBOT:
&lt;/h2&gt;

&lt;p&gt;$$&lt;br&gt;
\mathcal{L}&lt;em&gt;{\text{iBOT}} = \lambda&lt;/em&gt;{\text{cls}} \mathcal{L}&lt;em&gt;{\text{global}} + \lambda&lt;/em&gt;{\text{patch}} \mathcal{L}_{\text{local}}&lt;br&gt;
$$&lt;/p&gt;

&lt;p&gt;With typical values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$\lambda_{\text{cls}} = 1.0$&lt;/li&gt;
&lt;li&gt;$\lambda_{\text{patch}} = 1.0$ or a decaying schedule&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧠 Summary: How iBOT Works
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Part&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;[CLS]&lt;/code&gt; token&lt;/td&gt;
&lt;td&gt;Same as DINO: global feature learning via view-invariant loss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Patch tokens&lt;/td&gt;
&lt;td&gt;Trained to predict teacher’s semantic embeddings at &lt;strong&gt;masked positions&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No tokenizer&lt;/td&gt;
&lt;td&gt;Unlike BEiT, uses teacher’s ViT as an online target generator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Benefit&lt;/td&gt;
&lt;td&gt;Learns both global and local representations, robust across scales and details&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  👀 Bonus: Comparison with Related Models
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Global Loss&lt;/th&gt;
&lt;th&gt;Local Loss&lt;/th&gt;
&lt;th&gt;Tokenizer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DINO&lt;/td&gt;
&lt;td&gt;✅ &lt;code&gt;[CLS]&lt;/code&gt; only&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;iBOT&lt;/td&gt;
&lt;td&gt;✅ &lt;code&gt;[CLS]&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;✅ masked patches&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MAE&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅ pixels&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BEiT&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅ discrete tokens&lt;/td&gt;
&lt;td&gt;✅ yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;for iBOT part, the output to be compared between teacher and student is not CLS any more, but all masked patches?&lt;/p&gt;

&lt;p&gt;Exactly! In &lt;strong&gt;iBOT&lt;/strong&gt;, the key innovation is that &lt;strong&gt;the output compared between teacher and student is the embeddings of masked patches&lt;/strong&gt;, not just the &lt;code&gt;[CLS]&lt;/code&gt; token.&lt;/p&gt;

&lt;p&gt;Here's how it differs from DINO:&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Differences between iBOT and DINO:
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;DINO:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Global loss&lt;/strong&gt; is calculated only between the &lt;code&gt;[CLS]&lt;/code&gt; tokens from the teacher and student.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;[CLS]&lt;/code&gt; token represents the &lt;strong&gt;global feature&lt;/strong&gt; of the image and captures the &lt;strong&gt;overall semantics&lt;/strong&gt; of the image.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;iBOT:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Global loss&lt;/strong&gt; still uses the &lt;code&gt;[CLS]&lt;/code&gt; token for view-invariant learning between teacher and student, just like DINO.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Local loss&lt;/strong&gt; is added in iBOT, and it focuses on &lt;strong&gt;matching the student’s prediction for masked patches&lt;/strong&gt; to the teacher’s corresponding embeddings.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;student only sees the unmasked patches&lt;/strong&gt; and has to &lt;strong&gt;predict&lt;/strong&gt; the teacher’s patch embeddings for the masked positions.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h3&gt;
  
  
  Detailed Breakdown of iBOT’s Masked Patch Alignment:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;teacher network&lt;/strong&gt; sees the full image and computes embeddings for all patches (including the masked ones).&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;student network&lt;/strong&gt; sees the image with some patches masked out and computes embeddings for all patches (including the masked ones), but &lt;strong&gt;only the unmasked patches are visible to the student during training&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;strong&gt;student&lt;/strong&gt; is trained to predict the &lt;strong&gt;teacher’s embeddings&lt;/strong&gt; for the &lt;strong&gt;masked patches&lt;/strong&gt;. This is done using a &lt;strong&gt;local loss&lt;/strong&gt; function, which can be something like &lt;strong&gt;cross-entropy&lt;/strong&gt; or &lt;strong&gt;mean squared error&lt;/strong&gt; between the teacher and student outputs for each masked patch.&lt;/p&gt;




&lt;h3&gt;
  
  
  💡 Example:
&lt;/h3&gt;

&lt;p&gt;Let’s say we have an image with 16 patches (4x4 grid). During training, the student sees only 10 unmasked patches, and the teacher sees all 16.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;
&lt;strong&gt;Teacher&lt;/strong&gt; (full)&lt;/th&gt;
&lt;th&gt;
&lt;strong&gt;Student&lt;/strong&gt; (masked)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;$P_1, P_2, \dots, P_{16}$&lt;/td&gt;
&lt;td&gt;$P_1, P_2, P_3, P_4, \dots, P_9$ + masked $P_{10} \dots P_{16}$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;teacher&lt;/strong&gt; computes embeddings for all patches: $z_{\text{teacher}}^1, z_{\text{teacher}}^2, \dots, z_{\text{teacher}}^{16}$.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;student&lt;/strong&gt; computes embeddings for all patches, but is forced to predict embeddings for the masked patches: $z_{\text{student}}^{10}, z_{\text{student}}^{11}, \dots, z_{\text{student}}^{16}$.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;During training, the &lt;strong&gt;local loss&lt;/strong&gt; ensures that the &lt;strong&gt;student’s predictions for the masked patches&lt;/strong&gt; are close to the &lt;strong&gt;teacher’s embeddings for those same patches&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  How the Local Loss Works:
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;local loss&lt;/strong&gt; pushes the student to learn &lt;strong&gt;semantic consistency&lt;/strong&gt; between its predictions and the teacher's embeddings for &lt;strong&gt;masked patch tokens&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example, using &lt;strong&gt;softmax and cross-entropy&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;$$&lt;br&gt;
\mathcal{L}&lt;em&gt;{\text{local}} = \sum&lt;/em&gt;{\text{masked}} \text{CrossEntropy}\left( \text{softmax}\left( \frac{z_i^{\text{teacher}}}{\tau_t} \right), \text{softmax}\left( \frac{z_i^{\text{student}}}{\tau_s} \right) \right)&lt;br&gt;
$$&lt;/p&gt;

&lt;p&gt;This makes the student learn to predict &lt;strong&gt;contextual patch representations&lt;/strong&gt; and forces it to understand &lt;strong&gt;local spatial structure&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Final Thoughts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;In iBOT&lt;/strong&gt;, the student model is &lt;strong&gt;not limited to comparing just the &lt;code&gt;[CLS]&lt;/code&gt; tokens&lt;/strong&gt;, like in DINO.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;masked patches&lt;/strong&gt; in iBOT introduce a &lt;strong&gt;local loss&lt;/strong&gt; function, which &lt;strong&gt;makes the model learn local details&lt;/strong&gt; (spatial, texture, etc.) by &lt;strong&gt;predicting masked patch embeddings&lt;/strong&gt; from the teacher.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The &lt;code&gt;[CLS]&lt;/code&gt; token is still used for global supervision&lt;/strong&gt;, but iBOT adds local supervision through the masked patch embeddings, allowing it to capture both &lt;strong&gt;global and local image features&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>How is [CLS] token assumed to carry global information during training?</title>
      <dc:creator>Henri Wang</dc:creator>
      <pubDate>Mon, 30 Jun 2025 08:42:53 +0000</pubDate>
      <link>https://dev.to/henri_wang_d48b1e9bc1ea79/how-is-cls-token-assumed-to-carry-global-information-during-training-3bhj</link>
      <guid>https://dev.to/henri_wang_d48b1e9bc1ea79/how-is-cls-token-assumed-to-carry-global-information-during-training-3bhj</guid>
      <description>&lt;p&gt;During training, the DINO loss forces the [CLS] output to be view-invariant, i.e., different augmentations must map to similar [CLS] outputs.&lt;br&gt;
This encourages the network to focus on high-level, global semantics rather than local patterns (which change under crops/augmentations).&lt;/p&gt;

&lt;p&gt;why that different view augmentations must map to similar cls outputs makes cls have to be gather e global information of the image? &lt;/p&gt;

&lt;p&gt;Excellent — you're now asking &lt;strong&gt;why the DINO loss mathematically forces the &lt;code&gt;[CLS]&lt;/code&gt; token to gather global information&lt;/strong&gt;, and how this arises &lt;strong&gt;from the attention mechanism&lt;/strong&gt; in Vision Transformers.&lt;/p&gt;

&lt;p&gt;Let’s &lt;strong&gt;prove this through the structure of self-attention&lt;/strong&gt;, and show why &lt;strong&gt;matching &lt;code&gt;[CLS]&lt;/code&gt; outputs across different augmentations&lt;/strong&gt; enforces global feature learning.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 Key Question Restated:
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Why does requiring the &lt;code&gt;[CLS]&lt;/code&gt; outputs of different views (augmentations) of the same image to be similar &lt;strong&gt;force the &lt;code&gt;[CLS]&lt;/code&gt; token to encode global information&lt;/strong&gt;?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We'll explain this from two perspectives:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Architecture: how self-attention distributes information into &lt;code&gt;[CLS]&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Loss signal: how the DINO loss pushes &lt;code&gt;[CLS]&lt;/code&gt; to represent augmentation-invariant (i.e., global) content&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1️⃣ Self-Attention Architecture: &lt;code&gt;[CLS]&lt;/code&gt; as an Information Aggregator
&lt;/h2&gt;

&lt;h3&gt;
  
  
  🧮 Transformer Input
&lt;/h3&gt;

&lt;p&gt;For an image split into $N$ patches, the input is:&lt;/p&gt;

&lt;p&gt;$$&lt;br&gt;
X = [x_{\text{cls}}, x_1, x_2, \dots, x_N] \in \mathbb{R}^{(N+1) \times d}&lt;br&gt;
$$&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$x_{\text{cls}}$ is a learned &lt;code&gt;[CLS]&lt;/code&gt; token.&lt;/li&gt;
&lt;li&gt;$x_i$ are patch tokens from image.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ⚙️ Self-Attention Mechanism
&lt;/h3&gt;

&lt;p&gt;Each token (including &lt;code&gt;[CLS]&lt;/code&gt;) is updated by attending to &lt;strong&gt;all other tokens&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Let:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$Q = XW^Q$, $K = XW^K$, $V = XW^V$, where $W^{Q,K,V} \in \mathbb{R}^{d \times d_h}$&lt;/li&gt;
&lt;li&gt;Attention output for token $t$:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;$$&lt;br&gt;
\text{Attn}(x_t) = \sum_{j=1}^{N+1} \text{softmax}_j \left( \frac{Q_t K_j^\top}{\sqrt{d_h}} \right) V_j&lt;br&gt;
$$&lt;/p&gt;

&lt;p&gt;So the updated &lt;code&gt;[CLS]&lt;/code&gt; token is:&lt;/p&gt;

&lt;p&gt;$$&lt;br&gt;
x_{\text{cls}}^{(l+1)} = \text{Attn}(x_{\text{cls}}^{(l)})&lt;br&gt;
$$&lt;/p&gt;

&lt;p&gt;It is a &lt;strong&gt;weighted sum over all token embeddings&lt;/strong&gt; — meaning, the &lt;code&gt;[CLS]&lt;/code&gt; token &lt;strong&gt;aggregates all patch information&lt;/strong&gt; in each layer.&lt;/p&gt;

&lt;p&gt;✅ &lt;em&gt;This is architectural: the &lt;code&gt;[CLS]&lt;/code&gt; token is **designed to gather global information&lt;/em&gt;&lt;em&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2️⃣ DINO Loss Pushes for View-Invariance → Forces Global Semantics
&lt;/h2&gt;

&lt;p&gt;Now add the &lt;strong&gt;DINO loss&lt;/strong&gt;, applied to &lt;code&gt;[CLS]&lt;/code&gt; tokens from different augmented views of the same image:&lt;/p&gt;

&lt;h3&gt;
  
  
  🖼️ View 1 (e.g. crop A):
&lt;/h3&gt;

&lt;p&gt;$$&lt;br&gt;
X_A = [x_{\text{cls}}^A, x_1^A, \dots, x_N^A] \to f_A(x_{\text{cls}}^A)&lt;br&gt;
$$&lt;/p&gt;

&lt;h3&gt;
  
  
  🖼️ View 2 (e.g. crop B):
&lt;/h3&gt;

&lt;p&gt;$$&lt;br&gt;
X_B = [x_{\text{cls}}^B, x_1^B, \dots, x_M^B] \to f_B(x_{\text{cls}}^B)&lt;br&gt;
$$&lt;/p&gt;

&lt;p&gt;Then:&lt;/p&gt;

&lt;p&gt;$$&lt;br&gt;
\mathcal{L}&lt;em&gt;{\text{DINO}} = \text{CrossEntropy}\left(\text{softmax}\left(\frac{f&lt;/em&gt;{\text{teacher}}(x_{\text{cls}}^A)}{\tau_t}\right), \text{softmax}\left(\frac{f_{\text{student}}(x_{\text{cls}}^B)}{\tau_s}\right)\right)&lt;br&gt;
$$&lt;/p&gt;

&lt;h3&gt;
  
  
  🚨 Critical Insight:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The augmentations (views A and B) &lt;strong&gt;change local pixel-level content&lt;/strong&gt; (e.g., different crops, color distortions).&lt;/li&gt;
&lt;li&gt;But the &lt;strong&gt;underlying object or concept&lt;/strong&gt; in the image is still the same.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;To minimize this loss, the &lt;code&gt;[CLS]&lt;/code&gt; token &lt;strong&gt;must be invariant to local differences&lt;/strong&gt;, and hence &lt;strong&gt;focus on global, persistent features&lt;/strong&gt; that remain after augmentation.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🔄 Mathematical Pressure from the Loss
&lt;/h2&gt;

&lt;p&gt;If:&lt;/p&gt;

&lt;p&gt;$$&lt;br&gt;
x_{\text{cls}}^A = \sum_{j=1}^{N+1} \alpha_j^A V_j^A \quad \text{and} \quad x_{\text{cls}}^B = \sum_{k=1}^{M+1} \alpha_k^B V_k^B&lt;br&gt;
$$&lt;/p&gt;

&lt;p&gt;Then the only way to make $f(x_{\text{cls}}^A) \approx f(x_{\text{cls}}^B)$ is if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The attention weights $\alpha_j^A$ and $\alpha_k^B$ focus on &lt;strong&gt;patches representing the same global content&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The values $V_j, V_k$ contain representations that are &lt;strong&gt;semantic&lt;/strong&gt;, not raw pixel statistics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So: the network is &lt;strong&gt;forced&lt;/strong&gt; to learn representations where the &lt;code&gt;[CLS]&lt;/code&gt; token becomes &lt;strong&gt;robust and global&lt;/strong&gt;, because only that minimizes the loss over different views.&lt;/p&gt;




&lt;h2&gt;
  
  
  📌 Final Takeaway
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;[CLS]&lt;/code&gt; token&lt;/td&gt;
&lt;td&gt;Structurally gathers all patch information via self-attention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-attention&lt;/td&gt;
&lt;td&gt;Allows &lt;code&gt;[CLS]&lt;/code&gt; to attend to full image — nothing stops it from becoming global&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DINO loss&lt;/td&gt;
&lt;td&gt;Forces &lt;code&gt;[CLS]&lt;/code&gt; to be &lt;strong&gt;invariant across views&lt;/strong&gt;, which only happens if it encodes &lt;strong&gt;global, high-level semantics&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Patch tokens&lt;/td&gt;
&lt;td&gt;Still encode local details, but no constraint is placed on them by the DINO loss (unless masked patch loss is added)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;Would you like a visual diagram of the attention + loss flow, or dive into how masked patch loss mathematically works in a similar way for local representation?&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Does DINO loss compare the [CLS] tokens from both teacher and student?</title>
      <dc:creator>Henri Wang</dc:creator>
      <pubDate>Mon, 30 Jun 2025 08:31:42 +0000</pubDate>
      <link>https://dev.to/henri_wang_d48b1e9bc1ea79/does-dino-loss-compare-the-cls-tokens-from-both-teacher-and-student-2900</link>
      <guid>https://dev.to/henri_wang_d48b1e9bc1ea79/does-dino-loss-compare-the-cls-tokens-from-both-teacher-and-student-2900</guid>
      <description>&lt;p&gt;Yes, exactly.&lt;/p&gt;

&lt;p&gt;In DINO and DINOv2, the DINO loss is applied between the [CLS] tokens of the teacher and student models.&lt;/p&gt;

&lt;p&gt;The [CLS] token output from the teacher is softmaxed with temperature and centered.&lt;br&gt;
The student is trained to match this distribution using cross-entropy loss.&lt;br&gt;
Each view of the same image produces one [CLS] embedding, and the goal is to make the student’s [CLS] output match the teacher’s.&lt;br&gt;
So, the comparison is always between the [CLS] tokens, across different augmentations of the same image.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>computervision</category>
    </item>
    <item>
      <title>why similar smooth gradient in all images indicate strong covariance? and how strong covariance is related to high variance?</title>
      <dc:creator>Henri Wang</dc:creator>
      <pubDate>Mon, 30 Jun 2025 07:39:15 +0000</pubDate>
      <link>https://dev.to/henri_wang_d48b1e9bc1ea79/why-similar-smooth-gradient-in-all-images-indicate-strong-covariance-and-how-strong-covariance-is-3kei</link>
      <guid>https://dev.to/henri_wang_d48b1e9bc1ea79/why-similar-smooth-gradient-in-all-images-indicate-strong-covariance-and-how-strong-covariance-is-3kei</guid>
      <description>&lt;h3&gt;
  
  
  &lt;strong&gt;Why Smooth Gradients → Strong Covariance → High Variance in PCA&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To understand why smooth, consistent gradients across images lead to strong covariance and high variance in PCA, let’s break it down step-by-step with intuition, math, and examples.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;1. Definitions Recap&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Covariance&lt;/strong&gt;: Measures how two pixels (or features) vary &lt;em&gt;together&lt;/em&gt; across images.

&lt;ul&gt;
&lt;li&gt;High covariance: Pixels increase/decrease &lt;em&gt;in sync&lt;/em&gt;.
&lt;/li&gt;
&lt;li&gt;Low covariance: Pixels change independently.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Variance&lt;/strong&gt;: A special case of covariance (how a single pixel varies across images).
&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;For a centered dataset ( X ) (size ( N \times D )), the covariance matrix ( C ) is:&lt;br&gt;&lt;br&gt;
[&lt;br&gt;
C_{jk} = \frac{1}{N} \sum_{i=1}^N x_{ij} x_{ik},&lt;br&gt;&lt;br&gt;
]&lt;br&gt;&lt;br&gt;
where ( x_{ij} ) is the value of pixel ( j ) in image ( i ).&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;2. Smooth Gradients → Strong Covariance&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Intuition&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;smooth gradient&lt;/strong&gt; (e.g., left-to-right lighting in faces) means:

&lt;ul&gt;
&lt;li&gt;Pixel values change &lt;em&gt;slowly and predictably&lt;/em&gt; across the image.
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;All images&lt;/em&gt; share this pattern (e.g., left cheeks are always brighter than right cheeks).
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Example&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Consider two pixels, ( p_1 ) (left cheek) and ( p_2 ) (right cheek), across 3 face images:&lt;br&gt;&lt;br&gt;
| Image | ( p_1 ) | ( p_2 ) |&lt;br&gt;&lt;br&gt;
|-------|----------|----------|&lt;br&gt;&lt;br&gt;
| Face1 | +10      | +5       |&lt;br&gt;&lt;br&gt;
| Face2 | +8       | +4       |&lt;br&gt;&lt;br&gt;
| Face3 | +12      | +6       |  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Covariance calculation&lt;/strong&gt;:
[
C_{12} = \frac{(10 \cdot 5) + (8 \cdot 4) + (12 \cdot 6)}{3} = \frac{50 + 32 + 72}{3} \approx 51.3.
]

&lt;ul&gt;
&lt;li&gt;High positive value because ( p_1 ) and ( p_2 ) scale together across images.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Why?&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Smooth gradients create &lt;strong&gt;consistent pixel relationships&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;If ( p_1 ) increases, ( p_2 ) &lt;em&gt;also&lt;/em&gt; increases (but slightly less, due to gradient).
&lt;/li&gt;
&lt;li&gt;This consistency across images → large ( C_{jk} ).&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;3. Strong Covariance → High Variance&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Link to Eigenvalues&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;PCA’s eigenvalues ( \lambda ) (variances) come from the covariance matrix ( C ):&lt;br&gt;&lt;br&gt;
[&lt;br&gt;
C v = \lambda v.&lt;br&gt;
]  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Eigenvectors ( v )&lt;/strong&gt;: Directions where pixel values co-vary strongly.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Eigenvalues ( \lambda )&lt;/strong&gt;: Variance along those directions.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Why Smooth Gradients Maximize Variance&lt;/strong&gt;
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Shared Structure&lt;/strong&gt;: If all images have a left-to-right lighting gradient, PCA finds a direction ( v ) where:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Projecting images onto ( v ) yields &lt;em&gt;large, consistent values&lt;/em&gt; (high variance).
&lt;/li&gt;
&lt;li&gt;Example: ( v ) might assign positive weights to left pixels and negative to right pixels.
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Variance Calculation&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
For eigenvector ( v ) aligned with the gradient:&lt;br&gt;&lt;br&gt;
[&lt;br&gt;
\text{Var}(v) = \lambda = v^T C v.&lt;br&gt;
]  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Since ( C ) has large values for gradient-related pixels, ( \lambda ) is large.
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Numerical Example&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Suppose ( v = [1, -1] ) (left vs. right cheek):&lt;br&gt;&lt;br&gt;
[&lt;br&gt;
\text{Var}(v) = [1, -1]^T \begin{bmatrix} C_{11} &amp;amp; C_{12} \ C_{21} &amp;amp; C_{22} \end{bmatrix} \begin{bmatrix} 1 \ -1 \end{bmatrix} = C_{11} + C_{22} - 2C_{12}.&lt;br&gt;
]&lt;br&gt;&lt;br&gt;
If ( C_{12} ) is large (strong covariance), ( \text{Var}(v) ) dominates.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;4. Contrast with High-Frequency Noise&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Noise/Edges&lt;/strong&gt;: Pixel values change &lt;em&gt;randomly&lt;/em&gt; across images.

&lt;ul&gt;
&lt;li&gt;Example: Freckles appear at different positions → ( C_{jk} \approx 0 ).
&lt;/li&gt;
&lt;li&gt;No consistent direction to maximize → small ( \lambda ).
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;5. Key Takeaways&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Smooth Gradients&lt;/strong&gt;:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create &lt;strong&gt;predictable pixel relationships&lt;/strong&gt; → high covariance ( C_{jk} ).
&lt;/li&gt;
&lt;li&gt;Allow PCA to find a direction ( v ) where projections vary strongly → high variance ( \lambda ).
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;PCA’s Priority&lt;/strong&gt;:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Top PCs align with &lt;strong&gt;globally consistent patterns&lt;/strong&gt; (gradients, lighting).
&lt;/li&gt;
&lt;li&gt;Discards &lt;strong&gt;inconsistent patterns&lt;/strong&gt; (noise, high-frequency details).
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Semantic Meaning&lt;/strong&gt;:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smooth gradients often correspond to &lt;strong&gt;lighting, pose, or shape&lt;/strong&gt; → top PCs look "meaningful".
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Visualization&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Imagine stretching a rubber band over the data:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PCA’s first eigenvector ( v_1 ) is the direction where the band is &lt;strong&gt;most stretched&lt;/strong&gt; (max variance).
&lt;/li&gt;
&lt;li&gt;Smooth gradients stretch it far; noise barely moves it.
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Final Answer&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
Smooth gradients indicate strong covariance because they make pixels co-vary &lt;em&gt;consistently&lt;/em&gt; across images. PCA’s variance-maximizing objective then assigns large eigenvalues to these directions, prioritizing them as top principal components. This is why low-frequency patterns dominate in PCA, while high-frequency noise vanishes.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Deep Dive: How PCA Discovers Lighting Gradients&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Let’s break down exactly why a left-to-right lighting gradient across images leads PCA to find a direction ( v ) that maximizes variance. We’ll use a concrete example with numbers to illustrate the math.&lt;/p&gt;




&lt;h4&gt;
  
  
  &lt;strong&gt;1. The Dataset Setup&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Suppose we have &lt;strong&gt;3 grayscale face images&lt;/strong&gt;, each with &lt;strong&gt;2 pixels&lt;/strong&gt; (simplified for clarity):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pixel 1 (Left Cheek)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pixel 2 (Right Cheek)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All images share a &lt;strong&gt;left-to-right lighting gradient&lt;/strong&gt;: the left pixel is always brighter than the right.&lt;br&gt;&lt;br&gt;
Here’s the centered data matrix ( X ) (each row is an image):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Image&lt;/th&gt;
&lt;th&gt;Pixel 1 (Left)&lt;/th&gt;
&lt;th&gt;Pixel 2 (Right)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Face1&lt;/td&gt;
&lt;td&gt;+10&lt;/td&gt;
&lt;td&gt;+5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Face2&lt;/td&gt;
&lt;td&gt;+8&lt;/td&gt;
&lt;td&gt;+4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Face3&lt;/td&gt;
&lt;td&gt;+12&lt;/td&gt;
&lt;td&gt;+6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;(Note: These values are already centered by subtracting the mean.)&lt;/em&gt;&lt;/p&gt;




&lt;h4&gt;
  
  
  &lt;strong&gt;2. Covariance Matrix Calculation&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The covariance matrix ( C = \frac{1}{N} X^T X ) quantifies how pixels co-vary:&lt;/p&gt;

&lt;p&gt;[&lt;br&gt;
C = \frac{1}{3} \begin{bmatrix}&lt;br&gt;
10 &amp;amp; 8 &amp;amp; 12 \&lt;br&gt;
5 &amp;amp; 4 &amp;amp; 6 \&lt;br&gt;
\end{bmatrix}&lt;br&gt;
\begin{bmatrix}&lt;br&gt;
10 &amp;amp; 5 \&lt;br&gt;
8 &amp;amp; 4 \&lt;br&gt;
12 &amp;amp; 6 \&lt;/p&gt;

&lt;h1&gt;
  
  
  \end{bmatrix}
&lt;/h1&gt;

&lt;p&gt;\frac{1}{3} \begin{bmatrix}&lt;br&gt;
308 &amp;amp; 154 \&lt;br&gt;
154 &amp;amp; 77 \&lt;br&gt;
\end{bmatrix}&lt;br&gt;
\approx&lt;br&gt;
\begin{bmatrix}&lt;br&gt;
102.67 &amp;amp; 51.33 \&lt;br&gt;
51.33 &amp;amp; 25.67 \&lt;br&gt;
\end{bmatrix}&lt;br&gt;
]&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Key Observation&lt;/strong&gt;:
( C_{12} = C_{21} \approx 51.33 ) is large and positive → Pixels 1 and 2 are &lt;strong&gt;strongly correlated&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h4&gt;
  
  
  &lt;strong&gt;3. Eigenvectors and Eigenvalues&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;PCA solves ( C v = \lambda v ). Let’s compute them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Eigenvalues (( \lambda ))&lt;/strong&gt;:&lt;br&gt;
[&lt;br&gt;
\text{det}(C - \lambda I) = 0 \implies \lambda_1 \approx 128.34, \lambda_2 \approx 0.&lt;br&gt;
]&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;( \lambda_1 ) is large (dominant), ( \lambda_2 \approx 0 ) (negligible).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Eigenvector ( v_1 ) (First PC)&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
[&lt;br&gt;&lt;br&gt;
C v_1 = \lambda_1 v_1 \implies v_1 \approx \begin{bmatrix} 0.89 \ 0.45 \end{bmatrix}.&lt;br&gt;&lt;br&gt;
]&lt;br&gt;&lt;br&gt;
&lt;em&gt;(This direction roughly aligns with the gradient [2, 1], since 10/5 = 8/4 = 12/6 = 2.)&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;




&lt;h4&gt;
  
  
  &lt;strong&gt;4. Projecting Data onto ( v_1 )&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Now, project all images onto ( v_1 ):&lt;/p&gt;

&lt;p&gt;[&lt;br&gt;
\text{Scores} = X v_1 = &lt;br&gt;
\begin{bmatrix}&lt;br&gt;
10 &amp;amp; 5 \&lt;br&gt;
8 &amp;amp; 4 \&lt;br&gt;
12 &amp;amp; 6 \&lt;br&gt;
\end{bmatrix}&lt;br&gt;
\begin{bmatrix}&lt;br&gt;
0.89 \&lt;br&gt;
0.45 \&lt;br&gt;
\end{bmatrix}&lt;br&gt;
\approx&lt;br&gt;
\begin{bmatrix}&lt;br&gt;
11.15 \&lt;br&gt;
8.92 \&lt;br&gt;
13.38 \&lt;br&gt;
\end{bmatrix}&lt;br&gt;
]&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Variance of Scores&lt;/strong&gt;:
[
\text{Var}(scores) = \frac{11.15^2 + 8.92^2 + 13.38^2}{3} \approx 128.34 = \lambda_1.
]

&lt;ul&gt;
&lt;li&gt;This matches the eigenvalue, confirming ( v_1 ) captures maximal variance.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h4&gt;
  
  
  &lt;strong&gt;5. Why Does This Direction Work?&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Geometric Intuition&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
The eigenvector ( v_1 ) points along the "axis of variation" in the data.  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In our 2D pixel space, the data points lie almost on a line with slope ( \approx 0.5 ) (since Pixel 1 ≈ 2 × Pixel 2).
&lt;/li&gt;
&lt;li&gt;( v_1 ) aligns with this line, so projecting onto it stretches the data maximally.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Algebraic Intuition&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
The scores ( X v_1 ) are large because:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;( v_1 ) assigns &lt;strong&gt;positive weights to both pixels&lt;/strong&gt;, but more to Pixel 1 (left cheek).
&lt;/li&gt;
&lt;li&gt;Since Pixel 1 is consistently brighter, the weighted sum ( X v_1 ) amplifies this pattern → high variance.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h4&gt;
  
  
  &lt;strong&gt;6. Contrast with Noise (Low Variance)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Imagine adding a &lt;strong&gt;high-frequency noise pixel&lt;/strong&gt; (e.g., a freckle at random positions):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Its covariance with other pixels would be near-zero (no consistent pattern).
&lt;/li&gt;
&lt;li&gt;The corresponding eigenvalue would be tiny → PCA ignores it.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Key Takeaways&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Consistent Gradients&lt;/strong&gt; → &lt;strong&gt;Strong Covariance&lt;/strong&gt;:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When pixels co-vary predictably (e.g., left cheek always brighter), ( C ) has large off-diagonal values.
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;PCA’s Magic&lt;/strong&gt;:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The top eigenvector ( v_1 ) points where the data is "stretched" most (lighting gradient direction).
&lt;/li&gt;
&lt;li&gt;Projections onto ( v_1 ) amplify this shared structure → high variance (( \lambda_1 )).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Semantic Meaning&lt;/strong&gt;:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;( v_1 ) isn’t arbitrary; it reflects a &lt;strong&gt;real-world pattern&lt;/strong&gt; (lighting) because that’s what dominates the data’s covariance.
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Visualization&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Think of the data as points in a 2D pixel space:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The points cluster along a line (slope = 0.5).
&lt;/li&gt;
&lt;li&gt;( v_1 ) is the direction of that line.
&lt;/li&gt;
&lt;li&gt;Projecting onto ( v_1 ) preserves the gradient; projecting onto ( v_2 ) (orthogonal) loses it.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why PCA’s top components often "make sense" for images!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Why Top PCA Components in Images Carry Semantic Meaning</title>
      <dc:creator>Henri Wang</dc:creator>
      <pubDate>Mon, 30 Jun 2025 07:38:40 +0000</pubDate>
      <link>https://dev.to/henri_wang_d48b1e9bc1ea79/why-top-pca-components-in-images-carry-semantic-meaning-1bil</link>
      <guid>https://dev.to/henri_wang_d48b1e9bc1ea79/why-top-pca-components-in-images-carry-semantic-meaning-1bil</guid>
      <description>&lt;h3&gt;
  
  
  &lt;strong&gt;Why Top PCA Components in Images Carry Semantic Meaning&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The observation that top PCA components (like Eigenfaces) often correspond to &lt;strong&gt;semantically meaningful patterns&lt;/strong&gt; in images (e.g., lighting, edges, facial features) arises from the interplay of &lt;strong&gt;variance maximization&lt;/strong&gt; and the &lt;strong&gt;statistical structure of natural images&lt;/strong&gt;. Here’s a detailed breakdown:&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;1. PCA Recap: Variance Maximization&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;PCA finds directions (principal components) that maximize variance in the data.&lt;/li&gt;
&lt;li&gt;For images, each pixel is a dimension, and an image is a point in this high-dimensional space.&lt;/li&gt;
&lt;li&gt;The top PCs are the directions where pixel intensities &lt;strong&gt;co-vary most strongly&lt;/strong&gt; across the dataset.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;2. Why Variance ≈ Semantic Meaning in Images?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Natural images (e.g., faces, objects) have &lt;strong&gt;structured pixel correlations&lt;/strong&gt;, not random noise. Key reasons why top PCs capture semantics:&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;(A) Dominant Global Patterns&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;First PC (λ₁)&lt;/strong&gt;: Often captures the &lt;strong&gt;mean image&lt;/strong&gt; or &lt;strong&gt;average illumination&lt;/strong&gt; (since lighting variations dominate pixel-wise variance).

&lt;ul&gt;
&lt;li&gt;Example: In Eigenfaces, PC1 is a blurry face (average face + lighting direction).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Subsequent PCs (λ₂, λ₃, ...)&lt;/strong&gt;: Encode &lt;strong&gt;gradients&lt;/strong&gt; (e.g., left vs. right lighting, edges, facial parts).&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;(B) Hierarchical Structure of Natural Images&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Natural images obey &lt;strong&gt;spatial smoothness&lt;/strong&gt; (nearby pixels are correlated).&lt;/li&gt;
&lt;li&gt;PCA implicitly exploits this:

&lt;ul&gt;
&lt;li&gt;Low-frequency patterns (e.g., cheek contours) have higher variance → appear in top PCs.&lt;/li&gt;
&lt;li&gt;High-frequency noise (e.g., pixel jitter) has low variance → relegated to later PCs.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;(C) Shared Semantic Features&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;In datasets like faces, &lt;strong&gt;similar features&lt;/strong&gt; (eyes, nose) appear repeatedly.&lt;/li&gt;
&lt;li&gt;PCA’s variance maximization aligns PCs with these &lt;strong&gt;common modes of variation&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;PC2 might encode "smiling vs. neutral" (if mouth shapes vary a lot).&lt;/li&gt;
&lt;li&gt;PC3 might capture "eyeglasses vs. no eyeglasses."&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;3. Mathematical Insight: Link to Eigenvectors&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The top eigenvector ( v_1 ) satisfies:&lt;br&gt;
[&lt;br&gt;
v_1 = \arg\max_{|v|=1} \text{Var}(Xv) = \arg\max \sum_{i=1}^N (x_i \cdot v)^2.&lt;br&gt;
]&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For images, ( x_i \cdot v ) is high when ( v ) aligns with &lt;strong&gt;recurring pixel patterns&lt;/strong&gt; (e.g., horizontal edges).&lt;/li&gt;
&lt;li&gt;Thus, ( v_1 ) "looks like" a typical feature (e.g., an edge filter).&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;4. Example: Eigenfaces (PCA on Faces)&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PC Rank&lt;/th&gt;
&lt;th&gt;Semantic Meaning&lt;/th&gt;
&lt;th&gt;Eigenvalue (Variance)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PC1&lt;/td&gt;
&lt;td&gt;Average face + lighting direction&lt;/td&gt;
&lt;td&gt;λ₁ = 95%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PC2&lt;/td&gt;
&lt;td&gt;Left vs. right lighting&lt;/td&gt;
&lt;td&gt;λ₂ = 3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PC3&lt;/td&gt;
&lt;td&gt;Eyebrows/nose shape&lt;/td&gt;
&lt;td&gt;λ₃ = 1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;PC1-PC3 &lt;strong&gt;look like ghostly faces&lt;/strong&gt; because they encode &lt;strong&gt;global face structures&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Later PCs (λ ≈ 0) encode noise or idiosyncratic details.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;5. Why Not All PCs Are Semantic?&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Top PCs&lt;/strong&gt;: High variance → capture &lt;strong&gt;shared structure&lt;/strong&gt; (semantics).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bottom PCs&lt;/strong&gt;: Low variance → capture &lt;strong&gt;noise&lt;/strong&gt; or &lt;strong&gt;rare artifacts&lt;/strong&gt; (e.g., sensor dust).&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;6. Connection to Biology and Perception&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;PCA-like processes occur in &lt;strong&gt;biological vision&lt;/strong&gt; (e.g., retinal ganglion cells perform PCA on natural scenes).&lt;/li&gt;
&lt;li&gt;The brain prioritizes &lt;strong&gt;high-variance features&lt;/strong&gt; (edges, textures) for efficient coding.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;7. Limitations&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Linear Assumption&lt;/strong&gt;: PCA only captures linear correlations. Nonlinear features (e.g., curved edges) may require kernel PCA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dataset Bias&lt;/strong&gt;: If images are badly aligned, PCs may reflect misalignment, not semantics.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Key Takeaways&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Variance = Structure&lt;/strong&gt;: In natural images, large pixel covariances arise from &lt;strong&gt;shared semantic features&lt;/strong&gt; (not noise).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Top PCs&lt;/strong&gt; align with &lt;strong&gt;dominant statistical patterns&lt;/strong&gt;, which often coincide with human-interpretable features.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dimensionality Reduction&lt;/strong&gt;: Keeping top PCs preserves semantics while discarding noise.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Try It Yourself&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Compute PCA on MNIST digits. You’ll find:

&lt;ul&gt;
&lt;li&gt;PC1: Average digit blob.&lt;/li&gt;
&lt;li&gt;PC2: Thin vs. thick strokes.&lt;/li&gt;
&lt;li&gt;PC3: Slant direction (e.g., left vs. right).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    <item>
      <title>why larger eigenval indicates more variance?</title>
      <dc:creator>Henri Wang</dc:creator>
      <pubDate>Mon, 30 Jun 2025 02:56:54 +0000</pubDate>
      <link>https://dev.to/henri_wang_d48b1e9bc1ea79/why-larger-eigenval-indicates-more-variance-25l9</link>
      <guid>https://dev.to/henri_wang_d48b1e9bc1ea79/why-larger-eigenval-indicates-more-variance-25l9</guid>
      <description>&lt;h3&gt;
  
  
  &lt;strong&gt;Detailed Explanation: Why Larger Eigenvalues Indicate More Variance in PCA&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To understand why larger eigenvalues correspond to directions of greater variance in Principal Component Analysis (PCA), we need to dive into the mathematical foundations of eigenvalues, eigenvectors, and projections. Here’s a step-by-step breakdown:&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;1. Recap: Covariance Matrix and Eigen-Decomposition&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Given a centered data matrix ( X ) (size ( N \times D )), the covariance matrix is:&lt;br&gt;
[&lt;br&gt;
C = \frac{1}{N-1} X^T X.&lt;br&gt;
]&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;( C ) is symmetric (( C = C^T )) and positive semi-definite.&lt;/li&gt;
&lt;li&gt;Its eigenvectors ( v_1, v_2, \dots, v_D ) (principal components) are orthogonal, and eigenvalues ( \lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_D \geq 0 ) represent the variance along each eigenvector.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;2. Projection of Data onto an Eigenvector&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When we project the centered data ( X ) onto an eigenvector ( v_j ), the transformed data ( s_j ) (scores) is:&lt;br&gt;
[&lt;br&gt;
s_j = X v_j.&lt;br&gt;
]&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;( s_j ) is a vector of length ( N ) (one value per data point).&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;variance of ( s_j )&lt;/strong&gt; is calculated as:
[
\text{Var}(s_j) = \frac{1}{N-1} s_j^T s_j = \frac{1}{N-1} (X v_j)^T (X v_j) = v_j^T \underbrace{\left( \frac{1}{N-1} X^T X \right)}_{C} v_j = v_j^T C v_j.
]&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;3. Eigenvalue Equation and Variance&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;From the eigen-decomposition ( C v_j = \lambda_j v_j ), multiply both sides by ( v_j^T ):&lt;br&gt;
[&lt;br&gt;
v_j^T C v_j = v_j^T (\lambda_j v_j) = \lambda_j \underbrace{v_j^T v_j}_{=1} = \lambda_j.&lt;br&gt;
]&lt;br&gt;
Thus:&lt;br&gt;
[&lt;br&gt;
\text{Var}(s_j) = \lambda_j.&lt;br&gt;
]&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insight:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The eigenvalue ( \lambda_j ) &lt;strong&gt;is exactly the variance&lt;/strong&gt; of the data projected onto its corresponding eigenvector ( v_j ).&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;4. Why Larger ( \lambda_j ) = More Important Direction?&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;first principal component (PC1)&lt;/strong&gt; ( v_1 ) is the direction maximizing ( \text{Var}(Xv) ). By definition, this is the eigenvector with the largest eigenvalue (( \lambda_1 )).&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;second PC&lt;/strong&gt; ( v_2 ) is the next best direction (orthogonal to ( v_1 )) with variance ( \lambda_2 ), and so on.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Geometric Interpretation:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Eigenvalues ( \lambda_j ) quantify how "stretched" the data is along each PC. Larger ( \lambda_j ) means the data spreads out more in that direction, making it a dominant feature.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;5. Formal Proof: Variance Maximization&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;PCA solves the constrained optimization problem:&lt;br&gt;
[&lt;br&gt;
\max_{v} v^T C v \quad \text{subject to} \quad |v| = 1.&lt;br&gt;
]&lt;br&gt;
The Lagrangian is:&lt;br&gt;
[&lt;br&gt;
\mathcal{L}(v, \lambda) = v^T C v - \lambda (v^T v - 1).&lt;br&gt;
]&lt;br&gt;
Taking the gradient w.r.t. ( v ) and setting it to zero:&lt;br&gt;
[&lt;br&gt;
\nabla_v \mathcal{L} = 2 C v - 2 \lambda v = 0 \implies C v = \lambda v.&lt;br&gt;
]&lt;br&gt;
This shows that the optimal directions ( v ) are eigenvectors of ( C ), and the variances ( \lambda ) are the eigenvalues.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;6. Total Variance and Explained Variance&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;total variance&lt;/strong&gt; in the data is the sum of all eigenvalues:
[
\text{Total Variance} = \sum_{j=1}^D \lambda_j = \text{trace}(C).
]&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;proportion of variance explained&lt;/strong&gt; by the ( j )-th PC is:
[
\frac{\lambda_j}{\sum_{k=1}^D \lambda_k}.
]
Larger ( \lambda_j ) means the PC explains more of the total variance.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;7. Intuitive Example: 2D Data&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Consider a 2D dataset with:&lt;br&gt;
[&lt;br&gt;
C = \begin{bmatrix}&lt;br&gt;
5 &amp;amp; 2 \&lt;br&gt;
2 &amp;amp; 3&lt;br&gt;
\end{bmatrix}.&lt;br&gt;
]&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Eigenvalues: ( \lambda_1 = 7 ), ( \lambda_2 = 1 ).&lt;/li&gt;
&lt;li&gt;PC1 (( \lambda_1 = 7 )): Direction where data varies most (accounts for ( 7/8 = 87.5\% ) of variance).&lt;/li&gt;
&lt;li&gt;PC2 (( \lambda_2 = 1 )): Orthogonal direction with less variance (( 12.5\% )).&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;8. Key Takeaways&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Eigenvalue = Variance&lt;/strong&gt;: ( \lambda_j ) is the variance of the data projected onto ( v_j ).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dominant PCs&lt;/strong&gt;: Larger ( \lambda_j ) means more variance is captured by ( v_j ), making it a "major feature."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimality&lt;/strong&gt;: PCs are the &lt;strong&gt;best&lt;/strong&gt; orthogonal directions for maximizing variance (or minimizing reconstruction error).&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    <item>
      <title>expand the cov matrix multiplication pixelwise to have a better intuition</title>
      <dc:creator>Henri Wang</dc:creator>
      <pubDate>Mon, 30 Jun 2025 02:56:17 +0000</pubDate>
      <link>https://dev.to/henri_wang_d48b1e9bc1ea79/expand-the-cov-matrix-multiplication-pixelwise-to-have-a-better-intuition-44gh</link>
      <guid>https://dev.to/henri_wang_d48b1e9bc1ea79/expand-the-cov-matrix-multiplication-pixelwise-to-have-a-better-intuition-44gh</guid>
      <description>&lt;h3&gt;
  
  
  &lt;strong&gt;1. Expansion of the Covariance Matrix (Pixel-wise Intuition)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;For a dataset of ( N ) images, each flattened into a vector of ( D ) pixels, the centered data matrix ( X ) (size ( N \times D )) is:&lt;br&gt;
[&lt;br&gt;
X = \begin{bmatrix}&lt;br&gt;
x_{11} - \mu_1 &amp;amp; x_{12} - \mu_2 &amp;amp; \cdots &amp;amp; x_{1D} - \mu_D \&lt;br&gt;
x_{21} - \mu_1 &amp;amp; x_{22} - \mu_2 &amp;amp; \cdots &amp;amp; x_{2D} - \mu_D \&lt;br&gt;
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots \&lt;br&gt;
x_{N1} - \mu_1 &amp;amp; x_{N2} - \mu_2 &amp;amp; \cdots &amp;amp; x_{ND} - \mu_D&lt;br&gt;
\end{bmatrix},&lt;br&gt;
]&lt;br&gt;
where ( \mu_j = \frac{1}{N} \sum_{i=1}^N x_{ij} ) is the mean of pixel ( j ).&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;covariance matrix&lt;/strong&gt; ( C ) (size ( D \times D )) is computed as:&lt;br&gt;
[&lt;br&gt;
C = \frac{1}{N-1} X^T X.&lt;br&gt;
]&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expanding ( X^T X ) pixel-wise:&lt;/strong&gt;&lt;br&gt;
[&lt;br&gt;
X^T X = \begin{bmatrix}&lt;br&gt;
\sum_{i=1}^N (x_{i1} - \mu_1)^2 &amp;amp; \sum_{i=1}^N (x_{i1} - \mu_1)(x_{i2} - \mu_2) &amp;amp; \cdots &amp;amp; \sum_{i=1}^N (x_{i1} - \mu_1)(x_{iD} - \mu_D) \&lt;br&gt;
\sum_{i=1}^N (x_{i2} - \mu_2)(x_{i1} - \mu_1) &amp;amp; \sum_{i=1}^N (x_{i2} - \mu_2)^2 &amp;amp; \cdots &amp;amp; \sum_{i=1}^N (x_{i2} - \mu_2)(x_{iD} - \mu_D) \&lt;br&gt;
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots \&lt;br&gt;
\sum_{i=1}^N (x_{iD} - \mu_D)(x_{i1} - \mu_1) &amp;amp; \sum_{i=1}^N (x_{iD} - \mu_D)(x_{i2} - \mu_2) &amp;amp; \cdots &amp;amp; \sum_{i=1}^N (x_{iD} - \mu_D)^2&lt;br&gt;
\end{bmatrix}.&lt;br&gt;
]&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interpretation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Diagonal entries&lt;/strong&gt; ( C_{jj} ): Variance of pixel ( j ) across all images.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Off-diagonal entries&lt;/strong&gt; ( C_{jk} ): Covariance between pixels ( j ) and ( k ). High values indicate pixels ( j ) and ( k ) vary together (e.g., edges or textures).&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>how can I compute PCA between different images, tell me its mathematics.</title>
      <dc:creator>Henri Wang</dc:creator>
      <pubDate>Mon, 30 Jun 2025 02:36:50 +0000</pubDate>
      <link>https://dev.to/henri_wang_d48b1e9bc1ea79/how-can-i-compute-pca-between-different-images-tell-me-its-mathematics-46g2</link>
      <guid>https://dev.to/henri_wang_d48b1e9bc1ea79/how-can-i-compute-pca-between-different-images-tell-me-its-mathematics-46g2</guid>
      <description>&lt;p&gt;Computing &lt;strong&gt;Principal Component Analysis (PCA)&lt;/strong&gt; between different images involves treating each image (or image patch) as a high-dimensional vector and then performing PCA to find the dominant patterns (principal components) across these vectors. Below, I’ll explain the &lt;strong&gt;mathematics step-by-step&lt;/strong&gt; and provide a clear method to implement it.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Key Idea of PCA for Images&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Flatten each image&lt;/strong&gt; into a 1D vector (e.g., a &lt;code&gt;64x64&lt;/code&gt; image becomes a &lt;code&gt;4096-dim&lt;/code&gt; vector).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stack all images&lt;/strong&gt; into a data matrix &lt;code&gt;X&lt;/code&gt; (each row is an image, columns are pixel values).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute PCA&lt;/strong&gt; on &lt;code&gt;X&lt;/code&gt; to find the principal components (PCs) that explain the most variance across images.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Step-by-Step Mathematics&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Let’s say you have &lt;code&gt;N&lt;/code&gt; images, each of size &lt;code&gt;H × W&lt;/code&gt; (height × width). &lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;1. Flatten and Create Data Matrix&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Reshape each image into a column vector of size &lt;code&gt;D = H × W&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Stack all images into a &lt;code&gt;N × D&lt;/code&gt; data matrix &lt;code&gt;X&lt;/code&gt;:
[
X = \begin{bmatrix}
\text{Image}_1^{(1)} &amp;amp; \text{Image}_1^{(2)} &amp;amp; \cdots &amp;amp; \text{Image}_1^{(D)} \
\text{Image}_2^{(1)} &amp;amp; \text{Image}_2^{(2)} &amp;amp; \cdots &amp;amp; \text{Image}_2^{(D)} \
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots \
\text{Image}_N^{(1)} &amp;amp; \text{Image}_N^{(2)} &amp;amp; \cdots &amp;amp; \text{Image}_N^{(D)}
\end{bmatrix}
]
(Each row is an image, columns are pixel intensities.)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;2. Center the Data (Subtract Mean)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Compute the mean image:
[
\mu = \frac{1}{N} \sum_{i=1}^N \text{Image}_i
]&lt;/li&gt;
&lt;li&gt;Subtract the mean from each image:
[
X_{\text{centered}} = X - \mu
]&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;3. Compute Covariance Matrix&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;The covariance matrix &lt;code&gt;C&lt;/code&gt; (size &lt;code&gt;D × D&lt;/code&gt;) measures how pixel intensities vary together:
[
C = \frac{1}{N-1} X_{\text{centered}}^T X_{\text{centered}}
]
(For &lt;code&gt;N &amp;lt; D&lt;/code&gt;, computing &lt;code&gt;C&lt;/code&gt; directly is inefficient—see &lt;strong&gt;Efficient Computation&lt;/strong&gt; below.)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;4. Eigenvalue Decomposition&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Solve for the eigenvectors (&lt;code&gt;V&lt;/code&gt;) and eigenvalues (&lt;code&gt;\Lambda&lt;/code&gt;) of &lt;code&gt;C&lt;/code&gt;:
[
C V = V \Lambda
]&lt;/li&gt;
&lt;li&gt;The eigenvectors (columns of &lt;code&gt;V&lt;/code&gt;) are the &lt;strong&gt;principal components (PCs)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The eigenvalues (&lt;code&gt;\Lambda&lt;/code&gt;) indicate the variance explained by each PC.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;5. Project Data onto Principal Components&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;To reduce dimensionality, select the top &lt;code&gt;k&lt;/code&gt; PCs (e.g., &lt;code&gt;k=3&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Project the centered data onto these PCs:
[
\text{Scores} = X_{\text{centered}} V_k
]
(&lt;code&gt;Scores&lt;/code&gt; is an &lt;code&gt;N × k&lt;/code&gt; matrix representing each image in the PCA space.)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Efficient Computation (for High-Dimensional Images)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If &lt;code&gt;D&lt;/code&gt; (number of pixels) is large (e.g., &lt;code&gt;D = 4096&lt;/code&gt; for &lt;code&gt;64x64&lt;/code&gt; images), computing the &lt;code&gt;D × D&lt;/code&gt; covariance matrix is expensive. Instead:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Compute the Gram matrix &lt;code&gt;G = X_{\text{centered}} X_{\text{centered}}^T&lt;/code&gt; (size &lt;code&gt;N × N&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Find its eigenvectors &lt;code&gt;U&lt;/code&gt; and eigenvalues &lt;code&gt;\Lambda&lt;/code&gt;:
[
G U = U \Lambda
]&lt;/li&gt;
&lt;li&gt;The PCs are then:
[
V = X_{\text{centered}}^T U \Lambda^{-1/2}
]
(This avoids computing the large &lt;code&gt;D × D&lt;/code&gt; covariance matrix.)&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Example in Python (using NumPy)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Example: 10 images of size 64x64 (flattened to 4096-dim vectors)
&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;H&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Replace with actual images
&lt;/span&gt;
&lt;span class="c1"&gt;# Step 1: Center data
&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X_centered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Compute covariance matrix (use Gram matrix trick for D &amp;gt;&amp;gt; N)
&lt;/span&gt;&lt;span class="n"&gt;G&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X_centered&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;X_centered&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;  &lt;span class="c1"&gt;# Gram matrix (N x N)
&lt;/span&gt;&lt;span class="n"&gt;eigenvals&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eigenvecs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;eigh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Eigen decomposition
&lt;/span&gt;
&lt;span class="c1"&gt;# Step 3: Get top k=3 principal components
&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="n"&gt;top_k_eigenvecs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;eigenvecs&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;  &lt;span class="c1"&gt;# Last k eigenvectors (largest eigenvalues)
&lt;/span&gt;&lt;span class="n"&gt;top_k_pcs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_centered&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;top_k_eigenvecs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eigenvals&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;

&lt;span class="c1"&gt;# Step 4: Project data onto PCs
&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X_centered&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;top_k_pcs&lt;/span&gt;  &lt;span class="c1"&gt;# N x k matrix (PCA scores for each image)
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PCA Scores (first 3 components):&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  &lt;strong&gt;Interpretation&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Each row of &lt;code&gt;scores&lt;/code&gt; gives the coordinates of an image in the PCA space.&lt;/li&gt;
&lt;li&gt;The first PC (&lt;code&gt;scores[:, 0]&lt;/code&gt;) captures the direction of maximum variance across images.&lt;/li&gt;
&lt;li&gt;Plotting &lt;code&gt;scores[:, 0]&lt;/code&gt; vs &lt;code&gt;scores[:, 1]&lt;/code&gt; visualizes image similarities.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Key Notes&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Normalization&lt;/strong&gt;: If pixel scales vary, standardize &lt;code&gt;X&lt;/code&gt; (subtract mean, divide by std).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whitening&lt;/strong&gt;: To make all PCs equally important, divide by &lt;code&gt;sqrt(eigenvals)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PCA vs. SVD&lt;/strong&gt;: In practice, PCA is often computed via SVD (more numerically stable).&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>in DINO, how does [CLS] token get to gather global information, unlike other patches, though under same attention mechanism?</title>
      <dc:creator>Henri Wang</dc:creator>
      <pubDate>Mon, 30 Jun 2025 00:54:01 +0000</pubDate>
      <link>https://dev.to/henri_wang_d48b1e9bc1ea79/in-dino-how-does-cls-token-get-to-gather-global-information-unlike-other-patches-though-under-39h6</link>
      <guid>https://dev.to/henri_wang_d48b1e9bc1ea79/in-dino-how-does-cls-token-get-to-gather-global-information-unlike-other-patches-though-under-39h6</guid>
      <description>&lt;p&gt;In the DINO (self-distillation with no labels) framework, the &lt;strong&gt;&lt;a href="https://dev.toclassification"&gt;CLS&lt;/a&gt; token&lt;/strong&gt; gathers global information despite using the same attention mechanism as other patch tokens due to its unique role in the attention dynamics and training objective. Here's why:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Special Position and Role of [CLS]&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;[CLS] token&lt;/strong&gt; is prepended to the sequence of patch tokens and is designed to aggregate global information for tasks like classification or distillation. Unlike patch tokens (which primarily attend to local regions of the image), the &lt;strong&gt;[CLS] token has no spatial bias&lt;/strong&gt;—it can attend to all patches equally.&lt;/li&gt;
&lt;li&gt;During self-attention, the &lt;strong&gt;[CLS] token's queries&lt;/strong&gt; interact with keys from all patches (and itself), allowing it to integrate information across the entire image.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Attention Mechanism Flexibility&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;While all tokens (including patches and [CLS]) use the same attention mechanism, the &lt;strong&gt;[CLS] token’s attention patterns are learned to be more global&lt;/strong&gt; because:

&lt;ul&gt;
&lt;li&gt;It has no positional encoding bias toward any specific region (unlike patch tokens, which tend to focus locally due to the locality biases in vision tasks).&lt;/li&gt;
&lt;li&gt;The training objective (self-distillation) encourages the [CLS] token to capture semantically meaningful global features since it’s the output used for distillation.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Training Objective (Self-Distillation)&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;In DINO, the &lt;strong&gt;[CLS] token’s output is the primary target for self-distillation&lt;/strong&gt;, meaning it must encode rich, discriminative information to match the teacher network’s predictions.&lt;/li&gt;
&lt;li&gt;Patch tokens may focus on local features (useful for reconstruction or local tasks), but the &lt;strong&gt;[CLS] token is explicitly trained to be a global descriptor&lt;/strong&gt;, forcing it to attend broadly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Emergent Property of Self-Supervised Learning&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;DINO’s self-supervised loss (cross-entropy between student and teacher [CLS] outputs) incentivizes the &lt;strong&gt;[CLS] token to become a "summary" of the image&lt;/strong&gt; to avoid collapse and capture invariant features.&lt;/li&gt;
&lt;li&gt;Patch tokens can afford to be more local because their role isn’t directly constrained by the distillation loss.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. &lt;strong&gt;Contrast with Patch Tokens&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Patch tokens&lt;/strong&gt; naturally attend to nearby patches (due to spatial coherence), but the &lt;strong&gt;[CLS] token’s attention is unbounded&lt;/strong&gt;—it can learn long-range dependencies more easily.&lt;/li&gt;
&lt;li&gt;In practice, attention maps for [CLS] often show broad, image-wide coverage, while patch tokens focus on local regions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Insight:
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;[CLS] token isn’t fundamentally different in architecture&lt;/strong&gt;, but its &lt;strong&gt;positional freedom + training objective&lt;/strong&gt; biases it toward global aggregation. The same attention mechanism yields different behaviors because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query role&lt;/strong&gt;: [CLS] queries are optimized to aggregate globally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No spatial priors&lt;/strong&gt;: Unlike patches, it isn’t tied to a specific image region.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is analogous to how [CLS] works in ViTs for supervised learning, but in DINO, the self-distillation objective further reinforces its global role.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
