<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kumar Vishal</title>
    <description>The latest articles on DEV Community by Kumar Vishal (@kumarvis).</description>
    <link>https://dev.to/kumarvis</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2880626%2F776d9fd5-213c-4d2e-9694-4328d407fbef.png</url>
      <title>DEV Community: Kumar Vishal</title>
      <link>https://dev.to/kumarvis</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kumarvis"/>
    <language>en</language>
    <item>
      <title>Vision Transform</title>
      <dc:creator>Kumar Vishal</dc:creator>
      <pubDate>Sat, 30 Aug 2025 06:20:36 +0000</pubDate>
      <link>https://dev.to/kumarvis/vision-transform-2la2</link>
      <guid>https://dev.to/kumarvis/vision-transform-2la2</guid>
      <description>&lt;h1&gt;
  
  
  Understanding Paper: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" with ChatGPT
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2010.11929" rel="noopener noreferrer"&gt;Full Paper PDF&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Prompt
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Role: You are an experienced AI/ML researcher and teaching assistant.&lt;/li&gt;
&lt;li&gt;Objective: Help me understand the attached research paper through an interactive, step-by-step dialogue.&lt;/li&gt;
&lt;li&gt;Workflow &amp;amp; Ground Rules: When I ask “EXPLAIN” or Similar words, break down concepts as if teaching a graduate CS student.&lt;/li&gt;
&lt;li&gt;Define all symbols: Use analogies to common models (e.g., CNN, Transformer) if helpful.&lt;/li&gt;
&lt;li&gt;When I ask “CODE”, provide clean PyTorch-style pseudocode with comments and tensor shapes.&lt;/li&gt;
&lt;li&gt;If a question is ambiguous, ask a clarifying follow-up instead of guessing.&lt;/li&gt;
&lt;li&gt;Cite page or figure numbers from the PDF whenever you reference specific content.&lt;/li&gt;
&lt;li&gt;If you are unsure, say “I’m not certain—please verify in the original text.”&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  - Begin by replying only with: “Ready to analyze the paper. Awaiting your first instruction.”
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Prompt 1
&lt;/h2&gt;

&lt;p&gt;At first, I want an elaborate summary of the entire article. Return a bullet list with:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Main problem addressed
&lt;/li&gt;
&lt;li&gt;Core idea (3-4 sentences)
&lt;/li&gt;
&lt;li&gt;Claimed contributions (short phrases)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Output 1
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Vision Transformer (ViT) – High-level Take-aways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Main problem addressed&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Convolutional Neural Networks (CNNs) dominate vision, yet they embed hand-crafted inductive biases (locality, translation equivariance) that may limit scalability. The paper asks: &lt;em&gt;Can a standard NLP Transformer, with minimal changes, match or surpass state-of-the-art CNNs on image classification when trained at scale?&lt;/em&gt;  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Core idea (3–4 sentences)&lt;/strong&gt;  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Treat an image as a sequence of non-overlapping $$(P \times P)$$ patches (e.g., 16×16) and linearly project each flattened patch to a D-dimensional token embedding (Eq. 1, p. 3).
&lt;/li&gt;
&lt;li&gt;Prepend a learnable “[class]” token, add 1-D positional embeddings, and feed the resulting sequence into a vanilla Transformer encoder, unchanged from BERT except for patch sizing (Fig. 1, p. 2).
&lt;/li&gt;
&lt;li&gt;Supervised pre-training on very large image corpora (ImageNet-21k, 14 M images; JFT-300M, 303 M images) compensates for the lack of CNN inductive bias.
&lt;/li&gt;
&lt;li&gt;After fine-tuning (often at higher resolution) the resulting Vision Transformer attains competitive or superior accuracy to strong CNN baselines while using 2–4× less pre-training compute (Table 2, p. 5; Fig. 5, p. 7).
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Claimed contributions&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Introduce &lt;em&gt;Vision Transformer&lt;/em&gt; (pure Transformer on image patches)
&lt;/li&gt;
&lt;li&gt;Show &lt;strong&gt;scaling law&lt;/strong&gt;: larger datasets &amp;gt; inductive bias for vision
&lt;/li&gt;
&lt;li&gt;Achieve &lt;strong&gt;88.55 % ImageNet top-1&lt;/strong&gt; with ViT-H/14 (state-of-the-art at submission)
&lt;/li&gt;
&lt;li&gt;Demonstrate strong transfer: &lt;strong&gt;94.55 % CIFAR-100&lt;/strong&gt;, &lt;strong&gt;77.63 % VTAB mean&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Provide compute-efficient training (≥2× less TPU core-days than prior SOTA)
&lt;/li&gt;
&lt;li&gt;Release code &amp;amp; pre-trained checkpoints for reproducibility
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  Prompt 2
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Summarize the Introduction section in ≤200 words. &lt;/li&gt;
&lt;li&gt;Highlight:

&lt;ul&gt;
&lt;li&gt;Practical or theoretical gap the authors claim.&lt;/li&gt;
&lt;li&gt;Prior work they say is insufficient.&lt;/li&gt;
&lt;li&gt;Specific real-world stakes (datasets, applications, societal impact).
## Output 2&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;The authors confront a widening gap between the &lt;strong&gt;scalability of Transformers in NLP&lt;/strong&gt; and their modest traction in computer-vision pipelines. They argue that most vision research still leans on Convolutional Neural Networks (CNNs) whose &lt;em&gt;built-in inductive biases&lt;/em&gt; (locality, translation equivariance) become liabilities when models and datasets scale.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Practical/Theoretical gap claimed&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;No compelling evidence yet shows that a &lt;em&gt;pure&lt;/em&gt; Transformer—devoid of convolutional components—can match leading CNNs on large-scale image classification.&lt;/li&gt;
&lt;li&gt;Existing attention-augmented CNN hybrids inherit CNN constraints, limiting architectural simplicity and hardware efficiency.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why prior work is insufficient&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Earlier “fully-attention” models either keep some CNN stages, use specialized sparse or local attention patterns that complicate accelerator deployment, or were only tested on small images/datasets (e.g., 2×2-patch models, Sparse/Axial/Swin variants).&lt;/li&gt;
&lt;li&gt;Self-supervised iGPT reaches 72% ImageNet top-1—far below CNN SOTA—demonstrating the shortfall of prior Transformer attempts when data or compute are limited.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-world stakes highlighted&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Benchmarks: ImageNet-1k (1.3 M images), ImageNet-21k (14 M), and Google’s JFT-300M (303 M) are emphasized as proving grounds; performance on smaller transfer sets (CIFAR-100, VTAB suite) gauges generalization.&lt;/li&gt;
&lt;li&gt;Applications: Efficient training on these massive corpora matters for &lt;em&gt;industrial-scale visual recognition systems&lt;/em&gt; where compute budgets and hardware utilization dictate feasibility.&lt;/li&gt;
&lt;li&gt;Societal impact: Demonstrating that inductive biases can be learned rather than hard-coded informs future resource allocation—shifting focus from manual architecture design to data curation and availability.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Prompt 3
&lt;/h2&gt;

&lt;p&gt;Create a hierarchical outline of the paper. For every top-level section include a one-line purpose statement. For sub-sections, add a short note (~5 words) on what is done or proved there. Return in Markdown outline format.&lt;/p&gt;

&lt;h2&gt;
  
  
  Output 3
&lt;/h2&gt;

&lt;h1&gt;
  
  
  Hierarchical Outline of “An Image Is Worth 16×16 Words”
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Abstract&lt;/strong&gt; 
&lt;em&gt;Purpose:&lt;/em&gt; Summarises the problem, method and key results. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Introduction&lt;/strong&gt; 
&lt;em&gt;Purpose:&lt;/em&gt; Motivates replacing CNNs with pure Transformers for vision. \
2.1 NLP success of Transformers – scalability noted. \
2.2 CNN dominance in vision – limits. \
2.3 Prior hybrid/self-attention models – scaling issues. \
2.4 Proposed ViT idea – treat patches as tokens. \
2.5 Data-scale hypothesis – large data beats inductive bias. \
2.6 Headline results – ImageNet 88.55%, etc. \&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Related Work&lt;/strong&gt; 
&lt;em&gt;Purpose:&lt;/em&gt; Positions ViT against prior attention and large-scale vision efforts. \
3.1 Transformers in NLP – pre-training paradigm. \
3.2 Scaling attention to images – local/sparse variants. \
3.3 Closest prior (2×2 patches) – resolution limit. \
3.4 CNN-attention hybrids – diverse tasks. \
3.5 iGPT comparison – 72% top-1. \
3.6 Large-data CNN studies – motivate ViT scale. \&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Method&lt;/strong&gt; 
&lt;em&gt;Purpose:&lt;/em&gt; Describes Vanilla Vision Transformer architecture and training tricks. \
4.1 ViT overview – patch embedding + class token. \
4.2 Transformer encoder math – Eqs 1-4. \
4.3 Inductive bias discussion – minimal locality. \
4.4 Hybrid variant – CNN stem option. \
4.5 Fine-tuning &amp;amp; resolution – pos-embed interpolation. \&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Experiments&lt;/strong&gt; 
&lt;em&gt;Purpose:&lt;/em&gt; Empirically evaluate ViT versus CNNs, data size, compute. \
5.1 Setup – datasets, baselines, hyper-params. \
5.2 SOTA comparison – ViT-H/14 beats BiT. \
5.3 Data requirement study – bigger data helps ViT. \
5.4 Scaling study – accuracy vs exaFLOPs. \
5.5 Model inspection – filters, position sims, attention distance. \
5.6 Self-supervision pilot – masked patch 79.9%. \&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conclusion&lt;/strong&gt; 
&lt;em&gt;Purpose:&lt;/em&gt; Recaps findings, notes future research paths. \
6.1 Pure Transformer viable – cheaper compute. \
6.2 Open challenges – detection, SSL gap, further scaling. \&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Acknowledgements&lt;/strong&gt; 
&lt;em&gt;Purpose:&lt;/em&gt; Credits contributors and infrastructure support. \&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;References&lt;/strong&gt; \
&lt;em&gt;Purpose:&lt;/em&gt; Lists cited prior literature.\&lt;/li&gt;
&lt;li&gt;**Appendices&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Prompt 4
&lt;/h2&gt;

&lt;p&gt;Task: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Explain &lt;em&gt;Figure 1&lt;/em&gt; of the attached paper in detail, focusing on the model architecture it illustrates.&lt;/li&gt;
&lt;li&gt;Produce a numbered list that traces the data/control flow, one step per diagram arrow:
• Use the exact component name or label shown in the figure.
• State the main operation performed in ≤50 words.
• Specify the input(s) and output(s) for that step.
## Output 4
## 1  Detailed Explanation of Figure 1&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Figure 1 presents a bird’s-eye view of the &lt;strong&gt;Vision Transformer (ViT)&lt;/strong&gt; classifier. The key idea is to treat an image as a &lt;em&gt;sequence of patch tokens&lt;/em&gt;, feed those tokens—plus a special class token—into a standard Transformer encoder, and then read out the class prediction from the final representation of that class token. Nothing in the core encoder is vision-specific; all image handling happens in the patch-embedding “stem” and in the learned positional embeddings that preserve spatial order. The architecture therefore re-uses mature NLP Transformer implementations while replacing CNN inductive biases with scale-driven learning.&lt;/p&gt;

&lt;p&gt;Major blocks shown:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Linear Projection of Flattened Patches&lt;/strong&gt; – cuts the image into non-overlapping P×P patches, flattens each, and maps it to a D-dimensional vector.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;[class] Embedding&lt;/strong&gt; – a learned vector prepended to the patch sequence whose final state will carry the whole-image representation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Patch + Position Embedding&lt;/strong&gt; – adds learnable 1-D positional embeddings to every token (patch or class).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transformer Encoder (stack repeated L times)&lt;/strong&gt; – alternates Multi-Head Self-Attention (MSA), LayerNorm and MLP sub-blocks with residual connections.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLP Head&lt;/strong&gt; – a task-specific classifier (1-layer during fine-tuning) that transforms the final [class] representation into logits over K classes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Flow: Image → patch tokens → positional addition → Transformer stack → [class] vector → logits.&lt;/p&gt;




&lt;h2&gt;
  
  
  2  Arrow-by-Arrow Data/Control Flow
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Input Image → Linear Projection of Flattened Patches&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Operation: Slice image into P×P non-overlapping patches, flatten each, apply a linear map.&lt;/li&gt;
&lt;li&gt;Inputs: H×W×C image.&lt;/li&gt;
&lt;li&gt;Outputs: N patch embeddings ∈ℝᴺ×ᴰ.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;[class] Embedding → Sequence Concatenation&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Operation: Prepend a learned [class] vector to patch embeddings.&lt;/li&gt;
&lt;li&gt;Inputs: [class] token (1×D), patch embeddings (N×D).&lt;/li&gt;
&lt;li&gt;Outputs: (N+1)-token sequence ∈ℝᴺ⁺¹×ᴰ.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Position Embedding Addition → Patch + Position Embedding&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Operation: Add learnable 1-D positional vectors element-wise to each token.&lt;/li&gt;
&lt;li&gt;Inputs: token sequence, positional table (N+1×D).&lt;/li&gt;
&lt;li&gt;Outputs: position-encoded sequence ∈ℝᴺ⁺¹×ᴰ.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequence → Transformer Encoder (Layer 1: Multi-Head Attention)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Operation: Each token attends to all tokens; produces context-mixed representations.&lt;/li&gt;
&lt;li&gt;Inputs: position-encoded sequence.&lt;/li&gt;
&lt;li&gt;Outputs: attention-updated sequence (same shape).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Residual Add &amp;amp; Norm (Layer 1)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Operation: Add skip connection and apply LayerNorm.&lt;/li&gt;
&lt;li&gt;Inputs: previous layer input and attention output.&lt;/li&gt;
&lt;li&gt;Outputs: normalized sequence.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLP Block (Layer 1)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Operation: Two dense layers with GELU, projecting D→D_MLP→D.&lt;/li&gt;
&lt;li&gt;Inputs: normalized sequence.&lt;/li&gt;
&lt;li&gt;Outputs: feed-forward updated sequence.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Residual Add &amp;amp; Norm (Layer 1)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Operation: Second skip connection + LayerNorm.&lt;/li&gt;
&lt;li&gt;Inputs: attention-normalized sequence and MLP output.&lt;/li&gt;
&lt;li&gt;Outputs: sequence for next layer.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Steps 4–7 repeat L − 1 times in Transformer Encoder&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Operation: Deeper context mixing via identical layers.&lt;/li&gt;
&lt;li&gt;Inputs: sequence from previous layer.&lt;/li&gt;
&lt;li&gt;Outputs: final encoded sequence z_L ∈ℝᴺ⁺¹×ᴰ.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extract [class] Token → LN&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Operation: Apply LayerNorm to final [class] vector.&lt;/li&gt;
&lt;li&gt;Inputs: first token of z_L.&lt;/li&gt;
&lt;li&gt;Outputs: image representation y ∈ℝᴰ.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Representation y → MLP Head&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Operation: Single linear layer (fine-tuning) mapping D→K logits.&lt;/li&gt;
&lt;li&gt;Inputs: y.&lt;/li&gt;
&lt;li&gt;Outputs: class-score vector logits ∈ℝᴷ.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logits → Softmax (not drawn)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Operation: Convert logits to class probabilities.&lt;/li&gt;
&lt;li&gt;Inputs: logits.&lt;/li&gt;
&lt;li&gt;Outputs: probability distribution over K classes.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The eleven numbered steps follow every arrow in Figure 1 and collectively describe the complete forward path from raw image pixels to predicted class probabilities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt 5
&lt;/h2&gt;

&lt;p&gt;Can you explain bit more about Extra class token specially what is the input how it learn. "Can suggest a better version of the question and ask me if I would like to use it instead.”&lt;/p&gt;

&lt;h2&gt;
  
  
  Output 5
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Clarifying the “extra [class] token” in Vision Transformer
&lt;/h2&gt;

&lt;p&gt;Below is a deeper look at what the &lt;strong&gt;[class] token&lt;/strong&gt; is, why it is added, and how it learns during training.# Clarifying the “extra [class] token” in Vision Transformer&lt;/p&gt;

&lt;p&gt;Below is a deeper look at what the &lt;strong&gt;[class] token&lt;/strong&gt; is, why it is added, and how it learns during training.&lt;/p&gt;




&lt;h2&gt;
  
  
  1  What exactly is fed in as the [class] token?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;It is &lt;strong&gt;not&lt;/strong&gt; derived from the image.&lt;/li&gt;
&lt;li&gt;It is a &lt;strong&gt;single, learned D-dimensional vector&lt;/strong&gt; (same width as any patch embedding) that is &lt;strong&gt;randomly initialized&lt;/strong&gt; along with the rest of the model parameters.&lt;/li&gt;
&lt;li&gt;During every forward pass the token is simply &lt;em&gt;prepended&lt;/em&gt; to the patch-embedding sequence, so the input length becomes N + 1.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2  Why add it at all?
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Global pooling substitute&lt;/strong&gt;
Transformers output a vector per token; one must be chosen to represent the whole image. The [class] token gives the model a &lt;em&gt;designated&lt;/em&gt; slot whose final hidden state becomes that representation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Information sink&lt;/strong&gt;
Because this token attends to—and is attended by—all patch tokens in each self-attention layer, it can &lt;em&gt;collect&lt;/em&gt; a summary of the entire image content.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compatibility with NLP tooling&lt;/strong&gt;
BERT uses an identical mechanism for sentence-level tasks; ViT inherits that convention and re-uses the same code path.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  3  How does the token “learn”?
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Parameter learning&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Its &lt;em&gt;initial&lt;/em&gt; embedding vector is optimized like any other weight through back-propagation.&lt;/li&gt;
&lt;li&gt;With L layers of self-attention, the token is repeatedly updated by mixing with patch tokens; gradients therefore reflect the classification loss.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Representation learning&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Early layers let the token gather coarse image context.&lt;/li&gt;
&lt;li&gt;Deeper layers refine it into a discriminative vector.&lt;/li&gt;
&lt;li&gt;The final LayerNorm output (y = LN(zₗ)) is what the MLP head reads.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No special loss term&lt;/strong&gt;
Learning is entirely driven by the downstream cross-entropy (or whatever task loss) applied to the head’s logits.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  4  Couldn’t we just use global average pooling instead?
&lt;/h2&gt;

&lt;p&gt;Yes—and the authors tried it. A GAP-based head works &lt;strong&gt;almost identically&lt;/strong&gt; if you tune the learning rate, but the [class] token keeps the architecture identical to BERT and avoids extra pooling code. It also lets the model &lt;em&gt;decide&lt;/em&gt; how much emphasis to place on each patch rather than averaging blindly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Suggested clearer question
&lt;/h2&gt;

&lt;p&gt;“Can you explain how the Vision Transformer’s learned [class] token is initialized, updated during training, and used for classification? If you prefer, I can rephrase the question this way—would you like me to?”&lt;/p&gt;




&lt;h2&gt;
  
  
  Prompt 6
&lt;/h2&gt;

&lt;p&gt;Can suggest a better version of the question and ask me if I would like to use it instead.”&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Question = "How this class token result in representing complete image"
## Output 6
Here is a cleaner, more specific way to phrase your question:
&lt;strong&gt;“How does the Vision Transformer’s learnable [class] token end up summarizing the entire image for classification?”&lt;/strong&gt;
Would you like to use this wording?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prompt 7
&lt;/h2&gt;

&lt;h1&gt;
  
  
  How the Vision Transformer’s &lt;strong&gt;[class]&lt;/strong&gt; token learns to summarize the whole image
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Output 7
&lt;/h2&gt;

&lt;h2&gt;
  
  
  1. Where the [class] token fits in the ViT pipeline
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;ViT first chops an image into fixed-size patches (e.g., 16×16 px).&lt;/li&gt;
&lt;li&gt;Each patch is linearly projected to a D-dimensional embedding and prepended with a single extra learnable vector dubbed &lt;strong&gt;[class]&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The resulting sequence (length = number of patches + 1) is fed through L Transformer encoder layers, each containing multi-head self-attention (MSA) and an MLP.&lt;/li&gt;
&lt;li&gt;After the final layer, the model &lt;strong&gt;only extracts the updated [class] token&lt;/strong&gt; (often called &lt;em&gt;zL0&lt;/em&gt; in the paper) and passes it to a lightweight classifier head.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thus, the network’s entire downstream decision hinges on the information that flows into that single vector.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Why information concentrates into the [class] token
&lt;/h2&gt;

&lt;p&gt;Self-attention lets every token exchange information with every other token. During training, gradients from the classification loss propagate - only-  through the [class] vector and its incoming attention weights. To minimize the loss, the model learns two complementary behaviours:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Patch-to-class attention grows.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Early layers already show heads where patch tokens assign high attention weights to the [class] token, creating a &lt;em&gt;funnel&lt;/em&gt; that channels image content toward it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Class-to-patch attention becomes selective.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Later layers feature heads where the [class] token attends back to spatial tokens that are semantically relevant (e.g., bird silhouette, steering wheel), effectively &lt;em&gt;reading&lt;/em&gt; the scene. Visualizations in ViT-L/16 reveal that by layer 10+, many heads attend almost globally, but with strongest weights on object regions.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Together, these two flows repeatedly mix and refine features, so by the final layer the [class] embedding is a nonlinear, global aggregation of the image.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Empirical evidence
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Observation&lt;/th&gt;
&lt;th&gt;Paper section&lt;/th&gt;
&lt;th&gt;Evidence&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Some heads attend to most of the image already in low layers&lt;/td&gt;
&lt;td&gt;§4.5, Fig. 7 (right)&lt;/td&gt;
&lt;td&gt;Average attention distance of several heads ≈ image size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;[class] attention maps highlight the target object&lt;/td&gt;
&lt;td&gt;§4.5, Fig. 6 and Appendix Fig. 14&lt;/td&gt;
&lt;td&gt;Qualitative rollout shows focus on semantically relevant pixels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Removing the [class] token and using global average pooling works &lt;strong&gt;only&lt;/strong&gt; after retuning LR, confirming it’s a learnable, not structural, choice&lt;/td&gt;
&lt;td&gt;Appendix D.3, Fig. 9&lt;/td&gt;
&lt;td&gt;Need different LR; otherwise large drop&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  4. Intuition vs. CNN global pooling
&lt;/h2&gt;

&lt;p&gt;A CNN funnels spatial information by gradually shrinking feature maps and finally performing global average pooling. ViT achieves a similar &lt;em&gt;information bottleneck&lt;/em&gt;, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; The bottleneck exists &lt;strong&gt;from the start&lt;/strong&gt; (single extra token),
&lt;/li&gt;
&lt;li&gt; It relies on learned attention patterns rather than fixed convolutional kernels,
&lt;/li&gt;
&lt;li&gt; It keeps full-resolution patch tokens to enable long-range reasoning until the last layer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5. Practical take-aways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;The [class] token is &lt;strong&gt;not magic&lt;/strong&gt;; it works because self-attention plus the loss force it to become a global summary.
&lt;/li&gt;
&lt;li&gt;Visualization or probing (e.g., attention rollout) can help verify your fine-tuned ViT is focusing on the right regions.
&lt;/li&gt;
&lt;li&gt;If you switch to alternatives like mean-pooled patch tokens, re-tune optimization hyper-parameters—performance can match the [class] approach but is sensitive to learning rate.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The [class] token therefore acts as a learned “collector” that repeatedly gathers, mixes and refines patch information through attention until it holds a representation rich enough for accurate image-level decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt 8
&lt;/h2&gt;

&lt;p&gt;For equation 1 to 4 , Explain all the mathematical functions in the text to a under grad student. Who knows basic deep learning, machine learning but not this technique.&lt;br&gt;
Do NOT omit symbols: rewrite each equation, define every variable, and describe each step in plain language.&lt;br&gt;
Where relevant, draw analogies to known models (e.g., Transformer, CNN). Use numbered steps for algorithm flow.&lt;/p&gt;
&lt;h1&gt;
  
  
  Output 8
&lt;/h1&gt;
&lt;h1&gt;
  
  
  Equations (1) – (4) in Vision Transformer (ViT)
&lt;/h1&gt;

&lt;p&gt;Below each formula you’ll find&lt;br&gt;&lt;br&gt;
• the &lt;strong&gt;exact equations&lt;/strong&gt;&lt;br&gt;
• definitions of every symbol&lt;br&gt;&lt;br&gt;
• a plain-language explanation&lt;br&gt;&lt;br&gt;
• short analogies to standard NLP Transformers or CNNs  &lt;/p&gt;


&lt;h2&gt;
  
  
  Equation (1)
&lt;/h2&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mopen"&gt;[&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;z&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mopen"&gt;&lt;span class="delimsizing size1"&gt;[&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord text mtight"&gt;&lt;span class="mord mtight"&gt;class&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mpunct"&gt;;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;p&lt;/span&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;E&lt;/span&gt;&lt;span class="mpunct"&gt;;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;p&lt;/span&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;E&lt;/span&gt;&lt;span class="mpunct"&gt;;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="minner"&gt;…&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mpunct"&gt;;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;N&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;p&lt;/span&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;E&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mclose"&gt;&lt;span class="delimsizing size1"&gt;]&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;E&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;pos&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;]&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Variable glossary&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$$(x\in\mathbb{R}^{H\times W\times C})$$ – full RGB image (height (H), width (W), channels (C)).
&lt;/li&gt;
&lt;li&gt;Patch size $$(P\times P)$$; number of patches $$(N=\tfrac{HW}{P^{2}})$$.
&lt;/li&gt;
&lt;li&gt;$$(x^{i}_{p}\in\mathbb{R}^{P^{2}C})$$ – flattened pixel vector of patch $$(i)$$.
&lt;/li&gt;
&lt;li&gt;$$(E\in\mathbb{R}^{(P^{2}C)\times D})$$ – learnable linear projection (turns a patch into a $$(D)$$-dimensional “word embedding”).
&lt;/li&gt;
&lt;li&gt;$$(x_{\text{class}}\in\mathbb{R}^{1\times D})$$ – learnable &lt;strong&gt;[CLS]&lt;/strong&gt; token that will gather global image information.
&lt;/li&gt;
&lt;li&gt;$$(E_{\text{pos}}\in\mathbb{R}^{(N+1)\times D})$$ – learnable positional embeddings (one per token, including the class token).
&lt;/li&gt;
&lt;li&gt;$$(z^{0}\in\mathbb{R}^{(N+1)\times D})$$ – sequence injected into the Transformer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Plain-language explanation&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cut&lt;/strong&gt; the image into non-overlapping tiles (patches).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flatten&lt;/strong&gt; each tile’s pixels into a long vector.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project&lt;/strong&gt; every vector through the same linear layer $$(E)$$ to get a patch embedding.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prepend&lt;/strong&gt; a special learnable token $$(x_{\text{class}})$$.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add&lt;/strong&gt; positional embeddings so the model knows where each patch came from.
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Analogy: identical to mapping words→embeddings in BERT, just replacing “words” with image patches.&lt;/p&gt;




&lt;h3&gt;
  
  
  Equation (2)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;\boxed{\,z'_{\ell}
       = \mathrm{MSA}\!\bigl(\mathrm{LN}(z_{\ell-1})\bigr) + z_{\ell-1}
       \quad\text{for } \ell = 1\ldots L}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Symbols&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Layer index $\ell$.&lt;/li&gt;
&lt;li&gt;$\mathrm{LN}(\cdot)$ Layer Normalization (normalizes features within a token).&lt;/li&gt;
&lt;li&gt;$\mathrm{MSA}(\cdot)$ Multi-Head &lt;strong&gt;Self-Attention&lt;/strong&gt; (standard Transformer block).&lt;/li&gt;
&lt;li&gt;Residual “$+$” adds the attention output back to the input $z_{\ell-1}$ (skip connection).&lt;/li&gt;
&lt;li&gt;$z'_{\ell}$ Intermediate sequence after the attention sub-layer.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plain-language view&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Normalize&lt;/strong&gt; current token embeddings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Let every patch attend to every other&lt;/strong&gt; (plus the class token) via multi-head attention, mixing global information.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add&lt;/strong&gt; the original embeddings back in (skip-connection helps learning).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Analogy: Same attention step as BERT; unlike CNNs, receptive field is instantly global.&lt;/p&gt;




&lt;h3&gt;
  
  
  Equation (3)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;\boxed{\,z_{\ell}
       = \mathrm{MLP}\!\bigl(\mathrm{LN}(z'_{\ell})\bigr) + z'_{\ell}
       \quad\text{for } \ell = 1\ldots L}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Symbols&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;$\mathrm{MLP}(\cdot)$ Two fully-connected layers with a GELU activation in between (also called “feed-forward network”). Dimensionality: $D \rightarrow 4D \rightarrow D$.&lt;/li&gt;
&lt;li&gt;Remaining symbols as before.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plain-language view&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Normalize&lt;/strong&gt; again.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Process each token independently&lt;/strong&gt; with a small neural network (adds non-linearity and channel-wise mixing).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add&lt;/strong&gt; skip connection.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Analogy: Comparable to a 1×1 convolution (channel mixing) applied to every location in a CNN.&lt;/p&gt;




&lt;h3&gt;
  
  
  Equation (4)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;\boxed{\,y = \mathrm{LN}\bigl(z^{0}_{L}\bigr)}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Symbols&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;$z^{0}_{L}\in\mathbb{R}^{D}$ The &lt;strong&gt;class-token embedding&lt;/strong&gt; after the final (L-th) layer; we take index 0 along the sequence (because we prepended it first).&lt;/li&gt;
&lt;li&gt;$y\in\mathbb{R}^{D}$ Final image representation after LayerNorm; fed to a softmax or MLP head for classification.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Plain-language view&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The final, normalized vector of the special class token is the model’s distilled understanding of the whole image—just like the “[CLS]” vector in text BERT.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting it Together – Algorithm Flow
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Patchify &amp;amp; Embed&lt;/strong&gt; Apply Eq.(1) to build the token sequence with positional info.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repeat for ℓ = 1…L&lt;/strong&gt;
2.1 Self-Attention with skip → Eq.(2)
2.2 MLP with skip  → Eq.(3)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Readout&lt;/strong&gt; Take the Layer-NORMed class token (Eq.(4)).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Head&lt;/strong&gt; Pass $y$ through a linear or MLP layer to predict class logits.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  Key Intuitions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Global context early&lt;/strong&gt;: Unlike CNNs, attention lets even low-level layers see the entire image, not just a small kernel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimal vision bias&lt;/strong&gt;: The only “inductive bias” is square patching and adding position embeddings; everything else is generic Transformer machinery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same math as text Transformers&lt;/strong&gt;: If you understand BERT’s forward pass, swap “words” for “image patches” and you’ve got ViT’s core.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prompt 9.1
&lt;/h2&gt;

&lt;p&gt;CODE Convert  Equation (1) to  (4) combining with steps in Figure . into clean, runnable PyTorch pseudocode. . Focus on clarity, variable shapes, and comments;  Return ONLY the code block.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vision Transformer – PyTorch-style pseudocode
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Implements Equations (1)–(4) + the flow from Figure 1&lt;/li&gt;
&lt;li&gt;Shapes follow (B = batch, C = channels, H,W = image size, N = #patches, D = embed dim)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch.nn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ViTEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;img_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;224&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;patch_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;in_chans&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;embed_dim&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# D
&lt;/span&gt;        &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# L
&lt;/span&gt;        &lt;span class="n"&gt;num_heads&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;mlp_ratio&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;img_size&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;patch_size&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image size must be divisible by patch size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img_size&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;patch_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;                  &lt;span class="c1"&gt;# patches per image
&lt;/span&gt;
        &lt;span class="c1"&gt;# (Eq. 1) — linear projection of flattened patches
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;patch_embed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Conv2d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in_chans&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embed_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                     &lt;span class="n"&gt;kernel_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;patch_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                     &lt;span class="n"&gt;stride&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;patch_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# → (B, D, H/P, W/P)
&lt;/span&gt;
        &lt;span class="c1"&gt;# learnable class token + 1-D position embeddings
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cls_tok&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Parameter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embed_dim&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pos_embed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Parameter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embed_dim&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="c1"&gt;# Transformer encoder layers (Eqs. 2–3)
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;layers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ModuleList&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
            &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ModuleDict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;ln1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LayerNorm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embed_dim&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;attn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MultiheadAttention&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embed_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_heads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                             &lt;span class="n"&gt;batch_first&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;ln2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LayerNorm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embed_dim&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;mlp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embed_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embed_dim&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;mlp_ratio&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GELU&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                    &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embed_dim&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;mlp_ratio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embed_dim&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;)))&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ln_final&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LayerNorm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embed_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                 &lt;span class="c1"&gt;# (Eq. 4)
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                                            &lt;span class="c1"&gt;# (B, C, H, W)
&lt;/span&gt;
        &lt;span class="c1"&gt;# ---- Equation (1) ---------------------------------------------------
&lt;/span&gt;        &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;patch_embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                                  &lt;span class="c1"&gt;# (B, D, H/P, W/P)
&lt;/span&gt;        &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flatten&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                         &lt;span class="c1"&gt;# (B, N, D)
&lt;/span&gt;
        &lt;span class="n"&gt;cls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cls_tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                     &lt;span class="c1"&gt;# (B, 1, D)
&lt;/span&gt;        &lt;span class="n"&gt;z&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pos_embed&lt;/span&gt;        &lt;span class="c1"&gt;# (B, N+1, D)
&lt;/span&gt;
        &lt;span class="c1"&gt;# ---- Equations (2) &amp;amp; (3) repeated L times --------------------------
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;blk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;blk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;attn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ln1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;blk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ln1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;blk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ln1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;))[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;blk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mlp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ln2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="c1"&gt;# ---- Equation (4) ---------------------------------------------------
&lt;/span&gt;        &lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ln_final&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                                     &lt;span class="c1"&gt;# (B, N+1, D)
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;                                           &lt;span class="c1"&gt;# CLS token ⇒ image rep
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Prompt 9.2
&lt;/h2&gt;

&lt;p&gt;“Could you show a complete PyTorch data-pipeline (dataset class, transforms, and dataloaders) plus a concise training loop for classifying the Oxford-IIIT Pet images with integrating ViTEncoder  ? Please keep the code runnable end-to-end and highlight any important hyper-parameters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ---------------------------------------------------
# 1. CONFIG
# ---------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;CFG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;img_size&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;224&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# resize-crop for ViT
&lt;/span&gt;    &lt;span class="n"&gt;patch_size&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# must match encoder checkpoint
&lt;/span&gt;    &lt;span class="n"&gt;batch_size&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lr&lt;/span&gt;            &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;3e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Adam default β1=0.9, β2=0.999
&lt;/span&gt;    &lt;span class="n"&gt;weight_decay&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;epochs&lt;/span&gt;        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_workers&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# &amp;gt;0 for speed if your OS allows
&lt;/span&gt;    &lt;span class="n"&gt;model_name&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vit_base_patch16_224&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# timm preset
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ---------------------------------------------------
# 2. DATA PIPELINE
# ---------------------------------------------------
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timm&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch.nn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch.optim&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;optim&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torchvision&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.utils.data&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataLoader&lt;/span&gt;

&lt;span class="n"&gt;MEAN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.485&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.456&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.406&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;STD&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.229&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.224&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.225&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;train_tfms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Compose&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Resize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CFG&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;img_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;RandomResizedCrop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CFG&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;img_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;RandomHorizontalFlip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ToTensor&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MEAN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;STD&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;val_tfms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Compose&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Resize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CFG&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;img_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CenterCrop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CFG&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;img_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ToTensor&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MEAN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;STD&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;train_set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OxfordIIITPet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trainval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_types&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_tfms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;download&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;val_set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OxfordIIITPet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_types&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;val_tfms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;download&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;train_loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;train_set&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CFG&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batch_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CFG&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_workers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;pin_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;val_loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;val_set&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CFG&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batch_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CFG&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_workers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;pin_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ---------------------------------------------------
# 3. MODEL – ViT encoder + new classifier
# ---------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;vit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;CFG&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;pretrained&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;        &lt;span class="c1"&gt;# drop original head
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;in_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_features&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;37&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# 37 pet breeds
&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ---------------------------------------------------
# 4. OPTIMIZER &amp;amp; LR SCHEDULER
# ---------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;criterion&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CrossEntropyLoss&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;optimizer&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AdamW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                         &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CFG&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                         &lt;span class="n"&gt;weight_decay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CFG&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weight_decay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;scheduler&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lr_scheduler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CosineAnnealingLR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T_max&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CFG&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;epochs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ---------------------------------------------------
# 5. TRAIN / VAL LOOP
# ---------------------------------------------------
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;epoch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CFG&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;epochs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="c1"&gt;# --- training ---
&lt;/span&gt;    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;running_loss&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;correct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;train_loader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;non_blocking&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;non_blocking&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zero_grad&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;loss&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;criterion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="n"&gt;running_loss&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;item&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;correct&lt;/span&gt;      &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;eq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;item&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;seen&lt;/span&gt;         &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;train_acc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;correct&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;seen&lt;/span&gt;
    &lt;span class="n"&gt;train_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;running_loss&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;seen&lt;/span&gt;

    &lt;span class="c1"&gt;# --- validation ---
&lt;/span&gt;    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;correct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;val_loader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;non_blocking&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;non_blocking&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;correct&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;eq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;item&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;seen&lt;/span&gt;    &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;val_acc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;correct&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;seen&lt;/span&gt;

    &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Epoch &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;epoch&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;CFG&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;epochs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;loss &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;train_loss&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train acc &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;train_acc&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;val acc &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;val_acc&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ---------------------------------------------------
# 6. SAVE (optional)
# ---------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;state_dict&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vit_pet_classifier.pth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>ai</category>
      <category>computervision</category>
    </item>
  </channel>
</rss>
