For years, Convolutional Neural Networks (CNNs) ruled computer vision. But since the paper “An Image is Worth 16x16 Words”, the Vision Transformer (ViT) has challenged CNNs by treating an image as a sequence of patches—similar to how words form a sentence.
In this post, we’ll walk through a PyTorch implementation of ViT, trained on a small food classification dataset (pizza
, steak
, sushi
).
Core Idea
- Split an image into fixed-size patches (e.g., 16×16).
- Flatten patches into vectors → feed them as tokens.
-
Add:
- [CLS] Token → represents the entire image for classification.
- Positional Embeddings → retain spatial info.
Process the sequence with a Transformer Encoder.
ViT-Base Config
- Image size: 224×224
- Patch size: 16×16 → 196 tokens
- Embedding dim: 768
- Layers: 12
- Attention heads: 12
- Params: ~85.8M
Dataset
We used a 3-class dataset:
- 🍕 Pizza
- 🥩 Steak
- 🍣 Sushi
All images resized to 224×224.
Training Setup
Parameter | Value |
---|---|
Optimizer | Adam |
Loss | CrossEntropyLoss |
LR | 0.001 |
Batch Size | 32 |
Epochs | 10 |
Device | GPU (CUDA) |
Results
- Training Loss → decreases fast (ViT is very powerful).
- Validation Loss → may plateau or rise (overfitting risk).
- Accuracy → Training near 100%, validation reflects true performance.
ViTs are large models. On small datasets, they overfit quickly. For real use, try pretrained ViTs + fine-tuning.
Takeaways
- ViT proves attention works for vision, not just text.
- Even a scratch implementation highlights the shift from pixels → patches → tokens.
-
Next steps:
- Try on larger datasets (CIFAR-100, ImageNet subset).
- Use pretrained weights (HuggingFace, timm).
- Experiment with augmentations (Mixup, CutMix).
Top comments (0)