Vision Transformer (ViT) from Scratch in PyTorch

#deeplearning #pytorch #machinelearning #tutorial

For years, Convolutional Neural Networks (CNNs) ruled computer vision. But since the paper “An Image is Worth 16x16 Words”, the Vision Transformer (ViT) has challenged CNNs by treating an image as a sequence of patches—similar to how words form a sentence.

In this post, we’ll walk through a PyTorch implementation of ViT, trained on a small food classification dataset (pizza, steak, sushi).

Core Idea

Split an image into fixed-size patches (e.g., 16×16).
Flatten patches into vectors → feed them as tokens.
Add:
- [CLS] Token → represents the entire image for classification.
- Positional Embeddings → retain spatial info.
Process the sequence with a Transformer Encoder.

ViT-Base Config

Image size: 224×224
Patch size: 16×16 → 196 tokens
Embedding dim: 768
Layers: 12
Attention heads: 12
Params: ~85.8M

Dataset

We used a 3-class dataset:

🍕 Pizza
🥩 Steak
🍣 Sushi

All images resized to 224×224.

Training Setup

Parameter	Value
Optimizer	Adam
Loss	CrossEntropyLoss
LR	0.001
Batch Size	32
Epochs	10
Device	GPU (CUDA)

Results

Training Loss → decreases fast (ViT is very powerful).
Validation Loss → may plateau or rise (overfitting risk).
Accuracy → Training near 100%, validation reflects true performance.

ViTs are large models. On small datasets, they overfit quickly. For real use, try pretrained ViTs + fine-tuning.

Takeaways

ViT proves attention works for vision, not just text.
Even a scratch implementation highlights the shift from pixels → patches → tokens.
Next steps:
- Try on larger datasets (CIFAR-100, ImageNet subset).
- Use pretrained weights (HuggingFace, timm).
- Experiment with augmentations (Mixup, CutMix).

DEV Community