DEV Community

anesmeftah
anesmeftah

Posted on

Vision Transformer (ViT) from Scratch in PyTorch

For years, Convolutional Neural Networks (CNNs) ruled computer vision. But since the paper “An Image is Worth 16x16 Words”, the Vision Transformer (ViT) has challenged CNNs by treating an image as a sequence of patches—similar to how words form a sentence.

In this post, we’ll walk through a PyTorch implementation of ViT, trained on a small food classification dataset (pizza, steak, sushi).


Core Idea

Architecture of The ViT

  • Split an image into fixed-size patches (e.g., 16×16).
  • Flatten patches into vectors → feed them as tokens.
  • Add:

    • [CLS] Token → represents the entire image for classification.
    • Positional Embeddings → retain spatial info.
  • Process the sequence with a Transformer Encoder.


ViT-Base Config

  • Image size: 224×224
  • Patch size: 16×16 → 196 tokens
  • Embedding dim: 768
  • Layers: 12
  • Attention heads: 12
  • Params: ~85.8M

Dataset

We used a 3-class dataset:

  • 🍕 Pizza
  • 🥩 Steak
  • 🍣 Sushi

All images resized to 224×224.


Training Setup

Parameter Value
Optimizer Adam
Loss CrossEntropyLoss
LR 0.001
Batch Size 32
Epochs 10
Device GPU (CUDA)

Results

plots of the training and testing loss and accuracy

  • Training Loss → decreases fast (ViT is very powerful).
  • Validation Loss → may plateau or rise (overfitting risk).
  • Accuracy → Training near 100%, validation reflects true performance.

ViTs are large models. On small datasets, they overfit quickly. For real use, try pretrained ViTs + fine-tuning.


Takeaways

  • ViT proves attention works for vision, not just text.
  • Even a scratch implementation highlights the shift from pixels → patches → tokens.
  • Next steps:

    • Try on larger datasets (CIFAR-100, ImageNet subset).
    • Use pretrained weights (HuggingFace, timm).
    • Experiment with augmentations (Mixup, CutMix).

Top comments (0)