DeiT III vs DINOv2: ViT ImageNet Accuracy Without Labels

#deitiii #dinov2 #visiontransformer #selfsupervisedlearni

Two Papers That Changed How We Train Vision Transformers

ViT-Large hitting 87.7% top-1 accuracy on ImageNet without seeing a single label during pretraining. That's the headline from Meta AI's DINOv2, and it finally closes a gap that's been bugging me since the original ViT paper dropped.

But here's what's interesting: DeiT III (Touvron et al., 2022) came out a year earlier and achieved 87.2% with a supervised recipe. Same architecture family, nearly identical numbers, completely different training philosophies. Which approach actually wins, and more importantly — which would you deploy?

You can read the DeiT III paper here and DINOv2 here.