Two Papers That Changed How We Train Vision Transformers
ViT-Large hitting 87.7% top-1 accuracy on ImageNet without seeing a single label during pretraining. That's the headline from Meta AI's DINOv2, and it finally closes a gap that's been bugging me since the original ViT paper dropped.
But here's what's interesting: DeiT III (Touvron et al., 2022) came out a year earlier and achieved 87.2% with a supervised recipe. Same architecture family, nearly identical numbers, completely different training philosophies. Which approach actually wins, and more importantly — which would you deploy?
You can read the DeiT III paper here and DINOv2 here.
The DeiT III Training Recipe: Supervised, But Make It Simple
Continue reading the full article on TildAlice

Top comments (0)