DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

DeiT III vs DINOv2: ViT ImageNet Accuracy Without Labels

Two Papers That Changed How We Train Vision Transformers

ViT-Large hitting 87.7% top-1 accuracy on ImageNet without seeing a single label during pretraining. That's the headline from Meta AI's DINOv2, and it finally closes a gap that's been bugging me since the original ViT paper dropped.

But here's what's interesting: DeiT III (Touvron et al., 2022) came out a year earlier and achieved 87.2% with a supervised recipe. Same architecture family, nearly identical numbers, completely different training philosophies. Which approach actually wins, and more importantly — which would you deploy?

You can read the DeiT III paper here and DINOv2 here.

A corkboard with motivational sticky notes, ideal for planning and creativity.

Photo by Polina Zimmerman on Pexels

The DeiT III Training Recipe: Supervised, But Make It Simple


Continue reading the full article on TildAlice

Top comments (0)