ViT vs Swin vs ConvNeXt: ImageNet Accuracy at 4.5G FLOPs

#visiontransformer #convnext #swintransformer #imagenet

The Surprising Result: ConvNeXt Wins on Equal Compute

ConvNeXt-T hits 82.1% ImageNet top-1 accuracy at 4.5G FLOPs. Swin-T gets 81.3%. ViT-S/16? 79.9%.

That's a 2.2-point gap between the worst and best performers at roughly the same computational budget. When I first saw these numbers from Meta's ConvNeXt paper (Liu et al., CVPR 2022), I assumed they'd cherry-picked the comparison points. But after running my own benchmarks on a mix of model variants, the pattern holds across multiple FLOPs tiers.

Why does this matter? Because FLOPs-matched comparisons strip away the marketing noise. A model that needs 3x the compute to match another isn't better—it's just bigger. And if you're deploying to production where inference cost scales with every request, that 2.2% accuracy gain at identical compute is worth real money.