[memo]AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

#machinelearning #ai #deeplearning #paper

Vision Transformerの論文 Lucas

Large-scale training induces biases

Related works
Cordonnier et al.
Convolutional neural network-> object detection, image classification
ImageGPT: unsupervised fashion

Method
Vision transformer
Transformer input: 1D sequences
Reshape images into a sequence of flattened 2D patches.
the output of the transformer encoder serves as the image representation.
Adding position embeddings [CLASS] to each patches.

Vision transformer
簡単なアーキテクチャながら高い精度を達成しているのはすごいなあと

DEV Community

[memo]AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

Top comments (0)