Contrastive captioning
Google researchのVijay Vasudevan
Introduction
Language models
BERT
T5
GPT3
Downstream tasks
Zero-shot image classification
3 types of architectures
Single encoder
Dual encoder
Encoder-decoder
Related works
Vision Pretraining
Imagenet
Instagram
JFT
Vision-language pertaining
R-CNN
VinVL
VLMO
Image-Text Foundation Models
CLIP
ALIGN
LiT
Strength of CoCa
one forward and one backward propagation
Trained on two objectives
Approach
There are two CoCa's objectives.
Unimodal text decoder outputs the CLS token.
Also, an image encoder outputs the CLS token.
CoCa calculates contrastive loss between them.
After that, Coca inputs the result of an image encoder to the multimodal text decoder to calculate the captioning loss.
Pretraining Efficiency
The majority of calculations are shared between two losses.
Zero-shot transfer
Frozen feature evaluation
CoCa for video action recognition
CoCa outperforms a lot of specialized models -> foundation model
Top comments (0)