CoCa: Contrastive Captioners are Image-Text Foundation Models

Contrastive captioning
Google researchのVijay Vasudevan

Introduction

Language models
BERT
T5
GPT3

Downstream tasks
Zero-shot image classification

3 types of architectures
Single encoder
Dual encoder
Encoder-decoder

Related works
Vision Pretraining
Imagenet
Instagram
JFT

Vision-language pertaining
R-CNN
VinVL
VLMO

Image-Text Foundation Models
CLIP
ALIGN
LiT

Strength of CoCa
one forward and one backward propagation
Trained on two objectives

Approach
There are　two CoCa's objectives.
Unimodal text decoder outputs the CLS token.
Also, an image encoder outputs the CLS token.
CoCa calculates contrastive loss between them.

After that, Coca inputs the result of an image encoder to the multimodal text decoder to calculate the captioning loss.

Pretraining Efficiency
The majority of calculations are shared between two losses.

Zero-shot transfer
Frozen feature evaluation

CoCa for video action recognition

CoCa outperforms a lot of specialized models -> foundation model

DEV Community

CoCa: Contrastive Captioners are Image-Text Foundation Models

Top comments (0)