BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

かの有名なBLIP

image-text contrastive loss
Image-grounded text encoder: It distinguishes the positive and negative image-text pairs.
Decoder is trained with LM loss.

Related work
Image-text pairs are noisy.
Challenges mean both understanding-based tasks and generation-based tasks.

Knowledge distillation
It is effective for VLP and image classification.
Data augmentation: This method utilizes synthetic captions for training.

Method
Unimodal encoder: [CLS] tokens are appended to the beginning of the text input.

Image-grounded text encoder
It injects visual information by using one additional cross-attention layer between the self-attention layer.

Image-grounded text decoder
Task-specific token [Encode] is appended to the text.

Image-grounded text decoder
[Decode] tokens [EOS]

Pre-training objectives

ITC loss
CLIPと同じ

ITM loss
Image-text matching loss: binary classification tasks:似ているけどwrong pairs are considered more.

LM loss
Textual descriptions loss

CapFlit
Using a captioner, which is fine-tuned by I_h or T_h (human-annotated images and texts), combining human-annotated texts and synthetic ones, and web-crawled ones.

Experiment
VQA, NLVR, VisDialなどで性能が向上

Conglusion
これのすごいところは大きなデータセットから正確なCaptioningができるようにしたこと
だが思ったのだが，web-crawledなデータセットなのでcontaminationが起こっているのではと思った．

DEV Community

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Top comments (0)