かの有名なBLIP
image-text contrastive loss
Image-grounded text encoder: It distinguishes the positive and negative image-text pairs.
Decoder is trained with LM loss.
Related work
Image-text pairs are noisy.
Challenges mean both understanding-based tasks and generation-based tasks.
Knowledge distillation
It is effective for VLP and image classification.
Data augmentation: This method utilizes synthetic captions for training.
Method
Unimodal encoder: [CLS] tokens are appended to the beginning of the text input.
Image-grounded text encoder
It injects visual information by using one additional cross-attention layer between the self-attention layer.
Image-grounded text decoder
Task-specific token [Encode] is appended to the text.
Image-grounded text decoder
[Decode] tokens [EOS]
Pre-training objectives
ITC loss
CLIPと同じ
ITM loss
Image-text matching loss: binary classification tasks:似ているけどwrong pairs are considered more.
LM loss
Textual descriptions loss
CapFlit
Using a captioner, which is fine-tuned by I_h or T_h (human-annotated images and texts), combining human-annotated texts and synthetic ones, and web-crawled ones.
Experiment
VQA, NLVR, VisDialなどで性能が向上
Conglusion
これのすごいところは大きなデータセットから正確なCaptioningができるようにしたこと
だが思ったのだが,web-crawledなデータセットなのでcontaminationが起こっているのではと思った.
Top comments (0)