Image Difference Captioning with Pre-training and Contrastive Learning

Linli yao, Renmin University of Chinaのぐる

Image difference captioning paper
This paper explores fine-grained semantic comprehension.
Meanwhile, image difference captioning can be high-cost for the annotation task.

This paper focuses on pre-training and fine-tuning at fine-grained level.

Method
Three self-supervised tasks and pre-training

Masked language modeling
15 % of all tokens are randomly masked.
Some of them are replaced with [MASK] tokens.

Masked visual contrastive learning
Masking and recovering strategy on the visual side.

Masked visual contrastive learning (image)
The other batches are negative samples, and the masked batch is the positive sample.
They are contrastively trained.

Fine-grained difference aligning
Text-version contrastive learning.
Similar to CLIP

Finetuning
Generate a difference captioning

Conclusion
For large-scale data

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.