Linli yao, Renmin University of Chinaのぐる
Image difference captioning paper
This paper explores fine-grained semantic comprehension.
Meanwhile, image difference captioning can be high-cost for the annotation task.
This paper focuses on pre-training and fine-tuning at fine-grained level.
Method
Three self-supervised tasks and pre-training
Masked language modeling
15 % of all tokens are randomly masked.
Some of them are replaced with [MASK] tokens.
Masked visual contrastive learning
Masking and recovering strategy on the visual side.
Masked visual contrastive learning (image)
The other batches are negative samples, and the masked batch is the positive sample.
They are contrastively trained.
Fine-grained difference aligning
Text-version contrastive learning.
Similar to CLIP
Finetuning
Generate a difference captioning
Conclusion
For large-scale data
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.