かの有名なBLIP2 Salesforce research
Related work
The author tells about Vision-language pre-training.
Most VLP methods perform end-to-end pre-training.
High computational cost.
LiT pre-trained encoder for CLIP.
Frozen fine-tune image encoder, whose output is directly used for soft prompts.
Flamingo inserts new cross-attention layers on billions of image-text pairs.
Method
Two steps
Vision language representation learning
Vision-to-langage generattive learning
First stage
Q-former is initialized with the pre-trained weights of BERT_{base}.
Image-text contrastive learning
Image-text contrastive learning, which aligns multiple representations Z and text tokens t.
Image-grounded text generation
Generate text from the image query.
Second stage
By using a fully-connected layer, tokens are aligned.
Zero-shot -> Decoder
Encoder-decoder -> input text
Conclusion
Computational cost is low.
Top comments (0)