BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

かの有名なBLIP2 Salesforce research

Related work
The author tells about Vision-language pre-training.
Most VLP methods perform end-to-end pre-training.
High computational cost.

LiT pre-trained encoder for CLIP.
Frozen fine-tune image encoder, whose output is directly used for soft prompts.
Flamingo inserts new cross-attention layers on billions of image-text pairs.

Method
Two steps
Vision language representation learning
Vision-to-langage generattive learning

First stage
Q-former is initialized with the pre-trained weights of BERT_{base}.
Image-text contrastive learning

Image-text contrastive learning, which aligns multiple representations Z and text tokens t.

Image-grounded text generation
Generate text from the image query.

Second stage
By using a fully-connected layer, tokens are aligned.
Zero-shot -> Decoder
Encoder-decoder -> input text

Conclusion
Computational cost is low.

DEV Community

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Top comments (0)