[memo]Flamingo: a Visual Language Model for Few-Shot Learning

Deepmindです

Method

Visual language model
Bridge powerful pretrained vision-only and language-only model
Handle sequences of arbitarily interleaved visual and textual data
Seamlessly ingest images or videos as inputs.

Flamingo models can be trained on large-scale data
Few-shot learning on a wide array of 16 multimodal language tasks

Vision-encoder is pretrained

Gated Xattention layers
Time embeddings are added to learn spatio-temporal features.

Training on mixture of language and vision datasets

Images are interleaved.
When pre-training,
Input: descriptions
Output: description
This is the pre-training task.

Conclusion
Few-shot is completed.
Flamingo is also scalable.

感想
確かにこれじゃノイジーなペアが多く存在するよなあ
とおもっった．BLIPは偉大

DEV Community

[memo]Flamingo: a Visual Language Model for Few-Shot Learning

Top comments (0)