Deepmindです
Method
Visual language model
Bridge powerful pretrained vision-only and language-only model
Handle sequences of arbitarily interleaved visual and textual data
Seamlessly ingest images or videos as inputs.
Flamingo models can be trained on large-scale data
Few-shot learning on a wide array of 16 multimodal language tasks
Vision-encoder is pretrained
Gated Xattention layers
Time embeddings are added to learn spatio-temporal features.
Training on mixture of language and vision datasets
Images are interleaved.
When pre-training,
Input: descriptions
Output: description
This is the pre-training task.
Conclusion
Few-shot is completed.
Flamingo is also scalable.
感想
確かにこれじゃノイジーなペアが多く存在するよなあ
とおもっった.BLIPは偉大
Top comments (0)