Video-LLaMA An Instruction-tuned Audio-Visual Language Model for Video Understanding

#ai #mlp #architecture #discuss

Video-LLaMAの論文

Architecture
Vision-LLaMa is composed of two branches, the Vision-Language branch and the Audio-Language branch.

Pre-training of the Vision-Language branch is conducted by Webvid-2M, a large-scale dataset for short videos.
In pre-training, the model can generate content, but its ability to follow the instructions is decreased.
Therefore, fine-tuning to follow the instruction is conducted.

The visual encoder in the first stage is frozen.
Frames are input to the trainable video Q-Former.

感想
Video Q-Formerについてもっと知らなくてはいけないね