Video-LLaMAの論文
Architecture
Vision-LLaMa is composed of two branches, the Vision-Language branch and the Audio-Language branch.
Pre-training of the Vision-Language branch is conducted by Webvid-2M, a large-scale dataset for short videos.
In pre-training, the model can generate content, but its ability to follow the instructions is decreased.
Therefore, fine-tuning to follow the instruction is conducted.
The visual encoder in the first stage is frozen.
Frames are input to the trainable video Q-Former.
感想
Video Q-Formerについてもっと知らなくてはいけないね
Top comments (0)