Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maazが第一著者 MBZUAIというドバイの大学院の部ループ

100,000 images
Temporal and spatial training

Related works
LLaMA, OPT, MOSS

Pre-trained LLMs
Flamingo, BLIP-2 explored the power of web-scale image-text pairs.
LLaVA
Alpaca
MiniGPT4
VideoChat

LLaVA is LMM that integrates the visual encoder of CLIP.
The decoder is Vicuna

Method
Dataset collection was conducted via pre-trained CLIP.
Mean-pooling is conducted in a temporal and a spatial way.
The linear layer is trained for the likelihood of answers.

Video ChatGPTはいい感じにデータを集めてくるという手法

感想
Mean poolingすることによって失われている情報があるだろうなーと思っている

DEV Community

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Top comments (0)