Skip to content

DEV Community

Takara Taniguchi

Posted on Jun 7

VideoPrism: A Foundational Visual Encoder for Video Understanding

#machinelearning #ai #computervision #google

Long zhao,Googleのグループ
Exsiting models
Motion-centric models
36M
First step: train a video encoder
Second step: Video-only

Predict video-level global embedding and token-wise local embeddings
A random shuffle is applied to the encoder's output token

Method

Video-text contrastive training
CLIP like
Video encoder and pooler are trained in this stage.
Masked video modeling
VideoPrism only trains from videos because the pairs can be disruptive.
By using the frozen video encoder, the small model (VideoPrism) predicts the　pooled global information and shuffled local token-wise information.

感想
要はVideoPrismはknowledge distillationのことである

Top comments (0)

Subscribe