DEV Community

Takara Taniguchi
Takara Taniguchi

Posted on

VideoPrism: A Foundational Visual Encoder for Video Understanding

Long zhao,Googleのグループ
Exsiting models
Motion-centric models
36M
First step: train a video encoder
Second step: Video-only

  1. Predict video-level global embedding and token-wise local embeddings
  2. A random shuffle is applied to the encoder's output token

Method

  1. Video-text contrastive training
    CLIP like
    Video encoder and pooler are trained in this stage.

  2. Masked video modeling
    VideoPrism only trains from videos because the pairs can be disruptive.
    By using the frozen video encoder, the small model (VideoPrism) predicts the pooled global information and shuffled local token-wise information.

感想
要はVideoPrismはknowledge distillationのことである

Top comments (0)