[memo]VITED: Video Temporal Evidence Distillation

UCSBのLuのグループ， Fair

Intro
Novel framework to generate and search for evidence chain-of-thought.
Evidence distillation

Related works
Video understanding with LLMs.
They uniformly sample frames.
Authors focus on generating and localizing relevant evidence to support the question.

Chain-of-thought reasoning in videos
CoT for video understanding
deliberate search
majority voting
VIP
VSOR-CoT
MotionEpic

Visual evidence

Generating an evidence pool
Standard flow: Q->A
This method flow: Q->Evidence->flow

Method
Divide videos and get the appropriate information.

Generated chain-of-thoughts are selected by the algorithms.
どのようなevidenceがchainとの尤度を最も高めるかを検索する

Distilling evidence chains into a single model

Stage 1: instruction tuning
Stage 2: predict answers and evidence chains

Trained using next token prediction with cross-entropy loss

Distillationなのに時間とかの議論の票がない？

DEV Community

[memo]VITED: Video Temporal Evidence Distillation

Top comments (0)