[memo]VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

VideoVista
The temporal location task handles the duration of the selected scene
Video processing
Include diverse video categories
Ensure varying video durations
Incorporate comprehensive video understanding and reasoning tasks

Video processing
Video splitting algorithm
Duration distribution is long-tailed.

Automatic QA
Action
GPT—4o aanotates actions for each short clip
Event: By using segment anything and recognize anything -> times and regions
Objects: Thirdly, we leverage
GPT-4 to generate QA pairs about objects in a video clip
Multi-choice
Open-end
GPT-4o tended to focus on individuals in the center of the frame
Open-end to multiple-choice QA pairs

Evaluation
Comparison
Performance gap between medium and long is not so different
Huge duration can distort the accuracy.

DEV Community

[memo]VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Top comments (0)