Video LLM Benchmark
Diversirty of video types
Duration in temporal dimension
Breadth in data modalities
Quality in annotations
Video duration increases -> accuracy decreases
Manually collected data
3 per video
Video LLM considerable
Integrating subtitles and audio significantly improves the availability of Video LLM.
Related works
Vicuna
LLaMa
Fuyu-8b
Sequential data is underexplored.
Method
Data construction is constructed by 3 steps.
video collection
raw videos from YouTube
short, medium, and long videos
Recruit annotators whose English skills are proficient.
question-answer annotation,
and quality review
Experiment
Analysis
Could additional modalities benefit the performance?
Subtitles and audio can benefit.
Among long videos, these modalities can benefit.
For MLLMs, using subtitles is more effective than audio.
How MLLMs are robust to varied video duration?
Sparcity due to frame sampling can degrade the ability of MLLMs
Top comments (0)