[memo]Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Video LLM Benchmark

Diversirty of video types
Duration in temporal dimension
Breadth in data modalities
Quality in annotations

Video duration increases -> accuracy decreases

Manually collected data
3 per video

Video LLM considerable
Integrating subtitles and audio significantly improves the availability of Video LLM.

Related works
Vicuna
LLaMa
Fuyu-8b
Sequential data is underexplored.

Method
Data construction is constructed by 3 steps.
video collection
raw videos from YouTube
short, medium, and long videos
Recruit annotators whose English skills are proficient.

question-answer annotation,
and quality review

Experiment
Analysis
Could additional modalities benefit the performance?
Subtitles and audio can benefit.
Among long videos, these modalities can benefit.
For MLLMs, using subtitles is more effective than audio.

How MLLMs are robust to varied video duration?
Sparcity due to frame sampling can degrade the ability of MLLMs

DEV Community

[memo]Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Top comments (0)