Introduction
Atomic video understanding tasks
Action recognition
Language grounding
AoTD distills the multi-step reasoning and spatial-temporal understandings.
Enhancing video LLMs by distilling high-quality CoTs.
CoT dataset is constructed automatically by using multi-agents.
Distilled models outperform existing methods.
Related works
Video-language models
Encoder CLIP, SigLIP
LLMs:VideoLLaMA2,LLaVA-NeXT-Video,VideoChat2
Concurrent work
Video-Star: Construct CoTs using videos and existing labels
MotionEpic: Video of thoughts, video spatial-temporal scene graphs.
VideoQA
CoT verification
- Execution trace: The output must match the correct output
- Evaluate logical coherence and usefulness of the reasoning chains
Base-model
LLaVa-NeXT-Video
Instruction-tuning
MCQ, open-ended QA data
要はインストラクションチューニングより精度が向上したことを言っている
Analysis of latency and computation
Agent-based system requires a great amount of time because the latency is quite bad.
LNV-AoTD significantly decrease these cost because it is the distillation model.
Top comments (0)