[memo]Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation

Introduction

Atomic video understanding tasks
Action recognition
Language grounding

AoTD distills the multi-step reasoning and spatial-temporal understandings.

Enhancing video LLMs by distilling high-quality CoTs.
CoT dataset is constructed automatically by using multi-agents.
Distilled models outperform existing methods.

Related works
Video-language models
Encoder CLIP, SigLIP
LLMs:VideoLLaMA2,LLaVA-NeXT-Video,VideoChat2

Concurrent work
Video-Star: Construct CoTs using videos and existing labels
MotionEpic: Video of thoughts, video spatial-temporal scene graphs.

VideoQA
CoT verification

Execution trace: The output must match the correct output
Evaluate logical coherence and usefulness of the reasoning chains

Base-model
LLaVa-NeXT-Video

Instruction-tuning
MCQ, open-ended QA data
要はインストラクションチューニングより精度が向上したことを言っている

Analysis of latency and computation
Agent-based system requires a great amount of time because the latency is quite bad.
LNV-AoTD significantly decrease these cost because it is the distillation model.

DEV Community

[memo]Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation

Top comments (0)