Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation

#machinelearning #ai #llm #computervision

Yudi shiが第一著者，上海交通大学のWeidi xieのグル

Intro

Knowledge distillation of agents.

Related works

Video-language models
CLIP or SigLIP
MoreVQA suggests a multi-stage system.
VURF proposes a self-refinement method.
Inference speed is inferior

Visual Chain of Thoughts
Visual program knowledge distillation
Fact
They attempt to address these issues.

Method
Compared both labels and rationales
They distill the sub-task decompositions for video understanding.
The discussion of time and usage of memory is also done.
Distillation is also conducted.

Sub-tasks are processed by object detection, temporal grounding, and action recognition.
It means that this method is the distilled version of ViperGPT or such models.

Conclusion

DEV Community

Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation

Top comments (0)