Yudi shiが第一著者,上海交通大学のWeidi xieのグル
Intro
Knowledge distillation of agents.
Related works
Video-language models
CLIP or SigLIP
MoreVQA suggests a multi-stage system.
VURF proposes a self-refinement method.
Inference speed is inferior
Visual Chain of Thoughts
Visual program knowledge distillation
Fact
They attempt to address these issues.
Method
Compared both labels and rationales
They distill the sub-task decompositions for video understanding.
The discussion of time and usage of memory is also done.
Distillation is also conducted.
Sub-tasks are processed by object detection, temporal grounding, and action recognition.
It means that this method is the distilled version of ViperGPT or such models.
Conclusion
Top comments (0)