Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

Hongcheng Gaoが第一著者，上海交通大学

ビデオ理解のhallucinationに関する論文
This paper categorizes the hallucination in the video understanding task into three types.
Conflict with prior
ビデオの内容が事前知識と違う状況を示す
In this paper, the situation where a cat and a mouse get along means a strange situation, which causes hallucinations.

In-context-conflict
There are discrepancies between questions and options.
Valid answers cannot be obtained from the given materials.
These are unanswerable questions.

Capability deficiency
Numerical tasks

Experiments
Supervised reasoning fine-tuning
By using Chain of thoughts when generating the video pairs and answers, the proposed method enables the good fine-tuning dataset.
要はファインチューニング用のデータセットを作るのにLong CoT　Responseを使っただけ

SRFT means supervised reasoning fine-tuning
Thinking-based DPO

人間が修正したLMM-SRFTの文章と直接修正したぺあをmikuraberu
This method assigns a greater weight to the corrected reasoning steps.

Conclusion
CoTやられとるんだね．．．

DEV Community

Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

Top comments (0)