先行研究は複数のはるしネーションベンチマークを作ってきた
hallucination benchmarkの評価を行うベンチマークを作成した
Introduction
LVLMs tend to generate hallucinations
responses that are inconsistent with the corresponding visual inputs
Hallucination benchmark quality measurement framework
Contribution
- Propose a hallucination benchmark quality measurement framework for VLMs
- Construct a new high-quality hallucination benchmark
Related works
POPE constructs yes, no questions, and multiple-choice questions which contain non-existent objects
AMBER extended yes-no questions to other types of hallucinations.
HallusionBench yes,no pairs
Evaluation metrics
CHAIR
OpenCHAIR
Hallucination benchmark quality measurement framework
We select 6 representative publicly available hallucination benchmarks
MMHal, GAVIE
Follows from the psychological test.
Across different benchmarks, the scores are different from one another.
From the perspective of test-retest reliability, closed-ended benchmarks reveal obvious shortcomings.
Existing free-form VQA benchmarks exhibit limitations in both reliability and validity.
Conclusion
Introduced a quality measurement framework for hallucination benchmarks
感想
心理学的な信頼性テストの知見をAIに取り込んだ感じなんですかね
Top comments (0)