[memo]Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

先行研究は複数のはるしネーションベンチマークを作ってきた

hallucination benchmarkの評価を行うベンチマークを作成した

Introduction

LVLMs tend to generate hallucinations

responses that are inconsistent with the corresponding visual inputs

Hallucination benchmark quality measurement framework

Contribution

Related works

POPE constructs yes, no questions, and multiple-choice questions which contain non-existent objects

AMBER extended yes-no questions to other types of hallucinations.

HallusionBench yes,no pairs

Evaluation metrics

CHAIR

OpenCHAIR

Hallucination benchmark quality measurement framework

We select 6 representative publicly available hallucination benchmarks

MMHal, GAVIE

Follows from the psychological test.

Across different benchmarks, the scores are different from one another.

From the perspective of test-retest reliability, closed-ended benchmarks reveal obvious shortcomings.

Existing free-form VQA benchmarks exhibit limitations in both reliability and validity.

Conclusion

Introduced a quality measurement framework for hallucination benchmarks

感想
心理学的な信頼性テストの知見をAIに取り込んだ感じなんですかね

DEV Community