DEV Community

Takara Taniguchi
Takara Taniguchi

Posted on

[memo]Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

先行研究は複数のはるしネーションベンチマークを作ってきた

hallucination benchmarkの評価を行うベンチマークを作成した

Introduction

LVLMs tend to generate hallucinations

responses that are inconsistent with the corresponding visual inputs

Hallucination benchmark quality measurement framework

Contribution

  • Propose a hallucination benchmark quality measurement framework for VLMs
  • Construct a new high-quality hallucination benchmark

Related works

POPE constructs yes, no questions, and multiple-choice questions which contain non-existent objects

AMBER extended yes-no questions to other types of hallucinations.

HallusionBench yes,no pairs

Evaluation metrics

CHAIR

OpenCHAIR

Hallucination benchmark quality measurement framework

We select 6 representative publicly available hallucination benchmarks

MMHal, GAVIE

Follows from the psychological test.

Across different benchmarks, the scores are different from one another.

From the perspective of test-retest reliability, closed-ended benchmarks reveal obvious shortcomings.

Existing free-form VQA benchmarks exhibit limitations in both reliability and validity.

Conclusion

Introduced a quality measurement framework for hallucination benchmarks

感想
心理学的な信頼性テストの知見をAIに取り込んだ感じなんですかね

Top comments (0)