[memo]AMBER: An Adversarial Multimodal Benchmark for Robustness Evaluation

GPT-4 exhibited a range of remarkable abilities

Introduction

They collect images which have not been used for training

They provide comprehensive annotations to facilitate the evaluation of both generative and discriminative tasks.

LLM-free evaluation pipeline.

Contributions

Related works

Hallucinations reasons

Dataset construction

Images too challenging for accurate annotation are discarded

Annotated by human

Generative and discriminative tasks

Metrics

CHAIR is a commonly used metric for evaluating hallucinations.

It measures the frequency of hallucinatory objects.

Hal represents the proportion of responses with hallucinations

感想

人間がアノテーションしているので大変そう

疲れそう

DEV Community