DEV Community

Takara Taniguchi
Takara Taniguchi

Posted on

[memo]AMBER: An Adversarial Multimodal Benchmark for Robustness Evaluation

GPT-4 exhibited a range of remarkable abilities

Introduction

They collect images which have not been used for training

They provide comprehensive annotations to facilitate the evaluation of both generative and discriminative tasks.

LLM-free evaluation pipeline.

Contributions

  • Create benchmark AMBER
  • LLM-free evaluation pipeline
  • analyze the most advanced GPT-4V

Related works

Hallucinations reasons

  • Diverse training data with some errors
  • lose attention to the image
  • Information loss after visual encoder

Dataset construction

Images too challenging for accurate annotation are discarded

Annotated by human

Generative and discriminative tasks

Metrics

CHAIR is a commonly used metric for evaluating hallucinations.

It measures the frequency of hallucinatory objects.

Hal represents the proportion of responses with hallucinations

感想

人間がアノテーションしているので大変そう

疲れそう

Top comments (0)