[memo]MMBench: Is Your Multi-modal Model an All-around Player?

Abstract

VLMs have recently achieved remarkable progress

OwlEval

bilingual benchmark for VLMs

Circulareval

Introduction

VQAv2, COCO Caption, GQA, OK-VQA

Contribution

The construction of MMBench

Different from existing benchmarks

Hierarchical ability taxonomy

Data collection and quality control

学習に使われないとしているデータを利用した??

MMBench statistics

Evaluation Strategy

LLM-involved choice extraction

CircularEval: circular evaluation among one questions

LLM as the choice extractor

99%は正しいものを出す

GPT4 choice extractor

Experiment

VanillaEval to CircularEval = significantly dropping performance

LLM is important for VLM → The core model of VLM is important

Conclusion

DEV Community