DEV Community

Takara Taniguchi
Takara Taniguchi

Posted on

[memo]MMBench: Is Your Multi-modal Model an All-around Player?

Abstract

VLMs have recently achieved remarkable progress

OwlEval

bilingual benchmark for VLMs

Circulareval

Introduction

VQAv2, COCO Caption, GQA, OK-VQA

Contribution

  • Systematically constructed dataset
  • Robust evaluation → Circulareval

The construction of MMBench

Different from existing benchmarks

  • Adopts images
  • Performs rigorous quality control
  • bilingual multi-modal benchmark

Hierarchical ability taxonomy

  • L1: perception, reasoning
  • L2: six
  • L3: twenty dimensions

Data collection and quality control

学習に使われないとしているデータを利用した??

MMBench statistics

  • 3217 data samples

Evaluation Strategy

LLM-involved choice extraction

CircularEval: circular evaluation among one questions

  1. Matching predition. heuristic matching ← ayasii
  2. Matching LLM’s output.

LLM as the choice extractor

99%は正しいものを出す

GPT4 choice extractor

Experiment

VanillaEval to CircularEval = significantly dropping performance

LLM is important for VLM → The core model of VLM is important

Conclusion

  • 3000 multiple choice questions
  • Evaluate 20 mainstream VLMs

Top comments (0)