Abstract
VLMs have recently achieved remarkable progress
OwlEval
bilingual benchmark for VLMs
Circulareval
Introduction
VQAv2, COCO Caption, GQA, OK-VQA
Contribution
- Systematically constructed dataset
- Robust evaluation → Circulareval
The construction of MMBench
Different from existing benchmarks
- Adopts images
- Performs rigorous quality control
- bilingual multi-modal benchmark
Hierarchical ability taxonomy
- L1: perception, reasoning
- L2: six
- L3: twenty dimensions
Data collection and quality control
学習に使われないとしているデータを利用した??
MMBench statistics
- 3217 data samples
Evaluation Strategy
LLM-involved choice extraction
CircularEval: circular evaluation among one questions
- Matching predition. heuristic matching ← ayasii
- Matching LLM’s output.
LLM as the choice extractor
99%は正しいものを出す
GPT4 choice extractor
Experiment
VanillaEval to CircularEval = significantly dropping performance
LLM is important for VLM → The core model of VLM is important
Conclusion
- 3000 multiple choice questions
- Evaluate 20 mainstream VLMs
Top comments (0)