[memo]Visual Instruction Tuning

Haotial liu (LLavaの人)，マイクロソフト

Abstract

Instruction tuning large language model

First attempt to use language-onlyGPT

Two diverse benchmarks

Introduction

Contribution

Related work

Multimodal instruction following agents

Instruction tuning

The existing models are not tuned for instruction data.

Method

GPT-assisted visual instruction data generation

LLava → use ChatGPT to create annotations

Two specific use case scenarios

Benchmarks

LLaVa-Bench COCO

Generate three types of questions

30 images

Conversational questions → improvement

LLaVa-Bench in the wild

24 images with 60 questions in total

LLaVa-bench is designed to be challenging

Visual instruction tuning

Architecture

Image → CLIP encoder → Projection

Training

Multi-turn conversation data

Stage 1

Stege 2

Conclusion

DEV Community