DEV Community

Takara Taniguchi
Takara Taniguchi

Posted on

[memo]Visual Instruction Tuning

Haotial liu (LLavaの人),マイクロソフト

Abstract

Instruction tuning large language model

First attempt to use language-onlyGPT

Two diverse benchmarks

Introduction

Contribution

  • Multimodal instruction following → data reformation
  • MLLM →
  • Benchmark LLava-Bench
  • Open-source

Related work

Multimodal instruction following agents

  • Vision language navigation task, Habitat (required the embodied AI)
  • Langchain, LLMs, chatGPT, X-GPT, MM-REACT, VisProg, ViperGPT

Instruction tuning

The existing models are not tuned for instruction data.

Method

GPT-assisted visual instruction data generation

LLava → use ChatGPT to create annotations

Two specific use case scenarios

  • multimodal chatbot
  • Science QA

Benchmarks

LLaVa-Bench COCO

Generate three types of questions

30 images

Conversational questions → improvement

LLaVa-Bench in the wild

24 images with 60 questions in total

LLaVa-bench is designed to be challenging

Visual instruction tuning

Architecture

Image → CLIP encoder → Projection

Training

Multi-turn conversation data

Stage 1

  • To strike a balance between concept coverage and training efificiency

Stege 2

  • Fine-tuning end2end

Conclusion

  • Construct LLava, multimodal model to follow human intent
  • New Sota
  • New Benchmark LLava

Top comments (0)