Haotial liu (LLavaの人),マイクロソフト
Abstract
Instruction tuning large language model
First attempt to use language-onlyGPT
Two diverse benchmarks
Introduction
Contribution
- Multimodal instruction following → data reformation
- MLLM →
- Benchmark LLava-Bench
- Open-source
Related work
Multimodal instruction following agents
- Vision language navigation task, Habitat (required the embodied AI)
- Langchain, LLMs, chatGPT, X-GPT, MM-REACT, VisProg, ViperGPT
Instruction tuning
The existing models are not tuned for instruction data.
Method
GPT-assisted visual instruction data generation
LLava → use ChatGPT to create annotations
Two specific use case scenarios
- multimodal chatbot
- Science QA
Benchmarks
LLaVa-Bench COCO
Generate three types of questions
30 images
Conversational questions → improvement
LLaVa-Bench in the wild
24 images with 60 questions in total
LLaVa-bench is designed to be challenging
Visual instruction tuning
Architecture
Image → CLIP encoder → Projection
Training
Multi-turn conversation data
Stage 1
- To strike a balance between concept coverage and training efificiency
Stege 2
- Fine-tuning end2end
Conclusion
- Construct LLava, multimodal model to follow human intent
- New Sota
- New Benchmark LLava
Top comments (0)