Abstract
- Trained on diverse collection of 970k
- Build on LLama2
- Fused DINOv2, SigLIP,
- strong generalization results
Introduction
CLIP, SigLIP, and LLama
Visually conditioned language models such as Pali
OpenVLA outperforms 55B-parameter RT-2-X model, which is the SoTA
Related work
Visually-conditioned language models
- VQA, building on advances in both computer vision and NLP modeling
- Generalist robot policies: Octo trains a generalist policy that can control multiple robots
- VLA models: object detection, make it possible to be scalable
- RT-2-X trains a 55B0parameter VLA on the Open-X empdiement
The OpenaVLA model
Training procedure
- LLama tokenizer only reserves 100 token, which is too few
- 64 A100 GPU..すごいですね....
- Not too different from VLMs
VLM backbone
IDEFICS, LLaVa
Experiments
Bridgedata V, OpenX
tasks
- Evaluate multiple robots
- Fine-tuning performance
- PEFT and its tradeoffs
Comparison: RT-1-X, RT-2-X, Octo, RT-1-X are trained from scratch with OpenX
PEFT condition: Franka tabletop FrankaDROID
Contribution
- Existing foundation models
- VL such CLIP
- 7B parameter open-source VLA
- Open-x embodiement dataset
Conclusion
- Presented OpenVLA
- Limited capacity: only single image observation
- Improving the inference throughput of OpenVLA is critical
- < 90 % performacne
Top comments (0)