[memo]OpenVLA: An Open-Source Vision-Language-Action Model

Abstract

Introduction

CLIP, SigLIP, and LLama

Visually conditioned language models such as Pali

OpenVLA outperforms 55B-parameter RT-2-X model, which is the SoTA

Related work

Visually-conditioned language models

VQA, building on advances in both computer vision and NLP modeling
Generalist robot policies: Octo trains a generalist policy that can control multiple robots
VLA models: object detection, make it possible to be scalable
RT-2-X trains a 55B0parameter VLA on the Open-X empdiement

The OpenaVLA model

Training procedure

VLM backbone

IDEFICS, LLaVa

Experiments

Bridgedata V, OpenX

tasks

Comparison: RT-1-X, RT-2-X, Octo, RT-1-X are trained from scratch with OpenX

PEFT condition: Franka tabletop FrankaDROID

Contribution

Conclusion

DEV Community