DEV Community

Takara Taniguchi
Takara Taniguchi

Posted on

[memo]OpenVLA: An Open-Source Vision-Language-Action Model

Abstract

  • Trained on diverse collection of 970k
  • Build on LLama2
  • Fused DINOv2, SigLIP,
  • strong generalization results

Introduction

CLIP, SigLIP, and LLama

Visually conditioned language models such as Pali

OpenVLA outperforms 55B-parameter RT-2-X model, which is the SoTA

Related work

Visually-conditioned language models

  • VQA, building on advances in both computer vision and NLP modeling
  • Generalist robot policies: Octo trains a generalist policy that can control multiple robots
  • VLA models: object detection, make it possible to be scalable
  • RT-2-X trains a 55B0parameter VLA on the Open-X empdiement

The OpenaVLA model

Training procedure

  • LLama tokenizer only reserves 100 token, which is too few
  • 64 A100 GPU..すごいですね....
  • Not too different from VLMs

VLM backbone

IDEFICS, LLaVa

Experiments

Bridgedata V, OpenX

tasks

  • Evaluate multiple robots
  • Fine-tuning performance
  • PEFT and its tradeoffs

Comparison: RT-1-X, RT-2-X, Octo, RT-1-X are trained from scratch with OpenX

PEFT condition: Franka tabletop FrankaDROID

Contribution

  • Existing foundation models
  • VL such CLIP
  • 7B parameter open-source VLA
  • Open-x embodiement dataset

Conclusion

  • Presented OpenVLA
  • Limited capacity: only single image observation
  • Improving the inference throughput of OpenVLA is critical
  • < 90 % performacne

Top comments (0)