Robotic manipulation policies are currently limited by their inability to extrapolate beyond their training data. While these policies can adapt to new situations, such as different object positions or lighting, they struggle with unfamiliar objects or tasks, and require assistance to process unseen instructions.
Promisingly, vision and language foundation models, like CLIP, SigLIP, and Llama 2, have displayed exceptional generalization capabilities due to training on large-scale internet datasets. However, the most extensive robotic manipulation datasets contain between 100K to 1M examples, an insubstantial amount compared to vision and language pretraining.
Existing strategies include Visually-Conditioned Language Models (VLMs), Generalist Robot Policies, and Vision-Language-Action Models (VLMs). Among these, VLMs stand out, being used in robotics for tasks such as visual state representations, object detection, and high-level planning.
A new standard has been set for robotic manipulation policies with OpenVLA, a 7B-parameter open-source Vision-Language-Action model. Developed by researchers from multiple esteemed institutions, OpenVLA incorporates a pre-trained visually-conditioned language model backbone, providing detailed visual analysis. It fine-tuned on a comprehensive dataset of 970k robot manipulation trajectories from the Open-X Embodiment dataset. Impressively, OpenVLA outperforms the previous leading model, the 55B-parameter RT-2-X, by 16.5% in absolute success rate across 29 tasks on the WidowX and Google Robot platforms.
OpenVLA performs better than fine-tuned pre-trained policies like Octo across seven diverse manipulation tasks. During the pre-training phase, actions are predicted through a “vision-language” setup, where an input observation image and a natural language task instruction are linked to a sequence of predicted robot actions.
Notably, the Diffusion Policy was found to match or surpass the performance of the Octo and OpenVLA with single-instruction tasks like “Put Carrot in Bowl” and “Pour Corn into Pot”. However, for more complex tasks requiring language instructions, pre-trained generalist policies demonstrated superiority. With OpenX pre-training, Octo and OpenVLA models were better equipped to handle diverse tasks, with OpenVLA being the only model to consistently achieve at least a 50% success rate across all tested tasks.
In summary, OpenVLA, a new open-source model for vision-language-action tasks, sets a novel precedent in the robotics field. Capable of controlling a broad spectrum of robots, it performs strongly in a diverse range of tasks. The model can be easily adapted to new robotic setups and is especially suitable for imitation learning tasks that involve a variety of language instructions. Drawbacks include its single-image observation limitation, which presents an opportunity for future development to support multiple image and proprioceptive inputs, and observation history.