Robotic learning typically involves training datasets tailored to specific robots and tasks, necessitating extensive data collection for each operation. The goal is to create a “general-purpose robot model”, which could control a range of robots using data from previous machines and tasks, ultimately enhancing performance and generalization capabilities. However, these universal models face challenges unique to the robotic field, such as various robot embodiments, sensor configurations, action spaces, task specifications, surroundings, and compute budgets. While some advances have been made in “generalist robot policies” (GRPs), that can translate robot observations into actions suitable for new domains and robots, these models are often not accessible to the public, and do not allow for effective finetuning or a versatile range of input observations.
Researchers from UC Berkeley, Stanford, Carnegie Mellon University, and Google Deepmind have introduced Octo, a transformer-based model that is pre-trained using 800,000 robot demonstrations from the Open X-Embodiment dataset. The Octo model is the first open-source GRP to be effectively fine-tuned to new observations and action spaces. When trained on diverse datasets of robots and tasks, the transformer model can turn any number of observations and tasks into actions for multiple robots, camera setups, and input methods.
Octo is notable for innovative combinations of features, such as a transformer backbone, goal image specification support, and a diffusion head for modelling expressive action distributions. Experiments conducted on nine robots from four universities showed that the model generated state-of-the-art results in multi-robot control tasks. The tests showed Octo to be capable of effective initialization for fine-tuning in new observation and action spaces. The researchers also analyzed the impact of various design choices on the pre-trained GRP’s quality, highlighting the importance of scale and flexibility.
The team has made resources available for training, using, reproducing and refining the Octo model. Pretrained Octo model checkpoints allow task specification and multiple RGB camera inputs. The full pre-training pipeline, including optimum data loaders and transformer implementations for multimodal inputs, is available, along with scripts for fine-tuning the models.
The Octo model represents an important step towards generalist robot policies that could work in an array of robot settings, and aims to provide a practical platform for accessing larger robotics datasets. The researchers believe their work will facilitate the use of pretrained models for swift task learning and generalization, thus pushing forward the fields of robotics and machine learning.