We are excited to announce a revolutionary step forward in robotics technology: Google DeepMind’s release of a suite of new tools to help robots learn autonomously faster and more efficiently in novel environments! Training a robot to perform specific tasks in a single environment is only the beginning; if robots are to become truly useful to us, they must be able to perform a range of general tasks in environments they’ve never experienced before.
Last year, DeepMind introduced its RT-2 robotics control model and RT-X robotic datasets. RT-2 is designed to translate voice or text commands into robotic actions, and the new tools build on this to move us closer to autonomous robots that can explore different environments and learn new skills.
Two years ago, foundation models demonstrated their capability to perceive and reason about the world around us, bringing us closer to a scalable robotics system. DeepMind’s AutoRT framework is designed for orchestrating robotic agents in the wild, using a combination of a foundational Large Language Model (LLM) and a Visual Language Model (VLM). The VLM allows the robot to assess the scene in front of it and pass the description to the LLM, which evaluates the objects and scene and generates a list of potential tasks the robot could perform. These tasks are evaluated for safety, the robot’s capabilities, and whether or not attempting the task would add new skills or diversity to the AutoRT knowledge base.
With AutoRT, DeepMind has successfully “safely orchestrated as many as 20 robots simultaneously, and up to 52 unique robots in total, in a variety of office buildings, gathering a diverse dataset comprising 77,000 robotic trials across 6,650 unique tasks.”
Sending a robot out into new environments brings up the potential for dangerous situations that can’t be planned for specifically. DeepMind’s robotic constitution is inspired by Isaac Asimov’s 3 laws of robotics, providing generalized safety guardrails to protect the robot and its environment.
Additionally, DeepMind has created Self-Adaptive Robust Attention for Robotics Transformers (SARA-RT), which takes models like RT-2 and makes them more efficient. The neural network architecture of RT-2 relies on attention modules of quadratic complexity, meaning that if you double the input, you need four times the computational resources. SARA-RT uses a linear attention model to fine-tune the robotic model, resulting in a 14% improvement in speed and 10% accuracy gains.
RT-Trajectory adds a 2D visual overlay on a training video so that the robot can learn intuitively what kind of motion is required to accomplish the task. This allows the robot to convert natural language into a coded sequence of motor motions and rotations to drive its moving parts, and provides it with the necessary instructions to complete the task. When tested on 41 tasks unseen in the training data, an arm controlled by RT-Trajectory achieved a 63% success rate.
We are thrilled to see the progress Google DeepMind has made in their pursuit of autonomous robots, and are excited to see how these new tools will speed up the integration of AI-powered robots into our daily lives!