Skip to content Skip to footer

Researchers from Google’s DeepMind present “Mobility VLA”, a method for navigation instructions combining Long-Context VLMs and Topological Graphs in a multimodal approach.

Advancements in sensors, artificial intelligence (AI), and processing power have paved the way for new possibilities in robot navigation. Many research studies suggest bridging the natural language space of ObjNav and VLN to a multimodal space allowing robots to follow both text and image-based instructions simultaneously. This approach is called Multimodal Instruction Navigation (MIN).

MIN encapsulates various activities such as exploring the surroundings and following navigational instructions. However, the use of demonstration tours that cover the entire region can often help in circumventing the need for exploration.

Researchers from Google DeepMind have introduced a new concept called Multimodal Instruction Navigation with Tours (MINT). MINT uses demonstration tours and aims to execute multimodal user instructions. Vision-Language Models (VLMs), known for their capabilities in language and picture interpretation and common-sense reasoning, have shown significant promise in addressing MINT. However, VLMs face challenges such as limited input photos due to context-length constraints and the need for computed robot actions.

To overcome these challenges, the researchers proposed a hierarchical Vision-Language-Action (VLA) navigation policy called Mobility VLA. It integrates high-level VLM, that uses the demonstration tour and multimodal user guidance, with a low-level navigation policy built on topological networks. A topological graph offline, based on the tour frames, is used to create robot actions or waypoints.

Testing of Mobility VLA in various settings yielded promising results with success rates of up to 90%, significantly outperforming baseline techniques. The current version of Mobility VLA relies on demonstration tours, but it also presents an opportunity to incorporate pre-existing exploration methods.

The researchers also highlighted that long VLM inference times can cause unnatural user interactions, with waiting times for robot responses ranging between 10-30 seconds. Caching the demonstration tour can help improve inference speed significantly.

Considering the light onboard compute demand and the requirement of only RGB camera observations, Mobility VLA can be implemented on various robot models, paving the way for its wide deployment. This represents an optimistic and forward step in the field of robotics and AI.

Leave a comment

0.0/5