Recent technological advancements have enhanced robot navigation to great extents, particularly with the integration of AI, sensors, and improved processing power. Several studies advocate for the transition of the natural language space of ObjNav and VLN to a multimodal space, enabling robots to simultaneously follow commands in both text and image formats. This type of maritime activity is referred to as Multimodal Instruction Navigation (MIN) by researchers.
MIN covers a vast range of activities, such as following navigation instructions and exploring nearby surroundings. However, the use of a demonstration tour film that covers an entire region can often bypass the need for investigation altogether.
A study by Google DeepMind delves into a range of tasks known as Multimodal Instruction Navigation with Tours (MINT). It involves the use of demonstration tours to carry out multimodal user instructions. Recent achievements from massive Vision-Language Models (VLMs) in interpreting language and imagery, coupled with their ability to reason with common sense, show a lot of promise in tackling MINT. However, a standalone VLM isn’t sufficiently equipped to solve MINT due to certain reasons. VLMs often have a very limited number of input photos attributable to context-length limits, inhibiting an accurate understanding of large environments. Computed robot actions are necessary for addressing MINT solutions. The requests for these actions from robots are usually alien to the distribution with which VLMs are (pre)trained to handle, thereby necessitating a boost in zero-shot navigation performance.
The study introduces Mobility VLA, a hierarchical Vision-Language-Action (VLA) navigation policy that merges environmental knowledge and the ability to intuitively reason from long-context VLMs with a robust low-level navigation policy premised on topological networks. The high-level VLM uses both the demo tour video and multimodal user guidance to locate the required frame in the tour film. Once located, a conventional low-level policy is used to construct a topological graph offline from the tour frames at each time step, creating the necessary robot actions.
The team’s tests of Mobility VLA in both an office and residential setting yielded successful results. On complex MINT problems demanding intricate thought processes, Mobility VLA scored success rates of 86% and 90% respectively. This is noticeably better than the baseline techniques, demonstrating the real-world applicability of Mobility VLA.
The current version of Mobility VLA is reliant on a demonstration tour rather than independent exploration of surroundings. However, this demonstration tour gives scope for the inclusion of pre-existing exploration methods such as frontier or diffusion-based exploration.
While long VLM inference times may hamper natural user interactions, having users wait a considerable amount of time for robot responses, the inference speed can be greatly improved by caching the demonstration tour that utilizes about 99.9% of the input tokens.
Mobility VLA, given its low onboard compute demand and requirement of only RGB camera observations, can be executed on numerous robot avatars, suggesting its potential for widespread deployment. This is an optimistic advancement in the realm of artificial intelligence and robotics.