Large Language Models (LLMs) are adept at processing textual data, while Vision-and-Language Navigation (VLN) tasks are primarily concerned with visual information. Combining these two data types involves advanced techniques to correctly align textual and visual representations. However, a performance gap remains when applying LLMs to VLN tasks as compared to models specifically designed for navigation, mainly due to difficulties in understanding spatial relationships and resolving ambiguous references based on visual context.
A joint team from Adobe Research, the University of Adelaide, the Shanghai AI Laboratory, and the University of California, US, has introduced a model called NavGPT-2 to tackle these challenges. The study identified that the LLMs’ linguistic interpretation capabilities are usually underused, despite being crucial for generating navigational reasoning and effective interaction during robotic navigation.
Present methods of using LLMs in VLN tasks consist of zero-shot methods, using textual descriptions of the navigation environment as prompts, or fine-tuning methods, where LLMs are trained on instruction-trajectory pairs. The former often faces problems due to the complexities of prompt engineering and noisy data from image captioning and summarization, while the latter struggles due to inadequate training data and a misalignment of objectives between LLM pretraining and VLN tasks. The proposed solution, NavGPT-2, strives to converge LLM-based navigation and specialized VLN models by effectively incorporating both LLMs and navigation policy networks.
NavGPT-2 merges a Large Vision-Language Model (VLM) with a navigation policy network, improving VLN capabilities. The VLM processes visual data with the Q-former, which extracts image tokens and supplies them to a ‘frozen’ LLM for navigational reasoning. This preserves the language interpretation abilities of the LLMs while addressing their limitations in understanding spatial structures. The system further uses a topological graph-based navigation policy to remember the agent’s trajectory and enable effective backtracking.
Evaluation on the R2R dataset shows that NavGPT-2 significantly outperforms prior LLM-based methods and zero-shot approaches with higher success rates and data efficiency. For instance, it surpasses the performance of NaviLLM and NavGPT, scoring competitively against state-of-the-art VLN specialties such as DUET.
In conclusion, NavGPT-2 effectively addresses integrating LLMs into VLN tasks, by combining language interpretation capabilities with specialized navigation policies. It excels in understanding and responding to complex language instructions, processing visual information, and planning efficient navigation routes. Through overcoming challenges like grounding language in vision, managing ambiguous instructions, and adapting to dynamic environments, NavGPT-2 lays the groundwork for the development of more robust and intelligent autonomous systems. For more information, check out the research paper on GitHub.