Researchers from MIT and the MIT-IBM Watson AI Lab have developed a language-based navigational strategy for AI robots. The method uses textual descriptions instead of visual information, effectively simplifying the process of robotic navigation. Visual data traditionally requires significant computational capacity and detailed hand-crafted machine-learning models to function effectively. The researchers’ approach involves converting a robot’s field of vision into language through text captions which describe the environment from the robot’s perspective.
These text captions are processed by a larger language model which predicts the actions the robot should implement according to the user’s instructions. Additionally, because the process relies entirely on a textual model, researchers can synthesize an enormous volume of artificial training data quickly. This method also resolves the disparity between AI trained in simulated environments and those operating in real-world settings, a gap typically caused by computer-generated visuals varying significantly from live scenes.
In testing, while the technique didn’t outweigh the performance of vision-led models, it proved preferable in certain situations where visual data was sparse. Combining the language-based approach with vision-based strategies led to enhanced navigation performance overall.
One notable drawback is the natural loss of information conveyed through visual models — depth, for instance. However, the researchers were intrigued to discover that language might capture higher-level details that pure visual features cannot.
Continuing research will delve into the potential to integrate navigational data with a larger language model to enhance navigational capability and investigate if these models can display spatial awareness. They will also develop a navigation-specific descriptor that could escalate the method’s overall effectiveness. The MIT-IBM Watson AI Lab primarily supported the research.
The research group treated robotic navigation as a language task. Their language model accepts basic captions of a robot’s observations, combined with language instructions, to determine the following navigational step. On making a decision, the model subsequently produces a caption of the scene the robot should witness upon completing the task, updating the robot’s trajectory history to keep track of its progress.
As part of their templated approach, the team ensured the model received data in a uniform format, presenting it as a series of choices the robot can make depending on its environment. This approach also simplifies the process of comprehension for humans and expedites the troubleshooting process when the robot fails to achieve its goals. Moreover, the method can be tailored to different tasks and environments owing to its single input type — language.