Large-scale language models (LLMs) have made substantial progress in understanding language by absorbing information from their environment. However, while they excel in areas like historical knowledge and insightful responses, they struggle when it comes to real-time comprehension. Embodied AI, integrated into items like smart glasses or home robots, aims to interact with humans using everyday language and understanding its environment in real time. This is the goal pursued by Meta AI, presenting a significant research challenge.
A method to test an AI’s comprehension of its environment, Embodied Question Answering (EQA), has practical implications beyond research. In its most basic form, EQA can simplify daily life, helping find misplaced items, for instance. Still, even advanced models do not match human performance in EQA, according to Moravec’s paradox.
As an initiative in this direction, Meta introduced the Open-Vocabulary Embodied Question Answering (OpenEQA) framework. This evaluates an AI’s understanding of its environment through open-vocabulary questions, operating similarly to testing a person’s comprehension by the questions they resolve.
The OpenEQA framework has two parts. The first, the Episodic Memory EQA, requires the AI to recall prior experiences to answer questions. The second, Active EQA, requires the AI to actively gather information from its surroundings to answer queries.
OpenEQA includes over 180 movies and physical environment scans, additionally featuring over 1,600 authentic question-and-answer pairs provided by human annotators. It’s also paired with LLM-Match, a criterion for evaluating open vocabulary answers, proven to be as closely associated with humans as two humans are to each other.
Evaluation of various state-of-the-art Vision+Language foundation models (VLMs), using OpenEQA, revealed a gap in performances; humans achieved 85.9% whereas the most effective model, GPT-4V, achieved 48.5%. This disparity, especially with spatial understanding questions, suggests that models need to make better use of visual information, rather than relying on textual knowledge. This is an indication that embodied AI, powered by these models, need to improve perception and reasoning capabilities before they can be widely implemented.
OpenEQA merges the ability to respond in natural language with the ability to deal with complex open-vocabulary queries, creating an easy-to-understand metric to evaluate environmental comprehension. Researchers hope that OpenEQA, the first open-vocabulary benchmark for EQA, will allow others in the academic community to track developments in scene interpretation and multimodal learning.
The research was conducted by a team at Meta AI and findings were published in their paper, project, and blog. The researchers continue to encourage following their work on various social media platforms to stay updated on their advancements.