The Theory of Mind (ToM), the ability to comprehend the thoughts and intentions of others, is important for the development of machines with human-like social intelligence. The recent advancements in machine learning, particularly in large language models, have shown some capability in ToM understanding. However, existing benchmarks for ToM mostly depend on video or text datasets. These overlook the comprehensive nature of human ToM, which emphasizes on versatile thinking based on conceptual representations from various data sources.
To counter this limitation, researchers from MIT and Harvard introduced a Multimodal Theory of Mind Question Answering (MMToMQA) benchmark. This benchmark evaluates machine ToM in different unimodal and multimodal data types associated with a person’s activities in a home environment.
The researchers proposed a new method, called BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models), to improve this multimodal ToM capability. BIP-ALM extracts combined representations from multimodal data and applies language models for scalable Bayesian inverse planning. The tests showed that current large scale language and multimodal models lack robust ToM capabilities but BIP-ALM showed promising results.
BIP-ALM was compared against several leading models designed for text or multimodal question answering such as GPT-4 and Video-LLaMA. Although existing models performed well in other QA benchmarks, they made significant mistakes in the MMToMQA benchmark, demonstrating an inability to match human performance. BIP-ALM fine-tunes a language model using synthetic human activity data, such as household activities, and utilizes this language model to identify the likelihood of hypotheses concerning a person’s beliefs and goals.
BIP-ALM outperformed other models, showing the limitations of the present state-of-the-art models and demonstrating the effectiveness of BIP-ALM in engineering human-level ToM reasoning. In conclusion, the research team introduced the first multimodal ToM benchmark and the innovative method, BIP-ALM, while systematically comparing various machine learning models with human ToM capabilities.