The increasing sophistication of artificial intelligence and large language models (LLMs) like GPT-4 and LLaMA2-70B has sparked interest in their potential to display a theory of mind. Researchers from the University Medical Center Hamburg-Eppendorf, the Italian Institute of Technology, Genoa, and the University of Trento are studying these models to assess their capabilities against human performance. The theory of mind is the capacity to ascribe mental states to one’s self and others, a crucial aspect of human social interactions. As AI and LLMs progress, questions have arisen about their ability to comprehend and navigate social complexities equivalent to humans.
The researchers utilize an investigative approach rooted in psychology to examine the LLMs’ theory of mind abilities. They implement a series of time-tested theory of mind procedures, including the hinting task and false belief task, and measure the understanding of irony and the recognition of social gaffes. These tasks range from the rudimentary understanding of false beliefs to the intricate perception of social situations. The LLMs, such as GPT-4, GPT-3.5, and LLaMA2-70B, are subjected to each test multiple times to allow for a robust comparison with human performance. The researchers ensure that each task includes unique inputs so that the LLMs demonstrate authentic understanding rather than just regurgitating training data.
The researchers meticulously administer the tests to both parties, LLMs and humans, in written formats, to enable a fair comparison. They use test-specific scoring protocols to analyze the responses, comparing performance across models and humans. Notably, GPT-4 excels in tests both in irony comprehension and hinting, often even surpassing human performance. However, it struggles with uncertain scenarios, like the faux pas test. According to the study, GPT models adopt a cautious approach, using mitigation measures to reduce the occurrence of hallucinations and improve factual accuracy. Their lack of embodied decision-making processes also impacts their handling of social uncertainty.
The study emphasizes the complexity of assessing LLMs’ theory of mind competencies and the need for systematic testing for accurate comparison with human cognition. While LLMs like GPT-4 have shown considerable progress in certain theory of mind tasks, they fall short in uncertain situations. This reveals a cautious epistemic policy, possibly linked to training methodologies. Understanding these differences can guide the development of LLMs that can comfortably navigate social interactions with proficiency similar to humans.