Large language models (LLMs), such as GPT-3, are powerful tools due to their versatility. They can perform a wide range of tasks, ranging from helping draft emails to assisting in cancer diagnosis. However, their wide applicability makes them challenging to evaluate systematically, as it would be impossible to create a benchmark dataset to test a model on every type of question it could potentially answer.
Researchers from MIT have tackled this issue in a new study. They argue that because humans decide when and how to use these models, evaluating them requires an understanding of human beliefs about the model’s capabilities. This led them to develop a new framework for assessing LLMs based on their alignment with a human’s beliefs about how it will perform on a certain task.
The researchers introduced a concept called the human generalization function, which is a model of how people update their beliefs about an LLM’s abilities after using it. They then assessed how well the LLMs aligned with this function. They found that, when LLMs are not in alignment with this function, users may overestimate or underestimate their abilities, leading to unexpected failures. This misalignment means more competent models can perform worse than smaller ones in high-stakes scenarios.
To obtain these results, the researchers used a survey to understand how humans generalize when interacting with LLMs. They presented questions that a human or LLM got right or wrong and then asked if participants thought that the human or LLM would answer a related question correctly.
The results showed that humans are much worse at generalizing about LLMs than other humans. People tend to adjust their beliefs more when an LLM answers questions incorrectly than when it answers questions correctly. In situations where users put more weight on incorrect responses, simple models outperformed large models like GPT-4.
The researchers suggest that humans’ difficulty in generalizing about LLMs may be due to their novelty. In the future, however, people might improve this aspect by interacting more with these models.
The MIT team hopes their findings will be used as a benchmark to compare how LLMs perform concerning the human generalization function, which could enhance these models’ performance in real-world scenarios.
Prof Alex Imas from the University of Chicago’s Booth School of Business, comments that the paper reveals a crucial issue with deploying LLMs for general consumer use. If people do not understand when LLMs will be accurate and when they will not, they are likely to make mistakes and be discouraged from further use. This highlights the importance of aligning models with people’s understanding of generalization. The second contribution is more fundamental, as it provides a test of whether LLMs ‘ understand’ the problem they’re solving. The research was partly funded by the Harvard Data Science Initiative and the Center for Applied AI at the University of Chicago Booth School of Business.
The research will be presented at the International Conference on Machine Learning. The team now plans further studies to explore how human generalization could be incorporated into developing LLMs and aim to determine how people’s beliefs about LLMs evolve over time as they interact with these models.