MIT and Harvard researchers have highlighted the divergence between human expectations of AI system capabilities and their actual performance, particularly in large language models (LLMs). The inconsistent ability of AI to match human expectations could potentially erode public trust, thereby obstructing the broad adoption of AI technology. This issue, the researchers emphasized, escalates the risk in important areas like self-driving cars and healthcare diagnoses.
Evaluating LLMs remains challenging due to their extensive relevance across miscellaneous tasks such as drafting emails and assisting in medical diagnoses. The creation of a comprehensive benchmark dataset for testing every possible question that might be asked is deemed unfeasible. The key challenge lies in understanding how humans develop beliefs about the capabilities of LLMs and how these beliefs impact the decision to deploy the models in particular tasks.
Contemporary methods of assessing LLMs depend on benchmarking their performance across a wide variety of tasks. However, these methods fall short in capturing the human element that influences deployment decisions. As a solution, the researchers proposed a new framework that evaluates LLMs depending on their compatibility with human beliefs about the LLMs’ performance capabilities. This innovative framework introduces a human generalization function that analyzes how people modify their beliefs about an LLM’s capabilities after interacting with it, recognizing that any misalignment may cause overconfidence or underconfidence in deploying these models.
The human generalization function is responsible for observing how people form beliefs about an individual’s or an LLM’s capabilities based on their responses to certain questions. In order to measure this generalization, the researcher devised a survey where the participants were shown questions that an individual or LLM either correctly or wrongly answered. Based on their responses, the participants were asked to predict if the person or LLM could answer a related question correctly. The survey generated almost 19,000 instances over 79 tasks, thereby highlighting the way humans generalize about LLM performance. Interestingly, more simplistic models occasionally outdid more complex ones, such as the GPT-4 in scenarios where a higher weightage was placed on incorrect responses.
In conclusion, the research concentrated on the disparity between human expectations and the capabilities of LLMs that could possibly cause failure in high-stake situations. The human generalization function offered a fresh framework to measure this alignment, thereby stressing the importance of improving understanding and integrating human generalization into the development and evaluation process of LLMs. The proposed framework incorporates human factors for deploying general-purpose LLMs in an attempt to enhance real-world performance and user trust.
This research underscores the influence of human beliefs in the performance of the AI models, establishing a significant link between the two that could guide how an LLM gets deployed. This provides valuable insights on how human beliefs could help in effective utilization and improved performance of AI systems.