Research teams from the University of Cambridge, University of Oxford, and the Massachusetts Institute of Technology have developed a dynamic evaluation method called CheckMate. The aim is to enhance the evaluation of Large Language Models (LLMs) like GPT-4 and ChatGPT, especially when used as problem-solving tools. These models are capable of generating text effectively, but standard evaluation methods do not effectively reflect performance in real-world human-machine interactions as they rely on static pairs of inputs and outputs.
CheckMate aims to generate a more comprehensive understanding of LLM capabilities by enabling humans to interact with them directly, specifically as part of problem-solving scenarios. This has particular importance in areas such as mathematics where correctness is pivotal. For this, CheckMate uses two main evaluation approaches: structured multistep interactive ratings and free-form instance-based evaluation.
The platform captures instances of interaction between humans and LLMs, noting the precision of the generated responses and their perceived usefulness. This interaction data is collected through a mixed-cohort study, which involves different participants ranging from undergraduate students to experienced mathematics professors. This helps to provide insights into user behavior patterns with LLMs for problem-solving tasks. CheckMate also conducts case studies, overseen by domain experts, to highlight the strengths and weaknesses of LLMs particularly in mathematical reasoning scenarios.
The data and insights from CheckMate’s evaluations and case studies contribute to the forming of a taxonomy of user behavior patterns with LLMs that could be useful for machine learning practitioners and mathematicians. The case studies, mixed with insights from the evaluation, provide actionable insights for those working in Machine Learning.
The researchers conclude that dynamic evaluation such as CheckMate can help to develop more effective LLMs for problem-solving tasks. The interaction and feedback from users present a more holistic understanding of an LLMs performance, especially in crucial domains like mathematics. The study also underscores the importance of machine learning practitioners collaborating with domain experts to ensure more effective calibration of uncertainty communication, reasoning, and conciseness in model responses. Ultimately, this could inform the further development and deployment of these models as problem-solving assistants.