The complexity of mathematical reasoning in large language models (LLMs) often exceed the capabilities of existing evaluation methods. These models are crucial for problem-solving and decision-making, particularly in the field of artificial intelligence (AI). Yet the primary method of evaluation – comparing the final LLM result to a ground truth and then calculating overall accuracy – often overlooks important elements such as logical errors and inefficient processes. There are also difficulties using single references considering diverse reasoning paths can lead to the same answer.
In a quest to overcome these challenges, a team of researchers from prestigious academic institutions including Shanghai Jiao Tong University, Yale University, and Carnegie Mellon University, have introduced REASONEVAL. This innovative new approach evaluates reasoning quality beyond the accuracy of the final answer, and instead uses validity and redundancy metrics to characterize the quality of each reasoning step.
Unlike most existing methods, REASONEVAL doesn’t focus solely on the end result. Instead, it assesses multi-step reasoning tasks, analyzing each step of the problem-solving process. This method categories steps into positive, neutral, or negative labels before also computing step-level scores based on validity and redundancy. These scores are then aggregated to generate solution-level scores. The REASONEVAL method employs a variety of LLMs of different sizes and training strategies, utilizing training data from the PRM800k dataset, a collection of labeled, step-by-step solutions compiled by human annotators.
The REASONEVAL system has demonstrated state-of-the-art performance in assessing reasoning step quality based on correctness and efficiency. Unlike existing methods, it brings to light discrepancies between the accuracy of the final answer and the quality of the reasoning steps used to get there. Furthermore, it’s highly effective in data selection for training.
In operation, it accurately detects different types of errors generated by perturbation and clearly distinguishes between errors that affect validity and those that introduce redundancy. Interestingly the tool found that improvements in final-answer accuracy don’t always correlate with enhancements in the quality of the reasoning steps for complex mathematical problems.
In conclusion, the introduction of REASONEVAL signifies a significant step forward in the assessment of mathematical reasoning in LLMs. Not only does it provide a more thorough analysis of each step of the reasoning process, its metrics measurement also exposes various errors and proves effective in data selection for subsequent model training. All of which pose significant benefits in furthering the development of LLMs and AI.