Large language models (LLMs) like GPT-3 have proven to be powerful tools in solving various problems, but their capacity for complex mathematical reasoning remains limited. This limitation is partially due to the lack of extensive math-related problem sets in the training data. As a result, techniques like Instruction Tuning, which is designed to enhance the capabilities of LLMs, are hindered. Many of these LLMs utilize a method known as ChatGPT-based Instruction Tuning, which is known to enhance math instruction and boost performance. However, their capabilities remain constrained by the lack of large-scale datasets.
In response to this challenge, researchers from The Chinese University of Hong Kong, Microsoft Research, and the Shenzhen Research Institute of Big Data have devised a novel approach, MathScale, to improve the scalability and quality of mathematical reasoning datasets. Central to MathScale’s approach is the extraction of high-level concepts from existing math questions, the construction of concept graphs to map relationships, and the generation of diverse new questions based on randomly sampled concepts. The team also introduced a comprehensive benchmark, MWPBENCH, to provide a fair evaluation of models’ mathematical reasoning capabilities.
MathScale’s process of dataset generation involves four systematic steps. It begins by using GPT-3.5 to extract high-level concepts from math questions, enabling the creation of diverse question sets without relying on original problems. The approach then builds a concept graph to visually map relationships between the various concepts. Employing a random walk algorithm, the system samples diverse topics and knowledge points from the graph. Lastly, MathScale generates new math questions based on the sampled topics and strictly adheres to the provided knowledge points.
MathScale’s effectiveness has been proven in performance tests, showing superior results compared to existing models such as LLaMA-2 7B, LLaMA-2 13B, and Mistral 7B. Its respective micro and macro average accuracy surpassed counterparts of equivalent size by 42.9% and 43.7%. Furthermore, on out-of-domain test sets like GaokaoBench-Math and AGIEval-SAT-MATH, MathScale-7B demonstrated significant superiority over other open-source models.
In summary, the research presented by the team at The Chinese University of Hong Kong, Microsoft Research, and Shenzhen Research Institute of Big Data offers a promising advancement in the mathematical reasoning capabilities of LLMs. The introduction of MathScale, combined with the comprehensive MWPBENCH, offers an effective solution to the limitations posed by the scarcity of mathematical problem sets in the training data. By significantly improving LLM capabilities, their work is poised to advance mathematical problem-solving within the field of AI.