Large Language Models (LLMs) have shown impressive competencies across various disciplines, from generating unique content and answering questions to summarizing large text chunks, completing codes, and translating languages. They are considered one of the most significant advancements in Artificial Intelligence (AI). It is generally assumed that for LLMs to possess considerable mathematical abilities, they need to be vast in scale or undergo rigorous pre-training involving mathematics. However, recent research with the LLaMA-2 7B model challenges this belief, demonstrating impressive mathematical capabilities even with standard pre-training.
The LLaMA-2 7B model displayed a high level of accuracy, choosing the optimum response from 256 random generations with a success rate of 97.7% and 72.0% on the GSM8K and MATH benchmarks respectively. However, while the base model can deliver accurate results, it struggles to consistently invoke its mathematical abilities. This limitation is highlighted by the drop in accuracy to 49.5% and 7.9% on the same benchmarks when looking solely at the first response.
The research team proposed scaling up supervised fine-tuning (SFT) data to improve the accuracy of the responses. This approach suggests that increasing the data used for fine-tuning can significantly enhance the accuracy of the generated responses. However, a lack of publicly available math problems limits the potential for large-scale scalability. To circumvent this, the team employed synthetic data, which produced results almost as effective as real data.
Leveraging the GPT-4 Turbo model, the team created artificial math problems. They discovered that using the GPT-4 Turbo model for verification after implementing a primary generating approach led to highly efficient results. Using synthetic math problems enabled large-scale supervised fine-tuning data, with accuracy levels nearly matching those of real-world data.
This straightforward methodology resulted in a significant improvement in accuracy. The team achieved 82.6% accuracy on GSM8K and 40.6% accuracy on MATH using LLaMA-2 7B models, exceeding the accuracy of previous models by 14.2% and 20.8% respectively.
The research also provided useful insights into scaling behaviors across different error types and reasoning difficulties. Understanding how to minimize errors during scaling and how model performance changes as data volume increases is greatly beneficial.
Overall, the study indicates that language models can demonstrate excellent mathematical capabilities without the need for large-scale models or intense pre-training. Considerable progress in mathematical problem-solving can be achieved with language models by employing synthetic data and increasing supervised fine-tuning. This study gives encouraging forecasts for future advancements in the field of AI and language models.