A study conducted by researchers from Carnegie Mellon University, Google DeepMind, and MultiOn focuses on the role of synthetic data in enhancing the mathematical reasoning capabilities of large language models (LLMs). Predictions indicate that high-quality internet data necessary for training models could be depleted by 2026. As a result, model-generated or synthetic data is considered as a viable solution. But using synthetic data may lead to a decline in model performance, potentially amplifying biases and factual inaccuracies.
Whilst some methods for generating synthetic data have shown promise, problems persist, particularly in the case of mathematical reasoning models. To counter such issues, the study suggests exploring the use of negative model-generated responses to reveal and rectify problematic patterns in training data.
In their study, they found that positive synthetic data improved performance but at slower scaling rates compared to pre-training. Additionally, self-generated positive data was often as effective as data from larger models. The approach that proved most promising, however, used negative synthetic data along with positive data at critical junctures. In comparison to positive data alone, this technique demonstrated the potential to enhance efficiency by up to eight times.
The methodology involves creating a synthetic data pipeline that prompts models to generate new problems resembling real ones. These solutions are then obtained with step-wise reasoning and a binary reward function is implemented to validate the correctness of solution traces. They also generate both positive and negative synthetic data sets via model-generated solutions. The research finally uses Supervised and Rejection Finetuning along with Direct Preference Optimization to learn from both positive and negative data.
Despite positive improvements from positive synthetic datasets, the self-generated positive data doubled the efficiency. The most impactful result came from combining negative data with per-step Direct Preference Optimization, which increased the data efficiency by eight times compared to utilizing positive data alone.
Overall, the study highlights the importance of strategically employing both positive and negative synthetic data in training LLMs to achieve optimal mathematical reasoning performance. The integration of reinforcement learning techniques, estimating step-wise advantages using negative data, and the implementation of preference optimization objectives showcased substantial enhancement in synthetic data efficiency, consequently improving the LLMs’ mathematical reasoning capabilities.