Natural Language Processing (NLP) enables computers to understand, interpret, and generate human language. However, enhancing their ability to solve complex reasoning tasks that require logical steps and coherent thought processes is challenging, particularly as most current models rely on generating explicit intermediate steps which are computationally expensive.
Several existing methods attempt to address these challenges. Explicit chain-of-thought (CoT) reasoning improves accuracy by generating intermediate reasoning steps, but requires large amounts of computational resources. Implicit CoT via knowledge distillation (ICoT-KD) trains models using hidden states for reasoning, removing the need for explicit steps. Similarly, MathGLM eliminates intermediate steps to solve multi-digit arithmetic tasks, and Searchformer trains transformers to perform search tasks with fewer steps.
Research teams from the Allen Institute, University of Waterloo, University of Washington, and Harvard University have proposed an innovative solution, known as Stepwise Internalization, to address these challenges. This method starts with a model trained for explicit CoT reasoning, then gradually removes the intermediate steps during fine-tuning. This internalizes the reasoning steps within the language model, preserving performance while reducing computational overhead, and enabling implicit CoT reasoning without generating intermediate steps.
The Stepwise Internalization method involves a systematic training process. It starts with a language model trained using explicit CoT reasoning. As training progresses, the intermediate steps are gradually removed. As each stage, the model is fine-tuned to adapt to the absence of certain steps, encouraging it to internalize these steps within its hidden states. A linear schedule is used to remove CoT tokens, helping the model gradually adapt to the changes.
This method has demonstrated significant improvements in performance across various tasks. For example, a GPT-2 Small model trained using Stepwise Internalization achieved a 99% accuracy rate when solving 9-by-9 multiplication problems, a task that models trained using standard methods struggled with. The Mistral 7B model, trained in the same way, achieved over 50% accuracy on the GSM8K dataset, consisting of grade-school math problems, without producing any explicit intermediate steps. This outperformed significantly larger models. In addition, Stepwise Internalization proved to be up to 11 times faster on tasks requiring explicit CoT reasoning, still maintaining high accuracy levels.
To sum up, Stepwise Internalization offers a promising approach to enhance the reasoning capabilities of language models. By internalizing CoT steps, it provides a balance between accuracy and computational efficiency. This method could transform the way complex reasoning tasks are handled in NLP, creating more efficient and capable language models that are practical for a wider range of applications. The innovation this method represents suggests that further development and scaling of it could lead to even more impressive results in the future.