Language models (LMs) are a crucial segment of artificial intelligence and can play a key role in complex decision-making, planning, and reasoning. However, despite LMs having the capacity to learn and improve, their training often lacks exposure to effective learning from mistakes. Several models also face difficulties in planning and anticipating the consequences of their actions, causing problems in efficiency.
In response to these challenges, researchers from Stanford University, MIT, and Harvey Mudd have developed a method to enhance LMs’ ability to devise strategies for problem-solving, which they call Stream of Search (SoS). Using this teaching method, they have managed to improve the error-correction capacity of LMs by 25% and led to solving 36% of the problems that were previously unsolved. This was achieved by representing the search process as a serialized string, in a similar vein to playing the game of Countdown.
Existing models often integrate LMs into search and planning systems, allowing them to generate and evaluate potential actions or states. However, these methods tend to rely heavily on symbolic search algorithms and intricate search procedures, limiting their operability. While outcomes can be supervised by employing an external verifier model, this approach requires extensive labeled data.
To overcome these limitations, the researchers propose Markov Decision Process (MDP) as a feasible domain. This approach uses a set of states, actions, and rewards to define the search process, alongside instructions for different search algorithms. Making use of a synthetic dataset and a trained GPT-Neo model, the researchers found that LMs trained with the new teaching method via the MDP were more efficient in solving Countdown problems than those trained on optimal solutions.
Besides, the research offers insights into potential self-improvement strategies that could enhance LMs’ problem-solving capabilities, such as reinforcement learning (RL), expert iteration, and Advantage-Induced Policy Alignment (APA). In the long run, these strategies may lead to the discovery of heuristics or problem-solving techniques.
In conclusion, the SoS model delivers an effective method for LMs to learn problem-solving through simulated search processes in language. Unlike symbolic search methods that often face limitations, SoS allows LMs to learn internal “world models” for search, promoting generalization. While the study primarily focused on the game of Countdown, the SoS method shows promise for tackling more complex real-world tasks. Future research could continue enhancing SoS by incorporating more procedures and investigating domain transferability. The results of this study showcase the potential for LMs to excel in problem-solving through a variety of search strategies and iterative refinements.