Large Language Models (LLMs) have made significant strides in addressing various reasoning tasks, such as math problems, code generation, and planning. However, as these tasks become more complex, LLMs struggle with inconsistencies, hallucinations, and errors. This is especially true for tasks requiring multiple reasoning steps, which often operate on a “System 1” level of thinking – fast and instinctive but often inaccurate. The requirement for a more deliberative, “System 2” thinking approach is needed for better accuracy and consistency.
Several approaches have been used to tackle these challenges, such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), and Aligner methods to align the LLM outputs with human expectations. Tree-of-Thoughts (ToT), A* search, and Monte Carlo Tree Search (MCTS) methodologies have been employed to improve its planning capabilities. While these techniques have shown promise, they often necessitate high computational power, expert adjustments, or task-specific modifications, thereby restricting their generalizability and productivity.
To address these issues, researchers from Skywork AI and Nanyang Technological University have developed Q*, a framework designed to improve the multistep reasoning capabilities of LLMs through deliberate planning. Q* perceives LLM reasoning as a Markov Decision Process (MDP), where the state considers the initial input, previous reasoning steps, and the action analogizes the next reasoning step, and the reward gauges task success. Q* introduces methods for deriving optimal Q-values for state-action pairs, which includes offline reinforcement learning, optimal sequence selection, and stronger LLM completion. Framing multi-step reasoning as a heuristic search issue, Q* employs Q-value models within an A* search frame, aiding LLMs to select the best next steps effectively.
Q* proved its efficacy across various reasoning tasks. It enhanced Llama-2-7b to achieve 80.8% accuracy on the GSM8K dataset, outpacing ChatGPT-turbo. For the MATH dataset, Q* upgraded Llama-2-7b and DeepSeekMath-7b to achieve 55.4% accuracy, outshining models like Gemini Ultra (4-shot). Moreover, Q* elevated CodeQwen1.5-7b-Chat’s performance to 77.0% accuracy on the MBPP dataset in code generation tasks. These results consistently demonstrate Q*’s ability to improve LLM performance across math reasoning and code-generation tasks, outperforming typical methods and even some closed-source models.
Q* is demonstrating the potential to significantly enhance LLMs’ problem-solving abilities and outperform established techniques. It represents a significant advancement in the field of AI by introducing a robust deliberative framework, enhancing LLMs’ ability to solve complex problems without needing task-specific fine-tuning. Extensive experimentation in math reasoning and code generation underscores Q*’s superior performance, indicating its potential to significantly enhance LLM’s complex reasoning skills, saving computational resources while preserving high performance across a wide array of tasks.