Skip to content Skip to footer

CoT Informed by LM: A Unique Machine Learning System Using a Streamlined Language Model (10B) for Logic Problems

Chain-of-thought (CoT) prompting, an instruction method for language models (LMs), seeks to improve a model’s performance across arithmetic, commonsense, and symbolic reasoning tasks. However, it falls short in larger models (with over 100 billion parameters) due to its repetitive rationale and propensity to produce unaligned rationales and answers.

Researchers from Penn State University and Amazon AGI have introduced a solution called LM-guided CoT, which uses two separate LMs; a small one for generating reasoning rationales, and a larger one for predicting answers. The small LM uses a vanilla knowledge distillation technique to learn from rationales generated by the larger LM, hence bridging their reasoning gap. It also employs fine-grained measurements like relevance, consistency, logicality, and coherence to enhance rationale quality through reinforcement learning (RL), thereby improving the CoT reasoning performance.

The LM-guided CoT framework uses a light model (MS) for quality rationale generation and a larger one (ML) for output predictions. The process involves MS learning from ML-generated rationales, followed by refinement using eight linguistic measurements. The refinement process is first done manually and then automated for RL training. They utilize proximal policy optimization (PPO) to update MS using rewards based on task-specific accuracy and evaluation metrics related to the mentioned linguistic aspects.

The study compares the performance of ML with and without CoT prompting and found a reduced accuracy due to the model’s limited reasoning capacity with long contexts. However, LM-guided CoT, particularly with knowledge distillation, exhibited better performance compared to the original CoT prompting by 2% and 10% on HotpotQA and 2WikiMultiHopQA respectively. The framework was also found to significantly enhance answer predictions and rationale quality, especially for queries with long contexts, rivalling standard prompting in terms of accuracy.

In summary, the LM-guided CoT framework introduced by this research advances CoT prompting by breaking it down into rationale generation and answer prediction stages, which are optimized using RL. Despite outperforming all baselines, the act of selecting top-quality rationales did not consistently improve task performance, indicating the necessity of striking a balance between LM-produced rationales and overall task efficiency. This research provides a resource-efficient solution for overcoming challenges associated with CoT prompting but further work needs to focus on balancing the quality of rationales and task performance for optimal results.

Leave a comment

0.0/5