The field of Natural Language Processing (NLP) has seen a significant advancement thanks to Large Language Models (LLMs) that are capable of understanding and generating human-like text. This technological progression has revolutionized applications such as machine translation and complex reasoning tasks, and sparked new research and development opportunities.
However, a notable challenge has been the discrepancy between the reasoning capabilities of LLMs and those of human-level expertise. This difference is most noticeable in complex reasoning tasks where traditional models, based on majority voting mechanisms, tend to produce inaccuracies.
Methods that have been developed to bolster LLMs’ reasoning capabilities include Chain-of-Thought (CoT) prompting, which generates intermediate steps in reasoning. Self-consistency employs multiple reasoning chains, selecting the most frequent answer, while complexity-based prompting filters reasoning chains according to their complexity. Both DiVeRSe and Progressive-Hint Prompting methods aim to improve the accuracy and consistency of generated answers.
A team of researchers from Fudan University, the National University of Singapore, and the Midea AI Research Center have developed a new framework named AoR (Aggregation of Reasoning), that shifts the focus from answer frequency to the evaluation of reasoning chains. This framework includes a dynamic sampling component that increases the accuracy and reliability of the reasoning performed by LLMs.
The AoR framework operates through two phases: local scoring and global-evaluation. The local scoring phase evaluates reasoning chains that lead to the same answers, while the global evaluation phase assesses chains for logical coherence and consistency with their corresponding answers. The end result is an answer derived from the most logically sound reasoning chain.
AoR significantly outperforms traditional ensemble methods in complex reasoning tasks. Experimental results showed an accuracy improvement of up to 7.2% on the AQuA dataset over the Self-Consistency method. The framework also adapts well to various LLM systems and has a high performance ceiling.
Most notably, AoR’s dynamic sampling capability effectively balances the performance with computational cost, which could lower overhead by 20% compared to existing methods while maintaining high accuracy. For example, AoR outperformed all baseline approaches in mathematical reasoning tasks across six datasets.
The crux of AoR’s success lies in dynamic sampling. This feature not only enhances the accuracy of the model but also optimizes computational efficiency. For example, in the AQuA dataset, the dynamic sampling process reduced the number of samples needed by focusing computational efforts on more complex queries.
In conclusion, AoR introduces an effective method to evaluate and aggregate reasoning processes, addressing a key shortcoming in the reasoning capabilities of LLMs. This innovative approach improves the reliability and performance of LLMs in complex reasoning tasks. The research team from Fudan University, National University of Singapore, and Midea AI Research Center has provided an approach that could set new standards in natural language processing.