Recent research by scientists at Ohio State University and Carnegie Mellon University has analyzed the limitations of large language models (LLMs), such as GPT-4, and their limitations in implicit reasoning. This refers to their ability to make accurate comparisons of internalized facts and properties, even when aware of the entities in question.
The study focused on two fundamental types of reasoning used in deep learning models – comparison, which compares similarities or differences, and composition, which amalgamates several pieces of information. The models studied were transformers, a type of deep learning model that can learn to reason implicitly over parametric information. However, the research found these models were only successful at implicit reasoning through a process called ‘grokking’, which involves extended training, even past the point of overfitting, so that the model can learn about deeper patterns alongside memorizing training data.
The usefulness of grokking and the ability of transformers to generalize this learning was found to vary depending on the reasoning required. Transformers struggled to generalize when faced with out-of-distribution examples, or data that significantly differs from the training data, for composition tasks. However, for comparison tasks, they performed well.
The research produced several key findings. First, it found how the ‘generalizing circuit’, which applies learned rules to unique situations, appears and evolves over time. The success of this circuit in generalizing data, rather than just memorizing it, is critical for implicit reasoning. Second, they identified a close link between the arrangement of the generalizing circuit and the model’s capacity for systematic generalization, as how atomic knowledge and rules are arranged greatly influences the model’s reasoning powers.
The research also found that implicit reasoning in transformers heavily depends on how the training process and data are organized. The researchers proposed that transformer architectures could be improved by incorporating methods that promote cross-layer knowledge sharing to enhance the model’s reasoning abilities.
Furthermore, the study indicated that parametric memory, or the model’s ability to store and apply knowledge within its parameters, works well for complex reasoning tasks. Non-parametric memory models like GPT-4-Turbo and Gemini-1.5-Pro failed to perform well for disproportionately hard reasoning tasks with a large search field, regardless of how their retrieval processes were improved or guided. Yet, a transformer with parametric memory reached almost perfect accuracy after extensive grokking, showing its promise for enabling complex reasoning in language models.