Large language models (LLMs) have demonstrated impressive performances across various tasks, with their reasoning capabilities playing a significant role in their development. However, the specific elements driving their improvement are not yet fully understood. Current strategies to enhance reasoning focus on enlarging model size and expanding the context length via methods such as chain of thought, retrieval-augmented generation, and example-based prompting. Although these tactics are effective, they usually result in elevated computational costs and inference latency in real-world applications.
Different attempts have been made in an effort to understand LLMs. Some researchers have concentrated on mechanistic frameworks and pattern analysis through empirical results. Other studies have examined input-output relationships using domain-specific techniques such as graph problems to examine LLM expressiveness, algorithmic reasoning to identify their limitations, and arithmetic learning to review the impact of input formatting. Transformer networks have also been subjected to study in order to observe initialization, training dynamics, and embedding geometry in intermediate and last layers.
A study carried out by Tenyx researchers explored the geometry of transformer layers in LLMs, highlighting properties that correlate with their expressive power. They found two significant factors: the density of token interactions in the multi-head attention (MHA) module, and the relationship between increased model size and context length with higher attention density and improved reasoning. The researchers proposed that these two factors have a strong influence on reasoning abilities.
The research also focused on how increased regions induced by the Multi-Layer Perceptron affect reasoning. Using the GSM8K-Zero dataset, the researchers found that improved reasoning corresponded with increased intrinsic dimension at the final layer and revealed a correlation between expressive power and reasoning capabilities. It was suggested that augmenting input complexity to MLP blocks could effectively increase LLMs’ reasoning performance.
The experimental findings of the study show that increasing context in prompts can enhance the last layers’ intrinsic dimension, particularly when the context is relevant to the question. Such contexts lead to more piece-wise affine maps in the MLP, resulting in more adaptive transformations for each token. However, the connection between these geometric insights and the generalization capabilities of LLMs remains an unexplored territory.
An interesting aspect of the research is how it underscores the role of input space partitioning induced by MLPs in DNNs and LLMs. As the study shows, adaptive partitioning in DNNs is integral to their approximation capability. It also reveals how the number of regions massively impacts the function approximation abilities of LLMs. Although approximation power does not equal generalization, it is highly associated with reasoning capabilities of LLMs.
This investigation gives a broad view of the underlying theory along with a limited set of experiments, indicating that further exploration of the phenomena could be critical in refining the reasoning abilities of LLMs. Consequently, this research approach could potentially help smaller LLMs to narrow the performance gap with larger models in the future.