Large language models (LLMs) have made remarkable strides in many tasks, with their capacity to reason forming a vital aspect of their development. However, the main drivers behind these advancements remain unclear. Current measures to boost reasoning primarily involve increasing the model’s size and extending the context length with methods such as the chain of thought, retrieval augmented generation, and example-based prompting. These techniques, though effective, only represent a small portion of potential ways to improve and inevitably lead to increased computational expenses and delayed inference in actual applications.
To understand LLMs in depth, numerous studies have been conducted from multiple perspectives including mechanistic structures, empirical pattern analysis, domain-specific approaches, and algorithmic reasoning. However, these approaches often overlook a complete end-to-end geometric perspective and often do not account for the sequence dimension or provide a context-dependent analysis of LLMs, particularly in relation to model size, context length, and their roles in reasoning capabilities.
A recent study by Tenyx aims to probe into the geometry of transformer layers in LLMs, concentrating on key properties associated with their expressive power. The research identifies two significant factors: the density of token interactions in the multi-head attention (MHA) module, and the relationship between increased model size and context length with higher attention density and improved reasoning. The study seeks to capture the expressive power of LLMs and deepen the understanding of their behavior by examining LLM’s geometry correlation with its reasoning capabilities and the impact of increased input sequence length and the number of attention heads.
The researchers used the GSM8K-Zero dataset to perform experiments that explained LLMs’ reasoning capabilities through geometric analysis. The results showed a correlation between expressive power and reasoning capacities, suggesting the enhancement of the MLP block’s input complexity can improve reasoning performance.
Notably, the study reveals a strong relationship between the last layers’ intrinsic dimension (ID) and the correctness of the response, regardless of the model’s size. Increasing context in prompts can raise the ID, especially when the context is relevant to the question. Higher ID changes correlate with the increased likelihood of correct responses, signifying that a partitioning around tokens reduces the overall prediction error. However, the correlation between the geometric insights and the generalization capabilities of LLMs is yet to be explored thoroughly.
The research also discloses the vital role played by MLPs in DNNs and LLMs in input space partitioning. The regions in the input space are data-dependent and determined during training. This proves the interplay between approximation and the number of regions influences LLMs’ function approximation abilities.
This research can be beneficial to smaller LLMs to close the performance gap with larger models in the future. However, more extensive investigation is required to enhance LLMs’ reasoning abilities, as approximation power is not equivalent to generalization, despite its strong correlation with LLMs’ reasoning capabilities. The researchers of this project advise further exploration to understand the models’ robustness and adaptability across different contexts.