Neural language models (LMs), particularly those based on transformer architecture, have gained prominence due to their theoretical basis and their impact on various Natural Language Processing (NLP) tasks. These models are often evaluated within the context of binary language recognition, but this approach may create a disconnect between a language model as a distribution over strings and its theoretical abstraction, which is a set of strings. Researchers have suggested that to address this, it is necessary to delineate the classes of probability distributions over strings that can be depicted by the transformer in question.
A team at ETH Zurich has examined the representational capacity of transformer LMs through the lens of n-gram LMs, assessing how they capture the parallelizable quality unique to n-gram LMs. Their findings indicate that it is indeed feasible to harness the transformer architecture to yield different lower bounds on the possible representational capacity of transformer LMs. These models utilize multiple transformer layers in order to portray n-gram LMs via hard and sparse attention, which are different mechanisms through which transformer LMs can mimic n-gram LMs. The focus of this approach is to improve input representations, such as queries, keys, and values, through the assessment of their augmented counterparts.
To better articulate these findings, the researchers proposed two theorems. The first suggests that for any n-gram LM, there is a correspondingly weakly equivalent single-layer hard attention transformer LM equipped with n-1 heads. This is because it is possible to construct a weak equivalent LM by a transformer that reflects on the prior n-1 positions using n-1 heads. The second theorem holds that for any n-gram LM, there is a corresponding weakly equivalent n-1 layer hard attention transformer LM with a single head because an n-1 layer transformer LM can use n-1 layers to review the immediately preceding position and duplicate it n-1 times.
Through their research, they have shown that the connection between transformer LMs and conventional LMs is instrumental in capturing any n-gram LM through the use of hard and sparse attention transformer LMs. This subsequently provides a stable lower boundary for their probabilistic representational capacity. Additionally, their work has demonstrated an intrinsic link between the number of heads, layers, and the intricacy of the non-linear conversions needed to model n-gram LMs. Moreover, these findings contribute to an understanding of the possible mechanisms through which transformer LMs may implement formal computational models.
However, the researchers note that n-gram LMs represent a very simplified LM class that results in somewhat loose lower boundaries. Hence, transformer LMs usually exhibit a more complex structure than n-gram LMs. This sheds light on potential areas of study that can further bolster the understanding and development of transformer LMs.