Language models (LMs) such as BERT or GPT-2 are faced with challenges in self-supervised learning due to a phenomenon referred to as representation degeneration. These models work by training neural networks using token sequences to generate contextual representations, with a language modeling head, often a linear layer with variable parameters, producing next-token distributions of probability. Current trends favor the scale-up of generative pretraining, as we see with GPT-2, but concerns regarding hardware and energy restrictions are prominent.
Performance saturation, where a program’s performance plateaus or declines over time, was revealed in late pretraining phases when training smaller models on extensive corpora, as per the evaluation of a set of LMs known as Pythia. Smaller models trained on large corpora, like those used for the Pythia training were found to exhibit performance drops during late Lambada dataset training.
This research by the teams at Inria Paris and Sorbonne Universite explores the correlation between saturation and representation degeneration in these smaller models. They argued that with small hidden dimensions, a linear language modeling head could become a bottleneck for performance. This arises when a mismatch occurs between the small hidden dimensions in smaller models and the high rank of the target contextual probability distribution, negatively impacting performance via the softmax bottleneck phenomenon.
The research’s main contributions include: characterizing performance saturation of small LMs via evaluation and scaling laws extrapolation; identifying the degeneration of representations occurring in smaller language models; establishing, through empirical evidence, the high rank of target contextual distribution and the impact of low-rank linear head on performance; and theoretically quantifying the limitation of performance that an LM head induces.
Anisotropy, a form of representation degeneration, was found to be prevalent in small language models, causing a reduction in angular variability. Connections were observed between anisotory and performance saturation in the Pythia models. There is a clear, concurrent occurrence of performance saturation and spectral saturation patterns in the singular value distributions of language modeling heads. A theoretical connection is also implied between the performance bottleneck and contextual distribution dimensionality, as influenced by low-rank heads.
In conclusion, the study highlighted the performance saturation that is prevalent in small LMs. The challenge arises from striking a mapping balance between high-rank contextual probability distributions and the low-dimensional output representations that are obtained via linear language modeling heads. A theoretical link was established between this performance gap and the spectral properties of contextual probability distributions. Finally, the study found that there were significant performance drops when the hidden dimensions in the language modeling heads were reduced to less than a 1000, and demonstrated that last-layer anisotropy and spectral saturation in LM heads correlates strongly with saturation occurrence.