Language models (LMs) such as BERT or GPT-2 are faced with challenges in self-supervised learning due to a phenomenon referred to as representation degeneration. These models work by training neural networks using token sequences to generate contextual representations, with a language modeling head, often a linear layer with variable parameters, producing next-token distributions of probability.…
