Transformer-based neural networks have demonstrated remarkable capabilities in tasks such as text generation, editing and answering questions. These networks often improve as their parameters increase. Notably, some models perform optimally when small, like the 2B model MiniCPM, which fares comparably to larger models. Yet as computational resources for training these models increase, high-quality data availability struggles to keep up.
Techniques that address these constraints include Scaling laws, Energy-based models and Hopfield models. Scaling laws show that a model’s performance improves as its size and volume of training data increase. Energy-based models use a parameterized probability density function to represent the distribution of a learnable energy function in a neural network. Hopfield models, meanwhile, were developed as an example of associative memory. Transforming these principles into practice, researchers from the Central Research Institute, 2012 Laboratories Huawei Technologies Co., Ltd. introduced a theoretical framework focusing on the interplay of the memory process and performance dynamics within transformer-based language models.
The researchers conducted a series of experiments using GPT-2 across various data sizes to counteract signs of saturation. They also trained conventional Transformer models on a dataset comprising two million tokens. Their experimentation yielded theoretical insights that can guide and improve model training. One particular experiment involved the training of a 12-layer transformer language model using a GPT-2 small tokenizer and architecture on the OpenWebText data set. Subsets of this data set were used to train three models. These models were then scrutinized using different amounts of data.
The researchers found that training with just 0.1% (9M) of the OpenWebText data led to overfitting, and the training loss dissipated over iterations. This was due to training samples not being well-segregated, leading to the model’s energy declining to a sum of delta functions. Nevertheless, when a model’s size approximated the order of O(D2) and was trained on 90M tokens, it achieved similar training and validation loss to the model trained with 9B tokens. In comparison, two conventional Transformers of six and ten layers, trained using a batch size of eight, saw their training losses stabilize at a value around one.
Ultimately, the researchers produced a framework that centers on the memory process and performance dynamics of transformer-based language models. By modelling these networks using associative memory and focusing on cross-entropy loss, they created a global energy function for the layered structure of transformer models using majorization-minimization technique. They conducted experiments employing GPT-2 of different data sizes and trained conventional Transformer models with a dataset of 2M tokens. The researchers concluded that this approach optimizes the functionality of the models, providing critical insights into the mechanics of neural networks.