Transformer-based neural networks have demonstrated proficiency in a variety of tasks, such as text generation, editing, and question-answering. Perplexity and end task accuracy measurements consistently show models with more parameters perform better, leading industries to develop larger models. However, in some cases, larger models do not guarantee superior performance. The 2 billion parameter model, MiniCPM, shows similar capabilities to larger language models like the Llama2-7 billion, Mistral-7B, Gemma-7B, and Llama-13B. This suggests a lapse in the correlation between model size and its effectiveness, and notably, a lack of high-quality training data to benefit from large models.
To combat these performance inconsistencies, common methods involve Scaling laws, Energy-based models, and Hopfield models. These models focus on size-scaling, energy functions and associative memory respectively. Huawei Technologies’ Central Research Institute and 2012 Laboratories propose a theoretical framework to address model performance dynamics and the memorization process.
The researchers used the transformer-based language model, GPT-2, to conduct a series of tests across a spectrum of data sizes to comprehensively evaluate their proposed model. For these tests, the researchers also trained a classic Transformer model on a dataset of two million tokens. In the end, this experiment series validated their theoretical results, offering valuable insights about the optimal cross-entropy loss which could guide and enhance decision-making in future model training.
In their study, the researchers used a twelve-layer transformer language model, trained with the GPT-2 small tokenizer and architecture on the OpenWebText dataset. This dataset contains nine billion tokens from over eight million documents. They trained three models using different data sizes, with a subset containing the first 1% and 0.1% of the OpenWebText data respectively.
Their results identified that training with 0.1% of the OpenWebText data showed signs of overfitting. Over iterations, the training losses ceased to appear, indicating the model energy diminished to a sum of a few delta functions due to the lack of clarity between training examples. On the other hand, if the model size was approximately O(D2) and trained on 90 million tokens, the model could achieve comparable training and validation loss to the setting with nine billion tokens.
In conclusion, the researchers developed a theoretical framework based on the relationship between memorization and performance in transformer-based language models. The transformer-based networks used for the research were modeled with associative memory and cross-entropy loss was defined for model and data sizes. The researchers executed tests using GPT-2 with different data sizes and trained several Transformer models on a two million token dataset. Their findings provide the foundation for creating a global energy function for the layered structure of the transformer models using the majorization-minimization technique.