Researchers from Meta/FAIR Labs and Mohamed bin Zayed University of AI have carried out a detailed exploration into the scaling laws for large language models (LLMs). These laws delineate the relationship between factors such as a model’s size, the time it takes to train, and its overall performance. While it’s commonly held that larger models can store more knowledge, this investigation was aimed at verifying whether a model of a given size can scale linearly with its total knowledge. It also focused on understanding the constant that defines this scaling, a factor pivotal to evaluating the efficiency of transformer models in terms of knowledge storage.
Language models typically store factual information as tuples, comprising of three strings – name, attribute, and value. The study found that these models could store two bits of knowledge per parameter, a fact that was influenced by factors such as training duration, model architecture, quantization, sparsity constraints, and the signal-to-noise ratio of the data. Interestingly, the researchers also found that language models could hold more knowledge if training data were prefixed with domain names such as wikipedia.org, thereby allowing the models to identify and accord priority to knowledge-rich domains.
The study yielded certain key findings:
– Models such as GPT2 could consistently achieve a 2-bit per parameter capacity ratio across various data settings, implying that a 7B model could surpass the English Wikipedia in terms of knowledge.
– Preserving this ratio required longer training times, with approximately 1,000 exposures per piece of knowledge.
– Model architecture significantly influences capacity, with models like GPT2 outperforming others like LLaMA/Mistral due to the implementation of a gated MLP.
– Capacity is maintained when quantized to int8, but reduced when quantized to int4.
– Mixture-of-experts models slightly reduced capacity but remained efficient.
– The presence of ‘junk data’ significantly reduced a model’s capacity, although this effect could be mitigated by prefixing useful data.
Overall, the research indicated that a fully-trained transformer model could effectively store 2 bits of knowledge per parameter, regardless of its size or factors such as quantization to int8. They found that the influence of key hyperparameters – such as training duration, model architecture, precision, and data quality – on these scaling laws was significant. This methodology and its findings offer a rigorous framework for comparing model capabilities, facilitating better decision-making concerning model selection and training. It also provides a foundation that could aid in tackling the fundamental question of the best language model size, potentially directing future efforts towards achieving Artificial General Intelligence (AGI).