A recent study in Nature has highlighted that artificial intelligence (AI) models, specifically large language models (LLMs), experience a significant drop in quality when trained on data created by prior AI models. This degradation over time, known as “model collapse”, could undermine the quality of future AI models, especially given the growing prevalence of AI-generated content online, which is often used in model training data.
The research, carried out by experts from the University of Cambridge, University of Oxford, and other institutions, showed that when AI models are continually trained on data produced by previous iterations, they start producing nonsense. This effect was observed in various AI models, including language models, variational autoencoders, and Gaussian mixture models.
In one crucial experiment with language models, the team fine-tuned the OPT-125m model on the WikiText-2 dataset and then used it to generate new text. This AI-created text was then used to train the next iteration of the model, and this process was repeated multiple times. By the ninth generation, the model started outputting total gibberish.
It was also observed that before complete collapse, models lose information about rare or infrequent events. This poses a problem as these often relate to marginalized groups or outliers. The lack of representation of these events risks models rolling out responses across a limited spectrum of ideas and beliefs, thus solidifying biases.
To combat this, AI firms are partnering with news outlets and publishers to secure a continuous influx of high-quality, human-written data. The access to original, human-generated data sources is key to ensure the consistent quality of AI in the future. However, a recent study by Dr. Richard Fletcher, Director of Research at the Reuters Institute for the Study of Journalism, revealed that 48% of the most popular news sites worldwide are now inaccessible to OpenAI’s crawlers, and 24% of sites are blocking Google’s AI crawlers. This restriction limits the pool of high-quality, fresh data AI models can access, raising the chance of models training on subpar or outdated data.
The researchers argue that we need to manage AI-generated content efficiently to avoid its accidental inclusion in training datasets, although this is becoming exceedingly difficult given the rising quantity of such content. The study suggests four possible solutions: watermarking AI-created content to differentiate it from human-created data; incentivising humans to generate high-quality content; improving methods of filtering and curation for training data; and finding ways to maintain access to original data not generated by AI.
Stanford researchers echoed the conclusions of this study while exploring model collapse. They found that when each new iteration’s training data completely replaced the previous ones, model performance fell rapidly across all tested models. However, if synthetic data was added to the existing dataset, model collapse was largely avoided. Consequently, model collapse isn’t inevitable but depends on the amount of AI-generated data in the set and the balance of synthetic to authentic data.
Given these concerns, when model collapse begins to appear in advanced models, AI companies will undoubtedly be seeking long-term solutions. These findings underscore the necessity of careful management of training data to prevent degradation of AI models.