The phenomenon of “model collapse” represents a significant challenge in artificial intelligence (AI) research, particularly impacting large language models (LLMs). When these models are continually trained on data created by earlier versions of similar models, they lose their ability to accurately represent the underlying data distribution, deteriorating in effectiveness over successive generations.
Current training methods of AI models primarily use large data sets created by humans. Techniques including data augmentation, regularization and transfer learning are used to improve model robustness. These techniques do have limitations, such as data poisoning and catastrophic forgetting in variational autoencoders (VAEs) and Gaussian mixture models (GMMs), leading to degraded performance.
Researchers have taken a fresh approach to this problem by examining the “model collapse” phenomenon in detail. They provided evidence and a theoretical framework explaining how models trained on recursively generated data gradually lose the ability to accurately represent the underlying data distribution. The team’s innovation involves identifying the sources of errors: statistical approximation error, functional expressivity error, and functional approximation error. It’s these errors that compound over generations, leading to model collapse.
In the study, researchers used datasets like wikitext2 to train language models, systematically showcasing the “model collapse” effects through controlled experiments. Detailed analyses of the perplexity were conducted, a significant increase in perplexity was observed over multiple generations, indicating a clear degradation in model performance.
Findings showed that models trained on recursively generated data become less accurate over time. However, preserving a portion of the original human-generated data during training significantly mitigates the effects of model collapse. Indeed, when 10% of the original data was kept, the accuracy was found to be 87.5% on a benchmark dataset, surpassing previous state-of-the-art results by 5%. This underscores the importance of retaining access to genuine human-generated data to maintain model performance.
The research provides a comprehensive investigation into the phenomenon of model collapse in generative models, both theoretically and empirically. The proposed solution involves understanding and mitigating the sources of errors leading to collapse. These findings suggest the need for maintaining access to genuine human-generated data to prevent the degradation of AI models. This seminal work advances the field of AI by addressing a critical challenge affecting the reliability of AI systems in the long term.