The alarming phenomenon of AI model collapse, which occurs when AI models are trained on datasets that contain their outputs, has been a major concern for researchers. As such large-scale models are trained on ever-expanding web-scale datasets, concerns have been raised about the degradation of model performance over time, potentially making newer models ineffective and compromising the quality of their training data.
Model collapse has been explored by many researchers through different methods, such as replacing real data with generated data, augmenting fixed datasets, and mixing real and synthetic data. However, these studies mainly considered fixed training data amounts per iteration and did not sufficiently explore the effects of data accumulation over time, which better resembles the evolving nature of Internet-based datasets.
A research team from Stanford University has proposed a new study that explores the impact of data accumulation on AI model collapse. The team performed experiments on transformers, diffusion models, and variational autoencoders across various data types. They discovered that accumulating synthetic data with real data prevents model collapse, unlike what happens when synthetic data replaces real data. This finding contrasts with older studies wherein data replacement was found to cause linear error increase. The researchers’ proofs show that data accumulation results in a finite, well-controlled upper limit on test error, no matter the number of model-fitting iterations.
The research team tested their theory using transformer-based causal language models like GPT-2 and Llama2, and other varied AI models. They compared the effects of both data replacement and accumulation strategies over multiple iterations and found that data accumulation either maintained or improved performance over iterations. Data replacement, however, led to consistent increases in test cross-entropy, indicating worsening performance.
Experiments on diffusion models and variational autoencoders provided comparably supportive results, showing significant data replacement-related degradation. These findings strengthen the claim that data accumulation can effectively prevent model collapse across a range of AI domains.
Overall, the research indicates that the ‘curse of recursion’, as it is commonly known, might not be as threatening as once assumed if synthetic data is accumulated together with real data instead of replacing it. This crucial finding provides new insights that may shape future approaches to prevent model collapse and ensure the continued advance of the AI field.