From Nature
AI models fed AI-generated data quickly spew nonsense
By Elizabeth Gibney
Training artificial intelligence (AI) models on AI-generated text quickly leads to the models churning out nonsense, a study has found. This cannibalistic phenomenon, termed model collapse, could halt the improvement of large language models (LLMs) as they run out of human-derived training data and as increasing amounts of AI-generated text pervade the Internet.
“The message is we have to be very careful about what ends up in our training data,” says co-author Zakhar Shumaylov, an AI researcher at the University of Cambridge, UK. Otherwise “things will always, provably, go wrong,” he says.” The team used a mathematical analysis to show that the problem of model collapse is likely to be universal, affecting all sizes of language model that use uncurated data, as well as simple image generators and other types of AI...
The problem is analogous to inbreeding in a species, says Hany Farid, a computer scientist at the University of California, Berkeley. “If a species inbreeds with their own offspring and doesn’t diversify their gene pool, it can lead to a collapse of the species,” says Farid, whose work has demonstrated the same effect in image models, producing eerie distortions of reality...
Hany Farid is a professor in the Department of Electrical Engineering & Computer Sciences and the School of Information at UC Berkeley. He specializes in digital forensics.