Large language models (LLMs) and other transformer models underpinning products such as ChatGPT, Stable Diffusion and Midjourney come initially from human sources -- books, articles, and photographs that were created without the help of artificial intelligence. But as more people use AI to produce and publish content that content will gradually pollute the internet, and AI models begin to train on it.
Writing in the open-access journal arXiv a team of boffins from Cambridge University and the University of Edinburgh found that model-generated content in training causes irreversible defects in the resulting models.
"Specifically looking at probability distributions for text-to-text and image-to-image AI generative models, the researchers concluded that "learning from data produced by other models causes model collapse -- a degenerative process whereby, over time, models forget the true underlying data distribution... this process is inevitable, even for cases with almost ideal conditions for long-term learning," the report warns.
In fact, the report thinks that model collapse will happen quickly as models can rapidly forget most of the original data from which they initially learned.
One of the paper's authors, Ross Anderson, professor of security engineering at Cambridge University and the University of Edinburgh warned that humanity was about to fill the internet with blah in the same way it put plastic into the oceans.
“This will make it harder to train newer models by scraping the web, giving an advantage to firms which already did that, or which control access to human interfaces at scale. Indeed, we already see AI startups hammering the Internet Archive for training data," he said.