Until about now, most of the text online was written by humans. But this text has been used to train GPT3(.5) and GPT4, and these have popped up as writing assistants in our editing tools. So more and more of the text will be written by large language models (LLMs). Where does it all lead? What will happen to GPT-{n} once LLMs contribute most of the language found online?
And it’s not just text. If you train a music model on Mozart, you can expect output that’s a bit like Mozart but without the sparkle – let’s call it ‘Salieri’. And if Salieri now trains the next generation, and so on, what will the fifth or sixth generation sound like?
In our latest paper, we show that using model-generated content in training causes irreversible defects. The tails of the original content distribution disappear. Within a few generations, text becomes garbage, as Gaussian distributions converge and may even become delta functions. We call this effect model collapse.
Just as we’ve strewn the oceans with plastic trash and filled the atmosphere with carbon dioxide, so we’re about to fill the Internet with blah. This will make it harder to train newer models by scraping the web, giving an advantage to firms which already did that, or which control access to human interfaces at scale. Indeed, we already see AI startups hammering the Internet Archive for training data.
After we published this paper, we noticed that Ted Chiang had already commented on the effect in February, noting that ChatGPT is like a blurry jpeg of all the text on the Internet, and that copies of copies get worse. In our paper we work through the math, explain the effect in detail, and show that it is universal.
This does not mean that LLMs have no uses. As one example, we originally called the effect model dementia, but decided to rename it after objections from a colleague whose father had suffered dementia. We couldn’t think of a replacement until we asked Bard, which suggested five titles, of which we went for The Curse of Recursion.
So there we have it. LLMs are like fire – a useful tool, but one that pollutes the environment. How will we cope with it?
Sounds as if the AI industry has its own version of entropy.
Sounds a bit more like inbreeding in DNA.
I enjoyed this paper. I have been pointing out this problem with ChatGPT-2 and now 4 for some time. In all the hype there has been no mention of this very obvious problem that was well understood within cybernetics and information theory many decades ago. It’s also quite a big topic within the epistemology of the philosophy of science. Indeed, I was by no means alone when I discussed this issue when I started researching so-called AI in the 1980s.
I hope you can produce a version of your paper that can be used by the media because I think it is essential to show to as wide an audience as possible the *limits* of these systems and, so to speak, the pollution they could generate. AI hype is a perennial problem and is at the moment at a fever pitch.
a link to the new yorker essay: https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web
Thanks for this. I’m tracking down articles and papers on this issue and spreading the message.
Thanks for that link, John. I don’t have a subscription to the New Yorker – I’ve had them in the past, but had time to read so little of the flow that I stopped subscribing – so without your help I wouldn’t have seen the full article. Much appreciated!
I am struck by your comment posted here. Thanks for your contribution.
July 16, 2023.
Hello Keith. You’re work from the 80s sounds invaluable.
Many of us must also agree with the pressing need for urgent dissemination of the realities of training AI (using LLms, etc) where the source itself is AI-generated content. …Not happening yet, but it will only be a matter of months, not years, before we see the inclusion of such AI-generated content being scrapped. Probably already happening now.
It seems like we need two Internets; one for human-generated content (perhaps available for scraping), and one for AI-generated content (invaluable for human learning, and potentially for societal improvement)
James
Perhaps in the light of LLM’s we need a new moniker – A.M. – “Artificial Moronicity”?
Thanks for the information provided! we will use this information into our GPT/Chat-GPT dataset. See our sites too at Islam Berkemajuan
There’s a nice writeup of our work at Venturebeat
The BBC world service covered our work too; see 19.30 into this program here
Excellent. If it has been on the World Service it should be picked up far and wide. It will be interesting to see what the reaction is from OpenAI and others.
Why not train models only on text that is ‘policed’ to ensure it retains its meaning? For example, the whole body of English law – open access, electronically available, effectively copyright-free (I believe). And since from it you can – arguably – infer most of human behaviour, there’s a chance the model might end up with something akin to a moral code.
And lest the AI-hypesters forget, the LLM Ur-source, namely human-generated online content, is ALREADY contaminated with distortions, prejudice, and both accidental and deliberate falsehoods. The blind rush to infect our work spaces and overall culture with content that is unreliable yet persuasively and “authoritatively” presented is ultimately a recipe for tragedy. I’m hope this research will, among other things, remind people of the fundamental axiom: Garbage In Equals Garbage Out!
A thought-provoking thread in YC
Could call it “data dilution” – you keep adding junk output to the input and you end up with mostly output.
Is coprophagy too on-the-nose?
Reading this reminded of a term from a ST:TNG episode – The Masterpiece Society. The term they used in the episode was ‘replicant fading.’ The society was made of clones who were clones of clones – going back several generations and as result were starting to have genetic issues because of the ‘copying’ process.
So need to come up with a new term – Science Fiction has already given us one: Replicant Fading.
This could be further evidence of the problem.
https://www.theregister.com/2023/06/16/crowd_workers_bots_ai_training/?utm_source=daily&utm_medium=newsletter&utm_content=top-article
More in Business Insider
Just found this but have not run the video because I’m going out.
https://quansight.com/post/openai-pseudocode
Here is another article reporting on the paper.
https://techxplore.com/news/2023-06-ai-death-spiral.html?fbclid=IwAR3Sj_pKJUZR69FJA7UV4BOHeyvg8LxgQnLKlQ-WTNe5JQWbqMQ9QalwbI4
I’m disappointed but not surprised that this “little problem” has not been given much coverage and discussion by the main global TV broadcasters. Nor is it being discussed much in the plethora of academic, business and government groups that have suddenly emerged to urgently discuss the (hyped) hypothetical possibility that AI will exterminate humankind by the end of the decade, or at best will make most of the world’s population redundant in half that time.
Anyway, it appears to be getting reasonable coverage even among those who seek to fix it. AI’s little problems are seldom discussed which is one of the reasons they keep happening.
Ross,
I’m surprised you did not mention photocopies of photocopies.
And how the signal sinks and the noise rises.
Doug just wrote a good companion to this paper:
Gödel, Escher, Bach, and AI
A dazzlingly fast chatbot cannot replace the authentic and reflective voice of a thinking, living human being.
By Douglas Hofstadter
https://www.theatlantic.com/ideas/archive/2023/07/godel-escher-bach-geb-ai/674589/
I think this is a valid concern, and it is something that GPT model developers need to be aware of. However, I also think that there are ways to mitigate this problem.
Additionally, developers could use techniques such as regularization and dropout to prevent the models from overfitting the training data. This would help to ensure that the models are able to generalize to new data and avoid producing garbage output.
I think it is important to be aware of these limitations so that we can develop GPT models that are both powerful and reliable.