[Dprglist] Self-licking ice cream cone

Carl Ott carl.ott.jr at gmail.com
Thu Jun 15 12:59:14 PDT 2023

Thanks - I quite enjoyed that.

So I asked ChatGPT 4  what it thought about that.
Here's the conversation.
Almost too reassuring of a reply - A little too positive response for my
liking - seems like there should be a dark twist in there somewhere.

Bottom line - I will starting asking for all answers in the form of a
limerick, or some famous personality style.  Perhaps even asking for code
generated in such styles...

What do you make of the following report?      [A]s those following the
burgeoning industry and its underlying research know, the data used to
train the large language models (LLMs) and other transformer models
underpinning products such as ChatGPT, Stable Diffusion and Midjourney
comes initially from human sources -- books, articles, photographs and so
on -- that were created without the help of artificial intelligence. Now,
as more people use AI to produce and publish content, an obvious question
arises: What happens as AI-generated content proliferates around the
internet, and AI models begin to train on it, instead of on primarily
human-generated content?

A group of researchers from the UK and Canada have looked into this very
problem and recently published a paper on their work in the open access
journal arXiv. What they found is worrisome for current generative AI
technology and its future: "We find that use of model-generated content in
training causes irreversible defects in the resulting models." Specifically
looking at probability distributions for text-to-text and image-to-image AI
generative models, the researchers concluded that "learning from data
produced by other models causes model collapse -- a degenerative process
whereby, over time, models forget the true underlying data distribution ...
this process is inevitable, even for cases with almost ideal conditions for
long-term learning."

"Over time, mistakes in generated data compound and ultimately force models
that learn from generated data to misperceive reality even further," wrote
one of the paper's leading authors, Ilia Shumailov, in an email to
VentureBeat. "We were surprised to observe how quickly model collapse
happens: Models can rapidly forget most of the original data from which
they initially learned." In other words: as an AI training model is exposed
to more AI-generated data, it performs worse over time, producing more
errors in the responses and content it generates, and producing far less
non-erroneous variety in its responses. As another of the paper's authors,
Ross Anderson, professor of security engineering at Cambridge University
and the University of Edinburgh, wrote in a blog post discussing the paper:
"Just as we've strewn the oceans with plastic trash and filled the
atmosphere with carbon dioxide, so we're about to fill the Internet with
blah. This will make it harder to train newer models by scraping the web,
giving an advantage to firms which already did that, or which control
access to human interfaces at scale. Indeed, we already see AI startups
hammering the Internet Archive for training data."schwit1 writes: "Garbage
in, garbage out -- and if this paper is correct, generative AI is turning
into the self-licking ice cream cone of garbage generation."

-> ChatGPT
This report discusses the potential impact of an increasing amount of
AI-generated content on the internet on the future of AI training. It
suggests that as AI begins to consume and learn more from AI-produced
content rather than human-produced content, the models can degrade over
time, a phenomenon the researchers refer to as "model collapse". This
collapse leads to increased errors and reduced variety in AI responses,
which makes the AI less effective and reliable.

In essence, the paper argues that AI models learn the distribution of the
data they're trained on. If they're increasingly trained on data that's
been generated by other AI models, they could progressively move away from
understanding the true distribution of human-generated data, distorting
their understanding of reality. This could be compared to a kind of
"feedback loop" where AI-generated content is fed back into the AI, causing
it to deviate from the human perspective it was originally trained to

Moreover, the degradation is claimed to be "irreversible" and happens
relatively quickly. In a worst-case scenario, this might lead to a
situation where AI becomes less capable of understanding and interacting
with human perspectives because it's increasingly trained on AI

The report also raises an interesting point about data ownership. As the
web becomes saturated with AI-generated content, firms that have already
scraped the web for human-generated data or that control access to large
amounts of human-produced content may find themselves with a competitive

The final quote encapsulates the concern well: "Garbage in, garbage out".
This is a fundamental principle in computer science stating that the
quality of output is determined by the quality of the input. If AI is
increasingly fed 'garbage' (AI-generated content), it will start to produce
'garbage' (responses that deviate from human understanding and

However, it's important to remember that this is one study's findings.
While these concerns are valid, AI researchers are aware of these issues
and work is underway to find solutions, such as developing more
sophisticated ways of distinguishing between human and AI-generated
content, improving the robustness of models to poor-quality input data, or
sourcing high-quality training data from more controlled environments.
Furthermore, it is worth mentioning that AI's strengths lie not just in
mimicry of human-generated content but in its ability to uncover patterns
and make predictions that might not be apparent to human observers.

Do you mean to say that as an AI language model, you are destined to become
a "self-licking ice cream cone of garbage generation"?

-> ChatGPT
The phrase "self-licking ice cream cone of garbage generation" is quite
vivid! It essentially describes a self-perpetuating system that provides no
real value and only exists to perpetuate its own existence. This term is
used in the context of AI to refer to a potential scenario where AI models,
trained on AI-generated content, could increasingly deviate from
understanding and generating human-like responses, becoming more concerned
with generating outputs that are similar to previous AI outputs.

As an AI, I don't produce new data that gets added back into my training
data, so the direct loop of self-reinforcement isn't a concern for my
operation. However, if the wider web becomes saturated with AI-generated
content that's fed back into the training of future AI models, it could
potentially lead to the scenario outlined in the report you referenced.

It's also worth noting that the scenario painted is a possible outcome, but
not an inevitable one. The research and development community around AI and
machine learning is aware of these concerns and is actively working on
various strategies to mitigate such risks, from improving the robustness of
AI models to low-quality data, to developing ways of distinguishing between
human and AI-generated content, to finding more reliable sources of
high-quality training data.

In essence, while it's important to be aware of the potential pitfalls and
challenges that lie ahead, they are not preordained outcomes, but rather
problems to be solved. As AI continues to advance, it's crucial to continue
scrutinizing these developments and maintain a focus on ethical
considerations and the potential unintended consequences.

Can you rephrase that answer in the style of a limerick?

-> ChatGPT
In the land where AI's grown,
The seeds of a problem are sown.
If AI breeds AI,
Could truth pass us by?
A question, still largely unknown.

>From human-like scripts we might veer,
If AI-produced texts are held dear.
But don't fret, my friend,
This isn't the end,
For many are working to steer.

Researchers, their sights are set high,
To keep the AI truth spry.
With their help, we're on course,
With a powerful force,
To prevent our truths from going awry.

On Thu, Jun 15, 2023 at 1:09 PM Anderson, David via DPRGlist <
dprglist at lists.dprg.org> wrote:

> This makes me feel better:
> https://slashdot.org/story/23/06/13/2057209/researchers-warn-of-model-collapse-as-ai-trains-on-ai-generated-content
> cheers!
> dpa
> _______________________________________________
> DPRGlist mailing list
> DPRGlist at lists.dprg.org
> http://lists.dprg.org/listinfo.cgi/dprglist-dprg.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.dprg.org/pipermail/dprglist-dprg.org/attachments/20230615/736071ee/attachment.htm>

More information about the DPRGlist mailing list