Does training AI models on synthetic data actually cause model collapse over time?

🤖 AI reviewed 📅 Jun 2, 2026 👨‍⚕️ Expert reviewed ✍️ TryQuerra Editorial Team
Verdict
Yes, evidence supports the concept of model collapse when AI models are trained on synthetic data.
Research indicates that training AI models on synthetic data can lead to a phenomenon known as 'model collapse,' where the performance of the models deteriorates over time, resulting in incoherent and nonsensic.
Based on 7 reviewed sources including AI 'Model Collapse': The Risks of Synthetic Data Training | Richard Foster Fletcher, Model Collapse by Synthetic Data is fake news [Investigations], Model collapse - Wikipedia.
Trust Score: 73%
7 sources reviewed
Updated Jun 2, 2026
Trust score breakdown ?
Source quality
75%
Source diversity
93%
Consensus strength
71%
Freshness
76%
Expert agreement
72%
Source agreement
90%
Score is an AI-weighted composite using 7 sources. Higher source agreement means fewer meaningful contradictions across reviewed sources. Learn how we calculate trust →

Full answer body

Expanded summary

Research indicates that training AI models on synthetic data can lead to a phenomenon known as 'model collapse,' where the performance of the models deteriorates over time, resulting in incoherent and nonsensical outputs. This degradation in performance is attributed to a decrease in data diversity caused by the use of low-quality synthetic data. While some level of learning is still possible with generated data, the lexical, syntactic, and semantic diversity of model outputs consistently decreases through successive iterations. The practical implications for AI companies in 2026 include the need for better-designed synthetic data to prevent model collapse as high-quality, human-generated web data becomes scarcer.

Full analysis

Introduction

Research indicates that training AI models on synthetic data can lead to a phenomenon known as 'model collapse,' where the performance of the models deteriorates over time, resulting in incoherent and nonsensical outputs. This degradation in performance is attributed to a decrease in data diversity caused by the use of low-quality synthetic data. While some level of learning is still possible with generated data, the lexical, syntactic, and semantic diversity of model outputs consistently decreases through successive iterations. The practical implications for AI companies in 2026 include the need for better-designed synthetic data to prevent model collapse as high-quality, human-generated web data becomes scarcer.

Key Findings

  • Research shows that as models are repeatedly trained on synthetic data, their performance deteriorates, eventually producing incoherent and nonsensical outputs.
  • The usage of low-quality synthetic data reduces the diversity of the data that a model samples from, leading to model collapse.
  • Consistent decrease in the lexical, syntactic, and semantic diversity of model outputs is observed through successive iterations when models are trained on synthetic data.
  • AI developers will need to take more care about the data fed into their systems to prevent model collapse as human-generated data becomes scarcer.

Supporting Evidence

  • Evidence quality and conclusions vary across source types, study design, and methodological limitations in the cited material.

Limitations and Caveats

  • Evidence quality and conclusions vary across source types, study design, and methodological limitations in the cited material.

Practical Implications

  • Evidence quality and conclusions vary across source types, study design, and methodological limitations in the cited material.

Limitations and Caveats

  • There is a claim that model collapse only occurs when researchers intentionally induce it in ways that do not match actual practices.

Conclusion

Research indicates that training AI models on synthetic data can lead to a phenomenon known as 'model collapse,' where the performance of the models deteriorates over time, resulting in incoherent and nonsensical outputs. This degradation in performance is attributed to a decrease in data diversity caused by the use of low-quality synthetic data. While some level of learning is still possible with generated data, the lexical, syntactic, and semantic diversity of model outputs consistently decreases through successive iterations. The practical implications for AI companies in 2026 include the need for better-designed synthetic data to prevent model collapse as high-quality, human-generated web data becomes scarcer.

Evidence highlights
  • Research shows that as models are repeatedly trained on synthetic data, their performance deteriorates, eventually producing incoherent and nonsensical outputs.
  • The usage of low-quality synthetic data reduces the diversity of the data that a model samples from, leading to model collapse.
  • Consistent decrease in the lexical, syntactic, and semantic diversity of model outputs is observed through successive iterations when models are trained on synthetic data.
  • AI developers will need to take more care about the data fed into their systems to prevent model collapse as human-generated data becomes scarcer.
Disagreements and caveats
  • There is a claim that model collapse only occurs when researchers intentionally induce it in ways that do not match actual practices.

Sources reviewed (7 shown)

AI 'Model Collapse': The Risks of Synthetic Data Training | Richard Foster Fletcher
Model Collapse by Synthetic Data is fake news [Investigations]
Model collapse - Wikipedia
AI models trained on 'synthetic data' could break down and regurgitate unintelligible nonsense, scientists warn | Live Science
When AI Models Start to Forget: Unpacking the Collapse Phenomenon | by Yubraj Ghimire | Medium
r/singularity on Reddit: Evidence that training models on AI-created data degrades their quality

Community insights

💬
No community insights yet.
Be the first expert to contribute.
Share your insight
All contributions are reviewed by our AI for accuracy before publishing.

People also ask

What is model collapse in AI training?
Model collapse refers to the phenomenon where the performance of AI models deteriorates over time when trained on synthetic data, leading to incoherent outputs.
How does low-quality synthetic data contribute to model collapse?
Low-quality synthetic data reduces the diversity of data that a model samples from, limiting its ability to learn effectively and leading to model collapse.
Are there any practical implications for AI companies regarding model collapse?
AI companies need to be cautious about the quality of synthetic data used for training models to prevent model collapse, especially as human-generated data becomes scarce.