Technology

Does training AI models on synthetic data actually cause model collapse over time?

🤖 AI reviewed 📅 Jun 2, 2026 👨‍⚕️ Expert reviewed ✍️ TryQuerra Editorial Team

Verdict

Yes, evidence supports the concept of model collapse when AI models are trained on synthetic data.

Based on 7 reviewed sources including AI 'Model Collapse': The Risks of Synthetic Data Training | Richard Foster Fletcher, Model Collapse by Synthetic Data is fake news [Investigations], Model collapse - Wikipedia.

Trust Score: 73%

7 sources reviewed

Updated Jun 2, 2026

Trust score breakdown ?

Source quality

75%

Source diversity

93%

Consensus strength

71%

Freshness

76%

Expert agreement

72%

Source agreement

90%

Score is an AI-weighted composite using 7 sources. Higher source agreement means fewer meaningful contradictions across reviewed sources. Learn how we calculate trust →

Full answer body

Expanded summary

Research indicates that training AI models on synthetic data can lead to a phenomenon known as 'model collapse,' where the performance of the models deteriorates over time, resulting in incoherent and nonsensical outputs. This degradation in performance is attributed to a decrease in data diversity caused by the use of low-quality synthetic data. While some level of learning is still possible with generated data, the lexical, syntactic, and semantic diversity of model outputs consistently decreases through successive iterations. The practical implications for AI companies in 2026 include the need for better-designed synthetic data to prevent model collapse as high-quality, human-generated web data becomes scarcer.

Full analysis

Introduction

Key Findings

Research shows that as models are repeatedly trained on synthetic data, their performance deteriorates, eventually producing incoherent and nonsensical outputs.
The usage of low-quality synthetic data reduces the diversity of the data that a model samples from, leading to model collapse.
Consistent decrease in the lexical, syntactic, and semantic diversity of model outputs is observed through successive iterations when models are trained on synthetic data.
AI developers will need to take more care about the data fed into their systems to prevent model collapse as human-generated data becomes scarcer.

Supporting Evidence

Evidence quality and conclusions vary across source types, study design, and methodological limitations in the cited material.

Limitations and Caveats

Evidence quality and conclusions vary across source types, study design, and methodological limitations in the cited material.

Practical Implications

Evidence quality and conclusions vary across source types, study design, and methodological limitations in the cited material.

Limitations and Caveats

There is a claim that model collapse only occurs when researchers intentionally induce it in ways that do not match actual practices.

Conclusion

Evidence highlights

Research shows that as models are repeatedly trained on synthetic data, their performance deteriorates, eventually producing incoherent and nonsensical outputs.
The usage of low-quality synthetic data reduces the diversity of the data that a model samples from, leading to model collapse.
Consistent decrease in the lexical, syntactic, and semantic diversity of model outputs is observed through successive iterations when models are trained on synthetic data.
AI developers will need to take more care about the data fed into their systems to prevent model collapse as human-generated data becomes scarcer.

Disagreements and caveats