From Model Collapse to Model Improvement: Synthetic Data Verification to the Rescue

发布者:梁慧丽发布时间:2025-11-28浏览次数:10

报告人

Haifeng Xu

The University of Chicago

时间

2025年8月18日 星期一

下午 14:00-15:00

地点

科研实验大楼1113


Abstract


 

We are running out of Internet data to continuously train and improve today's large models. To address this barrier, a widely employed approach by industrial LLM platforms is to use model-generated synthetic data as new sources to keep training the model. However, various recent works have shown that training generative models with synthetic data can lead to model collapse in the long run, raising significant concerns about current industrial practice as well as discussions about how synthetic data should be properly integrated into the training process.

In this talk, I will share our recent study about how synthetic data could be properly integrated into generative model training in a way that not only prevents model collapses, but also strictly improve its model performance. We situated our principled analysis in a fundamental statistical problem, from which we derive practically useful insights. The talk will conclude with many open directions.


Biography


Caixing Wang is currently a postdoctoral researcher in the Department of Statistics at The Chinese University of Hong Kong. He received his Ph.D. degree from the School of Statistics and Data Science, Shanghai University of Finance and Economics. His research interests include statistics machine learning and large-scale data analysis. His has published papers in leading journals and conferences of machine learning and statistics, including JMLR, JCGS, NeurIPS and ICML.

搜索
您想要找的