Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

More recently they train on a mix of synthetic and organic text, like the Phi-4 and o1 / o3 models. Original copyrighted text can be safely replaced with synthetic standins.


I think this works only to a certain degree, they will still use as much data as they can use to train the models.

Synthetic data will not replace original data like books. Synthetic data works very good for math.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: