More recently they train on a mix of synthetic and organic text, like the Phi-4 ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		visarga on Jan 21, 2025 \| parent \| context \| favorite \| on: Authors seek Meta's torrent client logs and seedin... More recently they train on a mix of synthetic and organic text, like the Phi-4 and o1 / o3 models. Original copyrighted text can be safely replaced with synthetic standins.

BonoboIO on Jan 21, 2025 [–]

I think this works only to a certain degree, they will still use as much data as they can use to train the models.

Synthetic data will not replace original data like books. Synthetic data works very good for math.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact