Actually, the sources we had (everything scraped from the internet) turns out to...

Actually, the sources we had (everything scraped from the internet) turns out to be pretty bad.

Imagine not going to school and instead learning everything from random blog posts or reddit comments. You could do it if you read a lot, but it's clearly suboptimal.

That's why OpenAI, and probably every other serious AI company, is investing huge amounts in generating (proprietary) datasets.