Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Actually, the sources we had (everything scraped from the internet) turns out to be pretty bad.

Imagine not going to school and instead learning everything from random blog posts or reddit comments. You could do it if you read a lot, but it's clearly suboptimal.

That's why OpenAI, and probably every other serious AI company, is investing huge amounts in generating (proprietary) datasets.



GitHub, especially filtered by starred repos, is a pretty high quality dataset.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: