Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can download all of the (public) posts and comments on Reddit. It's a ~2TB torrent.


hmm... that makes it larger than RedPajama, a dataset of 1.2 trillion tokens

how much of reddit is being used for AI? it looks like there's plenty of text in there; maybe we just need to parse the reddit with GPT to filter out the good parts, and got a great dataset




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: