hmm... that makes it larger than RedPajama, a dataset of 1.2 trillion tokens
how much of reddit is being used for AI? it looks like there's plenty of text in there; maybe we just need to parse the reddit with GPT to filter out the good parts, and got a great dataset