The quote from the direct message made me respect Eleuther much more. Largely be...

sillysaurusx · on July 4, 2021

Exactly. This was the type of mistake that OpenAI could easily have made. I could see myself including this historical dataset without giving it a second thought. After all, the more data, the better, right?

One of The Pile's goals was to point out how tricky that can be. We've all seen how effortlessly Copilot spits out GPL code by rote; one wrong prompt would be all it takes to start spewing a lot things that no one wants to hear, if you have the wrong sort of data.

When you train with The Pile, you know exactly what you're getting, because you can take whatever parts you want and ignore the rest. It's a modular dataset. But defaults still matter -- by default, everyone will train on everything. Maybe OpenAI trained on the wrong thing, and maybe that's why they're forcing everyone to use their filters now. Whereas people can "just go train on everything in The Pile" and not have to worry.

(Once upon a time, the plan was to include a dump of Literotica in The Pile, which you can still find here: https://the-eye.eu/public/AI/pile_preliminary_components/ I argued heavily in favor of this, and thought it was totally lame when they decided to drop it.

In hindsight, that was a close call. AI Dungeon proves that it's easy to carelessly include things that can bite you later: https://gitgud.io/AuroraPurgatio/aurorapurgatio#aurorapurgat...

Maybe some people want their models to include that sort of thing, but it shouldn't be the default. People shouldn't have to worry that the defaults will be "Whoa, I only wanted to make a Q&A system for my business; why is it reciting love poems?"

Stella saw that, I think. I didn't.

d13 · on July 4, 2021

So what’s the rationale for including so much “romance” literature in The Pile? My innocent “walk in the park” prompt turned extremely graphic for no apparent reason.

sillysaurusx · on July 4, 2021

Unfortunately, that's probably my fault.

I foolishly had a big head, and felt like it was so clear what needed to happen: we needed a dataset of "every book ever."

books3, one of the largest components of The Pile, is 196,640 books. https://twitter.com/theshawwn/status/1320282149329784833?lan...

I'm proud I did that. And I'm also horrified that my perspective was so incredibly off-base. I get it now. I was blinded by my own thick skull.

The sheer quantity of knowledge in books3 is almost unfathomable. I find it hard to think too much about it, because you end up concluding that AIs are the only entity on earth that stand a chance of absorbing this much knowledge.

I just pulled up the books3 index of "2" -- i.e. all books starting with the number 2: https://gist.github.com/shawwn/85cbaf53cb6bb57c49f1688e70532...

That's the truncate file. If you go to the full file, then command-F for "sex", there are 93 hits.

93 sex books. In just the "2" section.

All the sections are here: http://the-eye.eu/public/Books/Bibliotik/

Like Hattori Hanzo, I feel I can say with no ego that books3 is my finest work. https://www.youtube.com/watch?v=az2dSNXRKOc&ab_channel=kurts...

You would not believe how hard it is to get 193 thousand books converted into perfectly-readable markdown. Even the software books have perfect formatting -- every table, every code snippet, I annihilated every corner case I could find. Because it needed to be perfect for humans, to have any chance of being perfect for AI.

But I was a fool. My ego blinded me to the fact that it's a bad idea to do what I truly believed was in everyone's best interest: that "because any human could read any of those books, AI should know all of those books."

It's not a human. It's a markov chain. Having it autocomplete sex books is a bad idea for business purposes. I wanted The Pile to be business-grade. My work here has endangered that goal.

And I don't know how it could have ended up any differently. Because I don't know how to sort 193 thousand books into reasonable selections that you may or may not want to exclude. Our goal with The Pile was to let you decide. Who among us would dare feel that they could judge 193 thousand books from their titles alone?

It's a job for filtering and heuristics and analysis and hard work -- none of which I did. I spent around three days turning Aaron Swartz' html2text library into the best damn "epub to training data converter" ever made. Yet my accomplishments feel so hollow, for the reasons you observed here.

Stella and Leo put so much more thought and care into their contributions. I try to take solace in the fact that The Pile lets you pick and choose which portions of training data you want to use: https://github.com/EleutherAI/the-pile

But of course, the irony is, even though The Pile is so flexible and modular, most people will just use the defaults. And by default, The Pile includes.... most of humanity's knowledge. A gargantuan pile of books. So many books that you could fill an entire neighborhood with nothing but books, and you'd still have a hundred thousand books left over.

I don't know how to feel about all that. I wanted to make an impact. I guess I did. Time will tell whether it's a net gain.

Luckily, OpenAI made these same mistakes. That's the grain of truth I cling to. They almost certainly made these exact same mistakes, because their goal was to make a million dollars a year (which they achieved), and to do so as quickly as possible.

Now they have to be super paranoid with their filters, and GPT-J is at least slightly less shocking than GPT-3 thanks to everyone not-me who worked on The Pile.