A quick summary of the Limitations section: - "OPT-175B does not work well with ...

ad_hominem · on May 3, 2022

> Pushshift.io Reddit corpus

Pushshift is a single person with some very strong political opinions who has specifically used his datasets to attack political opponents. Frankly I wouldn't trust his data to be untainted.

These models really need to be trained on more official data sources, or at least something with some type of multi-party oversight rather than data that effectively fell off the back of a truck.

edit: That's not even to mention I believe it's flat-out illegal for him to collect and redistribute this data as Reddit users did not agree to any terms of use with him. Just look at the disastrous mess of his half-baked "opt-out" thing that flagrantly violates GDPR: https://www.reddit.com/r/pushshift/comments/pat409/online_re...

VectorLock · on May 3, 2022

Thats interesting, any good sources for this accusation?

ad_hominem · on May 3, 2022

Not handy, and I'm not going to spend my evening digging. It may've also been one of the NGOs ideologically aligned with him that credited him for the data + assistance

throwawayohio · on May 3, 2022

If it's so egregious is it really that hard to find an example of the bias?

Calling the integrity of a single person operation into question, but then backing out with no evidence and even saying it might not have even been them seems a bit irresponsible.

arcticfox · on May 3, 2022

On the other hand, they warned you with their username...

throwmeariver1 · on May 3, 2022

You can just look at the data…

robbedpeter · on May 3, 2022

Web scraping is legal. Reddit users, like all other members of public forums, put their comments on the internet for the whole world to see. And collect, parse, process and manipulate. If you don't want the whole world to have access to your writing, you'd have to join a private forum.

Trying to shoehorn social media posts into some contorted post-hoc bastardization of the concept of privacy is ridiculous.

Shockingly, things that people post to publicly accessible websites are accessible by the public. We're starting to see social damage from this, with facial recognition and authoritarian governments using people's posts for tracking and oppression.

Decentralized services with strong legislation protecting personal data, and globally recognized content licensing will all be needed to prevent future abuse, but everyone currently in the planet over the age of 20 is more or less personally responsible for the massive and naive oversharing. We know better now, but 15+ years ago nobody except Sci-fi authors and fringe activists had a grasp of how badly unprotected globally shared streams of consciousness could go wrong.

mike_d · on May 3, 2022

> Just look at the disastrous mess of his half-baked "opt-out" thing that flagrantly violates GDPR

Pushshift collects data from Reddit using the same API as the mobile app and public site. It does not have any privileged access to the Reddit database, nor is it collecting any PII that would be subject to GDPR.

You as a user grant a pretty broad license to Reddit when you post content. One of the things the license allows them to do is redistribute the content to other users as well as search indexes and things like the Wayback Machine or Pushshift.

(While I did work for Reddit at one point, these opinions are my own)

ad_hominem · on May 3, 2022

> nor is it collecting any PII that would be subject to GDPR

Yeah that's not how that works. Reddit is a free text input interface. I'm free to put PII in any post or comment I want to and you have to comply with data protection laws accordingly if I want my information redacted later on.

The same way you wouldn't just "let it ride" if someone uploaded illegal content - the content itself is what's protected, doesn't matter how Reddit structures its web forms.

mike_d · on May 3, 2022

That has already been hashed out in the European courts. The processor of the data needs to have a reasonable way of establishing that the data belongs to a identifiable natural person.

But by all means, if you disagree feel free to report Pushshift to the EU regulators. As far as I know Pushshift is based in the US and has no presence to establish a nexus to EU law.

hjjjjjje · on May 3, 2022

The opt-out form doesn't even get processed these days. It's a fig leaf for GDPR compliance that doesn't actually work.

hoseja · on May 3, 2022

At some point they have to face the reality these "stereotypical biases" are natural and hamstringing AIs to never consider them will twist them monstrously.

SheinhardtWigCo · on May 3, 2022

Viruses are natural, so should we stop trying to hamstring them?

mdp2021 · on May 3, 2022

What about: at some point we would have to really catch that inspiration from the expression "Intelligence" and build a critical engine?

Edit: in fact, your latter statement seems to suggest finished products: no, they are toys. We are playing in order to build further, we are getting results, milestones in the construction abilities - but those "models" are little lab-byproducts monsters. What are you «twisting»?

IAmEveryone · on May 3, 2022

So if your plane model keeps blowing up, at some point people will just have to learn to live (/die) with it?

hoseja · on May 3, 2022

It's not blowing up though, it's experiencing natural turbulence and you're so afraid of getting jostled a bit you demand the plane be tethered to the ground and never exceed 10mph. How to fly under these conditions is left as an exercise for the reader.

Ar-Curunir · on May 3, 2022

you're just saying "people are naturally racist" in more words.

wardedVibe · on May 3, 2022

They're saying that racist stereotypes are true, specifically.

hoseja · on May 3, 2022

No, I am saying that the cure is worse than the disease. The proper fix for the AI being racist is to make it able to not be racist on it's own (which would probably need much deeper understanding on the side of the AI), not forbid everything that passes some primitive heuristic of "being racist". One is painful and correct, the other is easy and feelgood and doomed.

wardedVibe · on May 3, 2022

Fair enough, that's what I get for bringing reddit discussion norms with me.

Though because of how general purpose these models are, I have a hard time believing such a model couldn't be used to generate reams of racist screeds for propaganda/astroturfing purposes.

mavhc · on May 3, 2022

They are, that's the point of civilisation, to try to stop acting like animals

mdp2021 · on May 3, 2022

There's a non light terminological issue there. To say that specimen "as found in nature" are weak at something (uneducated) is one think, to say that it is "connatural" to them, that it is "their nature", is completely different¹. I would not mix them up.

(¹Actually opposite: the first indicates an unexpressed nature, the second a manifested one.)

boppo1 · on May 3, 2022

Can you think of an example?

speed_spread · on May 3, 2022

Reminds me a lot of "Do not taunt Happy Fun Ball".

bestcoder69 · on May 3, 2022

> - "OPT-175B does not work well with declarative instructions or point-blank interrogatives."

Lame!!! I've come to realize InstructGPT3 is just so so so much better than base GPT-3. I won't be _too_ excited about competitors yet until someone makes their own instruct model.

domenicrosati · on May 3, 2022

The T0 series by big science is essentially an instruct model (though using multitask prompting instead of user feedback). You should check it out. I have got very competitive results on prompting t0-11b v instructgpt3(text davinci 2)

bestcoder69 · on May 3, 2022

Thanks, this looks awesome. But my use case is creative text generation (chatbots), which from a quick glance doesn’t seem to be a suggested use case for T0?

I’ve found that simply describing to text-davinci-002 how a chatbot should act gives you more fun and believable responses. For example I trained a trump bot on 2000 tweets (davinci non-instruct fine tuning), and it generated responses that were more boring than when I just wrote a sentence saying to please tweet like trump + a couple adjectives to help it.

I ran out of guest API credits on hugging face before I could trick T0 to respond with a chat completion longer than a few words. But I’ll try it some more later.

yosito · on May 3, 2022

> OPT-175B has a high propensity to generate toxic language and reinforce harmful stereotypes

So they trained it on Facebook comments?

Gigachad · on May 3, 2022

I'd think any natural language model would have the same biases we see from real humans.

tsol · on May 3, 2022

Are there really no moderated forums that the data can be taken from? Even HN-based training data would be much more civil

Gigachad · on May 3, 2022

A model trained on HN would spit out a 5 paragraph story about how minorities provide a negative ROI for cities. Or how the homeless need to removed from society.

can16358p · on May 3, 2022

Don't forget that it must also generate, at some point regardless of the topic, a new terminal emulator, and an extremely positive or extremely negative opinion about how blockchain can solve a problem.

IAmEveryone · on May 3, 2022

Sure, but it would never do something actually bad, like raising the possibility that sexual harassment might, sometimes, be an issue, or questioning the value of phrenology.

sanxiyn · on May 3, 2022

Note that HN is included in the training data, see page 20.

IAmEveryone · on May 3, 2022

Go figure (8)!

yosito · on May 3, 2022

I'd think the training data is something that could be curated. Eliminating all bias might be impossible, but GIGO applies.

stephenroller · on May 3, 2022

We trained on Reddit comments and HackerNews comments.

Rebelgecko · on May 3, 2022

I thought Pushshift was only reddit comments?

MengerSponge · on May 3, 2022

Does it merely reinforce harmful stereotypes? Or will it help perpetrate genocide?

rhizome · on May 3, 2022

Tomato, tomahto.

ChrisRR · on May 3, 2022

Higher rate of toxicity and stereotypes?

So it was trained on facebook comments then

TedShiller · on May 3, 2022

AKA not as impressive as it sounds