Footnote 11 says that repeating data is considered harmful https://www.lesswrong...

VirusNewbie · on Aug 3, 2022

For efficiency reasons, no? I don’t think it is saying it makes the model perform worse.

ambrozk · on Aug 3, 2022

No, it's saying that the model performs worse. If you read the linked papers on footnote 11 you'll see the research in question.

VirusNewbie · on Aug 3, 2022

I read one of the papers, it seemed like it said it was a waste of resources to not dedupe, not that it made things worse.

p1esk · on Aug 3, 2022

Not sure which of the two papers you’re referring to. The Anthropic paper [1] clearly shows an example where not deduping leads to serious quality degradation.

However it’s not clear if training for more than one epoch on deduped and well balanced data would help. I personally think it should and the reason people don’t do it might be because it’s too expensive.

[1] https://arxiv.org/pdf/2107.06499.pdf

gfodor · on Aug 3, 2022

Yes I was referring to maintaining the training set as-is and just running more than one epoch. I don't think we should expect this to necessarily have the same effect as duplicating data in the training set. I understand it's expensive but if the goal of the paper was to try to tease out the leverage from each of these possible variables it's strange they are just ignoring this one, which could be significant.

p1esk · on Aug 3, 2022

I can't find it now, but I've seen somewhere a claim that it's better to train a 10x model for 1 epoch than a 1x model for for 10 epochs. This is most certainly not true for computer vision models (e.g. EfficientNet B0 vs B7), but perhaps it's true for NLP? I remember that the original BERT has been trained for 40 epochs (but only on 3.3B tokens), so I wonder how it would compare to GPT-3 trained for one epoch on the same dataset.

VirusNewbie · on Aug 3, 2022

any intuition of why that that would be the case for NLP?

Certainly when you have something as large as the datasets being used to train large transformer models, SOME of the data is already repeated. Why would one more epoch make it worse?

visarga · on Aug 3, 2022

The effect of deduplication is mostly related to reducing regurgitation of training data, the decrease in perplexity is small (fig. 2).

p1esk · on Aug 3, 2022

Oh, I actually meant to link to this paper: https://arxiv.org/pdf/2205.10487.pdf

They make an argument that there might exist an unfortunate dup/unique data ratio in a dataset, where a model decides to memorize a frequently repeated chunk of data which is big enough to justify accuracy degradation happening for the rest of the data, but not too big to make memorization difficult (section 5.1). The degradation they show is substantial - almost as if going from 800M to 400M model.