Not sure which of the two papers you’re referring to. The Anthropic paper [1] clearly shows an example where not deduping leads to serious quality degradation.
However it’s not clear if training for more than one epoch on deduped and well balanced data would help. I personally think it should and the reason people don’t do it might be because it’s too expensive.
Yes I was referring to maintaining the training set as-is and just running more than one epoch. I don't think we should expect this to necessarily have the same effect as duplicating data in the training set. I understand it's expensive but if the goal of the paper was to try to tease out the leverage from each of these possible variables it's strange they are just ignoring this one, which could be significant.
I can't find it now, but I've seen somewhere a claim that it's better to train a 10x model for 1 epoch than a 1x model for for 10 epochs. This is most certainly not true for computer vision models (e.g. EfficientNet B0 vs B7), but perhaps it's true for NLP? I remember that the original BERT has been trained for 40 epochs (but only on 3.3B tokens), so I wonder how it would compare to GPT-3 trained for one epoch on the same dataset.
any intuition of why that that would be the case for NLP?
Certainly when you have something as large as the datasets being used to train large transformer models, SOME of the data is already repeated. Why would one more epoch make it worse?
They make an argument that there might exist an unfortunate dup/unique data ratio in a dataset, where a model decides to memorize a frequently repeated chunk of data which is big enough to justify accuracy degradation happening for the rest of the data, but not too big to make memorization difficult (section 5.1). The degradation they show is substantial - almost as if going from 800M to 400M model.
https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla...