Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes I was referring to maintaining the training set as-is and just running more than one epoch. I don't think we should expect this to necessarily have the same effect as duplicating data in the training set. I understand it's expensive but if the goal of the paper was to try to tease out the leverage from each of these possible variables it's strange they are just ignoring this one, which could be significant.


I can't find it now, but I've seen somewhere a claim that it's better to train a 10x model for 1 epoch than a 1x model for for 10 epochs. This is most certainly not true for computer vision models (e.g. EfficientNet B0 vs B7), but perhaps it's true for NLP? I remember that the original BERT has been trained for 40 epochs (but only on 3.3B tokens), so I wonder how it would compare to GPT-3 trained for one epoch on the same dataset.


any intuition of why that that would be the case for NLP?

Certainly when you have something as large as the datasets being used to train large transformer models, SOME of the data is already repeated. Why would one more epoch make it worse?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: