The reason that language models require large amounts of data is because they lack grounding. When humans write a sentence about.. let's say "fire", we can relate that word to visual, auditory and kinesthetic experiences built from a coherent world model. Without this world model the LM needs a lot of examples, essentially it has to remember all the different contexts in which the word "fire" appears and figure out when it's appropriate to use this word in a sentence. A perfect language model is literally impossible because you can always contrive a novel context that the LM has never seen before.
I suspect that the more data modalities we add the less data would required, but that's not the whole picture either. For example, text-to-image generators often makes weird mistakes that look "unphysical", or objects that look like they're flowing into eachother. The reason is because these models (including DALLE) uses a simple UNET, which basically only sees textures. What it lacks is a human inductive bias that 2d images are typically representations of a 3d world, a world which contains largely discrete objects and physics. It makes these mistakes because it doesn't know what objects are, and need to brute force this idea from a ton of observations. Even simple cognitive abilities like object persistence requires time perception, which these models lack.
I think the fact that these models can make up for this deficit with a ton of data is very telling. There is a lot of low hanging fruit in integrating more data modalities.
Even simple cognitive abilities like object persistence requires time perception, which these models lack.
What do you mean? If we send a robot to explore its environment, and train it by having it constantly predict the next video frame, wouldn’t it eventually learn the physics and therefore gained the “time perception”?
* The "bigger models" line of research is probably exhausted for now: "insofar as we trust our equation, this entire line of research -- which includes GPT-3, LaMDA, Gopher, Jurassic, and MT-NLG -- could never have beaten Chinchilla, no matter how big the models got"
* We don't really know how much written data is available. Even Google - which has access to data repositories like scanned books that they cannot share - seems to have trouble getting consistently sized datasets for unknown reasons.
* There seems to be as much (more?) written English available in books than on the entire web. MassiveText "scrape" = 506B tokens, MassiveText "books" = 560B tokens
My understanding is all of these models trained one epoch. Not knowing the change in loss in these LM’s by doing a second pass seems like a huge blind spot. What’s the best example to point to that could help us understand how likely it is that would move the needle comparable to adding more unique tokens?
Not sure which of the two papers you’re referring to. The Anthropic paper [1] clearly shows an example where not deduping leads to serious quality degradation.
However it’s not clear if training for more than one epoch on deduped and well balanced data would help. I personally think it should and the reason people don’t do it might be because it’s too expensive.
Yes I was referring to maintaining the training set as-is and just running more than one epoch. I don't think we should expect this to necessarily have the same effect as duplicating data in the training set. I understand it's expensive but if the goal of the paper was to try to tease out the leverage from each of these possible variables it's strange they are just ignoring this one, which could be significant.
I can't find it now, but I've seen somewhere a claim that it's better to train a 10x model for 1 epoch than a 1x model for for 10 epochs. This is most certainly not true for computer vision models (e.g. EfficientNet B0 vs B7), but perhaps it's true for NLP? I remember that the original BERT has been trained for 40 epochs (but only on 3.3B tokens), so I wonder how it would compare to GPT-3 trained for one epoch on the same dataset.
any intuition of why that that would be the case for NLP?
Certainly when you have something as large as the datasets being used to train large transformer models, SOME of the data is already repeated. Why would one more epoch make it worse?
They make an argument that there might exist an unfortunate dup/unique data ratio in a dataset, where a model decides to memorize a frequently repeated chunk of data which is big enough to justify accuracy degradation happening for the rest of the data, but not too big to make memorization difficult (section 5.1). The degradation they show is substantial - almost as if going from 800M to 400M model.
Seems like once the easy scaling is over (you can't get more training data or train more on the data you have), the next step is using training data more efficiently with algorithmic improvements?
Or maybe multimodal models. If Google used YouTube as training data, they wouldn't run out of data in a while. Just feed it video once you run out of text.
Or maybe hallucinating additional training data: my impression is that AlphaFold was able to get around a lack of sufficient data by training an initial model and having it make a bunch of predictions, then the high-confidence predictions from that set were used to train the final model.
Given that language models seem to mostly know when they're making correct predictions [1] this method might be useful for stretching the available datasets into something larger without falling into the same pitfalls repeating data would give you. And if you squint it kind of looks like daydreaming? When I'm learning a new skill I often find myself playing back scenes and internally running some simulations of what I would have done differently.
Interesting idea. If a person sits down and writes a bunch of essays, that usually leads to improved language skills. Maybe hallucinating additional training data could be something somewhat similar for transformer networks.
I never updated my setup, I'd need to manually compile the browser, since AFAIK kiwi browser doesn't ship stable binaries with telemetry disabled. I'll get around to doing it someday, but I found it quite struggling if not impossible to install android SDKs without android studio.
Kiwi source branch is Chromium 77.0 + Kiwi backported fixes, will show 88.0.4324.152 to websites for compatibility reasons (64-bit)
I suspect that the more data modalities we add the less data would required, but that's not the whole picture either. For example, text-to-image generators often makes weird mistakes that look "unphysical", or objects that look like they're flowing into eachother. The reason is because these models (including DALLE) uses a simple UNET, which basically only sees textures. What it lacks is a human inductive bias that 2d images are typically representations of a 3d world, a world which contains largely discrete objects and physics. It makes these mistakes because it doesn't know what objects are, and need to brute force this idea from a ton of observations. Even simple cognitive abilities like object persistence requires time perception, which these models lack.
I think the fact that these models can make up for this deficit with a ton of data is very telling. There is a lot of low hanging fruit in integrating more data modalities.