Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Chinchilla's Wild Implications (lesswrong.com)
88 points by ctoth on Aug 2, 2022 | hide | past | favorite | 32 comments


The reason that language models require large amounts of data is because they lack grounding. When humans write a sentence about.. let's say "fire", we can relate that word to visual, auditory and kinesthetic experiences built from a coherent world model. Without this world model the LM needs a lot of examples, essentially it has to remember all the different contexts in which the word "fire" appears and figure out when it's appropriate to use this word in a sentence. A perfect language model is literally impossible because you can always contrive a novel context that the LM has never seen before.

I suspect that the more data modalities we add the less data would required, but that's not the whole picture either. For example, text-to-image generators often makes weird mistakes that look "unphysical", or objects that look like they're flowing into eachother. The reason is because these models (including DALLE) uses a simple UNET, which basically only sees textures. What it lacks is a human inductive bias that 2d images are typically representations of a 3d world, a world which contains largely discrete objects and physics. It makes these mistakes because it doesn't know what objects are, and need to brute force this idea from a ton of observations. Even simple cognitive abilities like object persistence requires time perception, which these models lack.

I think the fact that these models can make up for this deficit with a ton of data is very telling. There is a lot of low hanging fruit in integrating more data modalities.


Even simple cognitive abilities like object persistence requires time perception, which these models lack.

What do you mean? If we send a robot to explore its environment, and train it by having it constantly predict the next video frame, wouldn’t it eventually learn the physics and therefore gained the “time perception”?


Yeah that would work. I'm talking about LLMs, DALL-E and diffusion image generators specifically.


this is exactly what clip already does, and there will be massive improvements in this area in coming years, I promise.


So many interesting take-aways here.

* The "bigger models" line of research is probably exhausted for now: "insofar as we trust our equation, this entire line of research -- which includes GPT-3, LaMDA, Gopher, Jurassic, and MT-NLG -- could never have beaten Chinchilla, no matter how big the models got"

* We don't really know how much written data is available. Even Google - which has access to data repositories like scanned books that they cannot share - seems to have trouble getting consistently sized datasets for unknown reasons.

* There seems to be as much (more?) written English available in books than on the entire web. MassiveText "scrape" = 506B tokens, MassiveText "books" = 560B tokens


Seems there is an order-of-magnitude error for LaMDA though...


My understanding is all of these models trained one epoch. Not knowing the change in loss in these LM’s by doing a second pass seems like a huge blind spot. What’s the best example to point to that could help us understand how likely it is that would move the needle comparable to adding more unique tokens?


Footnote 11 says that repeating data is considered harmful

https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla...


For efficiency reasons, no? I don’t think it is saying it makes the model perform worse.


No, it's saying that the model performs worse. If you read the linked papers on footnote 11 you'll see the research in question.


I read one of the papers, it seemed like it said it was a waste of resources to not dedupe, not that it made things worse.


Not sure which of the two papers you’re referring to. The Anthropic paper [1] clearly shows an example where not deduping leads to serious quality degradation.

However it’s not clear if training for more than one epoch on deduped and well balanced data would help. I personally think it should and the reason people don’t do it might be because it’s too expensive.

[1] https://arxiv.org/pdf/2107.06499.pdf


Yes I was referring to maintaining the training set as-is and just running more than one epoch. I don't think we should expect this to necessarily have the same effect as duplicating data in the training set. I understand it's expensive but if the goal of the paper was to try to tease out the leverage from each of these possible variables it's strange they are just ignoring this one, which could be significant.


I can't find it now, but I've seen somewhere a claim that it's better to train a 10x model for 1 epoch than a 1x model for for 10 epochs. This is most certainly not true for computer vision models (e.g. EfficientNet B0 vs B7), but perhaps it's true for NLP? I remember that the original BERT has been trained for 40 epochs (but only on 3.3B tokens), so I wonder how it would compare to GPT-3 trained for one epoch on the same dataset.


any intuition of why that that would be the case for NLP?

Certainly when you have something as large as the datasets being used to train large transformer models, SOME of the data is already repeated. Why would one more epoch make it worse?


The effect of deduplication is mostly related to reducing regurgitation of training data, the decrease in perplexity is small (fig. 2).


Oh, I actually meant to link to this paper: https://arxiv.org/pdf/2205.10487.pdf

They make an argument that there might exist an unfortunate dup/unique data ratio in a dataset, where a model decides to memorize a frequently repeated chunk of data which is big enough to justify accuracy degradation happening for the rest of the data, but not too big to make memorization difficult (section 5.1). The degradation they show is substantial - almost as if going from 800M to 400M model.


Seems like once the easy scaling is over (you can't get more training data or train more on the data you have), the next step is using training data more efficiently with algorithmic improvements?


Or maybe multimodal models. If Google used YouTube as training data, they wouldn't run out of data in a while. Just feed it video once you run out of text.


Or maybe hallucinating additional training data: my impression is that AlphaFold was able to get around a lack of sufficient data by training an initial model and having it make a bunch of predictions, then the high-confidence predictions from that set were used to train the final model.

Given that language models seem to mostly know when they're making correct predictions [1] this method might be useful for stretching the available datasets into something larger without falling into the same pitfalls repeating data would give you. And if you squint it kind of looks like daydreaming? When I'm learning a new skill I often find myself playing back scenes and internally running some simulations of what I would have done differently.

[1] https://arxiv.org/abs/2207.05221


Interesting idea. If a person sits down and writes a bunch of essays, that usually leads to improved language skills. Maybe hallucinating additional training data could be something somewhat similar for transformer networks.


It seems obvious that the next barrier to overcome is learning from videos

All of a sudden, the shortage of training data will no longer be a problem, since the amount of video data available dwarfs all other forms of data


I'm not sure you get more words that way, though. The transcript of a half-hour video isn't very long, even if it's mostly talking.


The long term goal of these systems is to understand and reason about how the world works, videos have tons of info about this, regardless of words


But theres a huge information load that could be mined to gain more awareness of words both in and not in the transcript, ideally.


Coincidentally humans also learn from combined audio+visual feeds!


Naive question about units. Are all three terms on the same scale to make direct magnitude comparison meaningful?


Yes. The scale is “units of loss”, which is a strange term, but well founded.


TL;DR - In large language models what matters most is the data, not the model size. And data is finite.


[meta] another useless rant about the web.

the page loads perfectly for me on chromium 77, but after the bloated javascript finally loads, the entire content gets replaced with

> Error: TypeError: Object.fromEntries is not a function

/:


Chrome 77 came out three years ago?


I never updated my setup, I'd need to manually compile the browser, since AFAIK kiwi browser doesn't ship stable binaries with telemetry disabled. I'll get around to doing it someday, but I found it quite struggling if not impossible to install android SDKs without android studio.

Kiwi source branch is Chromium 77.0 + Kiwi backported fixes, will show 88.0.4324.152 to websites for compatibility reasons (64-bit)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: