Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>You could actually wonder that one possible explanation for the human sample efficiency that needs to be considered is evolution. Evolution has given us a small amount of the most useful information possible.

It's definitely not small. Evolution performed a humongous amount of learning, with modern homo sapiens, an insanely complex molecular machine, as a result. We are able to learn quickly by leveraging this "pretrained" evolutionary knowledge/architecture. Same reason as why ICL has great sample efficiency.

Moreover, the community of humans created a mountain of knowledge as well, communicating, passing it over the generations, and iteratively compressing it. Everything that you can do beyond your very basic functions, from counting to quantum physics, is learned from the 100% synthetic data optimized for faster learning by that collective, massively parallel, process.

It's pretty obvious that artificially created models don't have synthetic datasets of the quality even remotely comparable to what we're able to use.



I think it’s a bit different. Evolution did not give us the dataset. It helped us to establish the most efficient training path, and the data, the enormous volume of it starts coming immediately after birth. Humans learn continuously through our senses and use sleep to compress the context. The amount of data that LLMs receive only appears big. In our first 20 years of life we consume by at least one order of magnitude more information compared to training datasets. If we count raw data, maybe 4-5 orders of magnitude more. It’s also different kind of information and probably much more complex processing pipeline (since our brain consciously processes only a tiny fraction of input bandwidth with compression happening along the delivery channels), which is probably the key to understanding why LLMs do not perform better.

Sorry but this is patently rubbish, we do not consume orders of magnitude more data than the training datasets, nor do we "process" it in anything like the same way.

Firstly, most of what we see, hear, experience etc, is extremely repetitive. I.e. for the first several years of our live we see the same people, see the same house, repeatedly read the same few very basic books, etc etc. So, you can make this argument purely based on "bytes" of data. I.e. humans are getting this super HD video feed, which means more data than an LLM. Well, we are getting a "video feed" but mostly of the same walls in the same room, which doesn't really mean much of anything at all.

Meanwhile, LLMs are getting LITERALLY, all of humanities recorded textual knowledge, more recorded audio than 10000 humans could listen to in their lifetime, more images and more varied images than a single person could view in their entire life, reinforcement learning on the hardest maths, science, and programming questions etc.

The idea that because humans are absorbing "video" means that its somehow more "data" than frontier LLMs are trained with is laughable honestly.


I like your confidence, but I think you missed a few things here and there.

Training datasets are repetitive too. Let’s say, you feed some pretty large code bases to an LLM: how many times there will be a for loop? Or how many times Newton laws (or any other important ideas) are mentioned there? Not once, not two times, but many more. How many times you will encounter a description of Paris, London or St.Petersburg? If you eliminate repetition, how much data will actually be left there? And what’s the point anyway: this repetition is required part of the training, because it places that data in context, linking it to everything else.

Is repetition that we have in our sensory inputs really different? If you had children or had opportunity to observe how do they learn, they are never confined in the same static repetition cycle. They experience things again and again in a dynamic environment that evolves over time. When they draw a line, they get instant feedback and learn from it, so that next line is different. When they watch something on TV for fifth time, they do not sit still, they interact — and learn, through dancing, repeating phrases and singing songs. In a familiar environment that they have seen so many times, they notice subtle changes and ask about them. What was that sound? What was that blinking light outside? Who just came in and what’s in that box? Our ability to analyze and generalize probably comes from those small observations that happen again and again.

Even more importantly, when nothing is changing, they learn through getting bored. Show me an LLM that can get bored when digging through another pointless conversation on Reddit. When sensory inputs do not bring anything valuable, children learn to compensate through imagination and games, finding the ways to utilize those inputs better.

You measure quality of data using wrong metrics. The intelligence is not defined by the number of known facts, but by the ability to adapt and deal with the unknown. The inputs that humans use prepare us for that better than all written knowledge of the world available to LLM.


I think the important part in that statement is the "most useful information", the size itself is pretty subjective because it's such an abstract notion.

Evolution gave us very good spatial understanding/prediction capabilities, good value functions, dexterity (both mental and physical), memory, communication, etc.

> It's pretty obvious that artificially created models don't have synthetic datasets of the quality even remotely comparable to what we're able to use.

This might be controversial, but I don't think the quality or amount of data matters as much as people think if we had systems capable of learning similar enough to the way human's and other animals do. Much of our human knowledge has accumulated in a short time span, and independent discovery of knowledge is quite common. It's obvious that the corpus of human knowledge is not a prerequisite of general intelligence, yet this corpus is what's chosen to train on.


Please stop comparing these things to biological systems. They have very little in common.

I'm talking about any processes that can be vaguely described as learning/function fitting, and share the same general properties with any other learning. Not just biological processes, e.g. human distributed knowledge distillation process is purely social.

Structurally? Yes.

On the other hand, outputs of these systems are remarkably close to outputs of certain biological systems in at least some cases, so comparisons in some projections are still valid.


That's like saying that a modern calculator and a mechanical arithmometer have very little in common.

Sure, the parts are all different, and the construction isn't even remotely similar. They just happen to be doing the same thing.


But they just don't happen to be doing the same thing. People claiming otherwise have to first prove that we are comparing the same thing.

This whole strand of “inteligence is just a compression” may be possible but it's just as likely (if not a massively more likely) that compression is just a small piece or even not at all how biological inteligence works.

In your analogy it's more like comparing modern calculator to a book. They might have same answers but calculator gets to them through completely different process. The process is the key part. I think more people would be excited by a calculator that only counts till 99 than a super massive book that has all the math results ever produced by the human kind.


Well put and captures my feelings on this

They are doing "the same thing" only from the point of view of function, which only makes sense from the point of view of the thing utilizing this function (e.g. a clerical worker that needs to add numbers quickly).

Otherwise, if "the parts are all different, and the construction isn't even remotely similar", how can the thing they're doing be "the same"? More importantly, how is it possible to make useful inferences about one based on the other if that's the case?


The more you try to look into the LLM internals, the more similarities you find. Humanlike concepts, language-invariant circuits, abstract thinking, world models.

Mechanistic interpretability is struggling, of course. But what it found in the last 5 years is still enough to dispel a lot of the "LLMs are merely X" and "LLMs can't Y" myths - if you are up to date on the relevant research.

It's not just the outputs. The process is somewhat similar too. LLMs and humans both implement abstract thinking of some kind - much like calculators and arithmometers both implement addition.


Without a direct comparison to human internals (grounded in neurobiology, rather than intuition), it's hard to say how similar these similarities are, and if they're not simply a result of the transparency illusion (as Sydney Lamb defines it).

However, if you can point us to some specific reading on mechanistic interpretability that you think is relevant here, I would definitely appreciate it.


That's what I'm saying: there is no "direct comparison grounded in neurobiology" for most things, and for many things, there simply can't be one. For the same reason you can't compare gears and springs to silicon circuits 1:1. The low level components diverge too much.

Despite all that, the calculator and the arithmometer do the same things. If you can't go up an abstraction level and look past low level implementation details, then you'll remain blind to that fact forever.

What papers depends on what you're interested in. There's a lot of research - ranging from weird LLM capabilities and to exact operation of reverse engineered circuits.


There is no level of abstraction to go up sans context. Again, let me repeat myself as well: the calculator and the arithmometer do the same things -- from the point of view of the cleric that needs to add and subtract quickly. Otherwise they are simply two completely different objects. And we will have a hard time making correct inferences about how one works based only on how we know the other works, or, e.g. how calculating machines work.

What I'm interested in is evidence that supports that "The more you try to look into the LLM internals, the more similarities you find". Some pointers to specific books and papers will be very helpful.


> Otherwise they are simply two completely different objects.

That's where you're wrong. Both objects reflect the same mathematical operations in their structure.

Even if those were inscrutable alien artifacts to you, even if you knew nothing about who constructed them, how or why? If you studied them, you would be able to see the similarities laid bare.

Their inputs align, their outputs align. And if you dug deep enough? You would find that there are components in them that correspond to the same mathematical operations - even if the two are nothing alike in how exactly they implement them.

LLMs and human brains are "inscrutable alien artifacts" to us. Both are created by inhuman optimization pressures. Both you need to study to find out how they function. It's obvious, though, that their inputs align, and their outputs align. And the more you dig into internals?

I recommend taking a look at Anthropic's papers on SAE - sparse autoencoders. Which is a method that essentially takes the population coding hypothesis and runs with it. It attempts to crack the neural coding used by the LLM internally to pry interpretable features out of it. There are no "grandmother neurons" there - so you need elaborate methods to examine what kind of representations an LLM can learn to recognize and use in its functioning.

Anthropic's work is notable because they have not only managed to extract features that map to some amazingly high level concepts, but also prove causality - interfering with the neuron populations mapped out by SAE changes LLM's behaviors in predictable ways.


You are making the false assumption that if output can be inferred from structure, the converse is true as well. Similarity in behaviour does not in any way, shape or form imply structural similarity. The boy scout, the migrating swallow, the foraging bee, and the mobile robot are good at orienteering. Do they achieve this goal in a similar manner? Not really.

Re: "I'm baffled that someone in CS, a field ruled by applied abstraction, has to be explained over and over again that abstraction is a thing that exists". Computer science deals with models of computation. You are making a classic mistake in confusing models for the real things they are capable of modelling.


> That's where you're wrong. Both objects reflect the same mathematical operations in their structure.

This is missing the point by a country mile, I think.

All navel-gazing aside, understanding every bit of how an arithmometer works - hell, even being able to build one yourself - tells you absolutely nothing about how the Z80 chip in a TI-83 calculator actually works. Even if you take it down to individual components, there is zero real similarity between how a Leibniz wheel works and how a (full) adder circuit works. They are in fact fundamentally different machines that operate via fundamentally different principles.

The idea that similar functions must mean that they share significant similarities under the hood is senseless; you might as well argue that there are similarities to be found between a nuclear chain reaction and the flow of a river because they are both harnessed to spin turbines to generate electricity. It is a profoundly and quite frankly disturbingly incurious way for anyone who considers themself an "engineer" to approach the world.


You don't get it at all, do you?

"Implements the same math" IS the similarity.

I'm baffled that someone in CS, a field ruled by applied abstraction, has to be explained over and over again that abstraction is a thing that exists.


In case you have missed it in the middle of the navel-gazing about abstraction, this all started with the comment "Please stop comparing these things to biological systems. They have very little in common."[0]

If you insist on continuing to miss the point even when told explicitly that the comment is referring to what's inside the box, not its interface, then be my guest. There isn't much of a sensible discussion about engineering to be had with someone who thinks that e.g. the sentence "Please stop comparing [nuclear reactors] to [coal power plants]. They have very little in common" can be countered with "but abstraction! they both produce electricity!".

For the record, I am not the one you have been replying to.

[0] https://news.ycombinator.com/item?id=46053563


You are missing the point once again.

They have "very little in common", except for the fact that they perform the same kind of operations.


If we think of every generation as a compression step of some form of information into our DNA and early humans existed for ~1.000.000 years and a generation is happening ~20years on average, then we have only ~50.000 compression steps to today. Of course, we have genes from both parents so they is some overlap from others, but especially in the early days the pool of other humans was small. So that still does not look like it is on the order of magnitude anywhere close to modern machine learning. Sure, early humans had already a lot of information in their DNA but still

It only ends up in the DNA if it helps reproductive success in aggregate (at the population level) and is something that can be encoded in DNA.

Your comparison is nonsensical and simultaneously manages to ignore the billion or so years of evolution starting from the first proto-cell with the first proto-DNA or RNA.


Aren't you agreeing with his point?

The process of evolution distilled down all that "humongous" amount to what is most useful. He's basically saying our current ML methods to compress data into intelligence can't compare to billions of years of evolution. Nature is better at compression than ML researchers, by a long shot.


Sample efficiency isnt the ability to distill alot of data into good insights. Its the ability to get good insights from less data. Evolution didnt do that it had a lot of samples to get to where it did

> Sample efficiency isnt the ability to distill alot of data into good insights

Are you claiming that I said this? Because I didn't....

There's two things going on.

One is compressing lots of data into generalizable intelligence. The other is using generalized intelligence to learn from a small amount of data.

Billions of years and all the data that goes along with it -> compressed into efficient generalized intelligence -> able to learn quickly with little data


"Are you talking past me?"

on this site, more than likely, and with intent


>Aren't you agreeing with his point? ... Nature is better at compression than ML researchers, by a long shot.

What I mean is basically the opposite. Nature not better as in more efficient. It just had a lot more time and scale to do it in an inefficient way. The reason we're learning quickly is that we can leverage that accumulated knowledge, in a manner similar to in-context learning or other multi-step learning (bulk of the training forms abstractions which are then used by the next stage). It's really unlikely we have some magical architecture that is fundamentally better than e.g. transformers or any other architecture at sample efficiency while having bad underlying data. My intuition is there might even be a hard limit to that. Multi-stage bootstrap might be the key, not the architecture.

Same for the social process of knowledge transfer/compression.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: