I'd say llm inference requires both memory capacity and bandwidth. Cerebras provides bandwidth with on-chip SRAM, but not capacity (an entire wafer has only 44GB SRAM).
Indeed, and even if the cost per wafer was 300K, since about say 20-50 wafers are needed, its still 6MM to 15MM for the system. So likely it would appear this is VC subsidized.
I recently came across a critique of the Turing test that seems relevant here. Given the test's limited duration (five minutes in this study) and the constrained rate of human communication, it’s theoretically possible to anticipate every possible human response and prepare prewritten replies in advance. If such a giant lookup table successfully deceives the interrogator most of the time, would we then consider it intelligent?
Thank you. In my mind, "planning" doesn’t necessarily imply higher-order reasoning but rather some form of search, ideally with backtracking. Of course, architecturally, we know that can’t happen during inference. Your example of the indefinite article is a great illustration of how this illusion of planning might occur. I wonder if anyone at Anthropic could compare the two cases (some sort of minimal/differential analysis) and share their insights.
I used the astronomer example earlier as the most simple, minimal version of something you might think of as a kind of microscopic form of "planning", but I think that at this point in the conversation, it's probably helpful to switch to the poetry example in our paper:
- Something you might characterize as "forward search" (generating candidates for the word at the end of the next line, given rhyming scheme and semantics)
- Representing those candidates in an abstract way (the features active are general features for those words, not "motor features" for just saying that word)
- Holding many competing/alternative candidates in parallel.
- Something you might characterize as "backward chaining", where you work backwards from these candidates to "write towards them".
With that said, I think it's easy for these arguments to fall into philosophical arguments about what things like "planning" mean. As long as we agree on what is going on mechanistically, I'm honestly pretty indifferent to what we call it. I spoke to a wide range of colleagues, including at other institutions, and there was pretty widespread agreement that "planning" was the most natural language. But I'm open to other suggestions!
Thanks for linking to this semi-interactive thing, but ... it's completely incomprehensible. :o (edit: okay, after reading about CLT it's a bit less alien.)
I'm curious where is the state stored for this "planning". In a previous comment user lsy wrote "the activation >astronomer< is already baked in by the prompt", and it seems to me that when the model generates "like" (for rabbit) or "a" (for habit) those tokens already encode a high probability for what's coming after them, right?
So each token is shaping the probabilities for the successor ones. So that "like" or "a" has to be one that sustains the high activation of the "causal" feature, and so on, until the end of the line. Since both "like" and "a" are very very non-specific tokens it's likely that the "semantic" state is really resides in the preceding line, but of course gets smeared (?) over all the necessary tokens. (And that means beyond the end of the line, to avoid strange non-aesthetic but attract cool/funky (aesthetic) semantic repetitions (like "hare" or "bunny"), and so on, right?)
All of this is baked in during training, during inference time the same tokens activate the same successor tokens (not counting GPU/TPU scheduling randomness and whatnot) and even though there's a "loop" there's no algorithm to generate top N lines and pick the best (no working memory shuffling).
The planning is certainly performed by circuits which we learned during training.
I'd expect that, just like in the multi-step planning example, there are lots of places where the attribution graph we're observing is stitching together lots of circuits, such that it's better understood as a kind of "recombination" of fragments learned from many examples, rather than that there was something similar in the training data.
This is all very speculative, but:
- At the forward planning step, generating the candidate words seems like it's an intersection of the semantics and rhyming scheme. The model wouldn't need to have seen that intersection before -- the mechanism could easily piece examples independently building the pathway for the semantics, and the pathway for the rhyming scheme
- At the backward chaining step, many of the features for constructing sentence fragments seem like the target is quite general (perhaps animals in one case, or others might even just be nouns).
Thank you, this makes sense. I am thinking of this as an abstraction/refinement process where an abstract notion of the longer completion is refined into a cogent whole that satisfies the notion of a good completion. I look forward to reading your paper to understand the "backward chaining" aspect and the evidence for it.
Agreed, but PAC-Bayes or other descendants of VC theory is probably not the best explanation. The notion of algorithmic stability provides a (much) more compelling explanation. See [1] (particularly Sections 11 and 12)
I'm a huge fan of HN just for replies such as this that smash the OP's post/product with something better. It's like at least half the reason I stick around here.
Hard disagree. Your link relies on gradient descent as an explanation, whereas OP explains why optimization is not needed to understand DL generalization. PAC-Bayes, and the other different countable hypothesis bounds in OP also are quite divergent from VC dimension. The whole point of OP seems to be that these other frameworks, unlike VC dimension, can explain generalization with an arbitrarily flexible hypothesis space.
Yes, and that's the problem. What Zhang et al [2] showed convincingly in the Rethinking paper is that just focusing on the hypothesis space cannot be enough since the same hypothesis space fits real and random data so it's already too large. Therefore, these methods that focus on the hypothesis space have to talk about a bias in practice towards a better subspace, and that already requires studying the specific optimization algorithm in order to understand why it picks certain hypothesis over others in the space.
But once you are ready to do that then algorithmic stability is enough. You don't then need to think about Bayesian ensembles, or other proxies/simplifications etc. but can focus on just the specific learning setup you have. BTW algorithmic stability is not a new idea. An early version showed up within a few years of VC theory in the 80s in order to understand why nearest neighbors generalizes (it wasn't called algorithmic stability then though).
If you are interested in this, also recommend [3].
But it's not a problem, it's actually a good thing that OP's explanation is more general. One of the main points in the OP paper is that you do not in fact need proxies or simplification. You can derive generalization bounds that do explain this behavior, without needing to rely on optimization dynamics. This exactly responds to the tests set forth in Zhang et al. OP does not "rely on Bayesian ensembles, or other proxies/simplifications". That seems to be a misunderstanding of the paper. It's analyzing the solutions that neural networks
actually reach, which differentiates it from a lot of other work. It also additionally shows how other simple model classes can reproduce the same behavior, and these reproductions do not depend on optimization.
"and that already requires studying the specific optimization algorithm in order to understand why it picks certain hypothesis over others in the space." But the OP paper explains how even "guess and check" can generalize similarly to SGD. It's becoming more well understood that the role of the optimizer may have been historically overstated for understanding DL generalization. It seems to be more about loss landscapes.
Don't get me wrong, these references you're linking are super interesting. But they don't take away from the OP paper which is adding something quite valuable to the discussion.
Thank you for the great discussion. You've put your finger on the right thing I think. We can now dispense with the old VC-type thinking (i.e., that it's because the hypothesis space is not complex enough that we get generalization). Instead now the real question is this: is it the loss landscape itself, or the particular way in which the landscape is searched that leads to good generalization in deep learning.
One can think of perhaps an "exhaustive" search with say God's computer of the loss landscape and pick an arbitrary point among all the points that minimize (or are close to the minimum). Or with our computers we can merely sample. But in both cases, it's hard to see how one would avoid picking "memorization" solutions in the loss landscape. Recall that in an over-parameterized setting, there will be many solutions that have the same low training loss but very different test losses. The reference in my original post [1] shows a nice example with a toy overparameterized linear model (Section 3) where multiple linear models fit the training data but they have very different generalizations. (It also shows why GD ends up picking the better-generalizing solution.)
Now people have argued that the curvature around the solution is a distinguishing factor between well-generalizing solutions and not. Though already now we are moving into the territory of how to sample the space i.e. the specifics of the searching algorithm (a direction you may not like), but even if we press ahead, it's not a satisfactory explanation since in a linear model with L2 loss, the curvature is the same everywhere as Zhang et al. pointed out. So the curvature theories fail for the simplest case already unless one believes that somehow linear models are fundamentally different from deeper and non-linear models.
[1] points out other troubling facts about the curvature explanation (Section 12), but one I like more than the others is the following: As per curvature theories the reason for good generalization at the start of the training process is fundamentally different from the reason from good generalization at the end of the training process. (As always, generalization is just the difference between test and training, and so good generalization is when that difference is small; not necessarily that the test loss is small.) At the start of the GD training process curvature theories would not be applicable (we just picked a random point after all) and so they would hold that we get good (in fact, perfect) generalization because we didn't look at the training data. However, at the end of training, they say we have good generalization because we found a shallow minima. This lack of continuity is disconcerting. In contrast, stability based arguments provide a continuous explanation: the longer you run SGD the less stable it is (so don't run it too long and you'll be fine since you'll achieve an acceptable tradeoff between lowering the loss and overfitting).