* Inner layers of transformers share a representation space
* Some middle layers can be dropped without total failure (though it results in reduced performance)
* Middle layers are not interchangeable, they are performing different functions
* Order of layers only matters somewhat
* Layers can somewhat be executed in parallel
Each layer performs a different function but speaks the same language as other layers. A stack of transformers isn’t performing a sequence of fundamental transformations as much as it as performing a sequence of additions, each layer adding new paint to a shared canvas.
Since the layers speak the same language, it makes me wonder how we could modify and extend a transformer. Can you train other models to share the same representational space and have them “plug in” to the transformer? Does this shared representational space make it easier to perform RL and unlock agentic behavior?
Droppath is a regularisation technique that uses one of those insights. I wonder if the other insights can also be turned into regularisation methods. Like randomly shuffling layers or executing random batches of layers in parallel.
> Can you train other models to share the same representational space and have them “plug in” to the transformer?
English is kind of a shared representational space slowly transforming itself. embeddings are kind of a finite, floating point approximation of that infinite representational space.
> A stack of transformers isn’t performing a sequence of fundamental transformations as much as it as performing a sequence of additions, each layer adding new paint to a shared canvas.
I don't know if "additions" is a good mental model.
If you have layers 1..N that you're training via backprop, then layer N has no reason to "push" some of its computation "back" to layer N-1 if that computation could be done fully independently using the information already available at layer N-1. Instead, you'd just get a wider parameter space + post-pruning embedding vector at layer N, to do more parallel work and produce more parallel outputs at layer N.
The only reason you end up with more than a single hidden layer doing anything other than pure passthrough, is that a given layer K is constrained in the operations it can perform. If a layer K requires inputs that are more than just a single linear AddMM+softmax of the input layer away, then layer K can't do those operations "on its own", and needs some other layer to supply a pre-transformed input for layer K to do "the rest" of the work on. In practice, layer K thus acts as a loss function to train layer K-1 to compute those nonlinear inputs that layer K needs; and so on, pushing back "dependent responsibilities" for computing the outputs, all the way back to the input layer.
(You might have the intuition that a layer might just "run out of parameter space" and so need to "slide" some of the computation backward in time to a previous layer — but no, there'd be no reason to do this, as the abstract NN passthrough nodes required to propagate that already-complete computation from a previous layer, take up just as much space in a Transformer's Q/K embedding vectors as abstract NN nodes that are actually doing nontrivial computation do.)
So fundamentally, each layer is doing something to the previous layer that's not just "addition" (an operation which is one AddMM+softmax away.)
...but that being said, it can be something conceptually equally-trivial to addition. For example, in an image-generation model, "alpha-blending by weighted averaging in an oppositional-color-space projection" is conceptually trivial, but isn't a simple AddMM+softmax, and so requires another NN layer each time any object must be composited on top of any other existing alpha-blended object.
---
The interesting intuition that this "dependency graph" mental model of a Transformer gives you is that, despite the analogy used in the paper:
> The canvas (input) is passed along a series of painters. Some painters specialize in birds, while others are better at painting wheels. Each painter receives the canvas from the painter below her, then she decides whether to add a few strokes to the painting or just pass it along to the painter above her
...it's actually still possible for several layers to all know how to paint a bird and a wheel and other things too.
The constraint is that those things all need to be able to be done in parallel and independently to each-other, for them to be "scheduled" together as parallel operations within a single layer. A given layer can't do two things at the same time — or at least can't finish two things at the same time — if those two things are interdependent in a way that requires a nontrivial (not just AddMM+softmax) amount of math to merge.
Whereas, if any object depends on another object, then the work for one or the other object has to be "pushed backward in time" at training time to some previous layer, so that the outputs of that operation can be used as inputs. (Thus "painting" in a very literal sense — paint has to already be on the canvas by the time you want to blend on top of it!)
When that "pushing computation backward in time" happens often during training, in an biased way (i.e. with a causal correlation in the computational-dependency order that the occlusion/lighting/reflection/etc effect requires the parts of the scene be composed in), then due to the "scheduling constraint", some particular layers might end up trained more often to do a particular thing; and so end up better at doing that thing; and so end up being "the place" where that thing happens.
But just the same, if the "pushing computation backward in time" happens often during training in an unbiased way — and/or if you paint birds-on-birds-on-birds in your image, such that there's no way to get one layer to be "the" bird expert; then many layers will end up being trained to do the task, slowly, as at one time or another the "responsibility" for learning that sub-task falls one or more different layers for each training example.
Nice~ Glad to see this published / confirmed by others. Next I hope to see some of this symmetry used to improve MoE / dynamic compute / adaptive style models!
Context: I found the same structure: early - middle - end layers serving different purposes, including the permutability of the middle layers, a year or so ago, but never got to testing more models rigerously or publishing it.
> One interesting finding though (now that I'm rambling and just typing a lot) is that in a static model, you can "shuffle" the layers (eg. swap layer 4's weights with layer 7's weights) and the resulting tokens roughly seem similar (likely caused by the ResNet style backbone). Only the first ~3 layers and last ~3 layers seem "important to not permute". It kinda makes me interpret models as using the first few layers to get into some "universal" embedding space, operating in that space "without ordering in layer-order", and then "projecting back" to token space at the end. (rather than staying in token space the whole way through).
No need to postulate platonic forms. All we need is the idea that there are real patterns to be mapped. The idea that distinct nets can share a representational space is around at least since Laakso And Cottrell published their "Content and cluster analysis: assessing representational similarity in neural systems" in 2000. If you look for "representational similarity analysis" you'll find more research about it.
They find:
* Inner layers of transformers share a representation space
* Some middle layers can be dropped without total failure (though it results in reduced performance)
* Middle layers are not interchangeable, they are performing different functions
* Order of layers only matters somewhat
* Layers can somewhat be executed in parallel
Each layer performs a different function but speaks the same language as other layers. A stack of transformers isn’t performing a sequence of fundamental transformations as much as it as performing a sequence of additions, each layer adding new paint to a shared canvas.
Since the layers speak the same language, it makes me wonder how we could modify and extend a transformer. Can you train other models to share the same representational space and have them “plug in” to the transformer? Does this shared representational space make it easier to perform RL and unlock agentic behavior?