lucasoshiro's favorites

Just curious, how "deep" have you gone into the theory? What resources have you used? How strong is your math background?

Unfortunately a lot of the theory does require some heavy mathematics, the type you won't see in a typical undergraduate degree even for more math heavy subjects like physics. Topics such as differential geometry, metric theory, set theory, abstract algebra, and high dimensional statistics. But I do promise that the theory helps and can build some very strong intuition. It is also extremely important that you have a deep understanding of what these mathematical operations are doing. It does look like this exercise book is trying to build that intuition, but I haven't read it in depth. I can say it is a good start, but only the very beginning of the theory journey. There is a long road ahead beyond this.

  > how it makes me choose the correct number of neurons in a layer, how many layers,

Take a look at the Whitney embedding theorem. While this isn't a precise answer, it'll help you gain some intuition about the minimal number of parameters you need (and the VGG paper will help you understand width vs depth). In a transformer, the MLP layer post attention scales up 4x the dimensions before coming down, which allows for untangling any knots in the data. While 2x is the minimum, 4x creates a smoother landscape and so the problem can be solved more easily. Some of this is discussed in paper (Schaeffer, Miranda, and Koyejo) that counters the famous Emergent Abilities paper by Wei et al. This should be discussed early on in ML courses when discussing problems like XOR or the concentric circle. These problems are difficult because in their natural dimension you cannot draw a hyperplane discriminating them, but by increasing the dimensionality of the problem you can. This fact is usually mentioned in intro ML courses but I'm not aware of one that contains more details such as a discussion of the Whitney embedding theorem that allow you to better generalize the concepts here.

  > the activation functions

There's a very short video I like that visualizes Gelu[0], even using the concentric circles! The channel has a lot of other visualizations that will really benefit your intuition. You may see where the differential geometry background can provide benefits. Understanding how to manipulate manifolds is critical to understanding what these networks are doing to the data. Unfortunately these visualizations will not benefit you once you scale beyond 3D as weird things happen in high dimensions, even as low as 10[1]. A lot of visual intuition goes out the window and this often leads people to either completely abandon it or make erroneous assumptions (no, your friend cannot visualize 4D objects[2,3] and that image you see of a tesseract is quite misleading).

The activation functions provide non-linearity to the networks. A key ingredient missing from the preceptron model. Remember that with the universal approximation theorem you can approximate any smooth, Lipschitz-continuious function, over a closed boundary. You can, in simple cases, relate this to Riemann Summation, but you are using smooth "bump functions" instead of rectangles. I'm being fairly hand-wavy here on purpose because this is not precise but there are relationships to be found here. This is a HN comment, I have to overly simplify. Also remember that a linear layer without an activation can only perform Affine Transformations. That is, after all, what a matrix multiplication is capable of (another oversimplification).

The learning curve is quite steep and there's a big jump from the common "it's just GMMs" or "it's just linear algebra" that is commonly claimed[4]. There is a lot of depth here, and unfortunately due to the hype there is a lot of stuff that says "deep" or "advanced mathematics" but it is important to remember that these terms are extremely relative. What is deep to one person is shallow to another. But if it isn't going beyond calculus, you are going to struggle, and I am extremely empathetic to that. But again, I do promise that there is a lot of insight to be gained by digging into the mathematics. There is benefit to doing things the hard way. I won't try to convince you that it is easy or that there isn't a lot of noise surrounding the topic, because that'd be a lie. If it were easy, ML systems wouldn't be "black boxes"![5]

I would also encourage you to learn some meta physics. Something like Ian Hacking's representing and Intervening is a good start. There are limitations to what can be understand through experimentation alone, famously illustrated in Dyson's recounting of then Fermi rejected his paper[6]. There is a common misunderstanding of the saying "with 4 parameters I can fit an elephant and with 5 I can make it wiggle its trunk." [6] can help provide a better understanding to this, but we truly do need to understand the limitation of empirical studies. Science relies on the combination of empirical studies and theory. They are no good without the other. This is because science is about creating causal models, so one must be quite careful and be extremely nuanced when doing any form of evaluation. The subtle details can easily trick you.

[0] https://www.youtube.com/watch?v=uiB97cPEVxM

[1] https://www.penzba.co.uk/cgi-bin/PvsNP.py?SpikeySpheres

[2] https://www.youtube.com/shorts/_n7TMDnYdVY

[3] https://www.youtube.com/watch?v=FfiQBvcdFG0

[4] https://news.ycombinator.com/item?id=43418334

[5] I actually dislike this term. It is better to say that they are opaque. A black box would imply that we have zero insights. But in reality we can see everything going on inside, it is just extremely difficult to interpret. We also do have some understanding, so the interpretation isn't impenetrable.

[6] https://www.youtube.com/watch?v=hV41QEKiMlM