I was uncertain but your other statements made me think that sentiment was unintentional. I just want to push back against it because it is too common and misused even with good intentions. I hope you don't see this as me saying anything about your character. Honestly, impressions are that you do care.
> It sounds like the generalisation of projected gradient decent to "Muon"
I'm not a niche expert here, but do have knowledge in adjacent/overlapping domains. It sounds like you're in a similar boat? I ask because this pulls back to what I was trying to say about sometimes needing an expert eye.
If it helps, here's the "paper" for the Muon optimizer[0] and here's a follow-up[1]. Muon is definitely a gradient decent technique, but so are Adam, SGD, Ada, and many more[2].
The big thing of Muon is using NewtonSchulz5. So you update parameters with θ_{t-1} - η[NS_5(μB_{t-1} + ∇L(θ_{t-1}))] (I bracketed so you can see that this is just a specific version of θ_{t-1} - ηF(∇L(θ_{t-1}),...) which the standard gradient descent -- θ - η∇L(θ) -- is in that class of functions, right?). So we should be careful to over generalize and say that this is just gradient descent. You could even say [1] is "just [0] but with weight-decay" (or go look at the Adam and AdamW algos ;)
But one thing I should add is that gradient descent algorithms aren't topologically aware. I was able to find this post which asks a related question, trying to find what the conditions are for a surface's geodesic to align with gradient descent (note Newton differs from GD too). I don't think this paper is creating a solution where the GD formulation results in following a geodesic to the minimum, but my take is that it is working towards that direction. And to clarify, we'd want to follow the geodesic because that gives us the shortest or most energy efficient path (which ever perspective you want to use). In optimization we want to try to accomplish these two things (and more!): 1) take the "best" path to the optima, 2) find the best optima. Unfortunately these are ill-defined and there's not always objective answers to them. But in an ideal gradient descent algorithm we'd want it to go to the global minimum and take the fastest path, right? So with that it helps to be aware of the geometry (part of why people look at the Hessian but that comes at the cost of increased computation even if the additional information can get us there in fewer steps. So that's not (always) "the best").
I know this isn't a full answer and maybe with more reading I'll have a better one for you. But I'm hoping my answer can at least help you see some of the underlying nuanced problems that (_I think_) the authors are trying to get at. Hopefully I'm not too far off base lol. I'm hoping someone with more expertise can jump in and provide corrections/clarifications in the mean time.
If it helps, here's the "paper" for the Muon optimizer[0] and here's a follow-up[1]. Muon is definitely a gradient decent technique, but so are Adam, SGD, Ada, and many more[2].
The big thing of Muon is using NewtonSchulz5. So you update parameters with θ_{t-1} - η[NS_5(μB_{t-1} + ∇L(θ_{t-1}))] (I bracketed so you can see that this is just a specific version of θ_{t-1} - ηF(∇L(θ_{t-1}),...) which the standard gradient descent -- θ - η∇L(θ) -- is in that class of functions, right?). So we should be careful to over generalize and say that this is just gradient descent. You could even say [1] is "just [0] but with weight-decay" (or go look at the Adam and AdamW algos ;)
But one thing I should add is that gradient descent algorithms aren't topologically aware. I was able to find this post which asks a related question, trying to find what the conditions are for a surface's geodesic to align with gradient descent (note Newton differs from GD too). I don't think this paper is creating a solution where the GD formulation results in following a geodesic to the minimum, but my take is that it is working towards that direction. And to clarify, we'd want to follow the geodesic because that gives us the shortest or most energy efficient path (which ever perspective you want to use). In optimization we want to try to accomplish these two things (and more!): 1) take the "best" path to the optima, 2) find the best optima. Unfortunately these are ill-defined and there's not always objective answers to them. But in an ideal gradient descent algorithm we'd want it to go to the global minimum and take the fastest path, right? So with that it helps to be aware of the geometry (part of why people look at the Hessian but that comes at the cost of increased computation even if the additional information can get us there in fewer steps. So that's not (always) "the best").
I know this isn't a full answer and maybe with more reading I'll have a better one for you. But I'm hoping my answer can at least help you see some of the underlying nuanced problems that (_I think_) the authors are trying to get at. Hopefully I'm not too far off base lol. I'm hoping someone with more expertise can jump in and provide corrections/clarifications in the mean time.
[0] https://kellerjordan.github.io/posts/muon/
[1] https://arxiv.org/abs/2502.16982
[2] (far from a complete list) https://docs.pytorch.org/docs/stable/optim.html#algorithms
[3] (I think similar types of questions may also be fruitful) https://mathoverflow.net/questions/42617/functions-whose-gra...