I have this feeling that the lower bound for matrix multiplication is heading towards some transcendental number that we don’t know about, or a function of one we do.
I wonder if finding the answer to that will clue someone in to what the optimal solution is, by pointing to some hidden structure to the problem that we have all missed.
Maybe instead they'll be like Feigenbaum constants: apparently universal constants that we don't really understand. For instance, we don't even know for sure that Feigenbaum constants are irrational, though mathematicians who study them strongly suspect they're transcendental.
I love these types of improvements - I figure they don't really have a practical application (except for shaving a few milliseconds off ultra large matrices), but they're a testament to human ingenuity and the desire to strive for perfection no matter how far out of reach.
Strassen's algorithm is rarely used: its primary use, to my understanding, is in matrix algorithms of more exotic fields than reals or complex numbers, where minimizing multiplies is extremely useful. Even then, Strassen only starts to beat out naive matrix multiply when you're looking at n >= 1000's--and at that size, you're probably starting to think about using sparse matrices where your implementation strategy is completely different. But for a regular dgemm or sgemm, it's not commonly used in BLAS implementations.
The more advanced algorithms than Strassen's are even worse in terms of the cutover point, and are never seriously considered.
The cross-over can be around 500 (https://doi.org/10.1109/SC.2016.58) for 2-level Strassen. It's not used by regular BLAS because it is less numerically stable (a concern that becomes more severe for the fancier fast MM algorithms). Whether or not the matrix can be compressed (as sparse, fast transforms, or data-sparse such as the various hierarchical low-rank representations) is more a statement about the problem domain, though it's true a sizable portion of applications that produce large matrices are producing matrices that are amenable to data-sparse representations.
No, these aren't practical algorithms. No one uses them in reality.
The hope is that these small improvements will lead to new insights that will then lead to big improvements. I doubt how matrix multiplication will be done will really change regardless of these theoretical results, unless something really groundbreaking and shocking is discovered.
It's all relatively pretty in the grand scheme of things.
I think the last thing I saw on matrices was looking toward optimizing slightly larger sub-problems.
That smells a little bit like a loop unrolling and/or locality improvement to me, You can often beat a theoretical algorithmic improvement with a practical one in such situations. And if you do a little bit of both you hopefully end up with something that's faster than the current most practical implementation.
There's a hell of a lot of matrix multiplications going on in AI/ML. Shaving off a few milliseconds for ultra large matrices could save power and money for the big players like Google, Microsoft/OpenAI and Meta.
Tangential question: How can an algorithm end up with a fractional exponent? It's clear to me how nested loops can give O(n^2) or O(n^2), and how a binary search can give O(log n), but how on earth can you get something like O(n^2.371552), or even O(n^2.7)?
The classic example here is Strassen's algorithm. You divide your two input matrices into quarters, and then you need to do 7 matrix multiplies of the smaller (n/2) matrices. When you add up the total work done, this ends up needing to do n^(log base 2 of 7) work, or about n^2.8.
pretty much all progress in theoretical matmul comes from approximate decomposition tools, so results happen only when error is small enough to get fixed easily - on large multiplications
on the small side even optimal (3x3)(3x3) matmul is unknown more precisely than "between 19 and 23" - with only tiny progress lately in ruling out some symmetrical solutions (https://arxiv.org/html/2402.01011v1)
Tensor cores massively accelerate matrix multiplication with no algorithmic breakthrough. Just by being smart about how you move/cache/reuse values, operations which are considered O(1).
There should be an O notation which takes into account everything - various kinds of memory latencies, speed of logic operations, .... Obviously we have wall clock or total energy used.
> There should be an O notation which takes into account everything - various kinds of memory latencies, speed of logic operations
O notation is useful as-is because it simplifies an algorithm down to the bare bones. If such simplification doesn't capture what is important to you, you need a more complex model.
The folks designing GPUs use multiple different such models, ranging from the simple, inaccurate, understandable and fast; to the obnoxiously detailed, precise, abstruse and slow.
Not one model will ever have all the desirable properties, because they are in direct contradiction.
Isn’t O notation asymptotic and a truncation? Adding an additional O(n) time to a O(n^2) algorithm would result in an O(n^2) algorithm? As far as I understand anyway. But you could make a more descriptive expression for the time based on what you know about the operations for sure
I wonder if finding the answer to that will clue someone in to what the optimal solution is, by pointing to some hidden structure to the problem that we have all missed.