New Bounds for Matrix Multiplication: From Alpha to Omega

hinkley · on March 7, 2024

I have this feeling that the lower bound for matrix multiplication is heading towards some transcendental number that we don’t know about, or a function of one we do.

I wonder if finding the answer to that will clue someone in to what the optimal solution is, by pointing to some hidden structure to the problem that we have all missed.

staplung · on March 7, 2024

Maybe instead they'll be like Feigenbaum constants: apparently universal constants that we don't really understand. For instance, we don't even know for sure that Feigenbaum constants are irrational, though mathematicians who study them strongly suspect they're transcendental.

https://en.wikipedia.org/wiki/Feigenbaum_constants

zeroonetwothree · on March 8, 2024

I feel it’s more likely there is an n^2 algorithm using some radically different approach.

bobbylarrybobby · on March 8, 2024

I would assume n^(2+epsilon) where epsilon can be made arbitrarily small, but must be positive.

throwaway81523 · on March 8, 2024

From the paper: "In particular, the new bound on ω is ω ≤ 2.371552 (improved from ω ≤ 2.371866)".

barbarr · on March 7, 2024

I love these types of improvements - I figure they don't really have a practical application (except for shaving a few milliseconds off ultra large matrices), but they're a testament to human ingenuity and the desire to strive for perfection no matter how far out of reach.

jcranmer · on March 7, 2024

Strassen's algorithm is rarely used: its primary use, to my understanding, is in matrix algorithms of more exotic fields than reals or complex numbers, where minimizing multiplies is extremely useful. Even then, Strassen only starts to beat out naive matrix multiply when you're looking at n >= 1000's--and at that size, you're probably starting to think about using sparse matrices where your implementation strategy is completely different. But for a regular dgemm or sgemm, it's not commonly used in BLAS implementations.

The more advanced algorithms than Strassen's are even worse in terms of the cutover point, and are never seriously considered.

jedbrown · on March 7, 2024

The cross-over can be around 500 (https://doi.org/10.1109/SC.2016.58) for 2-level Strassen. It's not used by regular BLAS because it is less numerically stable (a concern that becomes more severe for the fancier fast MM algorithms). Whether or not the matrix can be compressed (as sparse, fast transforms, or data-sparse such as the various hierarchical low-rank representations) is more a statement about the problem domain, though it's true a sizable portion of applications that produce large matrices are producing matrices that are amenable to data-sparse representations.

taeric · on March 7, 2024

How big are the matrixes in some modern training pipelines? We always talk of absurdly large parameter spaces.

ogogmad · on March 7, 2024

SGD keeps the matrices small, I think.

amanda99 · on March 7, 2024

No, these aren't practical algorithms. No one uses them in reality.

The hope is that these small improvements will lead to new insights that will then lead to big improvements. I doubt how matrix multiplication will be done will really change regardless of these theoretical results, unless something really groundbreaking and shocking is discovered.

It's all relatively pretty in the grand scheme of things.

hinkley · on March 7, 2024

I think the last thing I saw on matrices was looking toward optimizing slightly larger sub-problems.

That smells a little bit like a loop unrolling and/or locality improvement to me, You can often beat a theoretical algorithmic improvement with a practical one in such situations. And if you do a little bit of both you hopefully end up with something that's faster than the current most practical implementation.

UncleOxidant · on March 7, 2024

There's a hell of a lot of matrix multiplications going on in AI/ML. Shaving off a few milliseconds for ultra large matrices could save power and money for the big players like Google, Microsoft/OpenAI and Meta.

amanda99 · on March 7, 2024

Those are by and large many small matrices being multiplied, these results have nothing to do with that.

imtringued · on March 8, 2024

The bottleneck in those is not the arithmetic operation but the memory bandwidth once you have to spill your matrix out of SRAM.

As it stands right now, it is actually better to have a slower algorithm that uses the local memory more efficiently.

AdamH12113 · on March 7, 2024

Tangential question: How can an algorithm end up with a fractional exponent? It's clear to me how nested loops can give O(n^2) or O(n^2), and how a binary search can give O(log n), but how on earth can you get something like O(n^2.371552), or even O(n^2.7)?

jcranmer · on March 7, 2024

The classic example here is Strassen's algorithm. You divide your two input matrices into quarters, and then you need to do 7 matrix multiplies of the smaller (n/2) matrices. When you add up the total work done, this ends up needing to do n^(log base 2 of 7) work, or about n^2.8.

csjh · on March 7, 2024

Many divide and conquer algorithms are expressed in the form O(n^logb(a)) because of the master theorem https://en.m.wikipedia.org/wiki/Master_theorem_(analysis_of_...

kzrdude · on March 7, 2024

I think a simpler algorithm like https://en.wikipedia.org/wiki/Karatsuba_algorithm demonstrates how it can happen.

vman512 · on March 7, 2024

It's pretty easy to get to O(N^(3/2)) at least, since that is sqrt(N)^3

Imagine an algorithm that re-arranges a list into a sqrt(N) by sqrt(N) grid, and does O(num_columns^2) work for each row.

dang · on March 8, 2024

Related ongoing thread:

New breakthrough brings matrix multiplication closer to ideal - https://news.ycombinator.com/item?id=39630759 - March 2024 (4 comments)

NooneAtAll3 · on March 8, 2024

pretty much all progress in theoretical matmul comes from approximate decomposition tools, so results happen only when error is small enough to get fixed easily - on large multiplications

on the small side even optimal (3x3)(3x3) matmul is unknown more precisely than "between 19 and 23" - with only tiny progress lately in ruling out some symmetrical solutions (https://arxiv.org/html/2402.01011v1)

pillusmany · on March 7, 2024

Tensor cores massively accelerate matrix multiplication with no algorithmic breakthrough. Just by being smart about how you move/cache/reuse values, operations which are considered O(1).

There should be an O notation which takes into account everything - various kinds of memory latencies, speed of logic operations, .... Obviously we have wall clock or total energy used.

david-gpu · on March 7, 2024

> There should be an O notation which takes into account everything - various kinds of memory latencies, speed of logic operations

O notation is useful as-is because it simplifies an algorithm down to the bare bones. If such simplification doesn't capture what is important to you, you need a more complex model.

The folks designing GPUs use multiple different such models, ranging from the simple, inaccurate, understandable and fast; to the obnoxiously detailed, precise, abstruse and slow.

Not one model will ever have all the desirable properties, because they are in direct contradiction.

_kulang · on March 7, 2024

Isn’t O notation asymptotic and a truncation? Adding an additional O(n) time to a O(n^2) algorithm would result in an O(n^2) algorithm? As far as I understand anyway. But you could make a more descriptive expression for the time based on what you know about the operations for sure