A good mental model is to compare the number of floats being processed -vs- the ...

hansvm · on Dec 11, 2024

That property is the same reason you don't incur substantial overhead doing large matrix multiplications sharded over disks or many machines. You apply the same chunking strategies used to optimally use L1/L2/L3 caches, just instead at the level of numa nodes, physical disks, machines, and clusters. So long as each "cache" is big enough for the N^3/N^2 term to dominate communication overhead (especially if that communication can happen concurrently), the networked result is about as fast as the individual machines running at their max FLOPs for some smaller problem.

hansvm · on Dec 12, 2024

It looks like a dozen people found this helpful. As a related idea, it's one of the main reasons batch inference is so much more efficient in ML. You transmute the problem from memory-bound to compute-bound.

fulafel · on Dec 12, 2024

This is indeed a good moel. From the article's context, dGPUs have more of both bandwidth and flops, the compute-intensity balance isn't necessarily the deciding factor on whether there's a speedup on GPU.

In this article the deciding factor seems to be the startup cost because the application has placed the data on the CPU memory side, and is considering shipping out to GPU memory just for this computation.

KeplerBoy · on Dec 12, 2024

You can modify the roofline model to include the PCIe bandwidth. It is sometimes called a hierarchical roofline model.

sigmoid10 · on Dec 11, 2024

This is amplified even more by the fact that only the trivial implementation of matmul is O(n^3) whereas efficient ones (e.g BLAS) use things like the Strassen algorithm. You can also speed it up significantly by using cache-aware approaches when retrieving rows and columns. In practice there is a huge amount of theory behind this that is far beyond the average person's scope if they are not actual researchers.

rnrn · on Dec 11, 2024

Is there actually a BLAS implementation that uses strassen?

I don’t think it’s accurate that only trivial implementations use the direct o(n^3) algorithm. AFAIK high performance BLAS implementations just use highly optimized versions of it.

leecarraher · on Dec 11, 2024

BLAS is just the library definition and not the implementation, so BLAS implementations could implement GEMM anyway they want. But in practice the triple loop method (n^3) is the most common, despite Strassen's and the more numerically stable, Winograd methods being well known and available for decades. But with most things involving real computing hardware, memory access patterns and locality tend to be more important for performance than operation counts

chessgecko · on Dec 11, 2024

I remember reading that it’s too hard to get good memory bandwidth/l2 utilization in the fancy algorithms, you need to read contiguous blocks and be able to use them repeatedly. But I also haven’t looked at the gpu blas implementations directly.

jcranmer · on Dec 11, 2024

AIUI, Strassen gets used moderately commonly with non-floating-point datatypes, where numerical stability is less of a concern and multiplications are more useful to minimize than memory traffic. But from what I can tell, every floating-point BLAS library eschews Strassen, despite a steady trickle of papers saying "hey, there might be some small wins if we go to Strassen!"

chillee · on Dec 11, 2024

The big issue with Strassen isn't performance - it's numerical stability.

ryao · on Dec 12, 2024

Not that long ago, I tried using the FFT to do matrix multiplication since it was supposed to be asymptomatically faster. It turns out that the constant factor is huge compared to the O(n^3) grade school algorithm that BLAS optimizes via tiling and other tricks. Even if it looks expensive on paper, the cubic algorithm is fast.

I just wish I understood the tricks done to make it so fast so I could implement my own for variations for which there are no pre-existing BLAS implementations. The best BLAS implementations are all closed source sadly.

david-gpu · on Dec 12, 2024

> The best BLAS implementations are all closed source sadly.

NVidia open-sourced CUTLASS [0] some years ago and it achieves pretty competitive performance compared to e.g. the closed-source cuBLAS.

Keen observers will notice that Strassen is not used in CUTLASS.

[0] https://github.com/NVIDIA/cutlass

Ballas · on Dec 12, 2024

Using FFT to dot matmul is much more memory intensive, IIRC.

CUDNN supports FFT to do matmul as well as convolution/correlation and can also be configured to automatically use the best algorithm.

In some cases the FFT method has the incidental side-benefit of data reuse, like in the case of FIR filters, where the data allows for partitioned convolution.

sestep · on Dec 11, 2024

Really? I thought that no practical linear algebra library used the Strassen algorithm. Can you provide a source?

CowFreedom · on Dec 11, 2024

The BLAS GEMM routines I have seen use normal blocked algorithms.

bee_rider · on Dec 11, 2024

I don’t know what the real convention is, but IMO, BLAS GEMM “is” the O(n^3) algorithm (blocked is fine of course$ in the sense that something like Strassen has stability implications and isn’t appropriate for lots of sizes. Just swapping it in would be nuts, haha.

phkahler · on Dec 12, 2024

Strassen works best on power of 2 sized matrices. That's not a restriction you'd usually see in a general purpose library.

ithkuil · on Dec 12, 2024

Ok strassen and others are better but they are still O(n^w) where 2 < w < 3