Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A good mental model is to compare the number of floats being processed -vs- the number of primitive computations. Matrix multiplication has n^3 computation with n^2 data. Multiplication of large matrices is therefore special in the potential for "data re-use" (each float is used against all the columns or rows of the other matrix) -- so systems are designed to have a much higher flops throughput than memory bandwidth. A dot product is at the other extreme, where each float is used only once (loosely).

Roofline plots [1] are framework to visualize system design from this perspective.

[1] https://en.wikipedia.org/wiki/Roofline_model



That property is the same reason you don't incur substantial overhead doing large matrix multiplications sharded over disks or many machines. You apply the same chunking strategies used to optimally use L1/L2/L3 caches, just instead at the level of numa nodes, physical disks, machines, and clusters. So long as each "cache" is big enough for the N^3/N^2 term to dominate communication overhead (especially if that communication can happen concurrently), the networked result is about as fast as the individual machines running at their max FLOPs for some smaller problem.


It looks like a dozen people found this helpful. As a related idea, it's one of the main reasons batch inference is so much more efficient in ML. You transmute the problem from memory-bound to compute-bound.


This is indeed a good moel. From the article's context, dGPUs have more of both bandwidth and flops, the compute-intensity balance isn't necessarily the deciding factor on whether there's a speedup on GPU.

In this article the deciding factor seems to be the startup cost because the application has placed the data on the CPU memory side, and is considering shipping out to GPU memory just for this computation.


You can modify the roofline model to include the PCIe bandwidth. It is sometimes called a hierarchical roofline model.


This is amplified even more by the fact that only the trivial implementation of matmul is O(n^3) whereas efficient ones (e.g BLAS) use things like the Strassen algorithm. You can also speed it up significantly by using cache-aware approaches when retrieving rows and columns. In practice there is a huge amount of theory behind this that is far beyond the average person's scope if they are not actual researchers.


Is there actually a BLAS implementation that uses strassen?

I don’t think it’s accurate that only trivial implementations use the direct o(n^3) algorithm. AFAIK high performance BLAS implementations just use highly optimized versions of it.


BLAS is just the library definition and not the implementation, so BLAS implementations could implement GEMM anyway they want. But in practice the triple loop method (n^3) is the most common, despite Strassen's and the more numerically stable, Winograd methods being well known and available for decades. But with most things involving real computing hardware, memory access patterns and locality tend to be more important for performance than operation counts


I remember reading that it’s too hard to get good memory bandwidth/l2 utilization in the fancy algorithms, you need to read contiguous blocks and be able to use them repeatedly. But I also haven’t looked at the gpu blas implementations directly.


AIUI, Strassen gets used moderately commonly with non-floating-point datatypes, where numerical stability is less of a concern and multiplications are more useful to minimize than memory traffic. But from what I can tell, every floating-point BLAS library eschews Strassen, despite a steady trickle of papers saying "hey, there might be some small wins if we go to Strassen!"


The big issue with Strassen isn't performance - it's numerical stability.


Not that long ago, I tried using the FFT to do matrix multiplication since it was supposed to be asymptomatically faster. It turns out that the constant factor is huge compared to the O(n^3) grade school algorithm that BLAS optimizes via tiling and other tricks. Even if it looks expensive on paper, the cubic algorithm is fast.

I just wish I understood the tricks done to make it so fast so I could implement my own for variations for which there are no pre-existing BLAS implementations. The best BLAS implementations are all closed source sadly.


> The best BLAS implementations are all closed source sadly.

NVidia open-sourced CUTLASS [0] some years ago and it achieves pretty competitive performance compared to e.g. the closed-source cuBLAS.

Keen observers will notice that Strassen is not used in CUTLASS.

[0] https://github.com/NVIDIA/cutlass


Using FFT to dot matmul is much more memory intensive, IIRC.

CUDNN supports FFT to do matmul as well as convolution/correlation and can also be configured to automatically use the best algorithm.

In some cases the FFT method has the incidental side-benefit of data reuse, like in the case of FIR filters, where the data allows for partitioned convolution.


Really? I thought that no practical linear algebra library used the Strassen algorithm. Can you provide a source?


The BLAS GEMM routines I have seen use normal blocked algorithms.


I don’t know what the real convention is, but IMO, BLAS GEMM “is” the O(n^3) algorithm (blocked is fine of course$ in the sense that something like Strassen has stability implications and isn’t appropriate for lots of sizes. Just swapping it in would be nuts, haha.


Strassen works best on power of 2 sized matrices. That's not a restriction you'd usually see in a general purpose library.


Ok strassen and others are better but they are still O(n^w) where 2 < w < 3




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: