Not that long ago, I tried using the FFT to do matrix multiplication since it wa...

david-gpu · on Dec 12, 2024

> The best BLAS implementations are all closed source sadly.

NVidia open-sourced CUTLASS [0] some years ago and it achieves pretty competitive performance compared to e.g. the closed-source cuBLAS.

Keen observers will notice that Strassen is not used in CUTLASS.

[0] https://github.com/NVIDIA/cutlass

Ballas · on Dec 12, 2024

Using FFT to dot matmul is much more memory intensive, IIRC.

CUDNN supports FFT to do matmul as well as convolution/correlation and can also be configured to automatically use the best algorithm.

In some cases the FFT method has the incidental side-benefit of data reuse, like in the case of FIR filters, where the data allows for partitioned convolution.