Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not that long ago, I tried using the FFT to do matrix multiplication since it was supposed to be asymptomatically faster. It turns out that the constant factor is huge compared to the O(n^3) grade school algorithm that BLAS optimizes via tiling and other tricks. Even if it looks expensive on paper, the cubic algorithm is fast.

I just wish I understood the tricks done to make it so fast so I could implement my own for variations for which there are no pre-existing BLAS implementations. The best BLAS implementations are all closed source sadly.



> The best BLAS implementations are all closed source sadly.

NVidia open-sourced CUTLASS [0] some years ago and it achieves pretty competitive performance compared to e.g. the closed-source cuBLAS.

Keen observers will notice that Strassen is not used in CUTLASS.

[0] https://github.com/NVIDIA/cutlass


Using FFT to dot matmul is much more memory intensive, IIRC.

CUDNN supports FFT to do matmul as well as convolution/correlation and can also be configured to automatically use the best algorithm.

In some cases the FFT method has the incidental side-benefit of data reuse, like in the case of FIR filters, where the data allows for partitioned convolution.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: