Any idea why it would be slower? I liked the idea of avoiding half the multiplic...

Someone · on Aug 29, 2016

Disclaimer: I have very, very, very little experience using BLAS. The reasons I post this are:

- the original poster gave an unqualified speed difference, which cannot reasonably be the full story. They likely left out information such as a 'for my use case' clause.

- I was curious, too, but couldn't Google benchmarks.

Having said that, my guess would be that it is slower for small matrices (where algorithm overhead plays a role), but faster for larger ones (where speed probably is proportional to memory access speed times amount of data accessed). There's a similarity here with searching a sorted array. There, a linear search is faster than a binary search up to a surprisingly large N.

I wouldn't dare guess where the cut-off point lies, but it likely lies at a point above where a matrix row fills a cache line (below that, reading only a few entries of a row brings in an entire row, anyways). For a level 1 cache line of 64 bytes, for floats, that would be a 16x16 matrix.