Optimizing matrix multiplication in C

makmanalp · on Aug 28, 2016

Attractivechaos's stuff blows my mind. Shameless plug - I've started dissecting his header-only hashmap library (khash.h) bit by bit, and I've been documenting my adventure here:

https://medium.com/@makmanalp/dissecting-khash-h-part-1-orig...

edit: and part 2:

https://medium.com/@makmanalp/dissecting-khash-h-part-2-scou...

santaclaus · on Aug 28, 2016

Rad, I wasn't aware of khash, thanks! Any idea how it compares to Google's dense_hash_map?

makmanalp · on Aug 28, 2016

No clue! Your comment history seems to say you're more qualified to answer that than me.

My first look says that dense_hash_map seems to use STL stuff and generics, khash is in pure C, and uses macro hacks to achieve the same functionality and only has a few key types it can use (str, int, int64).

Other than that, dense_hash_map and khash both use quadratic probing (https://en.wikipedia.org/wiki/Quadratic_probing) to resolve collisions. I couldn't really dig and find the hashing functions used by dense_hash_map but khash uses a weird one called X31 (search in https://download.samba.org/pub/unpacked/ntdb/lib/ccan/hash/h...) for the string ones and .... nothing??? for ints, with an alternative of Wang's integer hash function (https://gist.github.com/badboy/6267743). The world of hashing functions seem to be a crazy underworld of arcane incantations, old tomes, and copy pasting and emailing and trying random shit. It's really wonderful, reminds me of the olden days.

Khash.h is a measly 624 lines of code, everything included, so there's that.

makmanalp · on Aug 29, 2016

OK, a bit more here: https://attractivechaos.wordpress.com/2008/08/28/comparison-...

apathy · on Aug 28, 2016

Good write up. It is very rare to outperform decades of numeric analysts (and also avoid nasty machine precision issues) by shooting from the hip, and Eigen is amazingly easy to use (plus it is a headers-only implementation: no DLL).

santaclaus · on Aug 28, 2016

I'd like to see how MKL stacks up -- if you are on Intel hardware MKL often beats out Eigen.

alfalfasprout · on Aug 30, 2016

MKL is a BLAS substitute (with some additional features like the Vector Statistical Library). Eigen is a higher level library that can call BLAS functions. For what it's worth, OpenBLAS is now on-par with MKL on most benchmarks.

dagw · on Aug 29, 2016

Doesn't Eigen use MKL as a backend if MKL is installed?

attractivechaos · on Aug 29, 2016

Eigen can use MKL as a backend, but not by default. I don't have MKL installed on my machine (don't have a license on that linux server), so Eigen is entirely using its own matrix multiplication code.

hairy_man674 · on Aug 29, 2016

Any programmer with a competent faculty for science can see that your experiment is flawed:

1. too many indepedent variables: comparing two different compilers on two different platforms on two different architectures?!!

2. and then you compare performances without equal consideration of the problem size for n!

your conclusions are premature, sir, your analysis is conjecture and yet you dare use the word "primitive". for instance, are you aware that Blas uses Fortran whose lack of pointer aliasing effects in memory could be causing the disparity here? have you considered that certain interprocedural optimizations have been affected by not having seperate translation units for your tests?! you have also added debugging flags to the compilers further slowing the BLAS routines with more debugging cruft!?! gah!

for these reasons above and the flaunted use of architecture specific intrinsics, your post stands accused! make revisions and then also see how blas performs using a better algorithm with possibly fewer aliasing effects

see http://m.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=zvmm...

begone from this thread! use proper profiling tools. learn c! fix this violence to science and injustice!!!

attractivechaos · on Aug 29, 2016

> 2. and then you compare performances without equal consideration of the problem size for n!

All the different implementations are compared on the same condition: same n, same compiler and same machine.

> are you aware that Blas uses Fortran

Wrong. uBLAS and Eigen are written purely in C++. OpenBLAS is a mixture of C and ASM. They have nothing to do with Fortran.

> you have also added debugging flags to the compilers further slowing the BLAS routines with more debugging cruft!?

By debugging flags, you mean "-g"? No, -g should not affect the performance (maybe a tiny bit on loading the executable, but that is negligible). Get rid of -g, and you will get essentially the same result.

> you dare use the word "primitive"

Yes, I dare, because uBLAS is just that bad and should be trashed. Prove me wrong with your benchmarks. Let number speak for itself.

hairy_man674 · on Aug 30, 2016

The burden of proof is on you to prove otherwise, sir.

timeu · on Aug 29, 2016

Some time ago I did compare different BLAS implementations (OpenBLAS, MKL, ACML etc) on different Intel CPU architectures, in case somebody is interested in the differences between them

http://stackoverflow.com/questions/5260068/multithreaded-bla...

Const-me · on Aug 29, 2016

Apparently, the main reason for your results — GCC optimizer ain’t good.

Here’s Visual C++ port: https://github.com/Const-me/matmul/

Eugen is still faster than naïve implementations, but not that faster, just 30-40% compared to SSE+tiling sdot.

attractivechaos · on Aug 29, 2016

You have a faster CPU. "SSE sdot" and "SSE+tiling" are both faster on your machine. However, "Eigen" is slower. This suggests gpderetta might be right – MSVC is not as good as gcc to optimize Eigen, or conversely, Eigen has not been optimized for MSVC. Nonetheless, MSVC did fully vectorize the inner loop of non-SSE sdot, better than both gcc and clang (EDIT: of the versions I was using; it would be good to try the latest gcc/clang). It seems that explicit SSE vectorization is the most stable. Other implementations of "sdot" depend too much on the behavior of compilers.

Anyway, thanks a lot for this experiment. I rarely use MSVC these days. It is good to know where it stands.

Const-me · on Aug 30, 2016

Both CPUs are same microarch, Haswell. Xeon has much more cache. The i5 has higher base frequency (3.2 vs 2.6) the Xeon however has higher turbo frequency (3.6 vs 3.4).

OK, I’ve installed Cygwin and GCC, compiled and benchmarked the original code. I made the following changes in the makefile: (1) Replaced -O2 with -O3 (2) added -msse -msse2 -msse3 -mssse3 -msse4 -msse4.1 to the option, both C and C++.

The results on GCC/i5/Windows 10 are very consistent with the OPs result on GCC/Xeon/Linux.

Diagram: https://raw.githubusercontent.com/Const-me/matmul/master/res...

Numbers: https://github.com/Const-me/matmul/blob/master/Run/result.xl...

gpderetta · on Aug 29, 2016

> Apparently, the main reason for your results — GCC optimizer ain’t good.

while that could be the case, it does not follow from your results; you need to compare GCC and MSVC directly. It might rather be that MSVC is not capable to optimize Eigen enough to give it a larger advantage over SSE+tiling.

From experience MSVC is less capable of aggressive abstraction elimination (the bread and butter of Eigen) than GCC.

Const-me · on Aug 30, 2016

OK, did that. Results for GCC are very close to the OP’s results for Xeon.

GCC indeed did better optimization of Eigen than MSVC. Specifically, about 25% better. I think that might be because authors of Eigen have very intimate knowledge on GCC’s optimizer, and somehow specifically optimized their code for GCC compiler.

However, for the rest of the test cases, MSVC performed much better. For example, for Transposed algorithm, MSVC result is 3.5 times faster - huge difference.

k0htlane · on Aug 29, 2016

It seems that the code is compiled only with -O2 (instead of of -O3 -march=native), so strong gcc optimizations aren't really turned on.

g0xA52A2A · on Aug 29, 2016

Just wanted to share this video given at ANL recently which goes into the cost of communication in such methods. It has a HPC focus but the presentation is quite digestible.

https://www.youtube.com/watch?v=TUViqCd9EKk

em3rgent0rdr · on Aug 29, 2016

I don't like how the author has labeled tables as "Linux" and "Mac", when really most of the differences between those columns are the result of the compiler used, and to a lesser extent, the fact that the Mac was a local machine while the "Linux" tests were done on a remote server.

A much more useful comparison would keep everything constant except the single variable that is different. This could have been done by utilizing the same hardware, and only using a different compiler. Since both GCC & Clang work on both linux & mac, there is no excuse.

rurban · on Aug 28, 2016

The latest scatter/gather vectorization tricks are missing. The SSE improvements are only minimal.

Maybe a very new compiler, like Polly or ICC can vectorize this automatically.

ICC has a special -qopt-matmul option. https://software.intel.com/en-us/node/524953

saretired · on Aug 29, 2016

Intel's Math Kernel Library has optimized versions of BLAS, a sparse solver, etc. I'd like to see a comparison using ICC/MKL.

sevenless · on Aug 28, 2016

I'd like to see support for optimized multiplication on symmetric matrices. I don't think BLAS can take advantage of that, certainly scipy/numpy doesn't.

santaclaus · on Aug 28, 2016

> I don't think BLAS can take advantage of that

There is BLAS level 2 (f|d)symv for matrix-vector multiplies and BLAS level 3 (f|d)symm for matrix-matrix multiplies. Last time I benchmarked symv, it was slower than the general implementation by around 25%...

sevenless · on Aug 29, 2016

Any idea why it would be slower? I liked the idea of avoiding half the multiplications.

Someone · on Aug 29, 2016

Disclaimer: I have very, very, very little experience using BLAS. The reasons I post this are:

- the original poster gave an unqualified speed difference, which cannot reasonably be the full story. They likely left out information such as a 'for my use case' clause.

- I was curious, too, but couldn't Google benchmarks.

Having said that, my guess would be that it is slower for small matrices (where algorithm overhead plays a role), but faster for larger ones (where speed probably is proportional to memory access speed times amount of data accessed). There's a similarity here with searching a sorted array. There, a linear search is faster than a binary search up to a surprisingly large N.

I wouldn't dare guess where the cut-off point lies, but it likely lies at a point above where a matrix row fills a cache line (below that, reading only a few entries of a row brings in an entire row, anyways). For a level 1 cache line of 64 bytes, for floats, that would be a 16x16 matrix.

jackmott · on Aug 28, 2016

gather instructions on current cpus are not so good (unless that has changed in recent months), not always a speed increase from it.

mamcx · on Aug 29, 2016

Similar tricks in a managed language like .NET?

tom_mellior · on Aug 29, 2016

The transpose "trick" should work the same way and have similar beneficial cache effects. Manually unrolling the sdot loop should probably help, too. Vectorized intrinsics are also available from .NET, so that should work as well: https://msdn.microsoft.com/en-us/library/hh977022.aspx

sickboy · on Aug 28, 2016

fma may help to double performance，if you control memoryrw wrll

Const-me · on Aug 29, 2016

Right, but the price is slightly different product because incompatible rounding.

Also, too few CPUs support those, FMA only available since Intel Haswell (2013) and AMD Bulldozer (2011).

Finally, there’s also incompatibility between 3 and 4, but that’s the least of the problems because simple to workaround in runtime.