One of the pitfalls when analyzing HPC requirements is to start with a model tha...

One of the pitfalls when analyzing HPC requirements is to start with a model that's too simplistic - and matrix multiplication is typical for that. What you usually want to run is a solver or simulation. These have timesteps and numerical approximation algorithms (e.g. Runge-Kutta) where you want to make sure that intermediate values live only exactly as long as they need. The reason being that when you distribute your main memory to your threads, especially for GPGPU, you only have a few hundred kilobytes per thread if you want to achieve saturation. So what do you do? In C you typically see the inner timestep functions being called with output and input pointers, then these are swapped for the next step - no allocation, no copying, no overhead, very simple code and nothing that any compiler could screw up. That's just one example of a trick that makes a HPC programmer's life easy, not just because it performs optimally 100% of the times it's used, but because it doesn't complicate performance analysis. In order to be able to analyse code performance properly, one must be able to understand to device code that comes out of the compiler, and how it will interact with the pipeline, the cache etc. If there's too much of a mismatch, it becomes near impossible to understand what's going on. In theory compilers could always achieve the optimum for you and a programmer wouldn't have to care about hardware at all, and just live in his logical bubble. Experience shows that this ideal is pretty far off in the future.