For problems like Matrix Multiplication, it costs N to communicate the problem but N^2 operations to calculate.
For problems like dot product, it costs N to communicate but only N operations to calculate.
Compute must be substantially larger than communication costs if you hope to see any benefits. Asymptotic differences obviously help, but linear too might help.
You'd never transfer N data to perform a log(n) binary search for example. At that point communication dominates.
For those skimming and to add to the above the article is using the gpu to work with system memory since that’s where they have the initial data and where they want the result in this case and comparing it to a cpu doing the same. The entire bottleneck is GPU to system memory.
If you’re willing to work entirely with the gpu memory the gpu will of course be faster even in this scenario.
Sure, but once you've justified moving data onto the GPU you don't want to incur the cost of moving the operation output back to the CPU unless you have to. So, for example, you might justify moving data to the GPU for a neural net convolution, but then also execute the following activation function (& subsequent operators) there because that's now where the data is.
So awhile back I was working on a chaotic renderer. This let me to a really weird set of situations:
* If the GPU is a non-dedicated older style intel GPU, use CPU
* If the GPU is a non-dedicated anything else do anything super parallel on the GPU but anything that can be BRRRRRT via CPU on the CPU because the memory is shared.
* If the GPU is dedicated move everything to GPU memory and keep it there and only pull back small statistics if at all plausible.
For problems like Matrix Multiplication, it costs N to communicate the problem but N^2 operations to calculate.
For problems like dot product, it costs N to communicate but only N operations to calculate.
Compute must be substantially larger than communication costs if you hope to see any benefits. Asymptotic differences obviously help, but linear too might help.
You'd never transfer N data to perform a log(n) binary search for example. At that point communication dominates.