There is the compute vs communicate ratio. For problems like Matrix Multiplicati...

AnotherGoodName · on Dec 11, 2024

For those skimming and to add to the above the article is using the gpu to work with system memory since that’s where they have the initial data and where they want the result in this case and comparing it to a cpu doing the same. The entire bottleneck is GPU to system memory.

If you’re willing to work entirely with the gpu memory the gpu will of course be faster even in this scenario.

saagarjha · on Dec 12, 2024

Assuming your task is larger than kernel launch overhead, of course.

HarHarVeryFunny · on Dec 11, 2024

Sure, but once you've justified moving data onto the GPU you don't want to incur the cost of moving the operation output back to the CPU unless you have to. So, for example, you might justify moving data to the GPU for a neural net convolution, but then also execute the following activation function (& subsequent operators) there because that's now where the data is.

leeter · on Dec 12, 2024

So awhile back I was working on a chaotic renderer. This let me to a really weird set of situations:

* If the GPU is a non-dedicated older style intel GPU, use CPU

* If the GPU is a non-dedicated anything else do anything super parallel on the GPU but anything that can be BRRRRRT via CPU on the CPU because the memory is shared.

* If the GPU is dedicated move everything to GPU memory and keep it there and only pull back small statistics if at all plausible.