Even if you don't know the number of iterations it can be helpful to "partially ...

ryao · on Dec 20, 2024

Automatic vectorization is not relevant to CUDA device code because the threading model is implicitly vector based. Every single instruction is a vector instruction that is executed on all threads in the CUDA block simultaneously (unless it is a no-op due to divergent branching).

To be fair, there are actually 14 SIMD instructions intended for use by video operations, but I would be surprised if any compiler implemented optimization passes to use them since most code cannot use them:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...

Reducing loop overhead does make sense, although I suspect this would be best informed by profiling the performance. That said, I found the following explanation for unrolling on a GPU helping:

https://www.researchgate.net/post/CUDA-Is_it_worth_it_to_unw...

It mentions what had me wondering if loop unrolling made sense on GPUs, which is:

> Branch prediction does not even exists on GPUs. The GPU thread scheduler will just switch execution to a different warp until the outcome of the branch has been resolved.

It then goes on to mention a few other benefits to loop unrolling on a GPU, including reducing loop overhead.

reasonableklout · on Dec 21, 2024

As saagarjha mentions, vectorization of loads and stores is important for memory bandwidth and can be done automatically after unrolling a loop. Another important compiler optimization which requires and is applied after loop unrolling is pre-fetching: that is, for a loop whose iterations are independent and each perform loads and then some computation depending on the loaded value, we can re-arrange the loads to be grouped before the computations. The thread can use ILP to continue issuing loads while previous ones are still in-flight as long as it still has registers to store the results. Without unrolling, the computations are stalled waiting for the load instructions to return data, while with unrolling, we are able to overlap loads and make much better use of memory bandwidth.

I describe a situation in my blog post where automatic unrolling and pre-fetching was no longer being applied after changing a kernel to use FP16, and how I re-applied the optimizations manually to regain performance: https://andrewkchan.dev/posts/yalm.html#section-3.5

Here's an NVIDIA blog post which discusses pre-fetching and unrolling more generally: https://developer.nvidia.com/blog/boosting-application-perfo...

ryao · on Dec 21, 2024

Thanks!

reasonableklout · on Dec 22, 2024

BTW, this part is not entirely true: > Every single instruction is a vector instruction that is executed on all threads in the CUDA block simultaneously (unless it is a no-op due to divergent branching).

It is true that at one point instructions were executed in SIMT lockstep in warps, which are equal-size groups of CUDA cores (with each core mapping to one thread) that subdivide blocks and are the fundamental unit of execution on the hardware.

However, since Volta (2017), the execution model is allowed to move threads in a warp forward in any order, even in the absence of conditional code. Although from what I have seen, for now it appears that threads still move forward in SIMT lockstep and only diverge into active-inactive subsets at branches. That said, there is no guarantee on when the subsets may re-converge, and this is also only behavior which is done for efficiency by the hardware (https://stackoverflow.com/a/58122848/4151721) rather than in order to comply with any published programming model, e.g. it's implementation-specific behavior that could change at any time.

This is why the NVIDIA documentation (https://developer.nvidia.com/blog/using-cuda-warp-level-prim...) says to use __syncwarp() for operations with intra-warp dependencies and to not assume lockstep execution.

saagarjha · on Dec 22, 2024

I've seen divergence across high-latency instructions like memory loads.

upghost · on Dec 23, 2024

Ok, where do you nerds hang out and why am I not there?? I'm loving this discussion, y'all seem to be a rather rare breed of dev though. Where is the community for whatever this sort of dev is called? i.e., Clojure has the Clojurians Slack, the Clojurians Zulip, we have an annual Clojure conference and a few spin-offs. Where do you guys hang out??

This stuff is really awesome and I would love to dig in more!

ryao · on Dec 24, 2024

I have not found a specific hangout to have such discussions. They just organically happened here.

saagarjha · on Dec 20, 2024

Sure it is. Both loads and stores can be vectorized (and should, otherwise you’re leaving memory bandwidth on the table).

ryao · on Dec 21, 2024

That is on a CPU. A GPU works differently, such that threads on a GPU implicitly vectorize loads and stores as part of their warp/block. My question had concerned GPUs, where you cannot vectorize instructions by loop unrolling since the instructions are already vector instructions.

saagarjha · on Dec 22, 2024

I think you have a mistaken understanding of how GPUs work? There is some "vectorization" across threads in the form of coalescing but what I am talking about is literally a vectorized load/store, the same you would see on a CPU. Like, you can do a ld/ld.64/ld.128 to specify the width of the memory operation. If your loops load individual elements and it is possible to load them together then the compiler can fuse them together.

ryao · on Dec 23, 2024

That makes more sense. When you said automatic vectorization, I was thinking about SIMD calculations. Nvidia does support doing 128-bit loads and stores:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...