Even if you don't know the number of iterations it can be helpful to "partially unroll" the loop, for example going by 8 or 16 elements at a time and doing a single check for all elements to make sure you're not doing more work than you were asked to do. Not only does this amortize checks for the loop body it can also enable optimizations like automatic vectorization.
Automatic vectorization is not relevant to CUDA device code because the threading model is implicitly vector based. Every single instruction is a vector instruction that is executed on all threads in the CUDA block simultaneously (unless it is a no-op due to divergent branching).
To be fair, there are actually 14 SIMD instructions intended for use by video operations, but I would be surprised if any compiler implemented optimization passes to use them since most code cannot use them:
Reducing loop overhead does make sense, although I suspect this would be best informed by profiling the performance. That said, I found the following explanation for unrolling on a GPU helping:
It mentions what had me wondering if loop unrolling made sense on GPUs, which is:
> Branch prediction does not even exists on GPUs. The GPU thread scheduler will just switch execution to a different warp until the outcome of the branch has been resolved.
It then goes on to mention a few other benefits to loop unrolling on a GPU, including reducing loop overhead.
As saagarjha mentions, vectorization of loads and stores is important for memory bandwidth and can be done automatically after unrolling a loop. Another important compiler optimization which requires and is applied after loop unrolling is pre-fetching: that is, for a loop whose iterations are independent and each perform loads and then some computation depending on the loaded value, we can re-arrange the loads to be grouped before the computations. The thread can use ILP to continue issuing loads while previous ones are still in-flight as long as it still has registers to store the results. Without unrolling, the computations are stalled waiting for the load instructions to return data, while with unrolling, we are able to overlap loads and make much better use of memory bandwidth.
I describe a situation in my blog post where automatic unrolling and pre-fetching was no longer being applied after changing a kernel to use FP16, and how I re-applied the optimizations manually to regain performance: https://andrewkchan.dev/posts/yalm.html#section-3.5
BTW, this part is not entirely true:
> Every single instruction is a vector instruction that is executed on all threads in the CUDA block simultaneously (unless it is a no-op due to divergent branching).
It is true that at one point instructions were executed in SIMT lockstep in warps, which are equal-size groups of CUDA cores (with each core mapping to one thread) that subdivide blocks and are the fundamental unit of execution on the hardware.
However, since Volta (2017), the execution model is allowed to move threads in a warp forward in any order, even in the absence of conditional code. Although from what I have seen, for now it appears that threads still move forward in SIMT lockstep and only diverge into active-inactive subsets at branches. That said, there is no guarantee on when the subsets may re-converge, and this is also only behavior which is done for efficiency by the hardware (https://stackoverflow.com/a/58122848/4151721) rather than in order to comply with any published programming model, e.g. it's implementation-specific behavior that could change at any time.
Ok, where do you nerds hang out and why am I not there?? I'm loving this discussion, y'all seem to be a rather rare breed of dev though. Where is the community for whatever this sort of dev is called? i.e., Clojure has the Clojurians Slack, the Clojurians Zulip, we have an annual Clojure conference and a few spin-offs. Where do you guys hang out??
This stuff is really awesome and I would love to dig in more!
That is on a CPU. A GPU works differently, such that threads on a GPU implicitly vectorize loads and stores as part of their warp/block. My question had concerned GPUs, where you cannot vectorize instructions by loop unrolling since the instructions are already vector instructions.
I think you have a mistaken understanding of how GPUs work? There is some "vectorization" across threads in the form of coalescing but what I am talking about is literally a vectorized load/store, the same you would see on a CPU. Like, you can do a ld/ld.64/ld.128 to specify the width of the memory operation. If your loops load individual elements and it is possible to load them together then the compiler can fuse them together.
That makes more sense. When you said automatic vectorization, I was thinking about SIMD calculations. Nvidia does support doing 128-bit loads and stores: