Wow, great answer. That deserves more upvotes than it's probably going to get. S...

ryao · on Dec 17, 2024

You don’t need to repeatedly call the kernels. You just need to loop inside the kernels to have each process every nth multiplication according to their thread id, adding it to the thread’s partial sum and have a branch in the loop skip the addition (or add 0) when you exceed the range. Afterward, you would do a horizontal sum of the partial sums to get the value for the corresponding field in the output vector, and you would have a “grid” do this for all members of the output vector. The original poster’s mathmul kernel does this. See the example under “we can perform a warp sum reduction to combine results from all threads”:

https://andrewkchan.dev/posts/yalm.html#section-3.2

It does “if (i >= d) return;“ to handle non-multiples in one dimension and skips the addition in its dot product, matmul_row(), to handle non-multiples in the other dimension.

In CUDA, you have threads that are organized into 32 thread warps, that are organized into blocks of warps, that are organized into grids. Threads within a block can share memory, but blocks within a grid cannot. The hardware will effectively loop over blocks in the grid until it finishes. The dimension for the output vector does not require sharing memory, so we can set that to the grid size. The hardware might actually round the grid up, so “if (i >= d) return;” is a check for the case that we are executing past the end of the output vector in case the hardware schedules pass the end of the vector.

As for the other dimension, we only have a limited number of threads (in this example 32, which is a single warp), so we need each to do partial sums of the multiplications in the dot product. We essentially break the dot product into thread count (32 here) chunks. Each thread will know which multiplication in the thread count chunk to compute based on its thread id, and we keep adding the thread count (32 here) to that to get the next one until we have finished. The for loop’s loop condition protects us from going past the end of the dot product dimension.

Then we have thread count number of partial sums, so we need to do a horizontal sum across the threads in a block to get the final sum. This part is a bit magical and is done via a warp sum reduction. Since the threads share memory, you basically have them start summing their neighbors’ values into their own. Basically, all threads find pairs and one of the pair does the sum, cutting the threads doing the sum in half. Then the ones that did the sums pair up again. This again cuts the number in half. This continues until only 1 thread remains. Then only the remaining thread is allowed to write to “out[i] = rowSum” due to “if (threadIdx.x == 0)”. The warp itself is somewhat special since it can do a really fast horizontal sum via shuffles that cannot be done when using thread blocks ghat utilize multiple warps. That is why the author chose to have the block size be exactly 1 warp rather than a multiple of warps.

It helps to think about the threads in the block as a spreadsheet. The columns are the threads and the rows represent instructions. Every thread executes the same instruction, and when a branch says not to do something for only some threads, the ones excluded do noops until execution reaches a point where they all do the same thing. An if else similarly has noops in threads not executing true, followed by noops in the threads not executing false. This is called branch divergence. There are diagrams showing a spreadsheet like view of this, such as the following, which shows a cutdown example:

https://image1.slideserve.com/2427671/branch-divergence-l.jp...

I have mentioned a horizontal sum multiple times. That just means summing across the columns inside rows of the “spreadsheet”. It is called horizontal since rows are horizontal. Coincidentally, the partial sum calculation is called a vertical sum since the columns are vertical.

Now we need to consider the grid, which is yet another dimension and is the dimension along which we populate the output vector. Basically, you can view the grid as doing multiple spreadsheets in parallel, but not all at once. One comical way to imagine it is to imagine having printed the same spreadsheet hundreds of times an and then feeding them into a paper shredder head first. A normal paper shredder can only do a few at a time, and similarly, the GPU can only do so many at a time too. You can imagine the processing to proceed along the rows in all of the spreadsheets being shredded at the same time.

upghost · on Dec 18, 2024

Damn, wow. Good stuff. First of all, I assumed looping within kernels was "always bad" from a performance standpoint (not that I'd done any benchmarks on looping vs repeated kernel invocation). I also assumed it was "always bad" to include conditionals.

That certainly opens up a lot of possibilities on algorithms. It also explains why people think I'm crazy when I say I don't know how to multiply matrices that are not in power of 2 dimensions haha.

Thank you, not sure why I was limiting myself like that. Not too many people to chat to about low level GPGPU stuff!

ryao · 2024-12-19T15:25:19 1734621919

It is an area where I am still learning, but as far as I know, the badness of things is relative. CPU-GPU communication is the major thing to avoid. In my CUDA llama3.c fork, making changes to reduce CPU-GPU communication generally increases performance. That means combining kernels and reducing cudaMemcpy and cudaMalloc calls. CUDA graphs likely could give further reductions, but I have yet to run out of ideas for more normal methods for getting them.

To give a concrete example of CPU-GPU communication in a tight loop being bad (much like what you were planning to do), when I initially moved softmax to the GPU, I had been invoking the kernel in a host side loop (since attention calls softmax multiple times per pass), and it slowed down inference by a few tokens per second versus the CPU softmax. I eliminated the host side loop by using the grid dimension so that 1 CUDA call did all of the loop iterations implicitly and inference became a few tokens per second faster than the CPU softmax.

As for branches, avoiding them on the GPU where possible can be beneficial, as long as you are not doing something more expensive like CPU-GPU communication in their place. For example, in my softmax kernel, rather than do a branch to calculate the maximum value inside my loop, I use fmaxf(). This is better since there is no branch, and fmaxf is incredibly cheap since it is a single PTX instruction (which presumably maps to a single SASS intruction).

GPGPU grew out of work for having programmable shaders on GPUs and a big part of that was branching. As long as you are minimizing branch divergence and only using necessary branches, you are fine as far as branching on a GPU are concerned.

saagarjha · 2024-12-19T09:38:28 1734601108

Loops and conditions within kernels are much better than the alternative, since synchronizing with the CPU is quite expensive. If you unroll appropriately and hide latencies they are very effective.

ryao · 2024-12-19T19:50:56 1734637856

Sometimes, the compiler will unroll for you. I have noticed that nvcc will unroll this:

    #define warpSize 32
    float warpMax = partial_max;
    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
        float otherValue = __shfl_down_sync(0xFFFFFFFF, warpMax, offset);
        warpMax = fmaxf(warpMax, otherValue);
    }

Into this:

        bar.sync        0;
        shl.b32         %r40, %r2, 2;
        add.s32         %r8, %r39, %r40;
        ld.shared.f32   %f24, [%r8];
        mov.u32         %r43, 0;
        mov.b32         %r44, %f24;
        shfl.sync.down.b32      %r48|%p13, %r44, %r23, %r22, %r24;
        mov.b32         %f25, %r48;
        max.ftz.f32     %f26, %f24, %f25;
        mov.b32         %r49, %f26;
        shfl.sync.down.b32      %r51|%p14, %r49, %r27, %r22, %r24;
        mov.b32         %f27, %r51;
        max.ftz.f32     %f28, %f26, %f27;
        mov.b32         %r52, %f28;
        shfl.sync.down.b32      %r54|%p15, %r52, %r30, %r22, %r24;
        mov.b32         %f29, %r54;
        max.ftz.f32     %f30, %f28, %f29;
        mov.b32         %r55, %f30;
        shfl.sync.down.b32      %r56|%p16, %r55, %r21, %r22, %r24;
        mov.b32         %f31, %r56;
        max.ftz.f32     %f32, %f30, %f31;
        mov.b32         %r57, %f32;
        shfl.sync.down.b32      %r59|%p17, %r57, %r35, %r22, %r24;
        mov.b32         %f33, %r59;
        max.ftz.f32     %f34, %f32, %f33;
        mov.b32         %r60, %f34;

This avoids loop branching, which is unnecessary in this case because we know in advance when the loop terminates. The use of ftz instructions is because I compiled using `nvcc -ftz true -ptx -arch=sm_86 rung.cu -o rung.ptx`.

I am not sure if unrolling loops makes sense on a GPU when we don't know when the loop will terminate in advance. If you can suggest any references explaining when that is helpful, I would be happy to read them. :)

saagarjha · 2024-12-20T04:27:44 1734668864

Even if you don't know the number of iterations it can be helpful to "partially unroll" the loop, for example going by 8 or 16 elements at a time and doing a single check for all elements to make sure you're not doing more work than you were asked to do. Not only does this amortize checks for the loop body it can also enable optimizations like automatic vectorization.

ryao · 2024-12-20T14:31:42 1734705102

Automatic vectorization is not relevant to CUDA device code because the threading model is implicitly vector based. Every single instruction is a vector instruction that is executed on all threads in the CUDA block simultaneously (unless it is a no-op due to divergent branching).

To be fair, there are actually 14 SIMD instructions intended for use by video operations, but I would be surprised if any compiler implemented optimization passes to use them since most code cannot use them:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...

Reducing loop overhead does make sense, although I suspect this would be best informed by profiling the performance. That said, I found the following explanation for unrolling on a GPU helping:

https://www.researchgate.net/post/CUDA-Is_it_worth_it_to_unw...

It mentions what had me wondering if loop unrolling made sense on GPUs, which is:

> Branch prediction does not even exists on GPUs. The GPU thread scheduler will just switch execution to a different warp until the outcome of the branch has been resolved.

It then goes on to mention a few other benefits to loop unrolling on a GPU, including reducing loop overhead.

reasonableklout · 2024-12-21T05:12:44 1734757964

As saagarjha mentions, vectorization of loads and stores is important for memory bandwidth and can be done automatically after unrolling a loop. Another important compiler optimization which requires and is applied after loop unrolling is pre-fetching: that is, for a loop whose iterations are independent and each perform loads and then some computation depending on the loaded value, we can re-arrange the loads to be grouped before the computations. The thread can use ILP to continue issuing loads while previous ones are still in-flight as long as it still has registers to store the results. Without unrolling, the computations are stalled waiting for the load instructions to return data, while with unrolling, we are able to overlap loads and make much better use of memory bandwidth.

I describe a situation in my blog post where automatic unrolling and pre-fetching was no longer being applied after changing a kernel to use FP16, and how I re-applied the optimizations manually to regain performance: https://andrewkchan.dev/posts/yalm.html#section-3.5

Here's an NVIDIA blog post which discusses pre-fetching and unrolling more generally: https://developer.nvidia.com/blog/boosting-application-perfo...

ryao · 2024-12-21T15:59:31 1734796771

Thanks!

reasonableklout · 2024-12-22T03:34:30 1734838470

BTW, this part is not entirely true: > Every single instruction is a vector instruction that is executed on all threads in the CUDA block simultaneously (unless it is a no-op due to divergent branching).

It is true that at one point instructions were executed in SIMT lockstep in warps, which are equal-size groups of CUDA cores (with each core mapping to one thread) that subdivide blocks and are the fundamental unit of execution on the hardware.

However, since Volta (2017), the execution model is allowed to move threads in a warp forward in any order, even in the absence of conditional code. Although from what I have seen, for now it appears that threads still move forward in SIMT lockstep and only diverge into active-inactive subsets at branches. That said, there is no guarantee on when the subsets may re-converge, and this is also only behavior which is done for efficiency by the hardware (https://stackoverflow.com/a/58122848/4151721) rather than in order to comply with any published programming model, e.g. it's implementation-specific behavior that could change at any time.

This is why the NVIDIA documentation (https://developer.nvidia.com/blog/using-cuda-warp-level-prim...) says to use __syncwarp() for operations with intra-warp dependencies and to not assume lockstep execution.

saagarjha · 2024-12-22T03:41:34 1734838894

I've seen divergence across high-latency instructions like memory loads.

upghost · 2024-12-23T22:24:05 1734992645

Ok, where do you nerds hang out and why am I not there?? I'm loving this discussion, y'all seem to be a rather rare breed of dev though. Where is the community for whatever this sort of dev is called? i.e., Clojure has the Clojurians Slack, the Clojurians Zulip, we have an annual Clojure conference and a few spin-offs. Where do you guys hang out??

This stuff is really awesome and I would love to dig in more!

ryao · 2024-12-24T01:59:45 1735005585

I have not found a specific hangout to have such discussions. They just organically happened here.

saagarjha · 2024-12-20T18:14:18 1734718458

Sure it is. Both loads and stores can be vectorized (and should, otherwise you’re leaving memory bandwidth on the table).

ryao · 2024-12-21T15:58:39 1734796719

That is on a CPU. A GPU works differently, such that threads on a GPU implicitly vectorize loads and stores as part of their warp/block. My question had concerned GPUs, where you cannot vectorize instructions by loop unrolling since the instructions are already vector instructions.

saagarjha · 2024-12-22T03:46:14 1734839174

I think you have a mistaken understanding of how GPUs work? There is some "vectorization" across threads in the form of coalescing but what I am talking about is literally a vectorized load/store, the same you would see on a CPU. Like, you can do a ld/ld.64/ld.128 to specify the width of the memory operation. If your loops load individual elements and it is possible to load them together then the compiler can fuse them together.

ryao · 2024-12-23T00:35:54 1734914154

That makes more sense. When you said automatic vectorization, I was thinking about SIMD calculations. Nvidia does support doing 128-bit loads and stores:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...