Your triton code is great, nice work. Wouldn’t feel too bad about spending your time that way!
As it happens I was also thinking it might be worthwhile to dive into the Triton sources but for another reason: half2 arithmetic. That’s one thing that the Triton branch lost that the (faster) CUDA kernels had and I think it made a difference. In theory with compatible hardware you can retire twice as many ops per second when processing float16 data which we are in this case.
Can’t see anyone having tried to get half2 to work with Triton though.
As it happens I was also thinking it might be worthwhile to dive into the Triton sources but for another reason: half2 arithmetic. That’s one thing that the Triton branch lost that the (faster) CUDA kernels had and I think it made a difference. In theory with compatible hardware you can retire twice as many ops per second when processing float16 data which we are in this case.
Can’t see anyone having tried to get half2 to work with Triton though.