Using the naming from one of the existing API would put too much bias towards that API. It started as a WebGPU project early on, but some features are not present so mixing terms wasn't ideal. We're also working on extending CubeCL to CPU, so we want terms not only tied to the GPU word.
There you go you've hit basically two of 3 completely (AMD and Vulkan) and are close enough to CUDA that people would get it.
I have no idea what a plane connotes and a cube literally gives a distinct enough picture from block that I will be continuously reminding myself of the mapping.
What you did was pointless - you assigned new words to objects that you don't own and now your conceptual framework is askew from the actual underlying (true) conceptual framework.
> CubeCL to CPU
There is zero affinity between GPU programing models and multicore CPU programing models. If you don't believe me go ask the OpenMP people how they're doing supporting GPUs.
Well we can agree to disagree, CubeCL also has the concept of instruction parallelism, which would be used to target simd instructions on CPU. Our algorithms are normally flexible on both the plane size and the line size, adapting to the hardware with comptime logique. You are free to dislike the naming, but imo a mix of multiple APIs is worse than something new.
For people who are interested Kokkos (a C++ library for writing portable kernels) also has a naming scheme for hierarchical parallelism. They use ThreadTeam, Thread (for individual threads within a group), and ThreadVector (for per thread SIMD).
Just commenting to share, personally I have no naming preference but the hierarchal abstractions in general are incredibly useful.
It will make more sense once you start using CubeCL. There's now a CubeCL book available: https://burn.dev/books/cubecl/.
It does come with some mental overhead, but let’s be honest, there’s no objectively “good” choice here without introducing bias toward a specific vendor API.
Learning the core concepts takes effort, but if CubeCL is useful for your work, it’s definitely worth it.
Reminds me of ye olden days when kernel transforms were merely weighted multiplicative and/or additive matrixes applied to every point in the source arriving at pixel data in the target. Blur, sharpen, color channel filter, color swap, invert, etc. An extremely diagonalizable problem suitable for massive parallelism and concurrent calculation because there is little/no dependency on prior calculations.
We have safe and unsafe version for launching kernels where we can ensure that a kernel won't corrupt data elsewhere (and therefore won't create memory error or segfaults). But within a kernel ressources are mutable and shared between GPU cores, since that's how GPUs work.
The need to build CubeCL came from the Burn deep learning framework (https://github.com/tracel-ai/burn), where we want to easily build algorithms like in CUDA with a real programming language, while also being able to integrate those algorithms inside a compiler at runtime to fuse dynamic graphs.
Since we don't want to rewrite everything multiple times, it also has to be multi-platform and optimal, so the feature set must be per-device, not per-language. I'm not aware of a tool that does that, especially in Rust (which Burn is written in).
We support warp operations, barriers for Cuda, atomics for most backends, tensor cores instructions as well. It's just not well documented on the readme!
Amazing! Would love to try them! If possible, would also ask for a table translating between CubeCL and CUDA terminology. It seems like CUDA Warps are called Planes in CubeCL, and it’s probably not the only difference.
One of the main author here, the readme isn't really well up-to-date. We have our own gemm implementation based on CubeCL. It's still moving a lot, but we support tensor cores, use warp operations (Plane Operations in CubeCL), we even added TMA instructions for CUDA.
A lot of things happen at compile time, but you can execute arbitrary code in your kernel that executes at compile time, similar to generics, but with more flexibility. It's very natural to branch on a comptime config to select an algorithm.
During the last iteration of CubeCL, we refactored the matrix multiplication GPU kernel to work with many different configurations and element types.
The goal was to improve performance and flexibility by using Tensor cores when available, performing bounds checks when necessary, supporting any tensor layout without any new allocation to transpose the matrices beforehand, and implementing many improvements.
The performance is greatly improved, and now it works better with many different matrix shapes. However, I think we created an atrocity in terms of compilation speed. Simply compiling a few matmul kernels, using incremental compilation, took close to 2 minutes.
So we fixed it! I took the time to write a blog post with our solutions, since I believe this can be useful to Rust developers in general, even if the techniques might not be applicable to your projects.
Feel free to ask any questions here, about the techniques, the process, the algorithms, CubeCL, whatever you want!
Burn is now the first fully Rust-native deep learning framework. Do everything in Rust, from GPU kernels to model definition. No CUDA, C++ or WGSL needed thanks to CubeCL that we released last month.
We've introduced a new tensor data format that offers faster serialization/deserialization and supports Quantization (currently in Beta). Loading and saving can be up to 4X as fast.
As always, we've added numerous bug fixes, new tensor operations, and improved documentation. Thanks to all contributors, over 50 for this release.