More

nathanielsimard · 2025-07-19T01:17:54 1752887874

CubeCL supports WebGPU and can be used with wasm!

nathanielsimard · 2025-07-18T22:26:57 1752877617

Using the naming from one of the existing API would put too much bias towards that API. It started as a WebGPU project early on, but some features are not present so mixing terms wasn't ideal. We're also working on extending CubeCL to CPU, so we want terms not only tied to the GPU word.

almostgotcaught · 2025-07-18T22:36:41 1752878201

Thread, group, workgroup.

There you go you've hit basically two of 3 completely (AMD and Vulkan) and are close enough to CUDA that people would get it.

I have no idea what a plane connotes and a cube literally gives a distinct enough picture from block that I will be continuously reminding myself of the mapping.

What you did was pointless - you assigned new words to objects that you don't own and now your conceptual framework is askew from the actual underlying (true) conceptual framework.

> CubeCL to CPU

There is zero affinity between GPU programing models and multicore CPU programing models. If you don't believe me go ask the OpenMP people how they're doing supporting GPUs.

nathanielsimard · 2025-07-18T22:48:48 1752878928

Well we can agree to disagree, CubeCL also has the concept of instruction parallelism, which would be used to target simd instructions on CPU. Our algorithms are normally flexible on both the plane size and the line size, adapting to the hardware with comptime logique. You are free to dislike the naming, but imo a mix of multiple APIs is worse than something new.

gyrovagueGeist · 2025-07-18T23:18:07 1752880687

For people who are interested Kokkos (a C++ library for writing portable kernels) also has a naming scheme for hierarchical parallelism. They use ThreadTeam, Thread (for individual threads within a group), and ThreadVector (for per thread SIMD).

Just commenting to share, personally I have no naming preference but the hierarchal abstractions in general are incredibly useful.

almostgotcaught · 2025-07-18T22:53:10 1752879190

> Our algorithms are normally flexible on both the plane size and the line size

Congrats - I have no idea what this means lol.

syl20bnr · 2025-07-18T23:44:07 1752882247

It will make more sense once you start using CubeCL. There's now a CubeCL book available: https://burn.dev/books/cubecl/.

It does come with some mental overhead, but let’s be honest, there’s no objectively “good” choice here without introducing bias toward a specific vendor API.

Learning the core concepts takes effort, but if CubeCL is useful for your work, it’s definitely worth it.

sroussey · 2025-07-18T22:47:37 1752878857

Why unit instead of point?

Unit, plane (as vs train), and cube?

Or point, plane, cube (1d, 2d, 3d)?

nathanielsimard · 2025-07-18T22:53:58 1752879238

I don't recall the reason why, point is a valid name.

kevindamm · 2025-07-19T00:00:48 1752883248

Actually, points are zero dimensional, lines are one dimensional.

nathanielsimard · 2025-07-18T21:56:58 1752875818

One of the author here, don't hesitate if you have any question or comment!

burnt-resistor · 2025-07-19T04:16:45 1752898605

Reminds me of ye olden days when kernel transforms were merely weighted multiplicative and/or additive matrixes applied to every point in the source arriving at pixel data in the target. Blur, sharpen, color channel filter, color swap, invert, etc. An extremely diagonalizable problem suitable for massive parallelism and concurrent calculation because there is little/no dependency on prior calculations.

nathanielsimard · 2025-04-24T17:20:52 1745515252

We have safe and unsafe version for launching kernels where we can ensure that a kernel won't corrupt data elsewhere (and therefore won't create memory error or segfaults). But within a kernel ressources are mutable and shared between GPU cores, since that's how GPUs work.

nathanielsimard · 2025-04-24T12:18:30 1745497110

The need to build CubeCL came from the Burn deep learning framework (https://github.com/tracel-ai/burn), where we want to easily build algorithms like in CUDA with a real programming language, while also being able to integrate those algorithms inside a compiler at runtime to fuse dynamic graphs.

Since we don't want to rewrite everything multiple times, it also has to be multi-platform and optimal, so the feature set must be per-device, not per-language. I'm not aware of a tool that does that, especially in Rust (which Burn is written in).

fc417fc802 · 2025-04-24T20:35:51 1745526951

> I'm not aware of a tool that does that

Jax? But then you're stuck in python. SYCL?

But yeah not for Rust. This project is filling a prominent hole IMO.

rowanG077 · 2025-04-25T17:34:54 1745602494

Futhark immediately came to mind. It's designed to be able to be trivially integrated into a package.

nathanielsimard · 2025-04-24T12:11:11 1745496671

We support warp operations, barriers for Cuda, atomics for most backends, tensor cores instructions as well. It's just not well documented on the readme!

ashvardanian · 2025-04-24T20:27:43 1745526463

Amazing! Would love to try them! If possible, would also ask for a table translating between CubeCL and CUDA terminology. It seems like CUDA Warps are called Planes in CubeCL, and it’s probably not the only difference.

nathanielsimard · 2025-04-24T12:09:46 1745496586

One of the main author here, the readme isn't really well up-to-date. We have our own gemm implementation based on CubeCL. It's still moving a lot, but we support tensor cores, use warp operations (Plane Operations in CubeCL), we even added TMA instructions for CUDA.

nathanielsimard · 2025-04-24T01:54:34 1745459674

A lot of things happen at compile time, but you can execute arbitrary code in your kernel that executes at compile time, similar to generics, but with more flexibility. It's very natural to branch on a comptime config to select an algorithm.

nathanielsimard · on Jan 16, 2025

During the last iteration of CubeCL, we refactored the matrix multiplication GPU kernel to work with many different configurations and element types.

The goal was to improve performance and flexibility by using Tensor cores when available, performing bounds checks when necessary, supporting any tensor layout without any new allocation to transpose the matrices beforehand, and implementing many improvements.

The performance is greatly improved, and now it works better with many different matrix shapes. However, I think we created an atrocity in terms of compilation speed. Simply compiling a few matmul kernels, using incremental compilation, took close to 2 minutes.

So we fixed it! I took the time to write a blog post with our solutions, since I believe this can be useful to Rust developers in general, even if the techniques might not be applicable to your projects.

Feel free to ask any questions here, about the techniques, the process, the algorithms, CubeCL, whatever you want!

nathanielsimard · on Aug 27, 2024

Burn is now the first fully Rust-native deep learning framework. Do everything in Rust, from GPU kernels to model definition. No CUDA, C++ or WGSL needed thanks to CubeCL that we released last month.

We've introduced a new tensor data format that offers faster serialization/deserialization and supports Quantization (currently in Beta). Loading and saving can be up to 4X as fast.

As always, we've added numerous bug fixes, new tensor operations, and improved documentation. Thanks to all contributors, over 50 for this release.