More

boywitharupee · 2025-12-25T21:03:53 1766696633

shouldn't the title be "CUDA Tile IR Open Sourced"?

OneDeuxTriSeiGo · 2025-12-26T03:26:09 1766719569

It's more or less the same thing. CUDA TIle is the name of the IR, cuTile is the name of the high level DSLs.

boywitharupee · 2025-06-13T03:52:55 1749786775

is there a document or reference implementation that describes the full algorithm? tiling, sorting, merging, and strip conversion.

boywitharupee · on Nov 15, 2024

> In C++, it's an rvalue reference , which can be effectively thought of as an lvalue

hmm...this doesn't sound quite right? the comma operator's result in C++ is not an rvalue reference - it takes on exactly the value category of its right operand (which in this case is an lvalue)

boywitharupee · on Oct 31, 2024

so, these are hand optimized primitives for specific model of nvidia gpus? do you still have to make launch/scheduling decisions to maximize occupancy? how does this approach scale to other target devices with specialized instruction sets and different architecture?

boywitharupee · on Sept 24, 2024

can someone explain how is profiling tools like this written for GPU applications? wouldn't you need access to internal runtime api?

for ex. Apple wraps Metal buffers as "Debug" buffers to record allocations/deallocations.

MindSpunk · on Sept 24, 2024

Some graphics APIs support commands that tell the GPU to record a timestamp when it gets to processing the command. This is oversimplified, but is essentially what you ask the GPU to do. There’s lots of gotchas in hardware that makes this more difficult in practice as a GPU won’t always execute and complete work exactly as you specify at the API level if it’s safe to. And the timestamp domain isn’t always the same as the CPU.

But in principle it’s not that different to how you just grab timestamps on the CPU. On Vulkan the API used is called “timestamp queries”

It’s quite tricky on tiled renderers like Arm/Qualcomm/Apple as they can’t provide meaningful timestamps at much tighter granularity than a whole renderpass. I believe Metal only allows you to query timestamps at the encoder level, which roughly maps to a render pass in Vulkan (at the hardware level anyway)

ossobuco · on Sept 24, 2024

I don't know about Tracy, but I've seen a couple WebGPU JS debugging tools simply intercepting calls to the various WebGPU functions like writeBuffer, draw, etc, by modifying the prototypes of Device, Queue and so on[0].

- 0: https://github.com/brendan-duncan/webgpu_inspector/blob/main...

boywitharupee · on Sept 5, 2024

what kind of model architecture was used for this? is it safe to assume they used a transformer model or a variant of it?

boywitharupee · on July 14, 2024

what's the purpose of this? is it one of those 'fun' problems to solve?

jfoutz · on July 14, 2024

This quote might help - https://en.wikipedia.org/wiki/Von_Neumann%27s_elephant#Histo...

yes, a fun problem, but also a criticism of using to many parameters.

boywitharupee · on July 2, 2024

how different is this compared to Facebook's open-source tool Faiss[1]?

[1] https://github.com/facebookresearch/faiss/

throwaway4aday · on July 2, 2024

Faiss is for similarity search over vectors via k-NN. GraphRAG is, well, a graph. More precisely, GraphRAG has more in common with old school knowledge graph techniques involving named entity extraction and the various forms of black magic used to identify relationships between entities. If you remember RDF and the semantic web it's sort of along those lines. One of the uses of Faiss is in a k-NN graph but the edges between nodes in that graph are (similarity) distance based.

Looking at an example prompt from GraphRAG will make things clear https://github.com/microsoft/graphrag/blob/main/graphrag/pro...

especially these lines:

Return output in English as a single list of all the entities and relationships identified in steps 1 and 2.

Format each relationship as a JSON entry with the following format:

{{"source": <source_entity>, "target": <target_entity>, "relationship": <relationship_description>, "relationship_strength": <relationship_strength>}}

yard2010 · on July 2, 2024

Excuse me, how is it not?

boywitharupee · on June 14, 2024

In a similar fashion, you'll see that JAX has frontend code being open-sourced, while device-related code is distributed as binaries. For example, if you're on Google's TPU, you'll see libtpu.so, and on macOS, you'll see pjrt_plugin_metal_1.x.dylib.

The main optimizations (scheduler, vectorizer, etc.) are hidden behind these shared libraries. If open-sourced, they might reveal hints about proprietary algorithms and provide clues to various hardware components, which could potentially be exploited.

boywitharupee · on June 2, 2024

> At runtime, C&P generates executable code by copying the object code and patching the holes with runtime known values.

how would this work on OSs under hardened runtime rules?

mrugiero · on June 2, 2024

The same as with any other JIT runtime: you do your transformations first, and then you do the `mprotect` call that turns write permissions off and execution permissions on. The only caveats I can think of (`pledge`d not to use `mprotect`, marked most of the address space with `mimmutable`) apply to all other JITs too. The gist is that you operate on a copy of code, and that copy is in a writable page until it's ready to run, so you never violate the W^X rule.

IAmLiterallyAB · on June 3, 2024

Or you do what V8 does with WebAssembly and just use WX pages because doing it correctly is "too hard" to do without losing performance.

mrugiero · on June 3, 2024

Does that even work in W^X platforms? Context for my response has that assumption, we can't simply throw it out the window, right? I think I read somewhere about making two mappings to the same physical page (one W, one X), are you referring to that? (I'd still need to know how that works as it kinda defeats the protection, the OS should prohibit that, right?)

IAmLiterallyAB · on June 4, 2024

Oh, for sure what I said wouldn't work on a W^X system. I was just pointing out that one of the most widely used JIT software uses WX pages.

What OSes prohibit that? Linux doesn't (well, I think it can with SeLinux maybe?). OpenBSD might?

mrugiero · on June 7, 2024

The question was about OSes with hardened runtime protections. The most basic of them all is W^X. All BSDs use it, and IIRC Linux is able to enforce it as well. I'd be surprised if it isn't the default in most distros, but I guess it's not impossible. I need to go for lunch so I won't check right now.

boywitharupee · on June 5, 2024

or have the right entitlements?

https://developer.apple.com/documentation/bundleresources/en...