Hacker Newsnew | past | comments | ask | show | jobs | submit | boywitharupee's commentslogin

shouldn't the title be "CUDA Tile IR Open Sourced"?


It's more or less the same thing. CUDA TIle is the name of the IR, cuTile is the name of the high level DSLs.


is there a document or reference implementation that describes the full algorithm? tiling, sorting, merging, and strip conversion.


> In C++, it's an rvalue reference , which can be effectively thought of as an lvalue

hmm...this doesn't sound quite right? the comma operator's result in C++ is not an rvalue reference - it takes on exactly the value category of its right operand (which in this case is an lvalue)


so, these are hand optimized primitives for specific model of nvidia gpus? do you still have to make launch/scheduling decisions to maximize occupancy? how does this approach scale to other target devices with specialized instruction sets and different architecture?


can someone explain how is profiling tools like this written for GPU applications? wouldn't you need access to internal runtime api?

for ex. Apple wraps Metal buffers as "Debug" buffers to record allocations/deallocations.


Some graphics APIs support commands that tell the GPU to record a timestamp when it gets to processing the command. This is oversimplified, but is essentially what you ask the GPU to do. There’s lots of gotchas in hardware that makes this more difficult in practice as a GPU won’t always execute and complete work exactly as you specify at the API level if it’s safe to. And the timestamp domain isn’t always the same as the CPU.

But in principle it’s not that different to how you just grab timestamps on the CPU. On Vulkan the API used is called “timestamp queries”

It’s quite tricky on tiled renderers like Arm/Qualcomm/Apple as they can’t provide meaningful timestamps at much tighter granularity than a whole renderpass. I believe Metal only allows you to query timestamps at the encoder level, which roughly maps to a render pass in Vulkan (at the hardware level anyway)


I don't know about Tracy, but I've seen a couple WebGPU JS debugging tools simply intercepting calls to the various WebGPU functions like writeBuffer, draw, etc, by modifying the prototypes of Device, Queue and so on[0].

- 0: https://github.com/brendan-duncan/webgpu_inspector/blob/main...


what kind of model architecture was used for this? is it safe to assume they used a transformer model or a variant of it?


what's the purpose of this? is it one of those 'fun' problems to solve?


This quote might help - https://en.wikipedia.org/wiki/Von_Neumann%27s_elephant#Histo...

yes, a fun problem, but also a criticism of using to many parameters.


how different is this compared to Facebook's open-source tool Faiss[1]?

[1] https://github.com/facebookresearch/faiss/


Faiss is for similarity search over vectors via k-NN. GraphRAG is, well, a graph. More precisely, GraphRAG has more in common with old school knowledge graph techniques involving named entity extraction and the various forms of black magic used to identify relationships between entities. If you remember RDF and the semantic web it's sort of along those lines. One of the uses of Faiss is in a k-NN graph but the edges between nodes in that graph are (similarity) distance based.

Looking at an example prompt from GraphRAG will make things clear https://github.com/microsoft/graphrag/blob/main/graphrag/pro...

especially these lines:

Return output in English as a single list of all the entities and relationships identified in steps 1 and 2.

Format each relationship as a JSON entry with the following format:

{{"source": <source_entity>, "target": <target_entity>, "relationship": <relationship_description>, "relationship_strength": <relationship_strength>}}


Excuse me, how is it not?


In a similar fashion, you'll see that JAX has frontend code being open-sourced, while device-related code is distributed as binaries. For example, if you're on Google's TPU, you'll see libtpu.so, and on macOS, you'll see pjrt_plugin_metal_1.x.dylib.

The main optimizations (scheduler, vectorizer, etc.) are hidden behind these shared libraries. If open-sourced, they might reveal hints about proprietary algorithms and provide clues to various hardware components, which could potentially be exploited.


> At runtime, C&P generates executable code by copying the object code and patching the holes with runtime known values.

how would this work on OSs under hardened runtime rules?


The same as with any other JIT runtime: you do your transformations first, and then you do the `mprotect` call that turns write permissions off and execution permissions on. The only caveats I can think of (`pledge`d not to use `mprotect`, marked most of the address space with `mimmutable`) apply to all other JITs too. The gist is that you operate on a copy of code, and that copy is in a writable page until it's ready to run, so you never violate the W^X rule.


Or you do what V8 does with WebAssembly and just use WX pages because doing it correctly is "too hard" to do without losing performance.


Does that even work in W^X platforms? Context for my response has that assumption, we can't simply throw it out the window, right? I think I read somewhere about making two mappings to the same physical page (one W, one X), are you referring to that? (I'd still need to know how that works as it kinda defeats the protection, the OS should prohibit that, right?)


Oh, for sure what I said wouldn't work on a W^X system. I was just pointing out that one of the most widely used JIT software uses WX pages.

What OSes prohibit that? Linux doesn't (well, I think it can with SeLinux maybe?). OpenBSD might?


The question was about OSes with hardened runtime protections. The most basic of them all is W^X. All BSDs use it, and IIRC Linux is able to enforce it as well. I'd be surprised if it isn't the default in most distros, but I guess it's not impossible. I need to go for lunch so I won't check right now.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: