> In C++, it's an rvalue reference , which can be effectively thought of as an lvalue
hmm...this doesn't sound quite right? the comma operator's result in C++ is not an rvalue reference - it takes on exactly the value category of its right operand (which in this case is an lvalue)
so, these are hand optimized primitives for specific model of nvidia gpus? do you still have to make launch/scheduling decisions to maximize occupancy? how does this approach scale to other target devices with specialized instruction sets and different architecture?
Some graphics APIs support commands that tell the GPU to record a timestamp when it gets to processing the command. This is oversimplified, but is essentially what you ask the GPU to do. There’s lots of gotchas in hardware that makes this more difficult in practice as a GPU won’t always execute and complete work exactly as you specify at the API level if it’s safe to. And the timestamp domain isn’t always the same as the CPU.
But in principle it’s not that different to how you just grab timestamps on the CPU. On Vulkan the API used is called “timestamp queries”
It’s quite tricky on tiled renderers like Arm/Qualcomm/Apple as they can’t provide meaningful timestamps at much tighter granularity than a whole renderpass. I believe Metal only allows you to query timestamps at the encoder level, which roughly maps to a render pass in Vulkan (at the hardware level anyway)
I don't know about Tracy, but I've seen a couple WebGPU JS debugging tools simply intercepting calls to the various WebGPU functions like writeBuffer, draw, etc, by modifying the prototypes of Device, Queue and so on[0].
Faiss is for similarity search over vectors via k-NN. GraphRAG is, well, a graph. More precisely, GraphRAG has more in common with old school knowledge graph techniques involving named entity extraction and the various forms of black magic used to identify relationships between entities. If you remember RDF and the semantic web it's sort of along those lines. One of the uses of Faiss is in a k-NN graph but the edges between nodes in that graph are (similarity) distance based.
In a similar fashion, you'll see that JAX has frontend code being open-sourced, while device-related code is distributed as binaries. For example, if you're on Google's TPU, you'll see libtpu.so, and on macOS, you'll see pjrt_plugin_metal_1.x.dylib.
The main optimizations (scheduler, vectorizer, etc.) are hidden behind these shared libraries. If open-sourced, they might reveal hints about proprietary algorithms and provide clues to various hardware components, which could potentially be exploited.
The same as with any other JIT runtime: you do your transformations first, and then you do the `mprotect` call that turns write permissions off and execution permissions on. The only caveats I can think of (`pledge`d not to use `mprotect`, marked most of the address space with `mimmutable`) apply to all other JITs too. The gist is that you operate on a copy of code, and that copy is in a writable page until it's ready to run, so you never violate the W^X rule.
Does that even work in W^X platforms? Context for my response has that assumption, we can't simply throw it out the window, right?
I think I read somewhere about making two mappings to the same physical page (one W, one X), are you referring to that?
(I'd still need to know how that works as it kinda defeats the protection, the OS should prohibit that, right?)
The question was about OSes with hardened runtime protections. The most basic of them all is W^X. All BSDs use it, and IIRC Linux is able to enforce it as well. I'd be surprised if it isn't the default in most distros, but I guess it's not impossible. I need to go for lunch so I won't check right now.