Using Jax, you should be able to get good performance out of the box

		sakex 11 months ago \| parent \| context \| favorite \| on: Fast LLM Inference From Scratch (using CUDA) Using Jax, you should be able to get good performance out of the box