The bottleneck in those is not the arithmetic operation but the memory bandwidth...

		imtringued on March 8, 2024 \| parent \| context \| favorite \| on: New Bounds for Matrix Multiplication: From Alpha t... The bottleneck in those is not the arithmetic operation but the memory bandwidth once you have to spill your matrix out of SRAM. As it stands right now, it is actually better to have a slower algorithm that uses the local memory more efficiently.