> All the neural accelerators in the world won't make it competitive in speed wi...

> All the neural accelerators in the world won't make it competitive in speed with discrete GPUs that all have way more bandwidth.

That’s true for the on-GPU memory but I think there is some subtlety here. MoE models have slimmed the difference considerably in my opinion, because not all experts might fit into the GPU memory, but with a fast enough bus you can stream them into place when necessary.

But the key difference is the type of memory. While NVIDIA (Gaming) GPUs ship with HBM memory ship for a while now, the DGX Spark and the M4 use LPDDR5X which is the main source for their memory bottleneck. And unified memory chips with HBM memory are definitely possible (GH200, GB200), they are just less power efficient on low/idle load.

NVIDIA Grace sidestep: They actually use both HBM3e (GPU) and LPDDR5X (CPU) for that reason (load characteristics).

The moat of the memory makers is just so underrated…