I’m also not quite an expert, but have benchmarked an M1 and various GPUs. The M...

I’m also not quite an expert, but have benchmarked an M1 and various GPUs.

The M* chips have unified memory and (especially Pro/Max/Ultra) have very high memory bandwidth even compared eg to a 1080 (an M1 Ultra has memory bandwidth between 2080 and 3090).

At small batch sizes (including 1, like most local tasks), inference is bottlenecked by memory bandwidth, not compute ability. This is why people say the M* chips are good for ML.

However H100s are used primarily for training (at enormous batch sizes) and require lots of interconnect to train large models. At that scale, arithmetic intensity is very high, and the M* chips aren’t very competitive (even if they could be networked) - they pick a different part of the Pareto power/efficiency curve than H100s which guzzle up power.