At scale, algorithms are commonly limited by memory bandwidth, not concurrency. ...

At scale, algorithms are commonly limited by memory bandwidth, not concurrency. Most code can be engineered with enough cheap concurrency to efficiently saturate memory bandwidth.

This explains why massively parallel HPC codes are mostly minimal mutable state designs despite seemingly poor theoretical properties for parallelism. Real world performance and scalability is dictated by minimization of memory copies and maximization of cache disjointness.