Getting maximum performance out of SIMD requires rolling your own code with intrinsics. It is something a compiler can't do for you at a pretty fundamental level.
Most interesting performance optimizations from vector ISAs can't be done by the compiler.
Interesting, how so? I’ve had really good success with the autovectorization in gcc and the intel c compiler. Often it’s faster than my own instrinsics, though not always. One notable example though is that it seems to struggle with reduction - when I’m updating large arrays ie `A[i] += a` the compiler struggles to use simd for this and I need to do it myself.
To clarify, there are many things SIMD is used for that look nothing like the loop parallelism or doing numerics commonly discussed. For example, heterogeneous concurrency is likely going to be beyond compilers for the foreseeable future and it is a great SIMD optimization.
A common example is executing the equivalent of a runtime SQL WHERE clause on arbitrary data structures of mixed types. Clever idioms allow surprisingly complex unrelated constraint operators to be evaluated in parallel with SIMD. It would be cool if a compiler could take a large pile of fussy, branchy scalar code that evaluates ad hoc constraints on data structures and converts it to an equivalent SIMD constraint engine but that doesn't seem likely anytime soon. So we roll them by hand.
Most interesting performance optimizations from vector ISAs can't be done by the compiler.