Getting maximum performance out of SIMD requires rolling your own code with intr...

capyba · 2025-11-06T10:19:21 1762424361

Interesting, how so? I’ve had really good success with the autovectorization in gcc and the intel c compiler. Often it’s faster than my own instrinsics, though not always. One notable example though is that it seems to struggle with reduction - when I’m updating large arrays ie `A[i] += a` the compiler struggles to use simd for this and I need to do it myself.

burntsushi · 2025-11-06T14:58:26 1762441106

There's no optimal portable `movemask` operation. Because aarch64 NEON doesn't have it.

exDM69 · 2025-11-06T12:12:59 1762431179

> Getting maximum performance out of SIMD requires rolling your own code with intrinsics

Not disagreeing with this statement in general, but with std::simd I can get 80% of the performance with 20% of the effort compared to intrinsics.

For the last 20%, there's a zero cost fallback to intrinsics when you need it.

jandrewrogers · 2025-11-06T16:13:51 1762445631

To clarify, there are many things SIMD is used for that look nothing like the loop parallelism or doing numerics commonly discussed. For example, heterogeneous concurrency is likely going to be beyond compilers for the foreseeable future and it is a great SIMD optimization.

A common example is executing the equivalent of a runtime SQL WHERE clause on arbitrary data structures of mixed types. Clever idioms allow surprisingly complex unrelated constraint operators to be evaluated in parallel with SIMD. It would be cool if a compiler could take a large pile of fussy, branchy scalar code that evaluates ad hoc constraints on data structures and converts it to an equivalent SIMD constraint engine but that doesn't seem likely anytime soon. So we roll them by hand.