TBB and CUDA seem like odd choices to me for an example case. They are based much more heavily around vectorized / SIMD style for regular more general purpose operations. The in-place vs not thing making those functional is a stretch. Very much bulk parallel procedural.