I once built a quick and dirty load testing tool for a public facing service we ...

I once built a quick and dirty load testing tool for a public facing service we built. The tool was pretty simple - something like https://github.com/bojand/ghz but with traffic and data patterns closer to what we expected to see in the real world. We used argo-workflows to generate scale.

One thing which we noticed was that there was a considerable difference in performance characteristics based on how we parallelized the load testing tool (multiple threads, multiple processes, multiple kubernetes pods, pods forced to be distributed across nodes).

I think that when you run non-distrubuted load tests you benefit from bunch of cool things which happen with http2 and Linux (multiplexing, resource sharing etc) which might make applications seem much faster than they would be in the real world.