I love this article! But I think this insight shouldn't be surprising. Distribut...

I love this article! But I think this insight shouldn't be surprising. Distribution always has overheads, so if you can do things on a single machine it will almost always be faster.

I think a lot of engineers expect 100 computers to be faster than 1, because of the size comparison. But we're really looking at a process here, and a process shifting data between machines will almost always have to do more stuff, and therefore be slower.

Where spark/daft are needed is if you have 1tb of data or something crazy were a single machine isn't viable. If I'm honest though, I've seen a lot of occasions where someone thinks they have that happening, and none so far where they actually do.