Spark is significantly more efficient than Hadoop.
I don’t know about your specific workload, but i’ve seen quite a few Hadoop setups that were at 100% load most of the time, and were replaced by relatively simple non Hadoop based code that used 2% to 10% of the hardware and ran about as fast.
I didn’t spend much time evaluating the “pre”, but at least one workload spent 90% of the 100% on [de]serialization.
It’s not my link, it is Frank McSherry who is commenting in this thread - I hope he can chime in on why he chose this specific example - but it correlates very well with my experience.
I don’t know about your specific workload, but i’ve seen quite a few Hadoop setups that were at 100% load most of the time, and were replaced by relatively simple non Hadoop based code that used 2% to 10% of the hardware and ran about as fast.
I didn’t spend much time evaluating the “pre”, but at least one workload spent 90% of the 100% on [de]serialization.
It’s not my link, it is Frank McSherry who is commenting in this thread - I hope he can chime in on why he chose this specific example - but it correlates very well with my experience.