It's actually an interesting question you can get 192 cores and 12 TB RAM in a single x86 box. At what point does it actually make sense to go for Hadoop.
It makes sense if you are handling millions of requests from all over the world per second and need failovers if machines go down.
But...if you just want to run your own personal search engine say...
Then for Wikipedia/Stackoverflow/Quora size datasets (50GB with with 10GB worth of every kind of index(regex/geo/full text etc) ) you can run real time indexing on live updates with all the advanced search options you see under their "advanced search" page one any random Dell or HP Desktop with about 6-8GB of RAM.
Lots of people do this on Wall Street. People don't get what is possible on desktop cause so much of it has moved to the cloud. It will come back to desktop imho.
It won't come back to desktops because as they get cheaper the costs of people to maintain them increases.
There will always be people who need them and use them but that proportion is going to keep decreasing (I'm somewhat sad about this, but the math is hard to argue with).
Just temporary. Nature did not need to invent cloud computing to perform massive computation. The speed and data involved in computation going on in a cell or an ants brain show us how far we can still go on desktop.
The breakthroughs will come.
In the meantime (unless you are dealing with video) most text and image datasets out there that avg Joe needs can easily be stored/processed entirely locally thanks to cheap terabyte drives/multicore chips these days. People just haven't realized there isn't that much useable textual data OR that local computing doesn't require all the overhead of handling millions of requests a second. This is Google problem not an avg Joe problem that is being solved with cloud compute.
No idea what cost that comment refers too. If you are Facebook or YouTube or Twitter or Amazon sure you need your five thousand man PAID army of content reviewers to keep the data clean cause the data is mostly useless. But if you are running Wikipedia search or Stackoverflow's search or the team at the library of congress please go and take a look at the size of these teams and their budgets and their growth rates.
Are you conflating "single server" and "desktop computer" here? It sounds a lot like you might be?
Because almost everyone already uses the model of co-locating the search index and query code on a single computer (both Wikipedia and Stackoverflow use Elastic Search which does this).
They use multiple physical servers because of the number of simultaneous requests they serve.
This has never been the use-case for Hadoop.
I've built Hadoop based infrastructure for redundantly storing multiple PB of unstructured data with Spark on top for analysis. This is completely different to search.
That's very different to the Wall St analyst running desktop analysis in Matlab, or the oil/gas exploration team doing the same thing.
That's a good point. It's really hard to give a clear cutoff because there is quite a bit that you can potentially do on a single machine now. But there are disadvantages to doing your processing on a single really big machine even when you can find a way to handle it. If you have inconsistent workloads, a single machine may be harder to schedule to be close to fully allocated, and if it goes down, then your workflow is dead, possibly until you take some sort of manual action, whereas a cluster of machines is already built to be resilient to individual node failures. It all depends on the use case, which makes it really tough to give hard and fast rules.
> a single machine may be harder to schedule to be close to fully allocated
That seems remarkably unlikely if there truly is only the one. Maybe I don't understand what you mean, though. How is that any different from having a single distributed cluster, instead?
> if it goes down
I hear (read) this quite a bit, especially recently, and I'm a bit mystified that it's even brought up in this day and age.
Firstly, having worked with (what is now) commodity hardware for over a decade, I believe that people who haven't grossly over-estimate how often "it goes down", especially today. This overestimation is trotted out as a reason that operating ones own hardware is such a "nightmare" and therefore one must always use cloud or a VPS.
With a "large" enough server, the risk goes up, of course. More DIMMs means more memory can fail, but we're still talking about low single digit percent for all errors (including correctable). IIRC, CPUs have even lower failure rates. Everything else tends to be redundant.
Even then, your workflow isn't dead. It's just missing a DIMM or a CPU, possibly after a reboot (which won't be, if you configured it right).
In many cases, downtime isn't actually caused by hardware but by software (or humans). That's not going to be unique to centralized processing versus distributed.
Also, if the single machine is on a cloud provider, many hardware and utilization issues can be abstracted anyway, for a huge price premium.
Just replying to the top bit, since we're exchanging multiple comment chains. Hadoop, Spark, etc. are built on top of a job scheduler, so if you have multiple competing jobs, it will allocate tasks to resources in a fairly reasonable way. If there is too much contention at some point in time, your jobs will all still run, and portions that can't be scheduled immediately will be delayed while they wait for free nodes. The processing model also generally leads to independence between jobs. You can do job scheduling on a single machine too, of course, but you get it for free with Hadoop/YARN. You don't get it for free with chained bash scripts or a custom single-node solution. And without sophisticated job scheduling, you can't really come too close to fully allocating your machine (when you're not watching carefully), because you'll accidentally exhaust resources and break random jobs.
edit: if you're running on a cloud, you also may be able to autoscale to deal with spiky usage patterns.
> You can do job scheduling on a single machine too, of course, but you get it for free with Hadoop/YARN. You don't get it for free with chained bash scripts or a custom single-node solution.
I'd venture to say that "for free" is a bit of a stretch, other than under the "already paid for" definition of "free". Being built on top of a job scheduler means one always has to exert the effort/overhead of dealing with a scheduler, even for the simplest case.
On a single machine, you can decide if you want to add the complexity of something like LSF or not. (I would call a custom single-node solution a strawman in the face of pre-Hadoop schedulers. Batch processing isn't exactly new).
> And without sophisticated job scheduling, you can't really come too close to fully allocating your machine (when you're not watching carefully), because you'll accidentally exhaust resources and break random jobs.
I'm pretty sure I'm still missing something. Since my goal is to gain a better understanding, I'll put forward where I believe the disconnect is (but definitely, anyone, correct me if my assumptions are off):
You're coming from a distributed model, where the assumption is that "resources" are divisible and allocatable in such a way that, for any given job, there's an effective maximum number of nodes it can use (e.g. network-i/o-bound job that would waste any additional CPUs). Any leftover nodes need other jobs allocated to them to reach full utilization.
I, however, am considering that the single-server model assumes that there won't be any obvious bottlenecks or limiting resources, instead prefering to run a single job at one time, as fast as possible. This would completely obviate the need for any complex scheduling and avoids accidental resource exhaustion (which, I suppose, is really only disk or memory).
hah, you're right, I definitely am looking at it with distributed system goggles on. With just one machine, yes, it will probably make sense to fully process one job before moving onto another (although there are some cases where this may not be true, such as if you're limited by having to wait for external APIs. But this isn't a strong argument because calling external APIs probably isn't something you want to be doing in Hadoop anyway).
If your workload scales, though, you will eventually have to spread out onto multiple machines anyway. At that point, even if you process each job on a single machine, you either need a system to coordinate job allocation across machines, or you have to accept some slack if you only schedule jobs on individual machines. In the former case, we're taking rapid steps towards YARN anyways. I'm sure non-Hadoop, pre-packaged systems exist that can handle this, although I'm not personally familiar with them, but my point is just that the more steps we take towards the complexity of a distributed computing model, the more it starts to seem reasonable to just say "okay, let's just use hadoop/spark/etc. and get the extra benefits they offer as well (such as smooth scaling without having to think about hardware), even if it's more expensive than a setup with individual nodes optimized for our purpose." It's now really easy to spin up a cluster, and with small scale, costs are not that big a deal. With large scale, your system is distributed in some way anyway.
I don't think it's an obvious decision, and you're absolutely right that people are usually too quick to jump to distributed systems when they really don't need them. But I think there are a lot of arguments for using something like Hadoop even before it's strictly necessary. I think part of the disconnect is that we have different backgrounds, so we both look at different things and think "oh, that's easy" vs "oh, thats like a pain."
Sorry.. I forgot to respond to the last part of your message:
> I don't think it's an obvious decision
Certainly, which is why I even bother with discussion like this, in the hopes of making the decision clearer (of not obvious) to me in the future.
A response to my footnoted (in my other comment) comment pointed out how oversimplified my understanding of distributed databases was. Well, I knew it was an oversimplification, but not in which way.
There's plenty of computer science research from the 70s and 80s covering these topics, but they're both tough to translate to practical considerations, and they're woefully out of date (e.g. don't account for SSDs or cheap commodity hardware).
> But I think there are a lot of arguments for using something like Hadoop even before it's strictly necessary.
Well, philisophically, I would disagree with such an assertion on the grounds of premature optimization, absent the "strictly".
I would advocate for switching from scaling "up" (aka "vertically", larger single machines) to scaling "out" (aka "horizontally" or distributed) around the point of cost parity, not at the point it is no longer possible to scale up a single machine (unless that point can reasonably be expected to occur first, I suppose).
> I think part of the disconnect is that we have different backgrounds, so we both look at different things and think "oh, that's easy" vs "oh, thats like a pain."
That would account for any overestimation of how difficult it is to work with hardware or how complex Hadoop is set up, administer, or use. Those are just initial conditions and may well unduly influence decision making that has far longer-lasting consequences.
However, I'd like to think I'm not often guilty of the latter overestimation when discussing solutions (and even advocating single-server), as I tend to assume that it can it least become easy enough for anyone out there, so long as the technology is popular enough (like Hadoop) or traditional/mature enough (like the tools in the original comment, or PBS) that plenty of documentation and/or experts exist.
My background also includes having seen, first-hand, over decades, various attempts at distributed processing and databases in practice, with varying degrees of success. This has included early "universal" filesystems like AFS, "sharding" MySQL to give it "web scale" performance [1], Glustre and its ilk, some NoSQLs, and of course Hadoop.
If anything, I'd say that, with most popular, new technologies, especially ones predicated on performance or scale, "it's a pain" is not the knee-jerk skeptical reaction my experience has ingrained in me. Rather, it's more like "sure, it's easy now, but you'll pay." TANSTAAFL.
[1] This worked well enough but did have a high up-front engineering cost and a high on-going operating cost for the large number of medium-small servers plus larger than otherwise needed app servers to do DB operations that could no longer be done inside the database because each one had incomplete data. Due to effort overestimation, it was unthinkable to move from a VPS to a colo so as to get a medium-large single DB server with enough attached storage to break the "web scale" I/O bottleneck for years to come.
> calling external APIs probably isn't something you want to be doing in Hadoop anyway
And yet I've seen it done. At least they weren't truly external, just external to Hadoop.
> In the former case, we're taking rapid steps towards YARN anyways. I'm sure non-Hadoop, pre-packaged systems exist that can handle this, although I'm not personally familiar with them
There's those goggles again :) Batch schedulers have existed for decades (e.g. PBS started in '91).
> (such as smooth scaling without having to think about hardware),
This neatly embodies what I believe is the primary fallacy in most of the decision making, including the fact that it's often parenthetical (i.e. a throway assumption).
Does anyone really value the "smoothness" of scaling? I'd expect the important feature to be, instead, that the slope of the curve doesn't decrease too fast.
The notion that Hadoop someone frees one from having to think about hardware flies in the face of hardware sizing advice from, Cloudera, Hortonworks, and others that discuss sizing node resources based on workload (mostly i/o and ram) expectations and heterogenous clusters. It does, however, explain my observation, in the wild of clusters built out of nodes that seem undersized in all respects for any workload.
>It's now really easy to spin up a cluster,
It's really easy if it's already there? That borders on the tautological. Or are you talking about an additional cluster in an organization where there already is one?
>and with small scale, costs are not that big a deal.
That's just too broad a generalization, just as "cost is a big deal" would be. Cost is always a factor, just not always the biggest one.
Small scale is often (though not always) associated with limited funds. Doubling, or even tacking on 50% to, the cost could be catastrophic to a seed startup or a scientific researcher.
>With large scale, your system is distributed in some way anyway.
This strikes me as little more than a version of the slippery slope fallacy. Even some web or app servers behind a load balancers could be considered distributed [1], but that doesn't make them a good candidate for anything that's actually considered a distribute framework.
It also hand-waves away the problem that, even if costs weren't a big deal at small scale, they don't somehow magically become less of an issue at large scale. Paying a 50% "tax" on $40k is one thing. At $400k, you could have hired someone to think about hardware for a year, instead.
[1] I just recently pointed out, slightly tongue-in-cheeck, an architectural similarity, one layer down, between app server and database https://news.ycombinator.com/item?id=17521817
Your comment is actually pretty funny as the entire point of Hadoop was to be able to use commodity, cheap, off the shell PC hardware as opposed to the exotic specifications you mention there.
Except that of course nowadays such hardware is just a couple of clicks away in AWS.
Today's "exotic" (which is actually just high-end commodity) is tomorrow's middling.
I'm not sure it's fair to summarize any one thing as "the entire point" of Hadoop, but, as I recall, it was, originally, an open source implementation of Google's Map-Reduce paper. Put another way, it was a way to bring Google's compute strategy to the masses.
That said, the notion that there is "commodity, cheap, off the shelf PC hardware" and "exotic specifications" is completely a false dichotomy, especially in the face of what, for example, Google actually does.
Google goes cheap. Very cheap. It's custom and exotic, just optimized for cost, but not absolute cost per nod, rather the ratio of cost for performance.
That last part is what's missing from every single Hadoop installation I've personally seen (or that anyone I know has personally seen), the maximization of performance for cost. Instead, there's an inexplicable desire to increase node count by using cheaper nodes, no matter the performance.
> Except that of course nowadays such hardware is just a couple of clicks away in AWS.
I'm a bit unclear what the "except" means here. I don't believe AWS has the truly high-end specs available (and never has, historically, so we can reasonably assume it never will). It's also very not-cheap.
The point of hadoop might have been that, but it never actually delievered any real value to most users - it's an abysmal failure from a computing efficiency point of view; here's an example http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...
I've been using Spark for many years going back to 1.0.
It is the foundation technology of almost every data science team around the world. And your misguided post (which for some weird reason only focuses on graph algorithms) doesn't change that. And not sure why you think it's inefficient. We run 30 node, autoscaling clusters which stay close to 100% for most of the time.
> We run 30 node, autoscaling clusters which stay close to 100% for most of the time.
I have exactly zero knowledge on Spark's efficiency as well as zero on how representative graph algorithms are, but I can confidently say that the above statement fails to refute the referenced article's thesis (which, arguably, criticizes assertions just like that).
Just because your implementation scales (even autoscales) to use more compute resources says nothing about its efficiency (overall or even marginal when adding more nodes, i.e. the shape of the curve).
Computer science has struggled with achieving even near linear-scalability ever since the advent of SMP.
Spark is significantly more efficient than Hadoop.
I don’t know about your specific workload, but i’ve seen quite a few Hadoop setups that were at 100% load most of the time, and were replaced by relatively simple non Hadoop based code that used 2% to 10% of the hardware and ran about as fast.
I didn’t spend much time evaluating the “pre”, but at least one workload spent 90% of the 100% on [de]serialization.
It’s not my link, it is Frank McSherry who is commenting in this thread - I hope he can chime in on why he chose this specific example - but it correlates very well with my experience.
Joking snark aside, I'm actually doubtful this is true. Specifically, I don't recall the impetus for Hadoop (or Google's original Map-Reduce, as described in the '04 paper) being an all-in-memory workload.
Despite it being repeatedly brought up in this sub-thread, I maintain that it's a niche use case and that disk-based data processing workloads are far more common.
ETA: Does anyone know of a canonical or early/initial document outlining the purpose, or at least design goals, of Hadoop?