Compare with SoftLayer which has decided to charge through the nose for their RA...

Smerity · on April 27, 2012

I haven't looked at SoftLayer's offerings re: RAM but the most important thing to note with Hetzner is that only some of their dedicated machines have error-correcting code (ECC) memory. Hetzner's 16GB machine[1] (89 Euros/month) has ECC memory for example.

Whilst this might not sound like an enormous issue, DRAM errors are surprisingly common in real deployments[2] and can result in scary things happening to your data. If you're using a machine without ECC memory make sure you're prepared to deal with any possible issues that might arise, especially if it will impact your core business.

[1]: http://www.hetzner.de/en/hosting/produkte_rootserver/ex8

[2] DRAM Errors in the Wild: A Large-Scale Field Study: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

jules · on April 27, 2012

How common DRAM errors are is very unclear. The numbers in that paper are astronomically high, but other sources have published numbers that are 30x or 100,000x or literally 10,000,000x lower.

See this Stanford ee380 talk: http://stanford-online.stanford.edu/courses/ee380/100922-ee3...

The part about DRAM errors is about 57 minutes in.

Abstract here: http://www.stanford.edu/class/ee380/Abstracts/100922.html

Also, AFAIK most hosting providers don't even have ECC ram as an option for servers, e.g. Amazon.

justincormack · on April 27, 2012

Do you have a source on Amazon not using ECC?

EwanToo · on April 27, 2012

I don't think there's a confirmed source, but there's certainly indications, for example, an AWS post that they use ECC memory for the GPUs in their cluster GPU instance

http://aws.typepad.com/aws/2010/11/new-ec2-instance-type-the...

Given that they list both the standard memory and GPU memory next to each other, but only put ECC next to the GPU memory, it seems relatively likely to me that the standard server memory is not ECC.

justincormack · on April 27, 2012

But Nvidia only sell Tesla's with ECC. I don't really read that implication into this.

On the other hand, the fact that https://forums.aws.amazon.com/message.jspa?messageID=203167 has never been answered suggests they do not, as they would answer affirmatively if it was true surely. Of course they may use a mixture.

EDIT: Interesting that James Hamilton is on their team and thinks they should use it http://perspectives.mvdirona.com/2009/10/07/YouReallyDONeedE...

EwanToo · on April 27, 2012

I can well imagine that newer machines have it, and older ones don't, and slowly over time they'll end up with everyone using ECC.

The costs involved were a lot higher 3-5 years ago than they are today.

jules · on April 27, 2012

I was not able to find Amazon declaring that they don't, but neither do they say anywhere that they do. For example they describe the hardware specs here: http://aws.amazon.com/ec2/instance-types/ If they did have ECC RAM it is unlikely that they would keep it secret, especially given that ECC RAM can be twice as expensive.

wmf · on April 27, 2012

Amazon uses Xeons and Opterons where ECC is pretty much mandatory. They don't have an option for ECC because all their servers have it.

nirvana · on April 27, 2012

I think it is a good idea to architect your system under the assumption that once a year, some theif is going to sneak into the data center and make off with an entire server. Just one, but the whole server will then die horribly in a fire after a shoot-out with the police.

Lots of things can happen, you should have a higher level of replication such that you can handle a whole server going poof, not just a single bit going poof.

The cost of ECC at Hetzner-- the cheapest provider out there- is about half an additional server. So, buy three servers without ECC for the price of two servers with ECC, and replicate your data three times (and triple your bandwidth, horsepower, etc.)

This is not hard with platforms like Riak which are distributed homogenous clusters of nodes.

And if your service isn't built like that, then really it should be. (IMNSHO, of course.)

AdamGibbins · on April 27, 2012

Replication isn't going to help if your data is being silently corrupted, you'll just replicate the corruption.

nirvana · on April 29, 2012

ECC Does nothing to prevent your server from dying in a fire after that police chase, either. However, replication is an effective solution to the problem you describe:

Whenever data is read, you read from more than one replica, and then compare them. If one of them has been corrupted, its hash won't match and you'll know it. You can then write out the correct data to the node with the error. This is very easy in systems like Riak.

stevencorona · on April 27, 2012

SoftLayer charges an insane amount for memory