How we spent Friday night coming back online before Instagram and others

jasonkester · on July 1, 2012

Nice job staying on top of things, but looking in from the outside it does seem a bit wasteful to spend what, 100 man hours of effort and expense, canceling a handful of otherwise happy Friday nights, just to gain a couple hours uptime during a period where roughly nobody is using your thing.

I had a site that was affected by another one of Amazon's outages a while back, and here was my disaster recovery plan in its entirety:

  a. go to sleep.

There's a reason you farm things like this out to Amazon in the first place. They have a big team of smart people whose only job in life is to keep your stuff alive, or scramble like mad to bring it back up if it goes down.

So long as your site knows how to start up automatically when the box turns on, there's really not a lot you need to do in a situation like this.

If forty percent of the internet is down, and you're part of it, your users will probably understand. They'll expect you to come back up when the rest of the internet does. If you do manage to come up a bit earlier, you might get a shrug and a "cool", but it's probably not enough of a win to cancel Christmas.

saurik · on July 1, 2012

An ALIAS record is just an internal-to-Route-53 mapping "when people ask for X, pretend they asked for Y instead". This is conceptually similar to a server-side CNAME. The reason these don't have TTL is because users don't see them.

The returned record, of course, has a TTL, and the ALIAS mechanism will not alter the TTL of the aliased data, so in the case of an ELB you are talking a TTL of one hour. There are no magic bullets to the distributed cache expiration problem.

(The other comments in this article about TTL seem quite confused, though, so this explanation might not actually have helped. Even if you have a 20-year TTL, you are going to see changes immediately from clients that do not have the data cached anywhere on their path to the origin.)

(In particular, there is no difference at all to switching the ALIAS record out for an A record with regards to your TTL: if the user has the target of the old mapping cached they will use it, otherwise they will get the new one. It isn't really "getting away with" that behavior, and it isn't sue to Amazon's DNS being special.)

diafygi · on July 1, 2012

Agreed, at the time we were quite confused and still are about the intricacies of TTL. Sucks that there's not really a good way to address it. Maybe just keep the TTLs at 300 always?

elithrar · on July 1, 2012

Keep in mind that not all DNS caches—particularly those at large ISP's—respect sub-3600s TTL's.

(they do this in an effort to reduce load; whether it is truly effective on today's hardware is arguable)

saurik · on July 1, 2012

Some ISPs (in particular, those in countries far from the US, such as those in the Middle East, although occasionally even Europe) do not even honor the hour-long TTL used by ELB (for reasons of latency, not load), so if you care about your traffic not being routed to someone else's server, you should not allow ELB to get exposed to end user requests (in my case, I use it to balance my backend servers, but the only incoming connections it handles are from CDNetworks, whom I know has a to-specification implementation of DNS caching).

jc4p · on July 1, 2012

Won't that lead to a slower page load time for a lot of people due to unnecessary DNS look-ups?

hwatson · on July 1, 2012

Nicer thing to do would be to have the maintenance IP as your last A record. If the client can't reach your regular servers, it will automatically fall back to the maintenance IP.

RyanGWU82 · on July 1, 2012

That's not actually how A records work. Clients are supposed to randomly select one of the A records, although this doesn't always distribute the load as randomly as it should. (My company's site used to have 4 A records for 4 load balancers, and we found that one of them received 25-30% more traffic than the rest.) Clients certainly don't have a mechanism to retry if one of them fails -- they'll be stuck with the failing IP for the length of the TTL.

stevencorona · on July 1, 2012

Low TTLs are a necessarily evil, otherwise you turn your load balancer into a SPOF.

jetsnoc · on July 1, 2012

I can't praise Chef by Opscode enough. We had several webservers running with a fresh deployment of our web stack in a matter of ten minutes. We've built several recipes to install the pre-requisites (nginx, php5-fpm, Rails depending on the system role) and then it uses a GitHub deployment key to checkout the stable revision of our applications.

Even before you can afford a part-time DevOps engineer I highly recommend automating your system administration as much as possible. When you say you should script your installation of an application server I recommend doing that with Chef!

This will allow you to quickly bring instances online on about any cloud. HP, AWS, Linode, or your own self-brewed OpenStack cloud. You will still have challenges with your persistent data but it's a lot easier to breathe and act quickly with your persistent data knowing you'll have content servers to access that resource.

edit: I have no affiliation with Opscode - we aren't even a paying customer we use their free server. I am sure Puppet or any system administration tool will get you similar milage.

shimon_e · on July 1, 2012

I use a similar system. With OVH you can set installation templates that partition the OS, set ssh keys, and then run a script. Which for me is a puppet script that sets everything up and deploys the latest version of the site.

OVH have an android app so I can scale to a new server from the push of a button on my phone. :D

Plus they cost peanuts compared to amazon.

I has fall over in other data centres too but OVH makes me the happiest.

diafygi · on July 1, 2012

Daniel Roesler here, I feel really sorry for those who paid for Multi-AZ support on their RDS instances only to have all the availability zones go down. That would make me rage quite a bit.

sofuture · on July 1, 2012

The only bump in service we had was ~30s of downtime when our multi-AZ RDS cut over to the failover instance. :) I think we lucked out a little bit...

mrcalzone · on July 1, 2012

My multi-AZ RDS also failed over without a problem. The site was down for 5 minutes though, perhaps due to ELB problems.

jc4p · on July 1, 2012

Were you in us-east-1? That sounds amazingly lucky. If only Amazon would tell us how many AZs actually went down...

blagospot · on July 1, 2012

Does 1 AZ == 1 datacenter? It seems like it's possible 5 distinct "AZs" could have gone down while someone else escaped with all 4 of their AZs unscathed, since Amazon takes care to talk about availability zones rather than datacenters, and my AZs by definition aren't the same as your AZs.

diafygi · on July 1, 2012

AZ's are different data centers, but in the same region. I think that means that the buildings are close enough to have super low latency connections to each other. However, when a huge storm rolls through, it can wipe out all the AZs (which is basically what happened). AZs are mostly for guarding against floods, fires, and other localized disasters, not regional disasters that leave 2 million people without power.

robryan · on July 1, 2012

As I understand it the separate AZs can be the same data center but in theory they are completely separated from each other in terms of network/power etc so an outage in one shouldn't effect the others.

count · on July 1, 2012

The other catch that not many folks seem to realize is that 'your' us-east-1a is NOT the same as 'my' us-east-1a, although my 1a and 1b are guaranteed to be different AZs.

And I think you hit the nail on the head - 'regions' are for regional fault tolerance, while AZs are for within-a-region fault tolerance.

teuobk · on July 1, 2012

Nice write-up. How did your users perceive the downtime? Mostly understanding, or was there a lot of anger?

jc4p · on July 1, 2012

Our entire team was making sure we'd give immediate response to any user who asked what was going on while we didn't have the down page up, so most people were pretty happy:

http://i.imgur.com/hCP44.png

dennisgorelik · on July 1, 2012

This webpage is not available

jacques_chester · on July 1, 2012

tl;dr

We waited for stuff to come back up. Our relatively simple application required far fewer servers than more complex services with millions of users.

We're awesome!

wahnfrieden · on July 1, 2012

Hey Daniel, thanks for the educational writeup. I have to wonder about ways around the AMI issue. We use puppet to setup new instances (and keep existing instances in sync... although we tend to just recycle EC2 instances anyway). This is pretty nice to work with given its declarative nature, but we have to put up with long, long startup/initialization times for new instances. Which sucks in downtime of course.

Do you think there's some middle ground of using AMIs but also using puppet somehow, so you make new AMIs as a perf optimization but keep puppet config up to date? TBH it's something I've only casually wondered about. But maybe it's what we both need. Having a puppet config would mean you can launch on basically any provider.

diafygi · on July 1, 2012

Yes exactly, the speed of throwing up pre-built AMIs is tons faster than building from scratch each time you launch.

However, we're probably going to make a deployment script (we use fabric) that builds the AMI from scratch. Then, when we need to update the AMI, we just update the fabric script and run it to make the new AMI. That way if we ever need to make AMI's in a different region, we can just run the fabric script in that region.

mryan · on July 1, 2012

I use a similar process to this in my infrastructure - I use Puppet to configure instances, then a Fabric command to create new AMIs based on these Puppet configurations.

This gives the best of both worlds - version controlled configuration files, and an automated process for making new AMIs. Instances also boot quickly, as nearly all of the configuration is baked-in to the image.

Using EC2 tags gives you even more room for automation here.

wahnfrieden · on July 1, 2012

Can you elaborate on how you use EC2 tags? The only thin we use them for is to mark a new instance as bootstrapped once puppet finishes.

It also sounds like you could pretty easily make your AMI creation step a job in your CI software.

wahnfrieden · on July 1, 2012

If you used puppet (or chef etc) instead of fabric for config, you could have the best of both worlds - it should be much faster to run puppet on existing instances to keep them up to date, rather than failover all your instances to new AMIs every time. But by still creating those AMIs, you'll have them ready when you need to launch additional instances.

rikf · on July 1, 2012

The correct way to do it is as follows.

1) Set up a test environment using puppet or whatever that boots up an image instance and deploys your whole stack to your cloud provider. 2) Deploy your release candidate to this environment. 3) Run some smoke tests, acceptance tests and maybe even a few performance tests against this environment. 4) If all the tests pass store your image somewhere (S3, GitHub) also make sure to tag the image with your release candidate version. 5) You can now deploy this change-set to your production environment.

When things go wrong you will have fully configured servers available almost immediately.

Now granted this is potentially a very long running process but you could simply set this up as a nightly build.

wahnfrieden · on July 1, 2012

That's interesting, but would not work for us since we ship code up to 20-30 times a day at Canvas (using continuous deployment). It's already bad enough that it takes about 10 minutes to ship.

famo · on July 1, 2012

If you really want to make it funked out, you could set up a CI process to check out your puppet scripts and build an AMI whenever a change is made to your config. Then the deploy task would be a rolling upgrade of the new AMI to whichever environment you want the config to affect. With some cloud formation and puppet trickery you could also one-click-bootstrap new environments (e.g. "production in region x", or "staging in region y") and just run the deploy task to that new environment.

wahnfrieden · on July 1, 2012

Since you're already using puppet, which is great at keeping existing instances up to date with config changes (it's almost the whole point), wouldn't it be better for your CI to create those AMIs asynchronously when you deploy, but not actually use them unless you (or autoscale) are launching new instances? Just use puppet to deploy config changes as usual, then have the AMIs at the ready, rather than immediately recycling your instances with the new AMIs. This should be much faster (and more loosely decoupled from AWS).

Jeema101 · on July 1, 2012

Did you guys first attempt to boot up in another us-east availability zone? They were not all affected as this post seems to imply. I had a slave DB go down but the rest of our deployment was unaffected (the rest of the deployment is in a different AZ as the slave DB instance but all are in us-east).

ahmedaly · on July 1, 2012

I highly recommend you to sign up for eCompuCloud: http://www.ecompucloud.com/

It relies on several cloud computing providers, to prevent such drops to happen.

The pricing is even lower Amazon or any other computing provider! (we buy larger clusters which costs us less)

paulsutter · on July 1, 2012

The reason it took you 12 hours to get back online? Because you really never gave much consideration to availability.

You should THINK about your TTL values long in advance of a problem. You should also THINK about having a backup instance running (or at least ready to boot, if an hour of downtime is perfectly OK for your users).

More discussion here: http://news.ycombinator.com/item?id=4181918

uptown · on July 1, 2012

"But you didnt, and you should feel really embarrassed about that."

Sounds to me like they learned a lot, and are sharing what they've learned to help others. What's really to gain from shaming them over it?

paulsutter · on July 1, 2012

My summary of their post is "we basically have no idea how to offer a reliable service and we know almost nothing about basic Internet protocols, and we feel like heroes anyway".

I would be a lot more impressed if they were asking questions about how DNS really works and what they should do to avoid problems in the future. That would be cool.

aaronbrethorst · on July 1, 2012

Perhaps you could blog about this topic in more depth. I'd be interested in reading it.

taligent · on July 1, 2012

Surprised I am the only who found the title amusing.

The reason you got your site online before Instagram and others is because they have a lot of infrastructure and moving pieces as a result of being extremely popular. Obviously Fitocracy doesn't share those characteristics.

That said it is unacceptable for ANY site to go down simply because you lose power in a data center.