How our small startup survived the Amazon EC2 cloud-pocalypse

josephruscio · on April 23, 2011

This "cloud-pocalypse" simultaneously affected multiple availability zones in US-East. Had you had the misfortune to have your multi-AZ pair in two affected zones you would have had significant downtime. The only architectures that were truly safe from this outage were those with a completely multi-region strategy, and I suspect those are very far and few between.

josephruscio · on April 23, 2011

I should clarify, if your title was "How we minimized the impact of the cloud-pocalypse", then the content of the article is fine. I think most SaaS operators would define "surviving" as no downtime e.g. Netflix.

krobertson · on April 23, 2011

I am still waiting for Amazon's post-mortem, which I hope is honest. All the other services (4sq, Quora, etc) seriously were all in the same AZ and made the mistake of spreading their infrastructure to multiple AZs?

Amazon has seemed rather dishonest about the true breadth of the outage.

saurik · on April 23, 2011

I second this. It simply seems way too convenient that every single service that I actually know about that is using EC2 (strictly based on prior knowledge) happened to be deployed into the same availability zone. I mean, some were also in other zones (like Netflix), but seriously, from a bayesian analysis perspective, "something seems rotten" about the conclusions that I can draw from "reddit, Quora, Heroku, foursquare, Netflix, and Cydia were all relying on the same zone" + "I only know of one other service using EC2, I haven't heard back from them, and do not know if they used EBS anyway".

Either their availability zones are ludicrously skewed (supposedly they are random per customer), or they in fact pushed some kind of update that took almost everything down at once, and are relying on the fact that no one could determine what zone they were in.

Actually, on that note, who else here was affected? If your company had stuck EBS volumes during this extended outage period, could you look up your one year m1.small reserved instance offering id for the zone you are in? Mine (Cydia was affected by the outage) is: 438012d3-80c7-42c6-9396-a209c58607f9.

To do this, run this script (slightly modified from the one at the site below to update it for a couple years of drift: Amazon VPC instances were confusing it into showing two identifiers), and make certain to change the last grep to look for the availability zone in which your outage occurred:

  ec2-describe-regions | cut -f2 | while read -r region; do
    ec2-describe-reserved-instances-offerings --region $region
  done | grep 'm1\.small.*1y.*UNIX$' | grep us-east-1a #<- change this

For information on doing this correlation, check this out: http://alestic.com/2009/07/ec2-availability-zones

justinsb · on April 23, 2011

FathomDB was also hit - 438012d3... was also my majorly disrupted AZ, 60dcfab3... was not significantly affected. Not sure about other us-east AZs.

Thanks for pointing out that trick, by the way. Given AWS is exposing this anyway, they might as well just give them friendly aliases so that their status updates don't come across as quite so evasive. Probably they don't even know they're exposing this though!

Perhaps the mapping isn't strictly random but actually load-based. If there are 6 underlying AZs, and Zynga hits 4 of them, then perhaps most other customers would be in the remaining two zones, so it's a 50/50 chance.

I do totally agree that something isn't quite right - AWS status updates do imply they deliberately disabled API calls; perhaps they did that across otherwise unaffected AZs. We'll have to see what the PR department comes up with for the post-mortem.

dangrossman · on April 24, 2011

It really shouldn't be surprising at all. There are only two regions in the US, east or west, and east was the first open and is cheaper to use. ELB failed in multiple availability zones within US-East. That's going to affect more AWS customers than if it happened anywhere else in their infrastructure.

andrewjshults · on April 23, 2011

Even if you were in multiple AZs (we were, including a multi-AZ RDS deployment) there was a chance you where going to go down (which we did). Being in a different region entirely was the only "guarantee" that you wouldn't have gone down (while we were waiting to be able to actually restore snapshots we started creating a cold-swap copy of our environment on Rackspace so we'll have some quicker way of getting back online).

saurik · on April 23, 2011

Hey: would you mind running the script in my other reply to this in order to see whether you were actually in the same availability zone as me?

marcc · on April 24, 2011

FYI: us-east-1a for you might be us-east-1c for me. The labels of the AZ doesn't mean anything outside of your account.

drcode · on April 23, 2011

that jargon has legs: "cloud-pocalypse"