PipelineDeals is hosted on Amazon's AWS cloud platform, and has been since 2007. During these years we have been exposed to two separate mass outages, all of which affected their US-EAST availability zone.
Compared to other AWS availability zones, US-EAST center is their busiest, has the cheapest hourly server rates, and (happens to be) the most prone to massive outages.
Given that the majority of our customer base happens to be closer to the east coast, we keep our servers hosted in US-EAST. What this means, however, is that we must be prepared to jump ship to another availibility zone with as little downtime as possible.
Setting up for geographic redundancy
PipelineDeals relies heavily on chef for our server configuration managment. Over the past couple months we have made deep investments in our chef cookbooks to ensure that we can bring up any type of server in our stack to 100%, by simply running a rake command. Given that we can do this, we can effortlessly bring up any type of server in any AWS availability zone.
An ounce of prevention
We noticed that the past two massive outages were related to each other. They both had to do with Amazon's EBS system, which provides external storage, and also acts as a root device for certain types of instances. We noticed during the outage that our EBS-backed instances were experiencing problems.
We decided to become proactive about it. All of our production servers now no longer rely on EBS-backed instances. Because we use chef to bring up instances, we do not need the benefit of rebooting or stopping servers, thus we do not need one of the major benefits that EBS-backed servers provide. In fact, removing our dependence on EBS has forced our hand to utilize chef to its fullest potential.
We switched our chef recipes to run against instance store backed instances, which do not rely on the EBS subsystem. Since the outage we have replaced all running EBS-backed instances with instance store-backed instances.
Guarding against the 'mad rush'
Typically when an outage occurs in US-EAST, there is a massive rush to start firing up instances in other availability zones. So much so that it could take tens of minutes, possibly even hours to fire up servers, all in the midst of an emergency.
To guard against that, we have a "skeleton crew" set of servers that are always on, in US-WEST-1. They include a database slave, an application server, and a background process server. If we need to fail over to our skeleton crew in the west, we could be up while assessing the situation in the east, and/or firing up more powerful hardware in the west.
Essentially what would happen is we would promote our slave in the west to become a master, fire up a new west coast slave, and start firing up the ancillary servers that make up the full PipelineDeals experience.
Practice makes perfect
The multi-day threat of hurricaine sandy facilitated multiple practice runs to fail over to the west. This included exercising every single one of our chef recipes that make up the entire infrastructure of PipelineDeals.
During that time we made many many commits to our chef repository, and ensured that we could fire up servers to 100%, time after time.
In addition, we were able to combine roles of servers, to reduce the amount of servers needed to make up the entire PipelineDeals experience.