Filed under servers

Fixing holes in EC2 reliability

We are in the era of clouds, and at the moment AWS is the Zeus among public clouds. With its scalable and flexible architecture, cheap rates, secure PCI compliant environment, wide array of loosely coupled services and boasting of 99.95% availability, they may deserve the crown. However they are not without holes and few days ago I got the chance to taste it firsthand. This post is about few measures that you should (and I mean this with capital SHOULD) take before moving your production servers to AWS.

To start with, I had been using Slicehost and Linode as VPS providers for couple of years while tinkering with AWS. After a trial run of few months I was satisfied that everything is working as it should be and moved to AWS for real. But the mistake I’ve done and AWS didn’t bother to mention anywhere easily findable is to couple Elastic Block Storage (ESB) with all instance stores. And this is something easy to overlook when you are coming from a regular VPS provider because ephemeral Instance store is the most counterpart similar device to a slice and you may expect the same behaviour throughout.

So back to the story, everything was running fine until AWS had scheduled a maintenance rebooting of the instance two weeks ago. Nothing much to worry right ? But it turns out that the instance didn’t reboot and there was very little possible to do from the AWS web console. Unlike in regular VPS slices, AWS doesn’t come with a back-door SSH console and it turns out even the staff can do pretty much little regarding an instance store. The only solution they could give me was to reboot the instance few times and if it doesn’t work out…well, they are sorry and it’s a lost cause.

I earlier mentioned the mistake I’ve made. But what I got right was to have several layers of backups including database replication slaves. So backups were running pretty much as expected and there wasn’t any lasting damage done.  And only when you are in trouble that you are glad of the time well spent on emergency procedures.

So rest of the story is very little. I removed the crashed instance, restarted a new one from the custom AMI we had and copied data over from DB slaves. But this scenario could have gone vastly wrong if there wasn’t a redundancy setup and for some unfortunate bootstrapping startup it could have reduced all their hard work to crisp.

I know servers should be up running and having them down is not heroic. But there are few points you should have in place before moving your production servers to AWS.

  1. Have a proper backup procedure in place. Better if replication slaves are in some other server vendor or in another AWS region and have a monitor setup to make sure replication process is working properly. Also it’s better to have several layers of backups running so you will have point-in-time recoverable database copy as well as one day old, week old, month old.etc data copies in worst to worst case scenarios.
  2. Use Elastic Block Storage (EBS) – They are the external USB drives of AWS. Couple one or more EBS with your instance store  and use them to store any data you think is valuable. If your instance die, you can just decouple the block and reattach to another fresh instance and run without a hitch.
  3. Have a custom bare-bone AMI with just the OS and may be couple of basic services. Also have an AMI with fully ready-to-launch setup. This way you can make another production ready instance in minimal time as well as have an option in a worst case scenario where the full ready made AMI doesn’t work. Finally, test all your AMIs to make sure that they are working properly.
  4. Have snapshots from your EBS devices in scheduled intervals.
  5. Use these not so easy to find AWS architectural guidelines in designing your platform.

So as I mentioned it’s not about heroics, but making sure your service not getting reduced to ashes because of some stupid server glitch. As someone wise had noted, better be ready than sorry!

Update:

There is another set of sound suggestions made in comment #4 by kordless for any cloud deployment. If you are into heavy scaling they may be particularly useful.