Fixing holes in EC2 reliability

We are in the era of clouds, and at the moment AWS is the Zeus among public clouds. With its scalable and flexible architecture, cheap rates, secure PCI compliant environment, wide array of loosely coupled services and boasting of 99.95% availability, they may deserve the crown. However they are not without holes and few days ago I got the chance to taste it firsthand. This post is about few measures that you should (and I mean this with capital SHOULD) take before moving your production servers to AWS.

To start with, I had been using Slicehost and Linode as VPS providers for couple of years while tinkering with AWS. After a trial run of few months I was satisfied that everything is working as it should be and moved to AWS for real. But the mistake I’ve done and AWS didn’t bother to mention anywhere easily findable is to couple Elastic Block Storage (ESB) with all instance stores. And this is something easy to overlook when you are coming from a regular VPS provider because ephemeral Instance store is the most counterpart similar device to a slice and you may expect the same behaviour throughout.

So back to the story, everything was running fine until AWS had scheduled a maintenance rebooting of the instance two weeks ago. Nothing much to worry right ? But it turns out that the instance didn’t reboot and there was very little possible to do from the AWS web console. Unlike in regular VPS slices, AWS doesn’t come with a back-door SSH console and it turns out even the staff can do pretty much little regarding an instance store. The only solution they could give me was to reboot the instance few times and if it doesn’t work out…well, they are sorry and it’s a lost cause.

I earlier mentioned the mistake I’ve made. But what I got right was to have several layers of backups including database replication slaves. So backups were running pretty much as expected and there wasn’t any lasting damage done.  And only when you are in trouble that you are glad of the time well spent on emergency procedures.

So rest of the story is very little. I removed the crashed instance, restarted a new one from the custom AMI we had and copied data over from DB slaves. But this scenario could have gone vastly wrong if there wasn’t a redundancy setup and for some unfortunate bootstrapping startup it could have reduced all their hard work to crisp.

I know servers should be up running and having them down is not heroic. But there are few points you should have in place before moving your production servers to AWS.

  1. Have a proper backup procedure in place. Better if replication slaves are in some other server vendor or in another AWS region and have a monitor setup to make sure replication process is working properly. Also it’s better to have several layers of backups running so you will have point-in-time recoverable database copy as well as one day old, week old, month old.etc data copies in worst to worst case scenarios.
  2. Use Elastic Block Storage (EBS) – They are the external USB drives of AWS. Couple one or more EBS with your instance store  and use them to store any data you think is valuable. If your instance die, you can just decouple the block and reattach to another fresh instance and run without a hitch.
  3. Have a custom bare-bone AMI with just the OS and may be couple of basic services. Also have an AMI with fully ready-to-launch setup. This way you can make another production ready instance in minimal time as well as have an option in a worst case scenario where the full ready made AMI doesn’t work. Finally, test all your AMIs to make sure that they are working properly.
  4. Have snapshots from your EBS devices in scheduled intervals.
  5. Use these not so easy to find AWS architectural guidelines in designing your platform.

So as I mentioned it’s not about heroics, but making sure your service not getting reduced to ashes because of some stupid server glitch. As someone wise had noted, better be ready than sorry!

Update:

There is another set of sound suggestions made in comment #4 by kordless for any cloud deployment. If you are into heavy scaling they may be particularly useful.

8 thoughts on “Fixing holes in EC2 reliability

  1. Randy says:

    Yeah it’s to bad that we can’t trust the “experts ” that are supposed to know what they are doing. I don’t know a lot about the technical stuff but am learning,and it is surprising how often the “experts” screw up

  2. BraveNewCurrency says:

    I like to think of it this way: In the past, we focused on “MTBF” (Mean Time Between Failure.) We thought it would be a good idea if each computer had as much “uptime” as possible. We spent extra money on Dual Power Supplies, Dual NICs, RAID, dual UPS, yada yada. We paid $20K for a server we could have bought for $2K.

    But the server still failed sometimes. One computer can *never* be 100% reliable. The UPS isn’t reliable. The datacenter isn’t reliable. The network isn’t reliable. People aren’t reliable.

    Focus on “MTTR” (Mean Time To Recovery) instead. What are you going to do _when_ your server fails? I’ve seen a doctor’s office down for 3 days because it took 8 hours to restore a backup (and they restored the wrong backup twice.)

    Here’s a better plan: Buy several $2K servers, and use software to “RAID” them together. When failure happens, recovery should be seamless and automatic. There’s no reason you should be paged in the middle of the night just because some hardware died. Advanced users should be prepared for the whole datacenter/region going down.

    Instead of avoiding failure, embrace failure. That’s the cloud way.

  3. Rob Harrigan says:

    I recommend using Opscode Chef to bootstrap and bring up instances. This was an absolute lifesaver when we ran into similar reboot issues. New servers can be brought up and loaded with all the necessary packages in minutes. Freeing you to jockey backup data around and get the machine(s) back into a ready state.

    • Laknath says:

      Yes, I’m planning on using Chef or Puppet when scaling our app architecture. Btw, any particular reason for choosing Chef over other configuration management systems such as CFEngine, Bcfg2, Puppet .etc ?

  4. kordless says:

    Sorry to hear of your troubles, thanks for sharing.

    Loggly got hit far worse than you did. We’ve rebooted servers before and they usually come back up. This time over 95% didn’t, and it took us all the way down and out for the count. We were down for over 24 hours trying to get our search cluster back online and taking data again. Our system is neither simple nor easy to bootstrap, and even though we have scripts that start and stop instances of our stack at will, for development or testing, bringing a large, live production cluster back up from zero took us WAY longer than we expected. We should have planned for it. We didn’t. We didn’t expect all our machines to go away. What we expected was SOME of them to go away, or SOME interruption of service (because it’s the cloud!), but we never expected all of the boxes to be kicked by humans. All at once.

    We should have planned better.

    Backing up our database and all customer’s logs (up until we went down) ‘saved’ us, but we still suffered data loss during the outage, dropping data that customers were sending in, and we were DOWN and unusable, which is the greatest sin of all.

    Given our experiences, I would add a few more points to your list:

    6. Test a full deployment of your current architecture, including size/scale, while still running another instance of it – this ensures you have the resources to start it if you need another one (we did not, and found out post-disaster we could only launch 50 total instance). If required and/or possible, test taking data from the production system and teeing it into the new deployment to see if it works properly.
    7. Make sure your deployment management scripts (we use Puppet) work anywhere. We can launch instances of Loggly on VMs, bare metal, Rackspace, etc. Test alternate deployments on other AWS regions AND other providers. You don’t know all the bad things that could be. AWS could go completely tits up, and you’d need to fail to … somewhere.
    8. Regarding point 7, make sure you aren’t depended architecturally on AWS services. Where possible, adopt alternate technologies that work across multiple infrastructures. For example, OpenStack supports a S3 like storage system. Make sure your stack works with it.
    9. Don’t create technologies in your stack that are hard to scale and/or replicate. We’ve done that at Loggly because we thought we needed a single search cluster. We should have sharded customers/inputs/whatever across zones and regions. That way if part of it goes down, only a few customers are affected.
    10. If you run in the ‘cloud’ realize you are offloading the responsibility for running infrastructure to someone else. We expect AWS to be reliable, but yet are limited in our expectations because of limits in the technology and costs that they must manage. Running your own boxes may place more responsibility on you for managing them, but will also allow you to better manage the expectations of what can go wrong.

    • Laknath says:

      Thanks for sharing your experience. Though it’s hard to build a 100% reliable system, being aware of what has gone wrong/right in other cases give a good grasp of what can go wrong and be ready.

      The application I was speaking of isn’t scaled to the magnitude of your case since it’s still not yet open to the public but all your suggestions are sound and useful, so updated my post mentioning your comment.

  5. Nathan McCourtney says:

    I think your faith in EBS is unwarranted. Using EBS just means that it’ll persist if an instance terminates unexpectedly.

    In the two years I’ve been using AWS in large production environments, random instance termination was the least of our problems. Weird EBS issues can cause horrific outages. So solve both the redundancy and durability problem at the same time: replicate your data among hosts in different Availability Zones and Regions from the get-go.

    • Laknath says:

      “Replicate your data among hosts in different Availability Zones and Regions from the get-go”

      I was trying to make the same point throughout the post. EBS is just another tool helping to achieve the purpose but by no means be limited to it. However, rather than just having an instance store without any EBS coupled, having EBS with snapshots could give you more options in a failure.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>