Filed under Backups

Fixing holes in EC2 reliability

We are in the era of clouds, and at the moment AWS is the Zeus among public clouds. With its scalable and flexible architecture, cheap rates, secure PCI compliant environment, wide array of loosely coupled services and boasting of 99.95% availability, they may deserve the crown. However they are not without holes and few days ago I got the chance to taste it firsthand. This post is about few measures that you should (and I mean this with capital SHOULD) take before moving your production servers to AWS.

To start with, I had been using Slicehost and Linode as VPS providers for couple of years while tinkering with AWS. After a trial run of few months I was satisfied that everything is working as it should be and moved to AWS for real. But the mistake I’ve done and AWS didn’t bother to mention anywhere easily findable is to couple Elastic Block Storage (ESB) with all instance stores. And this is something easy to overlook when you are coming from a regular VPS provider because ephemeral Instance store is the most counterpart similar device to a slice and you may expect the same behaviour throughout.

So back to the story, everything was running fine until AWS had scheduled a maintenance rebooting of the instance two weeks ago. Nothing much to worry right ? But it turns out that the instance didn’t reboot and there was very little possible to do from the AWS web console. Unlike in regular VPS slices, AWS doesn’t come with a back-door SSH console and it turns out even the staff can do pretty much little regarding an instance store. The only solution they could give me was to reboot the instance few times and if it doesn’t work out…well, they are sorry and it’s a lost cause.

I earlier mentioned the mistake I’ve made. But what I got right was to have several layers of backups including database replication slaves. So backups were running pretty much as expected and there wasn’t any lasting damage done.  And only when you are in trouble that you are glad of the time well spent on emergency procedures.

So rest of the story is very little. I removed the crashed instance, restarted a new one from the custom AMI we had and copied data over from DB slaves. But this scenario could have gone vastly wrong if there wasn’t a redundancy setup and for some unfortunate bootstrapping startup it could have reduced all their hard work to crisp.

I know servers should be up running and having them down is not heroic. But there are few points you should have in place before moving your production servers to AWS.

  1. Have a proper backup procedure in place. Better if replication slaves are in some other server vendor or in another AWS region and have a monitor setup to make sure replication process is working properly. Also it’s better to have several layers of backups running so you will have point-in-time recoverable database copy as well as one day old, week old, month old.etc data copies in worst to worst case scenarios.
  2. Use Elastic Block Storage (EBS) – They are the external USB drives of AWS. Couple one or more EBS with your instance store  and use them to store any data you think is valuable. If your instance die, you can just decouple the block and reattach to another fresh instance and run without a hitch.
  3. Have a custom bare-bone AMI with just the OS and may be couple of basic services. Also have an AMI with fully ready-to-launch setup. This way you can make another production ready instance in minimal time as well as have an option in a worst case scenario where the full ready made AMI doesn’t work. Finally, test all your AMIs to make sure that they are working properly.
  4. Have snapshots from your EBS devices in scheduled intervals.
  5. Use these not so easy to find AWS architectural guidelines in designing your platform.

So as I mentioned it’s not about heroics, but making sure your service not getting reduced to ashes because of some stupid server glitch. As someone wise had noted, better be ready than sorry!

Update:

There is another set of sound suggestions made in comment #4 by kordless for any cloud deployment. If you are into heavy scaling they may be particularly useful.

Replication & backups with Ruby

If there’s one thing certain in life, it is the uncertainty. As you go higher up in the ladder of life, the fall grows steeper, risk becomes greater. Same rules apply in the digital world.

In the process of building and maintaining software, there are plenty of accidents, ways to screw things, foolish mistakes and enemies to sabotage work. So most would agree it’s sensible to have some solid backup strategy as your insurance policy in case a disaster strikes your budding app. But when you say backups & redundancy, it sounds really expensive and time consuming, isn’t it ? Well, Not anymore; with all the cloud services floating around it’s possible to have a good data backup plan with few additional bucks. So in this post I’m sharing a general backup approach easily implementable using wonderful Ruby Backup Gem that you could use or adopt according to your application needs and risk.

When it comes to data replication and recovery, there are various aspects you might need to look into depending on the nature/scale/risk of your data. The method suggested here assumes there are 2 servers for the master and slave where you can setup data replication with MySQL. Also this approach was largely inspired by the strategy adopted by Marco Arment for Instapaper.

Here is an overview of the strategy suggested in this post.

Backup Overview

Backup Overview

1) To achieve point-in-time recovery I’m using a simple master-slave database setup with MySQL. This is very straightforward and Here, here are some examples on how to set it up.

2) Install Backup Gem

gem install backup in both master and slave servers.

3) Setup MySQL binary log syncing from master to slave

– Setup Backup config in master (This will create a ~/Backup/config.rb file)

  sudo backup generate --databases='mysql' --storages='s3' --compressors='gzip'

– Additionally create a default config file defaults.rb (Put mail alert, twitter alert configs here)

– To sync binlogs every 5 mins put this code in ~/Backup/config.rb. Do usual SSH key copy procedure to avoid password prompting when rsyncing.

– Update crontab to sync MySQL logs every 5 minutes

0,5,10,15,20,25,30,35,40,45,50,55 * * * * /bin/bash -l -c 'cd /home/username/Backup /usr/bin/backup perform --trigger sync_logs --config-file
/home/username/Backup/config.rb'

4) In the slave, follow the same process as in the master and setup the backup directory + config.rb. Additionally create a default.rb to store common configurations.

– Create config.rb

  backup generate --databases='mysql' --storages='s3' --compressors='gzip'

– Add email, twitter notification settings in defaults.rb

5) MySQL binlogs will be synced with S3 every half an hour. For this add half_an_hour.rb to your ‘Backup’ directory (All this can be put in config.rb as well. But for the sake of clarity I’m separating them based on the frequency).

6) Daily backup a full copy the database to local disk. For this use the daily.rb script.

7) Weekly store a full copy of database in S3. Use the weekly.rb script for this purpose.

8 ) Use Dropbox to store a copy every month. Plus if one of your workplace machine is synced with Dropbox account it will get synced to the local machine automatically and you can burn them to disks.
Use monthly.rb for this task.

9) Finally update the crontab in your slave to run your backup scripts according to frequencies you intend.

#sync every 30 minutes
0,30 * * * * cd /home/username/Backup  /usr/bin/backup perform --trigger sync_backup --config-file /home/username/Backup/half_an_hour.rb
/var/log/cron/cron.log  /var/log/cron/error.log
 
#backup daily
0 6 * * * cd /home/username/Backup  /usr/bin/backup perform --trigger daily_backup --config-file /home/username/Backup/daily.rb  /var/log/cron/cron.log
/var/log/cron/error.log
 
#backup every week
0 6 1,8,15,22 * * cd /home/username/Backup  /usr/bin/backup perform --trigger weekly_backup --config-file /home/username/Backup/weekly.rb  /var/log/cron/cron.log
/var/log/cron/error.log
 
#backup monthly
0 6 26 * * cd /home/username/Backup  /usr/bin/backup perform --trigger monthly_backup --config-file /home/username/Backup/monthly.rb  /var/log/cron/cron.log
/var/log/cron/error.log

So at the end of this process you will have several redundant copies of your database as well as MySQL transactions up to last 5 minutes. In case of an emergency (ie: disastrous SQL query) you could pick up the nearest full backup and apply MySQL binary logs one up till the disaster occurred and you are good to go.

As a last note (not the least) kudos to the team behind Backup gem. You guys made my life lot easier!