<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments for Tech Gossips</title>
	<atom:link href="http://mytechgossips.com/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://mytechgossips.com</link>
	<description>Between two worlds</description>
	<lastBuildDate>Sun, 01 Jan 2012 19:28:22 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by Laknath</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-14019</link>
		<dc:creator><![CDATA[Laknath]]></dc:creator>
		<pubDate>Sun, 01 Jan 2012 19:28:22 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-14019</guid>
		<description><![CDATA[&quot;Replicate your data among hosts in different Availability Zones and Regions from the get-go&quot;

I was trying to make the same point throughout the post. EBS is just another tool helping to achieve the purpose but by no means be limited to it. However, rather than just having an instance store without any EBS coupled, having EBS with snapshots could give you more options in a failure.]]></description>
		<content:encoded><![CDATA[<p>&#8220;Replicate your data among hosts in different Availability Zones and Regions from the get-go&#8221;</p>
<p>I was trying to make the same point throughout the post. EBS is just another tool helping to achieve the purpose but by no means be limited to it. However, rather than just having an instance store without any EBS coupled, having EBS with snapshots could give you more options in a failure.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by Nathan McCourtney</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-14016</link>
		<dc:creator><![CDATA[Nathan McCourtney]]></dc:creator>
		<pubDate>Sat, 31 Dec 2011 19:53:48 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-14016</guid>
		<description><![CDATA[I think your faith in EBS is unwarranted.  Using EBS just means that it&#039;ll persist if an instance terminates unexpectedly.  

In the two years I&#039;ve been using AWS in large production environments, random instance termination was the least of our problems.  Weird EBS issues can cause horrific outages.  So solve both the redundancy and durability problem at the same time:   replicate your data among hosts in different Availability Zones and Regions from the get-go.]]></description>
		<content:encoded><![CDATA[<p>I think your faith in EBS is unwarranted.  Using EBS just means that it&#8217;ll persist if an instance terminates unexpectedly.  </p>
<p>In the two years I&#8217;ve been using AWS in large production environments, random instance termination was the least of our problems.  Weird EBS issues can cause horrific outages.  So solve both the redundancy and durability problem at the same time:   replicate your data among hosts in different Availability Zones and Regions from the get-go.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by Laknath</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-14015</link>
		<dc:creator><![CDATA[Laknath]]></dc:creator>
		<pubDate>Sat, 31 Dec 2011 09:24:02 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-14015</guid>
		<description><![CDATA[Yes, I&#039;m planning on using Chef or Puppet when scaling our app architecture. Btw, any particular reason for choosing Chef over other configuration management systems such as CFEngine, Bcfg2, Puppet .etc ?]]></description>
		<content:encoded><![CDATA[<p>Yes, I&#8217;m planning on using Chef or Puppet when scaling our app architecture. Btw, any particular reason for choosing Chef over other configuration management systems such as CFEngine, Bcfg2, Puppet .etc ?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by Laknath</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-14014</link>
		<dc:creator><![CDATA[Laknath]]></dc:creator>
		<pubDate>Sat, 31 Dec 2011 09:13:25 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-14014</guid>
		<description><![CDATA[Thanks for sharing your experience. Though it&#039;s hard to build a 100% reliable system, being aware of what has gone wrong/right in other cases give a good grasp of what can go wrong and be ready. 

The application I was speaking of isn&#039;t scaled to the magnitude of your case since it&#039;s still not yet open to the public but all your suggestions are sound and useful, so updated my post mentioning your comment.]]></description>
		<content:encoded><![CDATA[<p>Thanks for sharing your experience. Though it&#8217;s hard to build a 100% reliable system, being aware of what has gone wrong/right in other cases give a good grasp of what can go wrong and be ready. </p>
<p>The application I was speaking of isn&#8217;t scaled to the magnitude of your case since it&#8217;s still not yet open to the public but all your suggestions are sound and useful, so updated my post mentioning your comment.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by kordless</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-14012</link>
		<dc:creator><![CDATA[kordless]]></dc:creator>
		<pubDate>Fri, 30 Dec 2011 17:35:33 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-14012</guid>
		<description><![CDATA[Sorry to hear of your troubles, thanks for sharing.

Loggly got hit far worse than you did.  We&#039;ve rebooted servers before and they usually come back up.  This time over 95% didn&#039;t, and it took us all the way down and out for the count.  We were down for over 24 hours trying to get our search cluster back online and taking data again.  Our system is neither simple nor easy to bootstrap, and even though we have scripts that start and stop instances of our stack at will, for development or testing, bringing a large, live production cluster back up from zero took us WAY longer than we expected.  We should have planned for it.  We didn&#039;t.  We didn&#039;t expect all our machines to go away.  What we expected was SOME of them to go away, or SOME interruption of service (because it&#039;s the cloud!), but we never expected all of the boxes to be kicked by humans.  All at once.

We should have planned better.

Backing up our database and all customer&#039;s logs (up until we went down) &#039;saved&#039; us, but we still suffered data loss during the outage, dropping data that customers were sending in, and we were DOWN and unusable, which is the greatest sin of all.

Given our experiences, I would add a few more points to your list:

6. Test a full deployment of your current architecture, including size/scale, while still running another instance of it - this ensures you have the resources to start it if you need another one (we did not, and found out post-disaster we could only launch 50 total instance).  If required and/or possible, test taking data from the production system and teeing it into the new deployment to see if it works properly.
7.  Make sure your deployment management scripts (we use Puppet) work anywhere.  We can launch instances of Loggly on VMs, bare metal, Rackspace, etc.  Test alternate deployments on other AWS regions AND other providers.  You don&#039;t know all the bad things that could be.  AWS could go completely tits up, and you&#039;d need to fail to ... somewhere.
8.  Regarding point 7, make sure you aren&#039;t depended architecturally on AWS services.  Where possible, adopt alternate technologies that work across multiple infrastructures.  For example, OpenStack supports a S3 like storage system.  Make sure your stack works with it.
9.  Don&#039;t create technologies in your stack that are hard to scale and/or replicate.  We&#039;ve done that at Loggly because we thought we needed a single search cluster.  We should have sharded customers/inputs/whatever across zones and regions.  That way if part of it goes down, only a few customers are affected.
10.  If you run in the &#039;cloud&#039; realize you are offloading the responsibility for running infrastructure to someone else.  We expect AWS to be reliable, but yet are limited in our expectations because of limits in the technology and costs that they must manage.  Running your own boxes may place more responsibility on you for managing them, but will also allow you to better manage the expectations of what can go wrong.]]></description>
		<content:encoded><![CDATA[<p>Sorry to hear of your troubles, thanks for sharing.</p>
<p>Loggly got hit far worse than you did.  We&#8217;ve rebooted servers before and they usually come back up.  This time over 95% didn&#8217;t, and it took us all the way down and out for the count.  We were down for over 24 hours trying to get our search cluster back online and taking data again.  Our system is neither simple nor easy to bootstrap, and even though we have scripts that start and stop instances of our stack at will, for development or testing, bringing a large, live production cluster back up from zero took us WAY longer than we expected.  We should have planned for it.  We didn&#8217;t.  We didn&#8217;t expect all our machines to go away.  What we expected was SOME of them to go away, or SOME interruption of service (because it&#8217;s the cloud!), but we never expected all of the boxes to be kicked by humans.  All at once.</p>
<p>We should have planned better.</p>
<p>Backing up our database and all customer&#8217;s logs (up until we went down) &#8216;saved&#8217; us, but we still suffered data loss during the outage, dropping data that customers were sending in, and we were DOWN and unusable, which is the greatest sin of all.</p>
<p>Given our experiences, I would add a few more points to your list:</p>
<p>6. Test a full deployment of your current architecture, including size/scale, while still running another instance of it &#8211; this ensures you have the resources to start it if you need another one (we did not, and found out post-disaster we could only launch 50 total instance).  If required and/or possible, test taking data from the production system and teeing it into the new deployment to see if it works properly.<br />
7.  Make sure your deployment management scripts (we use Puppet) work anywhere.  We can launch instances of Loggly on VMs, bare metal, Rackspace, etc.  Test alternate deployments on other AWS regions AND other providers.  You don&#8217;t know all the bad things that could be.  AWS could go completely tits up, and you&#8217;d need to fail to &#8230; somewhere.<br />
8.  Regarding point 7, make sure you aren&#8217;t depended architecturally on AWS services.  Where possible, adopt alternate technologies that work across multiple infrastructures.  For example, OpenStack supports a S3 like storage system.  Make sure your stack works with it.<br />
9.  Don&#8217;t create technologies in your stack that are hard to scale and/or replicate.  We&#8217;ve done that at Loggly because we thought we needed a single search cluster.  We should have sharded customers/inputs/whatever across zones and regions.  That way if part of it goes down, only a few customers are affected.<br />
10.  If you run in the &#8216;cloud&#8217; realize you are offloading the responsibility for running infrastructure to someone else.  We expect AWS to be reliable, but yet are limited in our expectations because of limits in the technology and costs that they must manage.  Running your own boxes may place more responsibility on you for managing them, but will also allow you to better manage the expectations of what can go wrong.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by Rob Harrigan</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-14011</link>
		<dc:creator><![CDATA[Rob Harrigan]]></dc:creator>
		<pubDate>Fri, 30 Dec 2011 17:23:20 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-14011</guid>
		<description><![CDATA[I recommend using Opscode Chef to bootstrap and bring up instances. This was an absolute lifesaver when we ran into similar reboot issues. New servers can be brought up and loaded with all the necessary packages in minutes. Freeing you to jockey backup data around and get the machine(s) back into a ready state.]]></description>
		<content:encoded><![CDATA[<p>I recommend using Opscode Chef to bootstrap and bring up instances. This was an absolute lifesaver when we ran into similar reboot issues. New servers can be brought up and loaded with all the necessary packages in minutes. Freeing you to jockey backup data around and get the machine(s) back into a ready state.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by BraveNewCurrency</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-14010</link>
		<dc:creator><![CDATA[BraveNewCurrency]]></dc:creator>
		<pubDate>Fri, 30 Dec 2011 16:55:14 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-14010</guid>
		<description><![CDATA[I like to think of it this way: In the past, we focused on &quot;MTBF&quot; (Mean Time Between Failure.) We thought it would be a good idea if each computer had as much &quot;uptime&quot; as possible. We spent extra money on Dual Power Supplies, Dual NICs, RAID, dual UPS, yada yada. We paid $20K for a server we could have bought for $2K.

But the server still failed sometimes. One computer can *never* be 100% reliable. The UPS isn&#039;t reliable. The datacenter isn&#039;t reliable. The network isn&#039;t reliable. People aren&#039;t reliable.

Focus on &quot;MTTR&quot; (Mean Time To Recovery) instead. What are you going to do _when_ your server fails? I&#039;ve seen a doctor&#039;s office down for 3 days because it took 8 hours to restore a backup (and they restored the wrong backup twice.)

Here&#039;s a better plan: Buy several $2K servers, and use software to &quot;RAID&quot; them together. When failure happens, recovery should be seamless and automatic. There&#039;s no reason you should be paged in the middle of the night just because some hardware died. Advanced users should be prepared for the whole datacenter/region going down.

Instead of avoiding failure, embrace failure. That&#039;s the cloud way.]]></description>
		<content:encoded><![CDATA[<p>I like to think of it this way: In the past, we focused on &#8220;MTBF&#8221; (Mean Time Between Failure.) We thought it would be a good idea if each computer had as much &#8220;uptime&#8221; as possible. We spent extra money on Dual Power Supplies, Dual NICs, RAID, dual UPS, yada yada. We paid $20K for a server we could have bought for $2K.</p>
<p>But the server still failed sometimes. One computer can *never* be 100% reliable. The UPS isn&#8217;t reliable. The datacenter isn&#8217;t reliable. The network isn&#8217;t reliable. People aren&#8217;t reliable.</p>
<p>Focus on &#8220;MTTR&#8221; (Mean Time To Recovery) instead. What are you going to do _when_ your server fails? I&#8217;ve seen a doctor&#8217;s office down for 3 days because it took 8 hours to restore a backup (and they restored the wrong backup twice.)</p>
<p>Here&#8217;s a better plan: Buy several $2K servers, and use software to &#8220;RAID&#8221; them together. When failure happens, recovery should be seamless and automatic. There&#8217;s no reason you should be paged in the middle of the night just because some hardware died. Advanced users should be prepared for the whole datacenter/region going down.</p>
<p>Instead of avoiding failure, embrace failure. That&#8217;s the cloud way.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by Randy</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-14006</link>
		<dc:creator><![CDATA[Randy]]></dc:creator>
		<pubDate>Wed, 28 Dec 2011 00:31:45 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-14006</guid>
		<description><![CDATA[Yeah it&#039;s to bad that we can&#039;t trust the &quot;experts &quot; that are supposed to know what they are doing. I don&#039;t know a lot about the technical stuff but am learning,and it is surprising how often the &quot;experts&quot; screw up]]></description>
		<content:encoded><![CDATA[<p>Yeah it&#8217;s to bad that we can&#8217;t trust the &#8220;experts &#8221; that are supposed to know what they are doing. I don&#8217;t know a lot about the technical stuff but am learning,and it is surprising how often the &#8220;experts&#8221; screw up</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Replication &amp; backups with Ruby by Fixing holes in EC2 reliability &#171; Tech Gossips</title>
		<link>http://mytechgossips.com/2011/09/18/replication-backups-with-ruby/#comment-14003</link>
		<dc:creator><![CDATA[Fixing holes in EC2 reliability &#171; Tech Gossips]]></dc:creator>
		<pubDate>Sat, 24 Dec 2011 02:43:28 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=314#comment-14003</guid>
		<description><![CDATA[[...] monitor setup to make sure replication process is working properly. Also it&#8217;s better to have several layers of backups running so you will have point-in-time recoverable database copy as well as one day old, week old, [...]]]></description>
		<content:encoded><![CDATA[<p>[...] monitor setup to make sure replication process is working properly. Also it&#8217;s better to have several layers of backups running so you will have point-in-time recoverable database copy as well as one day old, week old, [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Steve Jobs effect by AnJay</title>
		<link>http://mytechgossips.com/2011/10/07/steve-jobs-effect/#comment-13998</link>
		<dc:creator><![CDATA[AnJay]]></dc:creator>
		<pubDate>Thu, 08 Dec 2011 12:36:18 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=364#comment-13998</guid>
		<description><![CDATA[Even his long time rival Bill Gates had decent things to say.

Read more: The Steve Jobs Effect &#124;]]></description>
		<content:encoded><![CDATA[<p>Even his long time rival Bill Gates had decent things to say.</p>
<p>Read more: The Steve Jobs Effect |</p>
]]></content:encoded>
	</item>
</channel>
</rss>

