<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments for Tech Gossips</title>
	<atom:link href="http://mytechgossips.com/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://mytechgossips.com</link>
	<description>Between two worlds</description>
	<lastBuildDate>Fri, 20 Apr 2012 05:42:36 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>Comment on Machine Learning in SaaS paradigm by laknath</title>
		<link>http://mytechgossips.com/2012/03/12/machine-learning-in-saas-paradigm/#comment-1859</link>
		<dc:creator>laknath</dc:creator>
		<pubDate>Fri, 20 Apr 2012 05:42:36 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=388#comment-1859</guid>
		<description>Cool. I like how your solution has a custom model creation option, which differs from Google prediction API&#039;s generalized approach.</description>
		<content:encoded><![CDATA[<p>Cool. I like how your solution has a custom model creation option, which differs from Google prediction API&#8217;s generalized approach.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Machine Learning in SaaS paradigm by Francisco J Martin</title>
		<link>http://mytechgossips.com/2012/03/12/machine-learning-in-saas-paradigm/#comment-1857</link>
		<dc:creator>Francisco J Martin</dc:creator>
		<pubDate>Fri, 20 Apr 2012 05:02:05 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=388#comment-1857</guid>
		<description>Great post!  As you mention: &quot;large players are using numerous machine learning techniques to enhance various aspects of their workflows&quot; but a number of startups will soon democratize Machine Learning and offer extremely easy ways for small and midsize companies to learn from their data. We are executing on that vision at BigML!</description>
		<content:encoded><![CDATA[<p>Great post!  As you mention: &#8220;large players are using numerous machine learning techniques to enhance various aspects of their workflows&#8221; but a number of startups will soon democratize Machine Learning and offer extremely easy ways for small and midsize companies to learn from their data. We are executing on that vision at BigML!</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Machine Learning in SaaS paradigm by Greg</title>
		<link>http://mytechgossips.com/2012/03/12/machine-learning-in-saas-paradigm/#comment-476</link>
		<dc:creator>Greg</dc:creator>
		<pubDate>Tue, 13 Mar 2012 12:29:31 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=388#comment-476</guid>
		<description>Love the post. I got similar background and completed Andrew&#039;s Ng ML course last year. Still trying to figure out how to plug this knowledge to startup opportunity. Anything that I&#039;m coming so far is a solution looking for a problem. It&#039;s frustrating as I got a feeling that there is something going on ML area in the last 5-7 years after 20 or so years of dead end.
Greg</description>
		<content:encoded><![CDATA[<p>Love the post. I got similar background and completed Andrew&#8217;s Ng ML course last year. Still trying to figure out how to plug this knowledge to startup opportunity. Anything that I&#8217;m coming so far is a solution looking for a problem. It&#8217;s frustrating as I got a feeling that there is something going on ML area in the last 5-7 years after 20 or so years of dead end.<br />
Greg</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by Laknath</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-300</link>
		<dc:creator>Laknath</dc:creator>
		<pubDate>Sun, 01 Jan 2012 19:28:22 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-300</guid>
		<description>&quot;Replicate your data among hosts in different Availability Zones and Regions from the get-go&quot;

I was trying to make the same point throughout the post. EBS is just another tool helping to achieve the purpose but by no means be limited to it. However, rather than just having an instance store without any EBS coupled, having EBS with snapshots could give you more options in a failure.</description>
		<content:encoded><![CDATA[<p>&#8220;Replicate your data among hosts in different Availability Zones and Regions from the get-go&#8221;</p>
<p>I was trying to make the same point throughout the post. EBS is just another tool helping to achieve the purpose but by no means be limited to it. However, rather than just having an instance store without any EBS coupled, having EBS with snapshots could give you more options in a failure.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by Nathan McCourtney</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-299</link>
		<dc:creator>Nathan McCourtney</dc:creator>
		<pubDate>Sat, 31 Dec 2011 19:53:48 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-299</guid>
		<description>I think your faith in EBS is unwarranted.  Using EBS just means that it&#039;ll persist if an instance terminates unexpectedly.  

In the two years I&#039;ve been using AWS in large production environments, random instance termination was the least of our problems.  Weird EBS issues can cause horrific outages.  So solve both the redundancy and durability problem at the same time:   replicate your data among hosts in different Availability Zones and Regions from the get-go.</description>
		<content:encoded><![CDATA[<p>I think your faith in EBS is unwarranted.  Using EBS just means that it&#8217;ll persist if an instance terminates unexpectedly.  </p>
<p>In the two years I&#8217;ve been using AWS in large production environments, random instance termination was the least of our problems.  Weird EBS issues can cause horrific outages.  So solve both the redundancy and durability problem at the same time:   replicate your data among hosts in different Availability Zones and Regions from the get-go.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by Laknath</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-298</link>
		<dc:creator>Laknath</dc:creator>
		<pubDate>Sat, 31 Dec 2011 09:24:02 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-298</guid>
		<description>Yes, I&#039;m planning on using Chef or Puppet when scaling our app architecture. Btw, any particular reason for choosing Chef over other configuration management systems such as CFEngine, Bcfg2, Puppet .etc ?</description>
		<content:encoded><![CDATA[<p>Yes, I&#8217;m planning on using Chef or Puppet when scaling our app architecture. Btw, any particular reason for choosing Chef over other configuration management systems such as CFEngine, Bcfg2, Puppet .etc ?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by Laknath</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-297</link>
		<dc:creator>Laknath</dc:creator>
		<pubDate>Sat, 31 Dec 2011 09:13:25 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-297</guid>
		<description>Thanks for sharing your experience. Though it&#039;s hard to build a 100% reliable system, being aware of what has gone wrong/right in other cases give a good grasp of what can go wrong and be ready. 

The application I was speaking of isn&#039;t scaled to the magnitude of your case since it&#039;s still not yet open to the public but all your suggestions are sound and useful, so updated my post mentioning your comment.</description>
		<content:encoded><![CDATA[<p>Thanks for sharing your experience. Though it&#8217;s hard to build a 100% reliable system, being aware of what has gone wrong/right in other cases give a good grasp of what can go wrong and be ready. </p>
<p>The application I was speaking of isn&#8217;t scaled to the magnitude of your case since it&#8217;s still not yet open to the public but all your suggestions are sound and useful, so updated my post mentioning your comment.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by kordless</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-296</link>
		<dc:creator>kordless</dc:creator>
		<pubDate>Fri, 30 Dec 2011 17:35:33 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-296</guid>
		<description>Sorry to hear of your troubles, thanks for sharing.

Loggly got hit far worse than you did.  We&#039;ve rebooted servers before and they usually come back up.  This time over 95% didn&#039;t, and it took us all the way down and out for the count.  We were down for over 24 hours trying to get our search cluster back online and taking data again.  Our system is neither simple nor easy to bootstrap, and even though we have scripts that start and stop instances of our stack at will, for development or testing, bringing a large, live production cluster back up from zero took us WAY longer than we expected.  We should have planned for it.  We didn&#039;t.  We didn&#039;t expect all our machines to go away.  What we expected was SOME of them to go away, or SOME interruption of service (because it&#039;s the cloud!), but we never expected all of the boxes to be kicked by humans.  All at once.

We should have planned better.

Backing up our database and all customer&#039;s logs (up until we went down) &#039;saved&#039; us, but we still suffered data loss during the outage, dropping data that customers were sending in, and we were DOWN and unusable, which is the greatest sin of all.

Given our experiences, I would add a few more points to your list:

6. Test a full deployment of your current architecture, including size/scale, while still running another instance of it - this ensures you have the resources to start it if you need another one (we did not, and found out post-disaster we could only launch 50 total instance).  If required and/or possible, test taking data from the production system and teeing it into the new deployment to see if it works properly.
7.  Make sure your deployment management scripts (we use Puppet) work anywhere.  We can launch instances of Loggly on VMs, bare metal, Rackspace, etc.  Test alternate deployments on other AWS regions AND other providers.  You don&#039;t know all the bad things that could be.  AWS could go completely tits up, and you&#039;d need to fail to ... somewhere.
8.  Regarding point 7, make sure you aren&#039;t depended architecturally on AWS services.  Where possible, adopt alternate technologies that work across multiple infrastructures.  For example, OpenStack supports a S3 like storage system.  Make sure your stack works with it.
9.  Don&#039;t create technologies in your stack that are hard to scale and/or replicate.  We&#039;ve done that at Loggly because we thought we needed a single search cluster.  We should have sharded customers/inputs/whatever across zones and regions.  That way if part of it goes down, only a few customers are affected.
10.  If you run in the &#039;cloud&#039; realize you are offloading the responsibility for running infrastructure to someone else.  We expect AWS to be reliable, but yet are limited in our expectations because of limits in the technology and costs that they must manage.  Running your own boxes may place more responsibility on you for managing them, but will also allow you to better manage the expectations of what can go wrong.</description>
		<content:encoded><![CDATA[<p>Sorry to hear of your troubles, thanks for sharing.</p>
<p>Loggly got hit far worse than you did.  We&#8217;ve rebooted servers before and they usually come back up.  This time over 95% didn&#8217;t, and it took us all the way down and out for the count.  We were down for over 24 hours trying to get our search cluster back online and taking data again.  Our system is neither simple nor easy to bootstrap, and even though we have scripts that start and stop instances of our stack at will, for development or testing, bringing a large, live production cluster back up from zero took us WAY longer than we expected.  We should have planned for it.  We didn&#8217;t.  We didn&#8217;t expect all our machines to go away.  What we expected was SOME of them to go away, or SOME interruption of service (because it&#8217;s the cloud!), but we never expected all of the boxes to be kicked by humans.  All at once.</p>
<p>We should have planned better.</p>
<p>Backing up our database and all customer&#8217;s logs (up until we went down) &#8216;saved&#8217; us, but we still suffered data loss during the outage, dropping data that customers were sending in, and we were DOWN and unusable, which is the greatest sin of all.</p>
<p>Given our experiences, I would add a few more points to your list:</p>
<p>6. Test a full deployment of your current architecture, including size/scale, while still running another instance of it &#8211; this ensures you have the resources to start it if you need another one (we did not, and found out post-disaster we could only launch 50 total instance).  If required and/or possible, test taking data from the production system and teeing it into the new deployment to see if it works properly.<br />
7.  Make sure your deployment management scripts (we use Puppet) work anywhere.  We can launch instances of Loggly on VMs, bare metal, Rackspace, etc.  Test alternate deployments on other AWS regions AND other providers.  You don&#8217;t know all the bad things that could be.  AWS could go completely tits up, and you&#8217;d need to fail to &#8230; somewhere.<br />
8.  Regarding point 7, make sure you aren&#8217;t depended architecturally on AWS services.  Where possible, adopt alternate technologies that work across multiple infrastructures.  For example, OpenStack supports a S3 like storage system.  Make sure your stack works with it.<br />
9.  Don&#8217;t create technologies in your stack that are hard to scale and/or replicate.  We&#8217;ve done that at Loggly because we thought we needed a single search cluster.  We should have sharded customers/inputs/whatever across zones and regions.  That way if part of it goes down, only a few customers are affected.<br />
10.  If you run in the &#8216;cloud&#8217; realize you are offloading the responsibility for running infrastructure to someone else.  We expect AWS to be reliable, but yet are limited in our expectations because of limits in the technology and costs that they must manage.  Running your own boxes may place more responsibility on you for managing them, but will also allow you to better manage the expectations of what can go wrong.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by Rob Harrigan</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-295</link>
		<dc:creator>Rob Harrigan</dc:creator>
		<pubDate>Fri, 30 Dec 2011 17:23:20 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-295</guid>
		<description>I recommend using Opscode Chef to bootstrap and bring up instances. This was an absolute lifesaver when we ran into similar reboot issues. New servers can be brought up and loaded with all the necessary packages in minutes. Freeing you to jockey backup data around and get the machine(s) back into a ready state.</description>
		<content:encoded><![CDATA[<p>I recommend using Opscode Chef to bootstrap and bring up instances. This was an absolute lifesaver when we ran into similar reboot issues. New servers can be brought up and loaded with all the necessary packages in minutes. Freeing you to jockey backup data around and get the machine(s) back into a ready state.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Fixing holes in EC2 reliability by BraveNewCurrency</title>
		<link>http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/#comment-294</link>
		<dc:creator>BraveNewCurrency</dc:creator>
		<pubDate>Fri, 30 Dec 2011 16:55:14 +0000</pubDate>
		<guid isPermaLink="false">http://mytechgossips.com/?p=394#comment-294</guid>
		<description>I like to think of it this way: In the past, we focused on &quot;MTBF&quot; (Mean Time Between Failure.) We thought it would be a good idea if each computer had as much &quot;uptime&quot; as possible. We spent extra money on Dual Power Supplies, Dual NICs, RAID, dual UPS, yada yada. We paid $20K for a server we could have bought for $2K.

But the server still failed sometimes. One computer can *never* be 100% reliable. The UPS isn&#039;t reliable. The datacenter isn&#039;t reliable. The network isn&#039;t reliable. People aren&#039;t reliable.

Focus on &quot;MTTR&quot; (Mean Time To Recovery) instead. What are you going to do _when_ your server fails? I&#039;ve seen a doctor&#039;s office down for 3 days because it took 8 hours to restore a backup (and they restored the wrong backup twice.)

Here&#039;s a better plan: Buy several $2K servers, and use software to &quot;RAID&quot; them together. When failure happens, recovery should be seamless and automatic. There&#039;s no reason you should be paged in the middle of the night just because some hardware died. Advanced users should be prepared for the whole datacenter/region going down.

Instead of avoiding failure, embrace failure. That&#039;s the cloud way.</description>
		<content:encoded><![CDATA[<p>I like to think of it this way: In the past, we focused on &#8220;MTBF&#8221; (Mean Time Between Failure.) We thought it would be a good idea if each computer had as much &#8220;uptime&#8221; as possible. We spent extra money on Dual Power Supplies, Dual NICs, RAID, dual UPS, yada yada. We paid $20K for a server we could have bought for $2K.</p>
<p>But the server still failed sometimes. One computer can *never* be 100% reliable. The UPS isn&#8217;t reliable. The datacenter isn&#8217;t reliable. The network isn&#8217;t reliable. People aren&#8217;t reliable.</p>
<p>Focus on &#8220;MTTR&#8221; (Mean Time To Recovery) instead. What are you going to do _when_ your server fails? I&#8217;ve seen a doctor&#8217;s office down for 3 days because it took 8 hours to restore a backup (and they restored the wrong backup twice.)</p>
<p>Here&#8217;s a better plan: Buy several $2K servers, and use software to &#8220;RAID&#8221; them together. When failure happens, recovery should be seamless and automatic. There&#8217;s no reason you should be paged in the middle of the night just because some hardware died. Advanced users should be prepared for the whole datacenter/region going down.</p>
<p>Instead of avoiding failure, embrace failure. That&#8217;s the cloud way.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

