<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Kitchen Soap</title>
	
	<link>http://www.kitchensoap.com</link>
	<description>Thoughts on capacity planning and web operations.</description>
	<lastBuildDate>Sat, 27 Feb 2010 20:23:42 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/KitchenSoap" /><feedburner:info uri="kitchensoap" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:browserFriendly></feedburner:browserFriendly><item>
		<title>Agile Executive Podcast</title>
		<link>http://www.kitchensoap.com/2010/02/12/agile-executive-podcast/</link>
		<comments>http://www.kitchensoap.com/2010/02/12/agile-executive-podcast/#comments</comments>
		<pubDate>Fri, 12 Feb 2010 14:47:58 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=408</guid>
		<description><![CDATA[Yesterday I was on a podcast with Andrew Shafer and Michael Coté, and we talked about development and operations cooperation. I rambled a bit, like I tend to do.
Andrew brought up something that&#8217;s disturbing, and I&#8217;ve seen elsewhere, which is that after seeing our presentation last year at Velocity, some folks decided that we somehow [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I was on a <a href="http://www.redmonk.com/cote/2010/02/11/agileexec008/" target="_blank">podcast</a> with <a href="http://stochasticresonance.wordpress.com/" target="_blank">Andrew Shafer</a> and <a href="http://www.redmonk.com/cote/" target="_blank">Michael Coté</a>, and we talked about development and operations cooperation. I rambled a bit, like I tend to do.</p>
<p>Andrew brought up something that&#8217;s disturbing, and I&#8217;ve seen <a href="http://news.ycombinator.com/item?id=1068098" target="_blank">elsewhere</a>, which is that after seeing our presentation last year at Velocity, some folks decided that we somehow gave an endorsement to the idea of pushing your code whenever you want, and let the &#8216;ops guys&#8217; deal with whatever comes as a result. Which isn&#8217;t at all what we suggested, and pretty much against the ideas of cooperation and communication between the dev and ops teams. I talk a bit about this in the podcast.</p>
<p>You have to <em>prove</em> that pushing whenever you want is an ok (safe, secure, etc.) thing to do. And the minute you can&#8217;t prove it, and you decide to continue that way&#8230;.IMHO: you&#8217;re doing it wrong. <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2010/02/12/agile-executive-podcast/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Need some FUDforum consulting done</title>
		<link>http://www.kitchensoap.com/2010/02/09/need-some-fudforum-consulting-done/</link>
		<comments>http://www.kitchensoap.com/2010/02/09/need-some-fudforum-consulting-done/#comments</comments>
		<pubDate>Tue, 09 Feb 2010 13:41:40 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=406</guid>
		<description><![CDATA[I&#8217;ve been helping out a friend for some years with running a decent-size discussion forum. It&#8217;s running on a little (512mb of RAM) dedicated server and it&#8217;s outgrown the box it&#8217;s on. It needs to move to a new machine, which is all ready to take it.
Problem is, it&#8217;s in a twisty-maze of dependencies. It&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been helping out a friend for some years with running a decent-size discussion forum. It&#8217;s running on a little (512mb of RAM) dedicated server and it&#8217;s outgrown the box it&#8217;s on. It needs to move to a new machine, which is all ready to take it.</p>
<p>Problem is, it&#8217;s in a twisty-maze of dependencies. It&#8217;s running FUDforum <span>2.6.4RC1, on MySQL 3.23, on RedHat 9 (!). It needs to somehow get backed up, moved, and upgraded to latest FUDforum (3.0.0) and MySQL 5, on the new machine.</span></p>
<p><span>It&#8217;s not 100% straightforward, needs someone who&#8217;s done this before, and someone who isn&#8217;t me, because of the new job and all. </span></p>
<p><span>If you know someone who can help out, please email me where my email address is jallspaw which is located on a server whose domain name is yahoo.com.</span></p>
<p><span>Thanks!</span></p>
<p><span><strong>UPDATE: I found a guy.  And he&#8217;s great with FUDForum. Excellent!  Thanks all those who emailed!</strong><br />
</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2010/02/09/need-some-fudforum-consulting-done/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Deployment is just a part of dev/ops cooperation, not the whole thing</title>
		<link>http://www.kitchensoap.com/2009/12/12/devops-cooperation-doesnt-just-happen-with-deployment/</link>
		<comments>http://www.kitchensoap.com/2009/12/12/devops-cooperation-doesnt-just-happen-with-deployment/#comments</comments>
		<pubDate>Sun, 13 Dec 2009 03:14:41 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=372</guid>
		<description><![CDATA[Dev/Ops is what some people are calling the renewed cross-interest in development and operations collaboration. Hammond and I spoke about it, and there was even a conference in Europe dedicated to it. While I do think that there&#8217;s still a lot more that is to be discussed around this idea of cooperation and mixing of [...]]]></description>
			<content:encoded><![CDATA[<p>Dev/Ops is what some people are calling the renewed cross-interest in development and operations collaboration. Hammond and I spoke about it, and there was even a <a href="devopsdays.org" target="_blank">conference in Europe dedicated to it</a>. While I do think that there&#8217;s still a lot more that is to be discussed around this idea of cooperation and mixing of approaches, this is a Very Good Thing™.</p>
<p>In what <a href="http://stochasticresonance.wordpress.com/" target="_blank">Andrew</a> has called &#8216;<a href="http://www.slideshare.net/littleidea/agile-infra-agileroots-2009" target="_blank">boundary objects</a>&#8216;, deployment of new code has been a rallying point for the devops crowd, and I think that&#8217;s great. Deployment is definitely one of the places where the rubber meets the road. In some organizations, deployment of new code can be the single-most stressful and dividing parts of their work. People get fired or quit because of the emotional baggage that can come with an event that in the worst case, is nothing more than a planned outage disguised as progress and a followup finger-pointing session. Some groups have such dysfunction that they might as well just not even deploy the code.  Just skip that part, head into a conference room, and fight bareknuckle. Toxic would be the nice way of describing those environments.</p>
<p>So it&#8217;s no wonder that a lot of the emphasis in this growing &#8220;devops&#8221; community is on deployment. Whether it&#8217;s providing confidence in changes with rigorous testing, deploying small changes often, dark launching, feature flags, or building a one-button deploy system &#8211; any effort to reduce the risk of change should be considered mandatory, IMHO.</p>
<p>But at the same time, deployment is only just a  <em>part</em> of what really makes a great environment for development and operations to collaborate. Really. It&#8217;s not just about developers collaborating on deployment and releases. It&#8217;s about both teams understanding each other&#8217;s responsibilities <strong>after </strong>code is deployed to production, and collaborating along the areas of their expertise in a way that&#8217;s constructive.</p>
<p>Good Operations teams already write code, just not usually user-facing code. They spend a good deal of their time writing code to gather information from the infrastructure and act on it with short, medium, or long-term goals, usually aimed at performance and availability.</p>
<p>I&#8217;ll say that things like:</p>
<ul>
<li>metrics collection</li>
<li>monitoring and associated thresholds</li>
<li>load-feedback behavior</li>
<li>instrumentation</li>
<li>fault tolerance</li>
</ul>
<p>should also be considered boundary objects between development and ops.</p>
<p>This is some of what I mean by that:</p>
<p><strong>Metrics collection</strong></p>
<p>I&#8217;ve said this before, but <a href="http://www.kitchensoap.com/2009/05/10/context-and-operational-metrics/" target="_blank">context is absolutely everything</a>. Application-level or feature-level metrics is what gives the missing context to in-the-box resource usage like CPU, disk, memory, or network. At Flickr, the ops group maintains a number of different platforms for gathering metrics, like ganglia. To make it easy to add metrics, some of our backend applications will just write a temp file with key value pairs that we want to have squirted into ganglia.  Like:</p>
<blockquote><p>image_processed=30</p>
<p>image_processing_time=5</p></blockquote>
<p>and ganglia&#8217;s gmetric cron job will pick that up every minute with the key as the metric name, and the value as, well, the value.</p>
<p>This means that all developers have to do is drop that file into an expected location and it will do the right thing. No tickets for making a new metric, no need for writing yet another script to gather a single metric, no need to understand the intricacies of whatever metrics collection system you have.</p>
<p>That&#8217;s an example of technical collaboration between the two groups. The missing piece is the cultural bits, which is the developer communicating their motivation behind getting these in-app metrics gathered and put on a graph. This gives the metric context, and might give ops some ideas on how they could use the metric for monitoring, capacity, or other purposes.</p>
<p><strong>Monitoring</strong></p>
<p>Involving development in designing your monitoring system can help provide a great perspective on failure modes. Peer code reviews are common in software development, so why shouldn&#8217;t monitors be reviewed? It&#8217;s still code, and it&#8217;s going to provide your humans (and maybe machines) with the data needed to fail gracefully, heal itself, or inform developers on what their constraints are when building new things. Your monitoring system is just like your code in that it should always be evolving, alongside your growth.</p>
<p>Remember all the <a href="http://www.watchingwebsites.com/archives/google-analytics-alerts-the-start-of-a-complete-view" target="_blank">raves</a> about Google Analytics adding &#8220;intelligence&#8221; and alerts? Having some notion of thresholds isn&#8217;t just for people answering pages from nagios, it&#8217;s for everyone. How else can you gauge your expectations and guide future modifications to your code with respect to resource usage?</p>
<p><strong>Load feedback behavior</strong></p>
<p>Like a lot of smart web infrastructures, we&#8217;ve built an <a href="http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/" target="_blank">offline tasks system</a>, which will asyncronously run jobs on our data that don&#8217;t have to be real-time.  If you haven&#8217;t read <a href="http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/" target="_blank">Myles&#8217; post</a> on it, you really should. It&#8217;s a huge part of our strategy to avoid pretty common scalability pitfalls.</p>
<p>Anyway, these tasks, which can be relatively hard on the databases (which is one of the reasons why we do them asyncronously in the first place) have some built-in feedback mechanisms: they&#8217;ll check if there&#8217;s an unreasonably high number of concurrent MySQL connections, or if the database shard master-master pair doesn&#8217;t have both servers in production, or otherwise can detect that either what it&#8217;s trying to do on the database is too harsh at the moment. Whether it&#8217;s because of current live traffic being high, or a loss of redundancy, the offline task system will stop what it&#8217;s doing and re-queue it for later. This is a great (and safe) way of schmearing out heavy loads over a longer time period, reducing their risk.</p>
<p>Throw in some metrics collection about the size of those queues, and monitor alerts to do something for low or high-water mark thresholds, and then you&#8217;re cookin&#8217; with gas.</p>
<p><strong>Instrumentation</strong></p>
<p>Through the magic of <a href="http://php.net/manual/en/function.apache-note.php" target="_blank">apache notes</a>, developers can send extremely useful bits from within php code to the access and error logs. At Flickr, we&#8217;ve got some pretty simple notes set to help track things down when there are issues. For example. when I load the page for my photostream, the log line looks something like:</p>
<blockquote><p>www394 123.456.789.012 <span style="color: #ff0000;">5555</span> 173663 [14/Dec/2009:04:08:21 +0000] &#8220;GET /photos/allspaw HTTP/1.1&#8243; &#8211; 200 18233 &#8220;-&#8221; &#8220;Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3&#8243; &#8211; -</p></blockquote>
<p>where <span style="color: #ff0000;">5555</span> is my user id. Since php knows you&#8217;re logged in when you view a certain page, there&#8217;s no reason why we shouldn&#8217;t just log that in the request, so if there are any user-specific issues, it&#8217;s not a needle in a haystack.</p>
<p>Another example are API requests. We&#8217;ll log the api key making the call along with the authenticated user id, even in POST requests. Being able to trace a bullet through the entire request and response via logs is obviously handy. Putting user ids, API methods, and API key specific info into log lines is hugely helpful when troubleshooting issues, especially if you&#8217;re running one of the <a href="http://www.programmableweb.com/apis/directory/1?sort=mashups" target="_blank">most popular APIs on the web</a>.</p>
<p><strong>Fault Tolerance</strong></p>
<p>Ross blogged about how we do <a href="http://code.flickr.com/blog/2009/12/02/flipping-out/" target="_blank">feature flipping</a> last week. He goes over how important (and awesome) this is to our development process, but another one of the advantages of this approach is how it affects operations.</p>
<p>This is an example of development taking an active role in not only deployment, but the time and effort to <em>operationalize</em> features and pieces of code so that in cases of degradation or failure, these individual pieces can be forced to fail gracefully. Our talk at Velocity last year went over some of this, but it&#8217;s still one of the reasons why we can push code thousands of times a year and still have an extremely low MTTR whenever there&#8217;s an issue.</p>
<blockquote><p>New code causing degradation? There&#8217;s an app for that! (it&#8217;s called a feature flag)</p></blockquote>
<p>Anyway, my point is that deployment is only a small part of how development and operations should collaborate and communicate. In fact, dev and ops is only the most obvious starting point for getting along and working together on problems.</p>
<p>Product and community management also have important boundary objects with operations as well, but that&#8217;s for another blog post. <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p><strong><br />
</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/12/12/devops-cooperation-doesnt-just-happen-with-deployment/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The epicenter of the web, and NYC</title>
		<link>http://www.kitchensoap.com/2009/12/03/360/</link>
		<comments>http://www.kitchensoap.com/2009/12/03/360/#comments</comments>
		<pubDate>Thu, 03 Dec 2009 23:47:42 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Random]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=360</guid>
		<description><![CDATA[One of my apprehensions in moving to New York from San Francisco was a common concern: why would I move from the &#8216;epicenter&#8217; of the web to a place where it&#8217;s not? There&#8217;s been lots written about startup hub cities, and innovative web metro areas, but the fact of the matter is that New York [...]]]></description>
			<content:encoded><![CDATA[<p>One of my apprehensions in moving to New York from San Francisco was a common concern: why would I move from the &#8216;epicenter&#8217; of the web to a place where it&#8217;s not? There&#8217;s been lots <a href="http://www.avc.com/a_vc/2006/05/replicating_sil.html" target="_blank">written</a> about startup hub cities, and innovative web metro areas, but the fact of the matter is that New York hasn&#8217;t historically been a hotbed of web growth and innovation. Not compared to the Bay Area or Seattle, anyway.</p>
<p>I do, of course, think this is changing as of recently. The punch line is that I obviously did <a href="http://www.kitchensoap.com/2009/11/18/from-one-door-to-another/" target="_blank">take the job</a>, despite my misgivings about not being surrounded by people who are constantly thinking about my industry. One of the reasons I got over not being in the &#8216;epicenter&#8217; is that <a href="http://www.avc.com" target="_blank">Fred Wilson</a> and <a href="http://continuations.com/" target="_blank">Albert Wenger</a><strong> </strong> did an insanely good job at convincing me it was a good idea. <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  Another reason is that I think Etsy is basically a Bay Area company that just happens to be in Brooklyn. I mean that as a compliment.</p>
<p>So while I always had some inkling of what &#8216;epicenter of the web&#8217; means, I was never really sure how that could be measured. Indeed.com has indirectly measured it by the <a href="http://www.indeed.com/jobtrends/information-technology-industry" target="_blank"># of job listings</a>.  O&#8217;Reilly did something similar for the <a href="http://radar.oreilly.com/2006/06/startup-centers.html" target="_blank"># of startup jobs in 2006.</a></p>
<p>Number of jobs is interesting, but I thought it might be fun to measure it by locations of headquarters as seen through the lens of monthly unique users. So, I took the <a href="http://www.quantcast.com/top-sites-1" target="_blank">Quantcast &#8220;Top 100&#8243;</a> sites, found the latitude and longitude of the headquarters of each site via <a href="http://www.crunchbase.com/help/api" target="_blank">Crunchbase&#8217;s API</a>, as well as other bits around the web, and <a href="http://www.aaronland.info/weblog/" target="_blank">Aaron</a> helped out with the excellent <a href="http://modestmaps.com/" target="_blank">Modest Maps</a> to make this:</p>
<div class="wp-caption alignnone" style="width: 500px">
	<a href="http://www.flickr.com/photos/straup/4155793319/in/set-72157622926803950/"><img title="North America" src="http://farm3.static.flickr.com/2568/4155793319_e5e2c6bb7b.jpg" alt="Quantcast Top 100 plotted on U.S. Map, radius = monthly uniques" width="500" height="313" /></a>
	<p class="wp-caption-text">Quantcast Top 100 plotted on U.S. Map, radius = monthly uniques</p>
</div>
<p>Like I said, this doesn&#8217;t change my thoughts about the new job, or what I think &#8216;epicenter of the web&#8217; means. But, still interesting, dontcha think?</p>
<p><strong>UPDATE</strong>: Here&#8217;s a link to the raw data: <a href="http://spreadsheets.google.com/pub?key=tLwD1C5mghn9U3XJj_yqyjw&amp;output=html" target="_blank">http://spreadsheets.google.com/pub?key=tLwD1C5mghn9U3XJj_yqyjw&amp;output=html</a></p>
<p>If there&#8217;s anything wrong, lemme know. <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/12/03/360/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>From one door to another</title>
		<link>http://www.kitchensoap.com/2009/11/18/from-one-door-to-another/</link>
		<comments>http://www.kitchensoap.com/2009/11/18/from-one-door-to-another/#comments</comments>
		<pubDate>Thu, 19 Nov 2009 05:21:32 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=336</guid>
		<description><![CDATA[Last week I gave 2 month&#8217;s notice &#8211; I&#8217;ll be leaving Flickr in January.
When Stew and Cat asked me to join Flickr in January of 2005, I felt like it was time to go and do something different, so I said yes.
Five years (and four billion photos) later, it&#8217;s again time to go and do [...]]]></description>
			<content:encoded><![CDATA[<p>Last week I gave 2 month&#8217;s notice &#8211; I&#8217;ll be leaving Flickr in January.</p>
<p>When Stew and Cat asked me to join Flickr in January of 2005, I felt like it was time to go and do something different, so I said yes.</p>
<p>Five years (and four billion photos) later, it&#8217;s again time to go and do something different. It&#8217;s hard for me to describe what a blast this has been. Our <a href="http://ludicorp.com/about.php" target="_blank">goal</a> was to kick ass, and I think we did. Flickr has served as the  backdrop of some of the largest changes in my life, and the work I&#8217;ve done there is essentially tied to those events in my memory.</p>
<p>During my time here at Flickr, I:</p>
<ul>
<li>moved house</li>
<li>saw the company get <a href="http://blog.flickr.net/en/2005/03/20/yahoo-actually-does-acquire-flickr/" target="_blank">bought</a> by Yahoo!, and worked out that whole transition thing</li>
<li>got <a href="http://www.flickr.com/photos/eekaroo/14650744/in/set-358128/" target="_blank">married</a></li>
<li>had a <a href="http://www.flickr.com/photos/allspaw/sets/72157594173557758/" target="_blank">daughter</a></li>
<li>co-invented a pretty <a href="http://faceball.org/press/" target="_blank">well-received</a> <a href="http://faceball.org/" target="_blank">office sport</a></li>
<li>wrote a <a href="http://www.amazon.com/Art-Capacity-Planning-Scaling-Resources/dp/0596518579" target="_blank">book</a></li>
<li>had a <a href="http://www.flickr.com/photos/allspaw/sets/72157607504797325/" target="_blank">son</a></li>
</ul>
<p>In addition to building, scaling, evolving, and generally being as loud and fast as we could possibly be with the original <a href="http://ludicorp.com/" target="_blank">Ludicorp</a> team, I had the absolute privilege to hire and work in the trenches with some of the greatest people on the web. I also had the chance to work with some of the smartest people at Yahoo, who I&#8217;ll continue to have relationships with even after I leave. Yahoo has treated me well, and I&#8217;ve learned more here than I have at any other company.</p>
<p>The reason I stayed here for five years wasn&#8217;t for the accolades (or the vesting). It was because I worked with people who <strong><em>care</em></strong> about building something that people <em><strong>care</strong></em> about.</p>
<p>This also happens to be the same reason why I chose my next step: <a href="http://www.etsy.com" target="_blank">Etsy</a>. They care, and it shows.</p>
<p style="text-align: left;">I still have a little more time here at Flickr to rock a bit more, but I&#8217;m excited to work with my friend <a href="http://www.chaddickerson.com/about.html" target="_blank">Chad</a> again on <a href="http://radar.oreilly.com/2009/01/work-on-stuff-that-matters-fir.html" target="_blank">something that matters</a>. I&#8217;ll be running the Ops group there, where they&#8217;ve already got superstars.</p>
<p style="text-align: left;">Chad wrote some more about it <a href="http://www.etsy.com/storque/etsy-news/john-allspaw-joins-the-etsy-team-6183/" target="_blank">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/11/18/from-one-door-to-another/feed/</wfw:commentRss>
		<slash:comments>33</slash:comments>
		</item>
		<item>
		<title>How Complex Systems Fail: A WebOps Perspective</title>
		<link>http://www.kitchensoap.com/2009/11/12/how-complex-systems-fail-a-webops-perspective/</link>
		<comments>http://www.kitchensoap.com/2009/11/12/how-complex-systems-fail-a-webops-perspective/#comments</comments>
		<pubDate>Thu, 12 Nov 2009 22:39:05 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Random]]></category>
		<category><![CDATA[WebOps]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=326</guid>
		<description><![CDATA[I guess I&#8217;m late on getting to this, but How Complex Systems Fail by Richard Cook is excellent.
Let me start with this: I don&#8217;t think I can overstate how right-on this paper is, with respect to the challenges, solutions, observations, and concerns involved with operating a medium to large web infrastructure. I found this via [...]]]></description>
			<content:encoded><![CDATA[<p>I guess I&#8217;m late on getting to this, but<a href="http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf"> How Complex Systems Fail</a> by <a href="http://www.ctlab.org/Cook.cfm" target="_blank">Richard Cook</a> is excellent.</p>
<p>Let me start with this: I don&#8217;t think I can overstate how right-on this paper is, with respect to the challenges, solutions, observations, and concerns involved with operating a medium to large web infrastructure. I found this via @<a href="http://twitter.com/benjaminblack" target="_blank">benjaminblack</a>, and I agree with him 100%: this should be considered <em><strong>required reading</strong></em> for anyone in our industry. I&#8217;m not sure if Cook ever thought that his paper would apply to web infrastructure, but I think it can and does. Please take 30 minutes right now and read it. <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>There are a number of salient points in the paper that I&#8217;d like to comment on. Again, this is through the lens of failures of complex systems as it pertains to web operations:</p>
<blockquote><p><strong>7) Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.</strong></p></blockquote>
<p>I&#8217;m going to guess that this portion may be viewed as controversial in the prevailing webops wisdom, where post-mortems are for sure necessary, but whose content may or may not be effective in preventing similar types of failure. I <em>do</em> value the process of a post-mortem, because I think the human element of understanding complex failures is important and doing whatever you can to put in place safety is good, modulo what is said in section #16 of the paper. I believe that even a rudimentary process of &#8220;<a href="http://www.startuplessonslearned.com/2009/07/how-to-conduct-five-whys-root-cause.html" target="_blank">5 Whys</a>&#8221; has value. But at the same time, I also think that there is something in the spirit of this paragraph, which is that there is a danger in standing behind a single underlying cause when there are systemic failures involved. Doing this can lead to the false belief that you&#8217;ve got this mode covered, you&#8217;ve found the silver bullet that made the whole mountain crumble, and jeez what a relief because <em><strong>that</strong></em> will never bite us again.</p>
<blockquote><p><strong>14) Change introduces new forms of failure.</strong></p></blockquote>
<p>I totally agree with this point. However, I often see this as a rallying point for operations teams to say &#8220;No!&#8221; to change, when instead they should be working alongside development (and product owners) with a goal of <em>reducing</em> the risk of failure associated with each change. I do not believe that &#8216;release early, release often&#8217; in and of itself can reduce that risk. I believe that the real (and only) way to do this is both technical <em>and</em> cultural. But I&#8217;ve <a href="http://velocityconference.blip.tv/file/2284377/" target="_blank">spoken about this before</a>.</p>
<blockquote><p><strong>16) Safety is a characteristic of systems and not of their components</strong></p></blockquote>
<p>Emphasis on <em>&#8220;Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system.&#8221; </em>Real safety comes from smart people doing smart things to the entire shebang, not the individual guts.</p>
<p>and I think the point I love the most, with all of my heart:</p>
<blockquote><p><strong>18) Failure free operations require experience with failure.</strong></p></blockquote>
<p>Fear is a strong emotion. I believe it can be used as a strong motivator for ensuring safety in the face of constant change, instead of a reason to push back on the very idea of change. Embrace fear of outages and degradation. Use it to guide your architecture, your code, your infrastructure. So <em>lean into it.</em></p>
<p>There are a lot of great points in the paper, and I could go on, but you get the idea.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/11/12/how-complex-systems-fail-a-webops-perspective/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>When you deploy: your internal monologue</title>
		<link>http://www.kitchensoap.com/2009/10/07/when-you-deploy-your-internal-monologue/</link>
		<comments>http://www.kitchensoap.com/2009/10/07/when-you-deploy-your-internal-monologue/#comments</comments>
		<pubDate>Wed, 07 Oct 2009 22:22:33 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=318</guid>
		<description><![CDATA[The minimum cycle of questions you should be asking yourself. As brought up by @debuggist and @benjaminblack.

]]></description>
			<content:encoded><![CDATA[<p>The minimum cycle of questions you should be asking yourself. As brought up by <a href="http://twitter.com/debuggist" target="_blank">@debuggist</a> and <a href="http://twitter.com/benjaminblack" target="_blank">@benjaminblack</a>.</p>
<p><a href="http://www.kitchensoap.com/wp-content/uploads/2009/10/InternalMonologue.png"><img class="alignnone size-full wp-image-319" style="border: 1px solid black;" title="What you might want to ask yourself before you deploy changes to production?" src="http://www.kitchensoap.com/wp-content/uploads/2009/10/InternalMonologue.png" alt="What you might want to ask yourself before you deploy changes to production?" width="724" height="547" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/10/07/when-you-deploy-your-internal-monologue/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Meanwhile: More Meta-Metrics</title>
		<link>http://www.kitchensoap.com/2009/10/05/meanwhile-more-meta-metrics/</link>
		<comments>http://www.kitchensoap.com/2009/10/05/meanwhile-more-meta-metrics/#comments</comments>
		<pubDate>Mon, 05 Oct 2009 17:50:26 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Tools]]></category>
		<category><![CDATA[WebOps]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=292</guid>
		<description><![CDATA[Like all sane web organizations, we gather metrics about our infrastructure and applications. As many metrics as we can, as often as we can. These metrics, given the right context, helps us figure out all sorts of things about our application, infrastructure, processes, and business. Things such as&#8230;
What:
&#8230;did we do before (historical trending, etc)
&#8230;is going [...]]]></description>
			<content:encoded><![CDATA[<p>Like all sane web organizations, we gather metrics about our infrastructure and applications. As many metrics as we can, as often as we can. These metrics, given the right context, helps us figure out all sorts of things about our application, infrastructure, processes, and business. Things such as&#8230;</p>
<p>What:</p>
<p style="padding-left: 30px;">&#8230;did we do before (historical trending, etc)<br />
&#8230;is going on right now? (troubleshooting, health, etc.)<br />
&#8230;is coming down the road (capacity planning, new feature adoption, etc.)<br />
&#8230;can we do to make things better (business intelligence, user-behavior, etc.)</p>
<p>All of which, of course, should be considered mandatory in order to help your business increase its awesome. Yay metrics!</p>
<p>Some time ago, Matthias wrote great a <a title="Agile Web Operations" href="http://www.agileweboperations.com/visible-ops-continuous-improvement/" target="_blank">blog post</a> about some of the metrics that can reasonably profile the effectiveness of web operations, taken from the <a title="VisibleOps" href="http://www.itpi.org/home/visibleops.php" target="_blank">ITIL primer, VisibleOps</a>.</p>
<p>In my opinion, there&#8217;s nothing on that list of things that isn&#8217;t valuable, as long as the cost of gathering those metrics isn&#8217;t too behaviorally, technically, or organizationally expensive. The topics included in that list of metrics and the context they live in is fodder for many, many blog posts.</p>
<p>But in the category of historical trending, I&#8217;m more and more fascinated by gathering what I&#8217;ll call &#8220;meta-metrics&#8221;, which is data about how you respond to the changes your system is experiencing.</p>
<p>One of the best examples of this is gathering information about operational disruptions. Collecting information about how many times your on-call rotation was alerted/paged/woken-up, during what times, and for what service(s) can be enlightening to say the least.  We&#8217;ve been tracking the volume of alerts a lot closer recently, and even with the level of automation we&#8217;ve got at Flickr, it&#8217;s still something you have to keep on top of, especially if you&#8217;re always finding new things to measure and alert on.</p>
<p>Now ideally, you have an alerting system that only communicates conditions that need resolvable action by a human. Which means every alert is critically important, and you&#8217;re not ignoring or dismissing any pages for any reasons that sound like <em>&#8220;oh, that&#8217;s ok, that cluster always does that&#8230;it&#8217;ll clear up, I&#8217;ll just acknowledge the page so I can shut up nagios.&#8221;</em> In other words, our goal is to have a zero-noise alerting system. Which means that <em>all</em> alerts are actionable, not ignorable, and require a human to troubleshoot or fix. Over time, you push as much of this work as you can to the robots. In the meantime, save humans for the yet-to-be-automated work, or the stuff that isn&#8217;t easily captured by robots.</p>
<p>Why is this important to us? I may be stating the obvious, but it&#8217;s because interrupting humans with alerts that don&#8217;t require action has a mental and physical context switching cost (especially if the guy on-call was sleeping), and it increases the likelihood of missing a truly critical page in a slew of non-critical ones.</p>
<p>Of course in the reality of evolving and growing web applications, even if we could reach a 100% noise-free alerting system, it&#8217;s impossible to sustain for any extended period of time, because your application, usage, and failure modes are constantly changing. So in the meantime, knowing how your alerts affect the team is a worthwhile thing to do for us. In fact, I think it&#8217;s so important that it&#8217;s worth collecting and displaying next to the rest of your metrics, and exposing these metrics to the entire dev and ops groups.</p>
<p>Something like this: (made-up numbers)</p>
<div id="attachment_295" class="wp-caption alignnone" style="width: 300px">
	<a href="http://www.kitchensoap.com/wp-content/uploads/2009/10/Alerts-Mockup.png"><img class="size-medium wp-image-295" title="Tracking Critical Alerts" src="http://www.kitchensoap.com/wp-content/uploads/2009/10/Alerts-Mockup-300x206.png" alt="Tracking Critical Alerts " width="300" height="206" /></a>
	<p class="wp-caption-text">Tracking Critical Alerts </p>
</div>
<p>Gathering up info about these alerts should give us a better perspective on where we can improve. So, things like:</p>
<ul>
<li> How many critical alerts are sent on a daily/hourly/weekly basis?</li>
<li> What does a time histogram of the alerts look like? Do you get more or less alerts during nighttime or non-peak hours?</li>
<li>How much (if any) correlation is there between critical alerts and:</li>
</ul>
<blockquote style="padding-left: 30px;"><p>- code deploys?<br />
- software upgrades?<br />
- feature launches?<br />
- open API abuse?</p></blockquote>
<ul>
<li> What does a breakdown of the alerts look like, in terms of: host type, service type, and frequency of each in a given time period?</li>
</ul>
<p>and maybe the most important ones:</p>
<ul>
<li> How many of those alerts aren&#8217;t actually critical or demand human attention?</li>
<li> How many of them always self-recover?</li>
<li> How many (and which) don&#8217;t matter in their role context (like, a single node in a load-balanced cluster) and could be turned into an aggregate check?</li>
</ul>
<p>We&#8217;ve built our own stuff to track and analyze these things. My question to the community is: I&#8217;m not aware of any open-source tool that is dedicated to analyzing these metrics. Do they exist? Nagios obviously has host/hostgroup/cluster warning and critical histories, and those can be crunched to find critical alert statistics, but I&#8217;m not aware of any comprehensive crunching. Of course, until I find one, we&#8217;re just building our own.</p>
<p>Thoughts, lazyweb?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/10/05/meanwhile-more-meta-metrics/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>WebOps: Good prep for becoming a new parent?</title>
		<link>http://www.kitchensoap.com/2009/09/29/webops-good-prep-for-becoming-a-new-parent/</link>
		<comments>http://www.kitchensoap.com/2009/09/29/webops-good-prep-for-becoming-a-new-parent/#comments</comments>
		<pubDate>Wed, 30 Sep 2009 04:23:36 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=281</guid>
		<description><![CDATA[I think I&#8217;ve said before somewhere that working in the field of web operations prepared me somewhat for being a parent. I thought the other day that I should write down some of this reasoning, because it&#8217;s pretty often that I&#8217;m reminded of similarities:
High availability
Having redundant infrastructure is WebOps 101. For my kids&#8217; most prized [...]]]></description>
			<content:encoded><![CDATA[<p>I think I&#8217;ve said before somewhere that working in the field of web operations prepared me somewhat for being a parent. I thought the other day that I should write down some of this reasoning, because it&#8217;s pretty often that I&#8217;m reminded of similarities:</p>
<p><em><strong>High availability</strong></em></p>
<p>Having redundant infrastructure is WebOps 101. For my kids&#8217; most prized possessions, their sleeping  <a title="Dollies" href="http://www.flickr.com/photos/eekaroo/3361150569/" target="_blank">&#8216;loveys&#8217; </a>there is no reason to have a <a title="Single Point of Failure" href="http://en.wikipedia.org/wiki/Single_Point_of_Failure" target="_blank">SPOF</a>, under any circumstances. We have at least 4 backups for each on any trip that we go on, as well as a couple of trusted stuffed animals who might meet unfortunate fates.</p>
<p><em><strong>Capacity planning</strong></em></p>
<p>This applies to both disposable diapers (a.k.a.<em> consumable capacity</em>) and episodes of the few TV shows we allow them to watch, on the Tivo. My daughter, at 3 and a half, knows every detail from every of the 49 episodes of <a title="The Backyardigans" href="http://www.google.com/url?sa=t&amp;source=web&amp;ct=res&amp;cd=1&amp;url=http%3A%2F%2Fwww.nickjr.com%2Fshows%2Fbackyardigans%2Findex.jhtml&amp;ei=LNjCSoypKZOCsgOQqPTuAg&amp;usg=AFQjCNFUuMBPdoxeunE6pvhpJtEtG1WSSw&amp;sig2=x2bYLViPeXoS70pEK6shww" target="_blank">The Backyardigans.</a> Having some of them on ipods and iphones can make a 6 hour drive to L.A. feel like 4, not 12.</p>
<p><em><strong>Documentation</strong></em></p>
<p>Since I&#8217;m already used to writing down observations and techniques learned &#8216;in the field&#8217;, then I was totally prepared:</p>
<div class="wp-caption alignnone" style="width: 500px">
	<a href="http://www.flickr.com/photos/allspaw/2592579909/"><img title="Allspaw Baby Soothing Method, v1" src="http://farm4.static.flickr.com/3205/2592579909_a5d8b25bb9.jpg" alt="Allspaw Baby Soothing Method, v1" width="500" height="327" /></a>
	<p class="wp-caption-text">Allspaw Baby Soothing Method, v1</p>
</div>
<p>and in case I ever forgot what my most successful swaddling method was:</p>
<p><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="400" height="300" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="flashvars" value="intl_lang=en-us&amp;photo_secret=de8c6a5027&amp;photo_id=2554081561&amp;flickr_show_info_box=true" /><param name="bgcolor" value="#000000" /><param name="allowFullScreen" value="true" /><param name="src" value="http://www.flickr.com/apps/video/stewart.swf?v=71377" /><param name="allowfullscreen" value="true" /><embed type="application/x-shockwave-flash" width="400" height="300" src="http://www.flickr.com/apps/video/stewart.swf?v=71377" allowfullscreen="true" bgcolor="#000000" flashvars="intl_lang=en-us&amp;photo_secret=de8c6a5027&amp;photo_id=2554081561&amp;flickr_show_info_box=true"></embed></object><br />
<em><strong></strong></em></p>
<p><em><strong>Architecture and design</strong></em></p>
<p>It&#8217;s unfortunate that I was so sleep-deprived that I never got a photo of the RadioShack remote-control truck that I turned into a cam-driven <a title="Moses basket" href="http://www.flickr.com/photos/nathanleland/2596474846/" target="_blank">Moses basket</a> automatic rocker mechanism. But you <a href="http://boingboing.net/2009/08/26/scripting-a-pc-cd-tr.html">understand what I&#8217;m talking about</a>.</p>
<p>There is one other thing that I learned from working at Flickr which turned out to be useful new parent advice: expect the unexpected, and never rely on past behaviors as an indication of what can happen in the future. They&#8217;re kids, not applications. <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/09/29/webops-good-prep-for-becoming-a-new-parent/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Automated Control paper by the RAD Lab folks</title>
		<link>http://www.kitchensoap.com/2009/08/01/automated-control-paper-by-the-rad-lab-folks/</link>
		<comments>http://www.kitchensoap.com/2009/08/01/automated-control-paper-by-the-rad-lab-folks/#comments</comments>
		<pubDate>Sat, 01 Aug 2009 22:32:11 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=271</guid>
		<description><![CDATA[Wow, how did I miss this until now? In June, some smart people gathered in Barcelona for the First Workshop on Automated Control for Datacenters and Clouds (ACDC09) and jeez it looked like it was a good time, from a glance at the program.
One of the cooler papers is &#8220;Automatic exploration of datacenter performance regimes&#8221; in [...]]]></description>
			<content:encoded><![CDATA[<p>Wow, how did I miss this until now? In June, some smart people gathered in Barcelona for the <a href="http://www.cs.duke.edu/nicl/acdc09/" target="_blank">First Workshop on Automated Control for Datacenters and Clouds (ACDC09)</a> and jeez it looked like it was a good time, from a glance at the <a href="http://www.cs.duke.edu/nicl/acdc09/program.html" target="_blank">program</a>.</p>
<p>One of the cooler papers is <a href="http://portal.acm.org/citation.cfm?id=1555271.1555273" target="_blank">&#8220;Automatic exploration of datacenter performance regimes&#8221;</a> in which the smart folks over at the <a href="http://radlab.cs.berkeley.edu/" target="_blank">RAD Lab</a> at UCB tackle the idea of:</p>
<ol>
<li>Gathering up real usage metrics in production</li>
<li>Taking that data to feed a resource allocation (&#8221;auto-scaling&#8221;) controller</li>
</ol>
<p>The bits about coming up with an <em>exploration policy</em> is where the juicy stuff comes in, building in safety factors driven by external SLAs. You should read the whole thing to see how thoughtful their method was, which includes taking into account effects such as cold ramping, which you almost never see accounted for in simulated situations.  Rock on, RAD Lab: this is the stuff that brings the academia smarts to the real world. Kudos.</p>
<p><em>FYI: I&#8217;m not just saying the paper is cool because they cite my book as a resource in it. <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/08/01/automated-control-paper-by-the-rad-lab-folks/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
	</channel>
</rss><!-- Dynamic Page Served (once) in 0.464 seconds --><!-- Cached page served by WP-Cache -->
