<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">

<channel>
	<title>tomkleinpeter.com</title>
	
	<link>http://www.tomkleinpeter.com</link>
	<description />
	<lastBuildDate>Sat, 23 Jan 2010 18:32:48 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/Spiteful" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="spiteful" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Where Are the AB Testing Frameworks?</title>
		<link>http://www.tomkleinpeter.com/2009/01/21/where-are-the-ab-testing-frameworks/</link>
		<comments>http://www.tomkleinpeter.com/2009/01/21/where-are-the-ab-testing-frameworks/#comments</comments>
		<pubDate>Wed, 21 Jan 2009 23:18:49 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=69</guid>
		<description><![CDATA[I read news.yc and reddit/programming pretty regularly to keep up with what is going on in the biz.  Based on that reading, I can probably name a dozen different systems for building high scale applications (distributed storage, message queues, caching layers, search engines, etc), but I can&#8217;t name a single AB testing framework other [...]]]></description>
			<content:encoded><![CDATA[<p>I read <a href="http://news.ycombinator.com">news.yc</a> and <a href="http://www.reddit.com/r/programming">reddit/programming</a> pretty regularly to keep up with what is going on in the biz.  Based on that reading, I can probably name a dozen different systems for building high scale applications (distributed storage, message queues, caching layers, search engines, etc), but I can&#8217;t name a single AB testing framework other than <a href="https://www.google.com/analytics/siteopt">Google Website Optimizer</a>.  That seems like a serious inversion of priorities for most startups.  Everyone with a sign up page should use AB testing.  Not everyone needs a message queue.</p>
<p>Is this because:
<ul>
<li>Nobody needs anything other than Google Website Optimizer?</li>
<li>Startups don&#8217;t actually do AB testing, possibly because they don&#8217;t get enough traffic to get meaningful results, or maybe because they don&#8217;t have time?</li>
<li>AB testing (including the statistical analysis to determine if results are valid) is so simple that everyone just bangs out their own?</li>
<li>As a largely theoretical issue for most startups, scalability is more fun to talk about on the Internet?</li>
<li>Everyone that is using AB testing is so happy that they are trying to suppress information about it so their competitors don&#8217;t start doing it too?</li>
</ul>
<p>If everyone is secretly using some great framework please shoot me an email and let me know.</p>
<p>If you haven&#8217;t thought much about it before, here is <a href="http://exp-platform.com/Documents/GuideControlledExperiments.pdf">a short paper on AB testing</a> from some folks that made Amazon a ton of money.</p>
<img src="http://feeds.feedburner.com/~r/Spiteful/~4/O06600UX24Q" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2009/01/21/where-are-the-ab-testing-frameworks/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Two and a Half Months of Twitter</title>
		<link>http://www.tomkleinpeter.com/2008/09/20/two-and-a-half-months-of-twitter/</link>
		<comments>http://www.tomkleinpeter.com/2008/09/20/two-and-a-half-months-of-twitter/#comments</comments>
		<pubDate>Sat, 20 Sep 2008 20:10:44 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=68</guid>
		<description><![CDATA[After a few months of playing around with Twitter, the service is really growing on me.  The ability to have casual IM-ish conversations without any immediacy is nice.  Also, having a place to record short thoughts and interesting links that other people might like scratches some sort of itch for me.  I [...]]]></description>
			<content:encoded><![CDATA[<p>After a few months of playing around with <a href="http://www.twitter.com/tklein">Twitter</a>, the service is really growing on me.  The ability to have casual IM-ish conversations without any immediacy is nice.  Also, having a place to record short thoughts and interesting links that other people might like scratches some sort of itch for me.  I wouldn&#8217;t want to write up a whole blog post for any of these, but they were all interesting enough to post on twitter:</p>
<ul>
<li>A clever proposal from Google: <a href="http://groups.google.com/group/SDCH">Shared Dictionary Compression over HTTP</li>
<li><a href="http://technet.microsoft.com/en-us/sysinternals/bb897561.aspx">Cacheset</a> &#8211; a tool for clearing the windows disk cache (useful for testing cold starts).</li>
<li>Fun fact: the Tesla Roadster carries <a href="http://www.teslamotors.com/blog4/?p=68">3 milligrams of electrons</a> when fully charged.</li>
<li>The ultimate Airplane on a Treadmill debate resource: <a href="http://www.airplaneonatreadmill.com/">www.airplaneonatreadmill.com</a></li>
<li>A 728-ton <a href="http://blog.longnow.org/2008/06/25/728-ton-pendulum/">tuned mass damper</a> in a skyscraper</li>
</ul>
<p>But, I don&#8217;t think I&#8217;ve reached the critical mass of followers necessary to really unlock the Q&#038;A potential of the site.  Having a few hundred technical folks all following each other would be a tremendously useful resource for everyone involved.  For example, I&#8217;m considering upgrading my desktop to 8 or 16GB of RAM.  I&#8217;m going to need a new motherboard, processor, and RAM.  My normal approach for this would be to spend a few hours on Newegg and the hardware review sites trying to figure out where the price/performance curve is and making sure I&#8217;m not getting ripped off.  If someone else has done this same research it would be nice to use their information as a starting point, and twitter provides the kind of free-form conversation necessary for that kind of sharing.  </p>
<p>To really make this work, you need to run one of the desktop apps so you don&#8217;t have to constantly reload the website (I use <a href="http://www.twhirl.org/">Twhirl</a>). </p>
<img src="http://feeds.feedburner.com/~r/Spiteful/~4/PiSTGV83H5o" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/09/20/two-and-a-half-months-of-twitter/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Next Gen Productivity Monitoring Software</title>
		<link>http://www.tomkleinpeter.com/2008/08/25/next-gen-productivity-monitoring-software/</link>
		<comments>http://www.tomkleinpeter.com/2008/08/25/next-gen-productivity-monitoring-software/#comments</comments>
		<pubDate>Mon, 25 Aug 2008 16:58:45 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Neat Ideas]]></category>
		<category><![CDATA[intolerance]]></category>
		<category><![CDATA[productivity]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=67</guid>
		<description><![CDATA[Now that I have a new baby, it is even more important to me that the time I spend in front of the computer is spent efficiently and productively. I’ve played around with productivity-monitoring software like RescueTime and TimeSnapper, and they provide a convenient way to record how I wasted my day.  It’s a [...]]]></description>
			<content:encoded><![CDATA[<p>Now that I have a new baby, it is even more important to me that the time I spend in front of the computer is spent efficiently and productively. I’ve played around with productivity-monitoring software like <a href="http://www.rescuetime.com/">RescueTime</a> and <a href="http://www.timesnapper.com/">TimeSnapper</a>, and they provide a convenient way to record how I wasted my day.  It’s a nice first step, but I’d like to see this class of application expand into 3 new areas: positive feedback, targeted recommendations, and an attention API.</p>
<p><strong>Positive Feedback</strong><br />
Being told that I only spent 10% of my day doing work is good to know, but getting a low number might depress me rather than motivate me.  I suggest a system that actually rewards me when I have a killer day or a great week.   For example, I give the service $25 or $50 up front, and after I meet some sort of goal it buys me something off my Amazon wish list.  </p>
<p>Wouldn’t that be neat?  You’re having a good week, and suddenly a book you want shows up at your door.  The key to this is making the rewards <a href="http://serendip.brynmawr.edu/bb/neuro/neuro05/web1/isiddiqui.html">somewhat random</a>: </p>
<blockquote><p>Several studies have been conducted which targeted neural response to rewards.  The results were unanimous in the fact that when one performed an action over and over again, and was given a reward randomly, dopamine levels rose.  If the reward was given consistently, i.e. every four times the action was performed, the dopamine levels remained constant.</p></blockquote>
<p>A slight variation that might work better would be for each contiguous block of productivity over a certain length, you have a chance of earning a credit towards a purchase.  After N credits, the service automatically buys and sends you the item.  Structuring it like this would make the feedback more rapid and allow for a little burst of dopamine each time you get an email saying you earned a credit.  Isn&#8217;t this why MMORPGs are so much fun?.  </p>
<p>A program to help you get addicted to work is either terrifying or a big win.  Either way, it would be really neat to try.</p>
<p><strong>Targeted Recommendations</strong><br />
Some software is just better than the default stuff that ships with Windows.  For example, I like Textpad and Paint.net a lot more than Notepad and MS Paint.  I’ve also been pleased with my switch from Bloglines to Google Reader, and from web based twitter to Twhirl.  If a program spends all day monitoring my activity, it would be a cinch for it to recommend the tools and websites that are considered “best in class.”  </p>
<p>There is obvious potential here in terms of sponsored recommendations, but it would be nice to see those separated out from community or editor controlled listings.  Recommendations could be driven by some sort of wiki, which would make for all kinds of interesting fights over things like whether Google Reader is better than Bloglines.  Any recommendation could also come with an estimate of the number of people that are currently using it, which would help the cream rise to the top.</p>
<p>Ultimately, it’s not just about how much time you spend slogging away – making good use of computer time is an important dimension of productivity as well.</p>
<p>A slight variation on this idea is to recommend something like <a href="https://addons.mozilla.org/en-US/firefox/addon/4476">LeechBlock</a> if the user is spending too much time on the web.    </p>
<p><strong>Attention API</strong><br />
I know I’ve seen this idea somewhere before, but because things like RescueTime are actually in a position to make it happen, I’m going to mention it here.  Interruptions are usually bad, but there are some times that they are worse than others.  If I’ve been focused on Visual Studio and Windbg for 30 minutes with no breaks, I’m almost certainly in that fascinating “<a href="http://en.wikipedia.org/wiki/Flow_(psychology)">flow</a>” state, and I’m going to be angry if I get an IM or (even worse) if some random app asks me to download a new version.  </p>
<p>To deal with this kind of thing, it would be great to have a standard for publishing my current tolerance for interruptions, just like IM apps publish my presence.  Both desktop apps and remote users could use this to determine if what they want to tell me is important enough to interrupt me.  Of course, this only works if apps pay attention to it,  but first we would need some apps that can accurately measure it.  I&#8217;m not terribly good with naming things, but unless someone has something better, I’m going to suggest calling this value your “inTolerance”.</p>
<p>So there you go.  One idea to help you spend more time being productive, another to help you make better use of the time you are actually working, and a third to keep you from getting interrupted.  Anyone want to go and implement this stuff?  I’ll be happy to beta test it for you.  </p>
<img src="http://feeds.feedburner.com/~r/Spiteful/~4/RsBi9LZQO-U" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/08/25/next-gen-productivity-monitoring-software/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Netflix Prize Concept  + Google 411 Data</title>
		<link>http://www.tomkleinpeter.com/2008/08/13/netflix-prize-concept-google-411-data/</link>
		<comments>http://www.tomkleinpeter.com/2008/08/13/netflix-prize-concept-google-411-data/#comments</comments>
		<pubDate>Wed, 13 Aug 2008 19:15:14 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Neat Ideas]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[ideas]]></category>
		<category><![CDATA[prizes]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=66</guid>
		<description><![CDATA[I’ve really enjoyed watching the Netflix Prize develop.  Amazingly, over 3600 teams have submitted a prediction, which makes Netflix the big winner in this contest.  The company will undoubtedly end up with a better product due to the amount of interest and research in collaborative filtering they have generated.  
But ultimately, better [...]]]></description>
			<content:encoded><![CDATA[<p>I’ve really enjoyed watching the <a href="http://www.netflixprize.com/">Netflix Prize</a> develop.  Amazingly, over 3600 teams have submitted a prediction, which makes Netflix the big winner in this contest.  The company will undoubtedly end up with a better product due to the amount of interest and research in collaborative filtering they have generated.  </p>
<p>But ultimately, better movie recommendations don’t matter a whole lot to me.  I’m more interested in the fact that by providing a unique set of data and a prize, they’ve been able stimulate so much interest.  The other day I was thinking about which companies are in a position to sponsor contests in other fields that might have a bigger impact on my life, and one thought jumped into my head – Google’s 411 phoneme collection service.  <a href="http://www.infoworld.com/archives/emailPrint.jsp?R=printThis&#038;A=/article/07/10/23/Google-wants-your-phonemes_1.html">Marissa Meyers says</a>:</p>
<blockquote><p>
You may have heard about our [directory assistance] 1-800-GOOG-411 service. Whether or not free-411 is a profitable business unto itself is yet to be seen. I myself am somewhat skeptical. The reason we really did it is because we need to build a great speech-to-text model &#8230; that we can use for all kinds of different things, including video search.</p>
<p>The speech recognition experts that we have say: If you want us to build a really robust speech model, we need a lot of phonemes, which is a syllable as spoken by a particular voice with a particular intonation. So we need a lot of people talking, saying things so that we can ultimately train off of that</p></blockquote>
<p>Presumably, Google has already done the heavy lifting to manually transcribe a large number of these samples so that they can train their own algorithms.  Why not create a contest that lets teams submit an algorithm that gets trained on a subset of the data and then tested against the rest?  Speech recognition is more complicated than movie recommendations, but making it easy to train and test an algorithm against an interesting number of samples would certainly lower the barrier to entry.  </p>
<p>Google would benefit from this in hiring, if nothing else.  It would give them a chance to realistically evaluate the work of all kinds of grad students and researchers, and demonstrate to the candidates the advantages of working for the company with the biggest databases.  </p>
<img src="http://feeds.feedburner.com/~r/Spiteful/~4/5or4s7tMThs" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/08/13/netflix-prize-concept-google-411-data/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Handling Human Error In the Datacenter</title>
		<link>http://www.tomkleinpeter.com/2008/08/11/handling-human-error-in-the-datacenter/</link>
		<comments>http://www.tomkleinpeter.com/2008/08/11/handling-human-error-in-the-datacenter/#comments</comments>
		<pubDate>Mon, 11 Aug 2008 19:21:19 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[startups]]></category>
		<category><![CDATA[uptime]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=64</guid>
		<description><![CDATA[When I was working on Live Mesh at Microsoft, I had the good fortune to meet James Hamilton.  James is full of good ideas, many of which are captured in his paper “On Designing and Deploying Internet-Scale Services.”  There is a lot of wisdom in those pages (Greg Linden had some thoughts on [...]]]></description>
			<content:encoded><![CDATA[<p>When I was working on Live Mesh at Microsoft, I had the good fortune to meet <a href="http://perspectives.mvdirona.com/">James Hamilton</a>.  James is full of good ideas, many of which are captured in his paper <a href="http://research.microsoft.com/~jamesrh/TalksAndPapers/JamesRH_Lisa.pdf">“On Designing and Deploying Internet-Scale Services.”</a>  There is a lot of wisdom in those pages (Greg Linden had <a href="http://glinden.blogspot.com/2008/03/designing-for-internet-scale.html">some thoughts on it</a>), but I’d like to focus in on this snippet in particular:</p>
<blockquote><p>Design the system to never need human interaction, but understand that rare events will occur where combined failures or unanticipated failures require human interaction. </p></blockquote>
<p>Yes, designing the system to never need human interaction is a <a href="http://www.25hoursaday.com/weblog/2008/08/11/ManagingLargeWebServerFarmsMicrosoftsAutoPilot.aspx">great ideal to shoot for</a>, but when you are working for a startup with three guys and a dozen servers, you don’t have the resources or the justification to do it from Day 1.  It is entirely likely that your business model will <a href="http://teddziuba.com/2008/04/im-going-to-scale-my-foot-up-y.html">fail</a> before you lose a single disk.  And since backend refinements don’t pay the bills at a small scale, something with a pair of hands is going to be interacting with your system until you get enough people and servers to justify more automation.  </p>
<blockquote><p>These events will happen and operator error under these circumstances is a common source of catastrophic data loss.</p></blockquote>
<p>That is a wonderfully simple and accurate summary of How Bad Things Happen in your datacenter.  It starts when you lose a hard drive or MySQL crashes, and you have to promote your slave until you can check the master tables, or anything painful but routine happens.  But then, as you are trying to fix things, you notice, for example, that you are almost out of disk space.  When you start trying to fix more than one problem under pressure, you are entering a world of pain.  </p>
<p>The big issue here as James points out is that you are going to do something wrong.  You’ll probably use a much stronger word than “wrong” once it is all over, but let’s settle on “stupid” for right now.  </p>
<p>It won&#8217;t feel stupid until after you hit “enter,” but when you are making unfamiliar decisions quickly under pressure, you are extremely likely to overlook something.  Maybe you won&#8217;t shut down mysql before you start a myisamchk from the shell, or maybe you&#8217;ll reverse the arguments to &#8220;tar -cvzf&#8221; and wipe out something important.  Or perhaps you&#8217;ll screw up a firewall rule and block ssh access to the machine you are frantically trying to fix.  Accidently killing the ssh daemon is another favorite.  The point is that during a stressful situation in the datacenter, the human operator is the biggest potential source of more downtime or “catastrophic data loss.”  </p>
<p>Assuming you can’t automate everything, what can you do?  Well, the absolute best thing you can do is practice.  Corrupt some data on your dev master db, and see how long it takes you to get it restored from backups or a slave copy.  Practice what would happen if you lost a database slave and had to activate a spare machine to take its place (I hope you have at least one spare machine).  But of course, no one at a small startup has time to practice.  Maybe once you hire a full time ops guy, it would be good to make sure he is practicing this sort of thing occasionally.  But when practicing is going to take away from writing code, you aren&#8217;t going to practice.  </p>
<p>Since you aren’t going to practice, what else can you do?  The next best thing is to cultivate the attitude that you are the most likely source of problems.  Don’t worry about hard drives, worry about bad decisions.  Develop some humility about how you expect to behave when you get woken up at 4am to fix a database the morning of your launch or when a switch fails an hour before your big demo.  From that mindset, here are a few things to do:</p>
<p><strong>Script what you can</strong><br />
Off the top of my head, a good place to start would be writing scripts for some of the steps in setting up master/slave replication and manipulating firewall traffic (allowing or blocking external traffic, for instance).  </p>
<p><strong>Use the buddy system</strong><br />
It is not a bad idea to have somebody else there looking at what you are typing, or at least on the phone confirming things verbally.</p>
<p><strong>Take your fingers off the keyboard before you hit enter</strong><br />
Are you in the right directory?  Are you on the right machine?  Are those arguments in the right order?  Can you just rename this old stuff instead of deleting it?  All of these are excellent questions to ask yourself or your coworker while you have your hands in your lap. This is also a good idea when you are doing something scary with SQL, like running any query that doesn&#8217;t have a where clause.  </p>
<p><strong>Slow things down </strong><br />
As soon as you make one mistake, no matter how minor, it is time to slow things down.  Beyond the fact that making a mistake will fluster you, making one mistake demonstrates that right now, you are likely to make mistakes.  That is a huge red flag.  At this point, the safest thing may be to accept a slightly longer downtime just so you can slow things down, get some water, and relax.  Trying to compensate for a little mistake by doing things faster can result in a much, much worse mistake.  Unless you’ve just rolled a server cage down the stairs, there is always a worse mistake you can make.</p>
<p><strong>Make it hard for people you work with to make mistakes</strong><br />
A quality server naming scheme is the easiest thing you can do here.  No colors, deities, countries, snack foods, snakes, etc.  I like $machineType-$number myself, but with distinct number ranges, even between different machine types.  So, don&#8217;t have SQL-001 and Web-001.  One day, some very sleepy datacenter employee may get things mixed up when you call and ask him to reboot Web-001.  I’m sure you’ll get an apology, but you won’t get your uptime back.  So make it harder for him to screw up: if your web machines start at Web-201, he&#8217;ll have to make 2 mistakes before he accidently reboots your primary database.  </p>
<p><strong>Talk about this stuff ahead of time</strong><br />
You probably have plenty of stuff to talk about at lunch with your coworkers, but here are a few convers	ation starters if you want to sharpen your disaster recovery skills:</p>
<ul>
<li>&#8220;What happens if we lose power to one of our racks?&#8221;</li>
<li>&#8220;How many of our switches could we lose and get the site back up?&#8221;</li>
<li>&#8220;What is the smallest amount of hardware we could lose that would knock us 100% offline?&#8221;</li>
</ul>
<p>This stuff isn’t theoretical.  I woke up at 2am one weekend during FolderShare with a ton of text messages from our cluster.  The kindly folks at the datacenter had been doing power supply maintenance.  At some point, they powered down 2 of our racks.  Then, they powered them right back up.  It wasn’t tough to fix, but it was so unexpected that it took me a few minutes to even realize what had happened.  </p>
<p><strong>Use tricks to deal with the general class of &#8220;running a command on the wrong machine&#8221; problem.</strong><br />
Typing the right command on the wrong machine is obviously something to avoid.  But when you have a sea of ssh windows open, what can you do?  </p>
<ul>
<li>Use a different color background for your terminal to machines hosting master databases versus slaves</li>
<li>Make sure the machine name shows up in the command prompt </li>
</ul>
<p>Does anyone else have any good ideas or horror stories to tell?  Post a comment and share your wisdom and/or pain.</p>
<img src="http://feeds.feedburner.com/~r/Spiteful/~4/ZX4-riukSVw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/08/11/handling-human-error-in-the-datacenter/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>tklein on twitter</title>
		<link>http://www.tomkleinpeter.com/2008/06/30/tklein-on-twitter/</link>
		<comments>http://www.tomkleinpeter.com/2008/06/30/tklein-on-twitter/#comments</comments>
		<pubDate>Mon, 30 Jun 2008 21:03:54 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=62</guid>
		<description><![CDATA[I&#8217;m on twitter now.  Follow me at http://twitter.com/tklein
]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m on twitter now.  Follow me at <a href="http://twitter.com/tklein">http://twitter.com/tklein</a></p>
<img src="http://feeds.feedburner.com/~r/Spiteful/~4/BJp-9y1kOCU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/06/30/tklein-on-twitter/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dev Diligence: Don’t Invest in the Wrong Code</title>
		<link>http://www.tomkleinpeter.com/2008/04/25/dev-diligence-dont-invest-in-the-wrong-code/</link>
		<comments>http://www.tomkleinpeter.com/2008/04/25/dev-diligence-dont-invest-in-the-wrong-code/#comments</comments>
		<pubDate>Fri, 25 Apr 2008 20:32:54 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Dev Diligence]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=61</guid>
		<description><![CDATA[When I&#8217;m starting a project or thinking about adding functionality to an existing code base, I always consider using any existing code.   Sometimes this is obvious &#8211; I&#8217;m not going to write my own RDBMS &#8212; but frequently, it is a more difficult decision than it should be.  In making a decision, [...]]]></description>
			<content:encoded><![CDATA[<p>When I&#8217;m starting a project or thinking about adding functionality to an existing code base, I always consider using any existing code.   Sometimes this is obvious &#8211; I&#8217;m not going to write my own RDBMS &#8212; but frequently, it is a more difficult decision than it should be.  In making a decision, I look first at the questions that I can actually get answers to:</p>
<ul>
<li>Am I getting more than I need?  It pains me to add a multi megabyte DLL to a client download for a small amount of functionality.</li>
<li>Will I spend more time learning the interface than I would writing the functionality I need myself?</li>
<li>Is this an active project, and is there any documentation?</li>
<li>If scheduling isn&#8217;t an issue, how much fun would it be to write my own version?</li>
</ul>
<p>Next comes a set of questions that are oftentimes harder to answer:</p>
<ul>
<li>Who else is using it?</li>
<li>Will I be using it the same way as other people who are successfully using it?</li>
<li>What am I going to find out when I put more stress on it than anyone else?</li>
</ul>
<p>One library that passed my gauntlet of questions is <a href="http://libredblack.sourceforge.net/">libredblack</a>. It ended up on a bunch of production servers at FolderShare, and it worked out great.  But there was a catch: I wanted to use it to store large numbers of items, but for every item I put in the tree, the library would allocate an object that held 4 pointers and an enum.  That took 40 bytes on my dev box.  Throw in malloc&#8217;s overhead, and I was up to 48 bytes.  The objects I was storing pointers to would also have some heap overhead, which may be as much as 24 bytes.  So to store 10M items in memory, I&#8217;d need an extra half gigabyte of memory just for overhead. </p>
<p>A second example from personal experience is <a href="http://librsync.sourceforge.net/">librsync</a>.  Again, the library works exactly as advertised.  But if you want to transfer deltas for large (gigabyte+ files) on machines that have hard memory limits (like embedded devices), you need to know that the memory usage is proportional to the file size.  For my situation, I ended up having to adjust the window size as file sizes grew just to keep the memory usage reasonable for large files.  </p>
<p>I don&#8217;t want anyone to think I&#8217;m complaining about this stuff &#8211; I&#8217;m a fan of both libraries.  But both of these examples illustrate a class of problem that is particularly frustrating: the one you might not find until you are heavily invested in a solution.  These gotchas won&#8217;t affect most people, and thus aren&#8217;t likely to show up when you are researching possible solutions.  They aren&#8217;t bugs, either, but they might be something you have to deal with.  So the sooner you can find out about them, the better.</p>
<p>Fortunately, the internet has plenty of software built for solving problems like this.  <a href="http://www.devdiligence.com/wishlist">Dev Diligence</a><a href="#footer_1"><sup>[1]</sup></a> is a new wiki I&#8217;ve started to collect details like these.  My goal is to have a reference page for any library or service developers might consider using in their solution.  For sufficiently large libraries, pages for classes or functions might be necessary, but let’s not get ahead of ourselves.  Ultimately, I’d like to have 5 headings for everything in the wiki:</p>
<ul>
<li>Overview: Brief description of the software and a link to the homepage</li>
<li>Short case studies or war stories:  These would include a brief description of how you are using the software, the version you used, and ideally some metrics.  If you used it for a while and then switched to something else, an explanation of that decision is very valuable information.  For libredblack, the relevant metrics would be things like average number of elements in your trees or insertions/deletions per second.</li>
<li>“Gotchas” (like the ones I&#8217;ve mentioned above): Subtle problems (hello, <a href="http://blogs.msdn.com/ricom/archive/2006/02/02/523626.aspx">heap fragmentation</a>) and things that aren&#8217;t necessarily bugs, but issues that may affect your design or help you choose one solution over another.</li>
<li>Alternatives: The name pretty much says it all.  With links, please.</li>
<li>Other Resources:  Links to blog posts, email threads, or reference pages would be great.</li>
</ul>
<p>I’ve gone ahead and created entries for <a href="http://www.spiteful.com/dd/libredblack">libredblack</a>, <a href="http://www.spiteful.com/dd/librsync">librsync</a>, and <a href="http://www.spiteful.com/dd/zlib">zlib</a> based on my experiences. I’d love to see some entries for the following and things like it:</p>
<ul>
<li>libev, libevent, boost.asio, and Twisted</li>
<li>openssl</li>
<li>sqlite and berkeleydb</li>
<li>memcached, spread, the reliable queue solutions (Starling, TheSchwartz, etc), and anything that uses “pubsub” in its description</li>
<li>libcurl and wininet (stuff like <a href="http://nick.typepad.com/blog/2006/06/microsoft_pleas.html">Nick Bradbury&#8217;s description of a CPU spike in WinInet that can be triggered by chunked-encoding</a> is gold)</li>
</ul>
<p>All of these and more are linked to from the <a href="http://www.spiteful.com/dd/wishlist">WishList</a> page.  </p>
<p>Can you guys help me out?  I’ve got enough people subscribed to this feed that I’m certain at least one of you has used everything on my list.  If you take 10 minutes to write down your experiences, you can make the software world a better place.  To justify doing it on your company’s time, keep this in mind: if you document the fact that you are successfully using a solution, you increase the chance that other people will use it as well.  The more users a solution has, the the better it will become.  </p>
<p><a name="footer_1">[1]</a> &#8220;Dev Diligence&#8221; is of course a play on the term <a href="http://en.wikipedia.org/wiki/Due_diligence">&#8220;Due Diligence&#8221;</a>.  </p>
<img src="http://feeds.feedburner.com/~r/Spiteful/~4/caK0AvxhL2A" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/04/25/dev-diligence-dont-invest-in-the-wrong-code/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Crashing When Something Feels Wrong</title>
		<link>http://www.tomkleinpeter.com/2008/04/14/crashing-when-something-feels-wrong/</link>
		<comments>http://www.tomkleinpeter.com/2008/04/14/crashing-when-something-feels-wrong/#comments</comments>
		<pubDate>Mon, 14 Apr 2008 19:47:37 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Assertions]]></category>
		<category><![CDATA[FolderShare]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=56</guid>
		<description><![CDATA[I’m sort of lazy, so I really like the idea of code that continually checks itself by using assertions.  I even like running production services with assertions turned on.  To be clear, I’m talking about assertions that check for actual bugs in your code – not assertions that socket() didn’t fail.  Still, [...]]]></description>
			<content:encoded><![CDATA[<p>I’m sort of lazy, so I really like the idea of code that continually checks itself by using assertions.  I even like running production services with assertions turned on.  To be clear, I’m talking about assertions that check for actual bugs in your code – not assertions that socket() didn’t fail.  Still, crashing production servers is a contentious issue, but sometimes (hopefully rarely) it is the best thing to do.  For something like FolderShare, crashing a server as soon as there is any hint of an error is vastly safer than possibly deleting someone’s files due to a bug.  Of course, this introduces the risk that you could have multiple servers fail in a short amount of time, but you need to design for that case anyway.  </p>
<p>I originally fell in love with assertions after reading Steve Maguire’s <em>Writing Solid Code</em> many years ago.  After I saw how helpful they could be, I started to structure my code to make it more “assertable.”  For example, I like state machines that use a table of valid transitions combined with assertions.  This prevents anything I don’t explicitly anticipate from happening and is really helpful in a networked, asynchronous world.  </p>
<p>But after developing a few long running services, I’ve started to perceive the need for a new type of assertion that is a bit higher level than a single conditional.  I want something that will let me crash a service (and save a memory dump) when something “feels” wrong.  For the moment, I call these “probabilistic assertions,” and I would have slept better while running FolderShare if I’d implemented them then.</p>
<p>Like all synchronization software, FolderShare had a few nightmare scenarios that I worried about all the time.  If, due to a bug, the service told every client that its files were all deleted, I’d probably have to read a lot of nasty blog posts, and I would feel like crap for a few years.  I would much rather crash all my servers and debug the problem with the system offline than take a chance on pissing off so many people.</p>
<p>Anyway, a normal assertion that looks at a single conditional wouldn’t help in this case.  Asserting that no files have been deleted when a client connects doesn’t make any sense.  But for a large enough sample size, I could assert that 80% of clients that connect shouldn’t have any files deleted.  And I could probably assert that 95% of clients that connect shouldn’t have all their files deleted.  </p>
<p>Functions like these cover most of what I’m thinking about:</p>
<p><img src="http://www.spiteful.com/wp-content/uploads/2008/04/prob_assertions.png" alt="" title="prob_assertions" width="497" height="132" class="aligncenter size-full wp-image-58" /></p>
<p>Depending on your application, these functions could be implemented either as assertions or as triggers to alert an administrator.  For alerts, you could really take this to another level by checking that incoming events are following a certain distribution for example.  Granted that it’s a bit over the top, but I can’t help but imagine checking if a stream of incoming events obeys a Poisson distribution or if the sizes of certain data are following a Gaussian distribution.  </p>
<p>Has anyone seen anything like this in the wild?  I’d love to hear about it.</p>
<img src="http://feeds.feedburner.com/~r/Spiteful/~4/JKntTSGwKws" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/04/14/crashing-when-something-feels-wrong/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Housekeeping</title>
		<link>http://www.tomkleinpeter.com/2008/04/14/housekeeping/</link>
		<comments>http://www.tomkleinpeter.com/2008/04/14/housekeeping/#comments</comments>
		<pubDate>Mon, 14 Apr 2008 19:43:33 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=55</guid>
		<description><![CDATA[I don&#8217;t plan on having many non-technical posts here, but I&#8217;m breaking my rule today for a good reason.  I&#8217;ve got a kid now!  My first child, Margot Lee Kleinpeter, was born about 10 days ago.  Between a long, drawn out labor, a few nights on a hospital couch, and fatherhood in [...]]]></description>
			<content:encoded><![CDATA[<p>I don&#8217;t plan on having many non-technical posts here, but I&#8217;m breaking my rule today for a good reason.  I&#8217;ve got a kid now!  My first child, Margot Lee Kleinpeter, was born about 10 days ago.  Between a long, drawn out labor, a few nights on a hospital couch, and fatherhood in general, I&#8217;ve fallen a bit behind on publishing.  Much to my surprise, Margot prefers clean diapers and songs to essays on startups and programming.  But, I&#8217;ve got a new post for today and I&#8217;ll hopefully be back on a more normal schedule soon.  In the meantime, enjoy this picture of her sleeping:</p>
<p><img src="http://www.spiteful.com/wp-content/uploads/2008/04/margot.jpg" alt="" title="margot" width="500" height="333" class="aligncenter size-full wp-image-60" /></p>
<img src="http://feeds.feedburner.com/~r/Spiteful/~4/FBH0937Lp5I" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/04/14/housekeeping/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Things That Are Important: Where Clauses</title>
		<link>http://www.tomkleinpeter.com/2008/03/24/things-that-are-important-where-clauses/</link>
		<comments>http://www.tomkleinpeter.com/2008/03/24/things-that-are-important-where-clauses/#comments</comments>
		<pubDate>Mon, 24 Mar 2008 20:03:45 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Audiogalaxy]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Startup Lessons]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/2008/03/24/things-that-are-important-where-clauses/</guid>
		<description><![CDATA[When you are running a distributed service in a datacenter, you encounter a lot of interesting problems.  At Audiogalaxy, I ran into all the standard application level bugs, crashes, and race conditions.  Once we had a certain number of machines, we even had to deal with flaky memory, disks, and networking cards.  [...]]]></description>
			<content:encoded><![CDATA[<p><img style="margin: 0pt 0pt 2px 7px; padding: 4px; float: right; display: inline;"  src='http://www.spiteful.com/wp-content/uploads/2008/03/quake-3-bones.jpg' alt='quake-3-bones.jpg' />When you are running a distributed service in a datacenter, you encounter a lot of interesting problems.  At Audiogalaxy, I ran into all the standard application level bugs, crashes, and race conditions.  Once we had a certain number of machines, we even had to deal with flaky memory, disks, and networking cards.  But all of that was pretty typical compared to the weirdest bug I ever had to deal with – the one that was caused by Quake III Arena.  </p>
<p>Audiogalaxy had a small client that simply handled the P2P transfers and a complicated website for everything else, including account settings.  One of the adjustable account settings on the website was the “max number of transfers.”  To encourage users to send as much as they received, we only gave them a single number for this setting.  With a value of 1, a Satellite would only send a single file at a time, but it could only download one file a time as well.</p>
<p>Things were not so simple on the back-end.  For better or for worse, I had designed some flexibility into the system.  The max transfers value was actually stored in two columns in the Users table – MaxSend and MaxRecv.  The back-end – the part that actually looked at these values when it was setting up transfers–had no idea these columns were linked.  The front-end enforced what went into the database, and the back-end obeyed it.  Whenever the Satellite reconnected to the cloud, our server would read the value out of the database and store it in memory for the duration of that connection.  </p>
<p>Of course, somewhere between the frontend and the backend is mysqlclient, but I’ll get to that in a moment.  </p>
<p>Quake III Arena was my game of choice at the time I worked for AG.  We had a few developers that also enjoyed the game, and it was common to find people staying late on the weekend to take advantage of our nice internet connection.  Unfortunately, our nice internet connection had a dozen people running our p2p music sharing client on one side, so it would periodically slow down when someone’s computer started blasting a file out at high speed.  These slowdowns drove us crazy, particularly when they prevented us from using the game’s rail gun effectively.</p>
<p>Good developers like to fix problems, and developers at startups also tend to have access to the database.  So, you can probably imagine what a developer might do.  And if you know a little bit about SQL, you can also imagine what might go horribly wrong.  I never found out who issued the bad query, but I can just imagine how it played out:</p>
<blockquote><p>Hey, I&#8217;ve got an idea about how we can keep the games from lagging tonight.  I’ll just block everyone in the office from sending files.  One simple &#8216;Update Users set MaxSend = 0&#8242; and we should be good to go for the evening…  Why is that query taking so long?  Uh oh&#8230;</p></blockquote>
<p>SQL is good for a lot of things, but I’ve always marveled at how easy it is to destroy an entire table simply by forgetting a where clause.  And thus, in a few short minutes, every one of our 30 million users had a subtle change applied to their accounts.  Did I mention that the single value we displayed on the website for this setting came from the MaxRecv column?  Whoops…</p>
<p>Monitoring the health of the system was one of my jobs, so I kept a close eye on my graph of the “current transfer rate.”  Ultimately, most problems in the system resulted in less files getting transferred, so the global transfer rate was a good proxy for the health of the system.  </p>
<p>Every day of the week plotted a unique and predictable curve that I knew by heart, and so it didn’t take me long to realize that something was wrong.  Transfer rates were dropping.  But why?  I called our ISP and asked if they knew of any problems with the Internet.  Nope.  We had exactly the right number of clients connected.  No one had trenched over a fiber optic cable in the middle of nowhere.  Requests were coming into the system at the normal rate; they just weren’t getting fulfilled.  Microsoft hadn’t pushed any patches out that might have firewalled off half the world. </p>
<p>Clients generally stayed connected for days or weeks at a time.  As they gradually reconnected, more and more of the network got their new MaxSend setting and dutifully started not sending anything. Users weren’t complaining – it was perfectly normal for rare songs to be inaccessible, and nobody noticed if his client just wasn’t sending anything.  </p>
<p>After tearing my hair out for a day or so about this, I finally realized I was seeing a lot more “client busy – no free slots” type messages than I usually did while tail –f’ing the log files. Digging into that, I noticed some other funny messages, and eventually I was staring in shock at the results of a “select MaxSend, MaxRecv from Users limit 1000.”  </p>
<p>Fixing the problem was easy enough: &#8220;Update Users set MaxSend = MaxRecv,&#8221; but you can imagine I spent quite some time staring at that query before issuing it.  </p>
<p><img style="margin: 0pt 0pt 2px 7px; padding: 4px; float: right; display: inline;"  src='http://www.spiteful.com/wp-content/uploads/2008/03/mysql-i-am-a-dummy.png' alt='mysql-i-am-a-dummy' />So what&#8217;s the moral of the story?  Don&#8217;t let your developers have access to the production database?  Maybe, but that isn’t practical for a small startup.  Better logging?  That certainly could help.  Force everyone to access the database using the &#8211;i-am-a-dummy flag for MySQL?  That is not a bad idea and will get you some of the way there, but a shoddily written script can do exactly the same kind of damage.  Backups?  Sure, we had backups, but we were adding customers so quickly that restoring data more than a few hours old would have pissed off many thousands of people.  An Admin class of users, with configurable policy that prevented them from sending files between 7pm and 3am on weekends?   Yeah, right.  </p>
<p>If you run a big and complicated system, problems you will never predict are going to happen and cause your system to do impossibly weird things you don&#8217;t expect.  You must invest in tools to give you visibility into your system.  My transfer rate graph was the only reason I was even able to go looking for a problem.  I knew something was wrong, and it was just a matter of digging until I found it.  Let your admins see into the system (specifically – how the system is behaving right now) so that they can develop intuition about what it should look like.  Finding a bug in production is never fun.  But it is going to happen, and it is always better if you find it before your users do.</p>
<img src="http://feeds.feedburner.com/~r/Spiteful/~4/6VZvI0R4LVo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/03/24/things-that-are-important-where-clauses/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
	</channel>
</rss>
