<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>igvita.com</title>
	
	<link>http://www.igvita.com</link>
	<description>A goal is a dream with a deadline.</description>
	<pubDate>Mon, 13 Jul 2009 16:37:12 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/igvita" type="application/rss+xml" /><feedburner:emailServiceId>igvita</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><item>
		<title>Extending Tokyo Cabinet DB with Lua</title>
		<link>http://feedproxy.google.com/~r/igvita/~3/5iq57HwtKVM/</link>
		<comments>http://www.igvita.com/2009/07/13/extending-tokyo-cabinet-db-with-lua/#comments</comments>
		<pubDate>Mon, 13 Jul 2009 16:37:12 +0000</pubDate>
		<dc:creator>Ilya Grigorik</dc:creator>
		
		<category><![CDATA[Databases]]></category>

		<category><![CDATA[db]]></category>

		<category><![CDATA[lua]]></category>

		<category><![CDATA[tokyocabinet]]></category>

		<guid isPermaLink="false">http://www.igvita.com/?p=583</guid>
		<description><![CDATA[Tokyo Cabinet is a trove of hidden of gems, the more you learn about it, the more you will appreciate the design and technical decisions behind it. By database standards it is a young project (started in 2007), but since it is a successor to the QDBM project developed by Hirabayashi-san (2000-2007), we could make [...]]]></description>
			<content:encoded><![CDATA[<p><img align="left" src="http://www.igvita.com/posts/09/tc.png" style="margin-right: 1em;"/>Tokyo Cabinet is a <a href="http://www.igvita.com/2009/02/13/tokyo-cabinet-beyond-key-value-store/">trove of hidden of gems</a>, the more you learn about it, the more you will appreciate the design and technical decisions behind it. By database standards it is a young project (started in 2007), but since it is a successor to the <a href="http://qdbm.sourceforge.net/">QDBM</a> project developed by Hirabayashi-san (2000-2007), we could make the argument that it has been, in fact, nine years in the making.</p>
<p>Best of all, the rewrite allowed the project to shed its past baggage and build on a more modern stack with a better perspective of the required features for a modern deployment (<a href="https://launchpad.net/drizzle">Drizzle</a> is another recent example from the MySQL camp). Case in point, <a href="http://tokyocabinet.sourceforge.net/tyrantdoc/#luaext">Tokyo Tyrant supports Lua scripting</a> on the server, allowing us to add arbitrary user defined functions (<a href="http://dev.mysql.com/doc/refman/5.1/en/plugin-api.html">UDF's in MySQL</a> parlance) and to extend the database itself!</p>
<h4><strong>Lowering the Barrier With Lua Scripting</strong></h4>
<p><img align="left" src="http://www.igvita.com/posts/09/lua.png" style="margin-right: 1em;"/>If you have played with Lego Mindstorms, or ever tried to script your phone or a gaming environment (<a href="http://www.wowwiki.com/Lua">WoW</a>), chances are you've worked with <a href="http://www.lua.org/">Lua</a>. Due to its extremely small footprint (~212kb for source, documentation, and examples), fast interpreter, portability and easy to understand syntax it has become the de-facto embedded scripting language for many applications. Explore the <a href="http://lua-users.org/wiki/TutorialDirectory">available tutorials</a>, or pick up "<a href="http://www.amazon.com/Programming-Lua-Second-Roberto-Ierusalimschy/dp/8590379825/ref=sr_1_1?ie=UTF8&s=books&qid=1247449280&sr=8-1">Programming Lua</a>" by the creator of the language for a more in depth look, it's a fun language!</p>
<p>So why is Lua scripting such an exciting feature for Tokyo Tyrant? Because it lowers the barrier for programming and extending the database by an order of magnitude! If you have ever worked with MySQL UDF functions, you'll definitely appreciate the ease and the speed of development. No need for mucking with internal API's, nothing to compile or link against, and execution is done within a stable and sandboxed environment. </p>
<h4><strong>Extending Tokyo Tyrant with Lua</strong></h4>
<p>Anytime you bootup a Tokyo Tyrant server you can tell it to load arbitrary Lua code alongside which will then be evaluated at runtime if the client requests it. From there, the developer of the extension has full access to the incoming request and the underlying Tokyo Cabinet database, allowing us to inject arbitrary functionality. To get a flavor for the workflow, take a look at some of the examples in the <a href="http://www.slideshare.net/igrigorik/lean-mean-tokyo-cabinet-recipes-with-lua">following slides</a>:</p>
<div style="width:600px;text-align:left" id="__ss_1708861"><object style="margin:0px" width="600" height="490"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=tokyo-recipes-090711100331-phpapp02&rel=0&stripped_title=lean-mean-tokyo-cabinet-recipes-with-lua" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=tokyo-recipes-090711100331-phpapp02&rel=0&stripped_title=lean-mean-tokyo-cabinet-recipes-with-lua" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="600" height="490"></embed></object></div>
<p><div class='download-link'>
							<a href='http://www.igvita.com/download.php?file=http://www.github.com/igrigorik/tokyo-recipes/tree/master/.git'><img alt='Download' class='leftalign' src='http://www.igvita.com/wp-content/plugins/dBeautifier/icons/github.png' /></a>
							<h4>
								<a href='http://www.igvita.com/download.php?file=http://www.github.com/igrigorik/tokyo-recipes/tree/master/.git'>tokyo-recipes.git (Lua Extensions for Tokyo Tyrant)</a>
							</h4><p>Downloads: 94 File Size: 0.0 KB </p>
						</div> </p>
<p>Because the Lua scripting interface is relatively new, the number of available extensions and documentation is not very large. For that reason, I've started the <a href="http://github.com/igrigorik/tokyo-recipes/tree/master">tokyo-recipes repo</a> on GitHub, in which I've aggregated some of the available extensions, and added extra documentation and examples to help grease the wheels: <a href="http://github.com/igrigorik/tokyo-recipes/tree/1fdaf47a1fcf7a77e721379b1efa9cc0a510f72f/expire">TTL functionality</a> (ala memcached), <a href="http://github.com/igrigorik/tokyo-recipes/tree/1fdaf47a1fcf7a77e721379b1efa9cc0a510f72f/sets">working with Sets</a> (ala Redis), <a href="http://github.com/igrigorik/tokyo-recipes/tree/1fdaf47a1fcf7a77e721379b1efa9cc0a510f72f/session-trail">session timestamping</a> and a <a href="http://github.com/igrigorik/tokyo-recipes/tree/1fdaf47a1fcf7a77e721379b1efa9cc0a510f72f/map-reduce">wordcount map-reduce</a> example just to name a few. Give it a try, it is an extremely powerful feature!</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/igvita?a=5iq57HwtKVM:Wu0wjjQMQrI:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/igvita?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=5iq57HwtKVM:Wu0wjjQMQrI:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/igvita?i=5iq57HwtKVM:Wu0wjjQMQrI:D7DqB2pKExk" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=5iq57HwtKVM:Wu0wjjQMQrI:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/igvita?i=5iq57HwtKVM:Wu0wjjQMQrI:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=5iq57HwtKVM:Wu0wjjQMQrI:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/igvita?i=5iq57HwtKVM:Wu0wjjQMQrI:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=5iq57HwtKVM:Wu0wjjQMQrI:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/igvita?i=5iq57HwtKVM:Wu0wjjQMQrI:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/igvita/~4/5iq57HwtKVM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.igvita.com/2009/07/13/extending-tokyo-cabinet-db-with-lua/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.igvita.com/2009/07/13/extending-tokyo-cabinet-db-with-lua/</feedburner:origLink></item>
		<item>
		<title>HTTP PubSub: Webhooks &amp; PubSubHubbub</title>
		<link>http://feedproxy.google.com/~r/igvita/~3/-zbWrPTcnHE/</link>
		<comments>http://www.igvita.com/2009/06/29/http-pubsub-webhooks-pubsubhubbub/#comments</comments>
		<pubDate>Mon, 29 Jun 2009 16:14:07 +0000</pubDate>
		<dc:creator>Ilya Grigorik</dc:creator>
		
		<category><![CDATA[Architecture]]></category>

		<category><![CDATA[Ruby]]></category>

		<category><![CDATA[pubsub]]></category>

		<category><![CDATA[pubsubhubbub]]></category>

		<guid isPermaLink="false">http://www.igvita.com/?p=514</guid>
		<description><![CDATA[With all the recent buzz about real-time web, surely this is the year XMPP/AMQP Publish-Subscribe (PubSub) makes it to the big leagues! Or maybe not. Ejabberd (XMPP), RabbitMQ (AMQP) and other pubsub server implementations have come a long way but they remain cumbersome to setup and maintain, and perhaps more importantly, the clients require special [...]]]></description>
			<content:encoded><![CDATA[<p><img align="left" style="margin-right: 1em;" src="http://www.igvita.com/posts/09/webhooks.png"/>With all the recent buzz about real-time web, surely this is the year <a href="http://xmpp.org/extensions/xep-0060.html">XMPP</a>/<a href="http://jira.amqp.org/confluence/display/AMQP/Advanced+Message+Queuing+Protocol">AMQP</a> Publish-Subscribe (PubSub) makes it to the big leagues! Or maybe not. <a href="http://www.ejabberd.im/">Ejabberd</a> (XMPP), <a href="http://www.rabbitmq.com/">RabbitMQ</a> (AMQP) and other pubsub server implementations have come a long way but they remain cumbersome to setup and maintain, and perhaps more importantly, the clients require special libraries and a steep learning curve. That is not to say that either XMPP or AMQP are doomed for failure, in fact, they will continue to thrive, but there is a great case for a simplified PubSub implementation to cover the ad-hoc cases where a dedicated TCP channel might be an overkill: enter Webhooks.</p>
<p>The best part about <a href="http://www.webhooks.org/">Webhooks</a> is that most of us are already familiar with them: callbacks over HTTP. Pioneered by PayPal and Subversion as a way to send real-time notifications to the client, they have found their way into many dozens of products we all use every day. Need pre or post commit hooks for your SVN or Git repository? Both GitHub and SVN support HTTP callbacks. Need a payment alert from PayPal, or an alert when a wiki page is modified? There are webhooks for that too. This simple mechanism allows us to build web services that work together via a simple and ubiquitous protocol we can all understand: HTTP!</p>
<h4><strong>Working with Webhooks</strong></h4>
<p><img align="left" style="margin-right: 1em;" src="http://www.igvita.com/posts/09/webhook-callback.png"/>A good way to think about webhooks is as Unix pipes for the web. While PubSub is a great application of the technology, Webhooks allow bidirectional communication, which opens up our possibilities to: publishing notifications, chaining web services to perform complex actions, or allowing external plugins to enter the workflow. All of this functionality is accomplished via simple HTTP queries (both POST and GET), which means no special libraries or servers to make it all happen. In fact, building your own Webhooks powered PubSub server takes <a href="http://github.com/jcapote/watercoolr/blob/125707e211b0a3cb6462ea405bb77aefb1618f8b/watercoolr.rb">less than a hundred lines</a> with the help of Sinatra, as demonstrated by the Julio Capote and his <a href="http://github.com/jcapote/watercoolr/tree/master">watercoolr</a> project on GitHub. To see it in action, we can use <a href="http://www.postbin.org/">PostBin</a> as our mock recipient:</p>
<blockquote><p>
[ilya@igvita ~]# ruby postbin.rb <a href="http://www.postbin.org/1987c4m">http://www.postbin.org/1987c4m</a><br />
creating channel...<br />
adding subscriber to channel sej7u7<br />
<em>{"status":"OK"}</em><br />
Post message: Hello Webhooks PubSub!<br />
<em>{"status":"OK"}</em>
</p></blockquote>
<p>In this scenario, our local <a href="http://watercoolr.nuklei.com/">watercoolr service</a> acts as a PubSub server, accepting requests to create custom channels, maintaining a subscriber list, and pushing out notifications when an update arrives at the channel. Best of all, interacting with the server is as simple as writing a RESTful client:</p>
<p><a href="javascript:showme('4746_1');"> <b>> postbin.rb</b></a>
<div style=" background:white;" id=4746_1>
<pre class="ruby"><span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">&quot;rubygems&quot;</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">&quot;rest_client&quot;</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">&quot;json&quot;</span>
&nbsp;
<span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;creating channel...&quot;</span>
resp = RestClient.<span style="color:#9900CC;">post</span> <span style="color:#996600;">&quot;http://localhost:4567/channels&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:data</span> =&gt; <span style="color:#996600;">&quot;&quot;</span>
id = JSON.<span style="color:#9900CC;">parse</span><span style="color:#006600; font-weight:bold;">&#40;</span>resp<span style="color:#006600; font-weight:bold;">&#41;</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;id&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>
&nbsp;
<span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;adding subscriber to channel #{id}&quot;</span>
resp = RestClient.<span style="color:#9900CC;">post</span> <span style="color:#996600;">&quot;http://localhost:4567/subscribers&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:data</span> =&gt; <span style="color:#006600; font-weight:bold;">&#123;</span>:channel =&gt; id, <span style="color:#ff3333; font-weight:bold;">:url</span> =&gt; ARGV<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">0</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#125;</span>.<span style="color:#9900CC;">to_json</span>
&nbsp;
<span style="color:#CC0066; font-weight:bold;">loop</span> <span style="color:#9966CC; font-weight:bold;">do</span>
  <span style="color:#CC0066; font-weight:bold;">print</span> <span style="color:#996600;">&quot;Post message: &quot;</span>
  msg = STDIN.<span style="color:#CC0066; font-weight:bold;">gets</span>.<span style="color:#CC0066; font-weight:bold;">chomp</span>
&nbsp;
  resp = RestClient.<span style="color:#9900CC;">post</span> <span style="color:#996600;">&quot;http://localhost:4567/messages&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:data</span> =&gt; <span style="color:#006600; font-weight:bold;">&#123;</span>:channel =&gt; id, <span style="color:#ff3333; font-weight:bold;">:message</span> =&gt; msg<span style="color:#006600; font-weight:bold;">&#125;</span>.<span style="color:#9900CC;">to_json</span>
  <span style="color:#CC0066; font-weight:bold;">puts</span> resp
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;</pre>
</div>
<h4><strong>PubSubHubbub Spec: Auth, Discovery & Protocol Implementation</strong></h4>
<p>Brad Fitzpatrick of Livejournal fame and Brett Slatkin, both of whom are currently at Google <a href="http://pubsubhubbub.googlecode.com/svn/trunk/pubsubhubbub-core-0.1.html">drafted a PubSubHubbub spec</a> for a "simple web-scale pubsub protocol" which preserves the spirit of HTTP webhooks but adds a much needed layer of basic security, discovery, and coherent protocol definition:</p>
<blockquote><p>
We offer this spec in hopes that it fills a need or at least advances the state of the discussion in the pubsub space. Polling sucks. We think a decentralized pubsub layer is a fundamental, missing layer in the Internet architecture today and its existence, more than just enabling the obvious lower latency feed readers, would enable many cool applications, most of which we can't even imagine. But we're looking forward to decentralized social networking.
</p></blockquote>
<p><iframe src='http://docs.google.com/EmbedSlideshow?docid=ajd8t6gk4mh2_34dvbpchfs&amp;size=s' frameborder='0' width='555' height='340'></iframe></p>
<p>The <a href="http://code.google.com/p/pubsubhubbub/">google code project</a> also provides an AppEngine implementation with a <a href="http://pubsubhubbub.appspot.com/publish">public hub</a>, which we can test via an asynchronous (EventMachine based) PubSubHubbub Ruby library:</p>
<p><a href="javascript:showme('4746_2');"> <b>> pubsub-gae.rb</b></a>
<div style=" background:white;" id=4746_2>
<pre class="ruby"><span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">&quot;rubygems&quot;</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">&quot;pubsubhubbub&quot;</span>
&nbsp;
EventMachine.<span style="color:#9900CC;">run</span> <span style="color:#006600; font-weight:bold;">&#123;</span>
  <span style="color:#008000; font-style:italic;"># publish single URL</span>
  pub = <span style="color:#6666ff; font-weight:bold;">EventMachine::PubSubHubbub</span>.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'http://pubsubhubbub.appspot.com/publish'</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">publish</span> <span style="color:#996600;">&quot;http://www.test.com/&quot;</span>
  pub.<span style="color:#9900CC;">callback</span> <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Successfully notified hub.&quot;</span> <span style="color:#006600; font-weight:bold;">&#125;</span>
  pub.<span style="color:#9900CC;">errback</span>  <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Uh oh, something broke: #{pub.response}&quot;</span> <span style="color:#006600; font-weight:bold;">&#125;</span>
&nbsp;
  <span style="color:#008000; font-style:italic;"># publish multiple URL's to hub</span>
  feeds = <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;http://www.test.com&quot;</span>, <span style="color:#996600;">&quot;http://www.test.com/2&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>
  pub = <span style="color:#6666ff; font-weight:bold;">EventMachine::PubSubHubbub</span>.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'http://pubsubhubbub.appspot.com/publish'</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">publish</span> feeds
&nbsp;
  pub.<span style="color:#9900CC;">callback</span> <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Successfully notified hub.&quot;</span> <span style="color:#006600; font-weight:bold;">&#125;</span>
  pub.<span style="color:#9900CC;">errback</span>  <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Uh oh, something broke: #{pub.response}&quot;</span> <span style="color:#006600; font-weight:bold;">&#125;</span>
<span style="color:#006600; font-weight:bold;">&#125;</span>
&nbsp;</pre>
</div>
<p><div class='download-link'>
							<a href='http://www.igvita.com/download.php?file=http://www.github.com/igrigorik/PubSubHubbub/tree/master/.git'><img alt='Download' class='leftalign' src='http://www.igvita.com/wp-content/plugins/dBeautifier/icons/github.png' /></a>
							<h4>
								<a href='http://www.igvita.com/download.php?file=http://www.github.com/igrigorik/PubSubHubbub/tree/master/.git'>PubSubHubbub.git (Asynchronous PubSubHubbub Ruby Client)</a>
							</h4><p>Downloads: 201 File Size: 0.0 KB </p>
						</div> </p>
<h4><strong>Webhooks & PubSubHubbub</strong></h4>
<p>Of course, neither Webhooks nor PubSubHubbub are the answer to every problem. Both XMPP and AMQP will continue to exist alongside, but chances are, will take on the brunt of high-velocity feeds while Webhooks can happily power the remaining 90% of the simpler use-cases. In fact, there is already a <a href="http://github.com/tonyg/rabbithub/tree/master">RabbitMQ PubSubHubbub hub server implemenation</a> in the works! For further reading, flip through Jeff Lindsay's <a href="http://www.slideshare.net/progrium/web-hooks-and-the-programmable-world-of-tomorrow-presentation">presentations</a> on <a href="http://blog.webhooks.org/2009/04/23/slides-from-pivotal-labs-talk/">webhooks</a>, and if you are interested in PubSubHubbub check out <a href="http://www.veodia.com/player.php?vid=fCNU1qQ1oSs">this presentation</a> by Brad Fitzpatrick and Brett Slatkin.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/igvita?a=-zbWrPTcnHE:W_-CvtdNiIM:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/igvita?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=-zbWrPTcnHE:W_-CvtdNiIM:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/igvita?i=-zbWrPTcnHE:W_-CvtdNiIM:D7DqB2pKExk" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=-zbWrPTcnHE:W_-CvtdNiIM:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/igvita?i=-zbWrPTcnHE:W_-CvtdNiIM:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=-zbWrPTcnHE:W_-CvtdNiIM:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/igvita?i=-zbWrPTcnHE:W_-CvtdNiIM:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=-zbWrPTcnHE:W_-CvtdNiIM:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/igvita?i=-zbWrPTcnHE:W_-CvtdNiIM:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/igvita/~4/-zbWrPTcnHE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.igvita.com/2009/06/29/http-pubsub-webhooks-pubsubhubbub/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.igvita.com/2009/06/29/http-pubsub-webhooks-pubsubhubbub/</feedburner:origLink></item>
		<item>
		<title>Measuring &amp; Optimizing I/O Performance</title>
		<link>http://feedproxy.google.com/~r/igvita/~3/RtVFWofIDfs/</link>
		<comments>http://www.igvita.com/2009/06/23/measuring-optimizing-io-performance/#comments</comments>
		<pubDate>Tue, 23 Jun 2009 17:56:05 +0000</pubDate>
		<dc:creator>Ilya Grigorik</dc:creator>
		
		<category><![CDATA[Architecture]]></category>

		<category><![CDATA[io]]></category>

		<category><![CDATA[iostat]]></category>

		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://www.igvita.com/?p=521</guid>
		<description><![CDATA[Measuring and optimizing IO performance is somewhat of a black art: the tools are there, the resources and discussions are plenty, but it is also incredibly easy to get lost in the forest. I speak from recent experience. Having gone down multiple false starts with filesystem optimization, RAID tweaking, and even app-level changes it really [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.igvita.com/posts/09/disk-architecture.png" align="left" style="margin-right:1em;" />Measuring and optimizing IO performance is somewhat of a black art: the tools are there, the resources and discussions are plenty, but it is also incredibly easy to get lost in the forest. I speak from recent experience. Having gone down multiple false starts with filesystem optimization, RAID tweaking, and even app-level changes it really helped to finally step back and revisit the basics. Many man pages and discussion threads later, a few useful realizations emerged: iostat is your best friend, but it can also be incredibly deceiving; refreshing your memory of disk latencies will go a long way; disks and filesystems are fast, but not <em>that</em> fast.</p>
<h4><strong>Monitoring IO Performance with iostat</strong></h4>
<p>If IO performance is suspect, iostat is your best friend. Having said that, the <a href="http://linux.die.net/man/1/iostat">man pages</a> are cryptic so don't be surprised if you find yourself reading the source. To get started, identify the device in question and start a monitoring process:</p>
<blockquote><p>
# -k output rates in kB<br />
# -x output extended stats<br />
# -d monitoring single device<br />
# sample stats every 5 seconds for device /dev/sdh<br />
<strong>[ilya@igvita]> iostat -dxk /dev/sdi 5</strong>
</p></blockquote>
<p>Next, allocate yourself a couple of hours to understand the output or expect to find yourself down a wrong path in no time flat (been there, done that). iostat is a popular tool amongst the database crowd, so not surprisingly you'll find a lot of <a href="http://dammit.lt/2009/03/11/iostat/">great</a> <a href="http://www.pythian.com/news/247/basic-io-monitoring-on-linux">discussions</a> <a href="http://jcole.us/blog/archives/2007/05/08/on-iostat-disk-latency-iohist-onward/">documenting</a> the use. Depending on your application you will need to focus on different metrics, but as a gentle introduction let's take a look at <em>await</em>, <em>svctime</em> and <em>avgque</em>:</p>
<blockquote><p>
<strong>await</strong> - The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.<br />
<strong>svctime</strong> - The average service time (in milliseconds) for I/O requests that were issued to the device.<br />
<strong>avgqu-sz</strong> - The average queue length of the requests that were issued to the device.
</p></blockquote>
<p align="center"><img src="http://www.igvita.com/posts/09/iostat.png" alt="iostat" /></p>
<p>First off, await is a <a href="http://developer.amazonwebservices.com/connect/thread.jspa?threadID=33228">deceiving metric</a>! Even though it claims to measure average time, it is better understood as an aggregate function, so don't be mislead by it: <strong>avgqu-sz * svctm / (%util/100)</strong>. Ideally, await should be roughly equal to your svctime, which leads us to a corollary: your average queue size is ideally hovering around single digits. Understanding these variables alone can tell you volumes about the application generating the load.</p>
<h4><strong>Disk Latencies Refresher & EBS Performance</strong></h4>
<p><img src="http://www.igvita.com/posts/09/ebs.png" align="left" style="margin-right:1em;" />Disk <a href="http://en.wikipedia.org/wiki/Access_time">access time</a> is determined via the sum of several variables: spin-up, seek, rotational delay, and transfer time. Assuming your disk is not is not sleeping we can discount the <a href="http://en.wikipedia.org/wiki/Spin-up">spin-up time</a>, which leaves us with <a href="http://en.wikipedia.org/wiki/Seek_time">seek</a> (time for the disk arm to find the track: ~10ms), <a href="http://en.wikipedia.org/wiki/Rotational_delay">rotational delay</a> (time to get the right sector under the head: depends on RPM), and the actual transfer time. Hence, in the worst case we will take ~10ms to seek, 60s/7200RPM ~= 8ms in rotational delay, plus the read time. On average, for a 7.2k RPM disk this translates into roughly ~5ms access time (~20ms in worst case) to read the first byte!</p>
<p>Armed with this knowledge we can now put Amazon's EBS performance in context: on average our EBS mounts show 10~30ms svctime, which all things considered is not outrageous for a SAN. This number also dips into low single digits at nights and on weekends, which points to the fact that as with any shared resource, the performance of <a href="http://developer.amazonwebservices.com/connect/thread.jspa?messageID=124227#124227">EBS degrades during the day</a>. Having said that, a 6x performance difference based on time of day is definitely not anything to sneeze at, so let's hope Amazon is on top of this!</p>
<p>Average queue size (avgqu-sz) is a popular metric in the DBA circles, but <a href="http://sqlblog.com/blogs/linchi_shea/archive/2007/11/12/disk-queue-length-some-data-points-may-help.aspx">do be careful with it</a> <a href="http://www.r71.nl/index.php?option=com_content&view=article&catid=7:technical-docs&id=185:disk-queue-length-vs-disk-latency-times-which-is-best-for-measuring-database-performance&Itemid=50">when running on a SAN</a> or any multi-spindle device. Ideally, your queue size (avgqu-sz) for a single disk should be in single digits, which means that the underlying device is well matched to the IO load generated by the application. Conversely, if the queue size is artificially low, chances are your application code can benefit from some tuning: do less disk flushing, think about caching or buffering, or in other words, double check the assumption that IO is the bottleneck! </p>
<h4><strong>Disks, Filesystems and Facebook Case Study: Haystack</strong></h4>
<p><img src="http://www.igvita.com/posts/09/facebook.png" align="left" style="margin-right:1em;" />Average access time on our disks places some hard limits on the number of IOPs - at 5ms average, we get a very optimistic 200 req/s with no read time. Hence, if you're trying to store several hundred files a second, you might want to revisit the architecture or seriously think about switching to SSD's! Databases such as MySQL work around this constraint by minimizing the number of file handles, caching data, and using aggressive buffering techniques. Willing to potentially loose a little bit of data with InnoDB? Set <a href="http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/">flush_log_at_trx_commit to 2</a> to avoid flushing on every transaction in favor of a periodic one second flush. In similar fashion, you can tweak your <a href="http://dev.mysql.com/doc/refman/5.0/en/myisam-storage-engine.html">MyISAM</a> key buffers, or even place your index and data files on different drives. </p>
<p>Facebook team recently <a href="http://www.facebook.com/note.php?note_id=76191543919">released the details of their Haystack photo storage system</a> which serves as a great case study of working around the IO bottlenecks: over 15PB of photo storage, and ~360 new photos being uploaded every second as of April '09. To meet the requirements, they dropped the POSIX filesystem semantics and went for an append only structure with a separate in-memory index which stores the direct inode offsets for each photo. As a result, each photo access is translated into a single IO request - a huge win. Read through it, <a href="http://www.facebook.com/note.php?note_id=76191543919">fascinating stuff</a> and an illustrative example of optimizing for IO.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/igvita?a=RtVFWofIDfs:Ky5MJ2qTxGM:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/igvita?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=RtVFWofIDfs:Ky5MJ2qTxGM:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/igvita?i=RtVFWofIDfs:Ky5MJ2qTxGM:D7DqB2pKExk" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=RtVFWofIDfs:Ky5MJ2qTxGM:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/igvita?i=RtVFWofIDfs:Ky5MJ2qTxGM:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=RtVFWofIDfs:Ky5MJ2qTxGM:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/igvita?i=RtVFWofIDfs:Ky5MJ2qTxGM:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=RtVFWofIDfs:Ky5MJ2qTxGM:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/igvita?i=RtVFWofIDfs:Ky5MJ2qTxGM:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/igvita/~4/RtVFWofIDfs" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.igvita.com/2009/06/23/measuring-optimizing-io-performance/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.igvita.com/2009/06/23/measuring-optimizing-io-performance/</feedburner:origLink></item>
		<item>
		<title>Profiling Ruby With Google’s Perftools</title>
		<link>http://feedproxy.google.com/~r/igvita/~3/iLBoEZZ5xSk/</link>
		<comments>http://www.igvita.com/2009/06/13/profiling-ruby-with-googles-perftools/#comments</comments>
		<pubDate>Sat, 13 Jun 2009 16:18:04 +0000</pubDate>
		<dc:creator>Ilya Grigorik</dc:creator>
		
		<category><![CDATA[Ruby]]></category>

		<category><![CDATA[debugging]]></category>

		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://www.igvita.com/?p=501</guid>
		<description><![CDATA[Benchmarking, profiling and debugging are all areas where better tool support could really benefit the Ruby community. Built in benchmark library and extensions such as ruby-prof provide us with a minimal level of introspection to help identify the common bottlenecks, but they still fall short of the available tools for the JVM, or other dynamic [...]]]></description>
			<content:encoded><![CDATA[<p><img align="left" style="margin-right: 1em;" src="http://www.igvita.com/posts/09/google-perftools.png"/>Benchmarking, profiling and debugging are all areas where better tool support could really benefit the Ruby community. Built in <a href="http://www.ruby-doc.org/stdlib/libdoc/benchmark/rdoc/index.html">benchmark</a> library and extensions such as <a href="http://ruby-prof.rubyforge.org/">ruby-prof</a> provide us with a minimal level of introspection to help identify the common bottlenecks, but they still fall short of the available tools for the JVM, or other dynamic runtimes.</p>
<p>If you're working with Rails, the story is better, as within the past year companies such as <a href="http://www.fiveruns.com/">FiveRuns</a> and <a href="http://www.newrelic.com/">New Relic</a> have built a number of great performance and monitoring applications. However, if you're working on the bowels of the framework itself, or attempting to optimize a codepath in any other Ruby application, you're out of luck. For that reason, one interesting project to explore is <a href="http://github.com/tmm1/perftools.rb/tree/master">perftools.rb</a>, an adaptation of <a href="http://code.google.com/p/google-perftools/">Google's perftools</a> library to the Ruby land by <a href="http://twitter.com/tmm1">Aman Gupta</a>.</p>
<h4><strong>Hidden Gems in Google's Perftools: CPU + Heap profilers</strong></h4>
<p>Google's <a href="http://code.google.com/p/google-perftools/">perftools library</a>, which recently <a href="http://www.linux-magazine.com/online/news/improved_multi_threading_performance_google_s_perf_tools_version_1_0">reached 1.0 status</a> is largely known for its fast <a href="http://goog-perftools.sourceforge.net/doc/tcmalloc.html">TCMalloc</a> which offers impressive <a href="http://mysqlha.blogspot.com/2009/03/more-on-using-google-perftools-with.html">drop in speed improvements</a> over the regular <a href="http://en.wikipedia.org/wiki/Malloc">malloc</a> (memory allocation). However, also hidden inside is a set of heap and CPU performance analysis and visualizations tools that are all too often overlooked.</p>
<p>At the moment perftools.rb only supports the CPU profiler, but the heap profiler is next on the agenda and offers a lot of promise. To get started, simply download and install the gem (<em>gem install perftools.rb</em>), the installer will download Google's perftools library, patch it on the fly to enable Ruby instrumentation and install it on your system. From here, you have two options on how to perform the profiling: explicitly instrument your code, or include the perftools library on the fly.</p>
<p><a href="javascript:showme('3944_1');"> <b>> perftools-instrument.rb</b></a>
<div style=" background:white;" id=3944_1>
<pre class="ruby"><span style="color:#008000; font-style:italic;">## instrument your code</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'perftools'</span>
<span style="color:#6666ff; font-weight:bold;">PerfTools::CpuProfiler</span>.<span style="color:#9900CC;">start</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;/tmp/add_numbers_profile&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#9966CC; font-weight:bold;">do</span>
  5_000_000.<span style="color:#9900CC;">times</span><span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#006666;">1</span><span style="color:#006666;">+2</span><span style="color:#006666;">+3</span><span style="color:#006666;">+4</span><span style="color:#006666;">+5</span> <span style="color:#006600; font-weight:bold;">&#125;</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#008000; font-style:italic;">## profile existing application</span>
CPUPROFILE=/tmp/my_app_profile RUBYOPT=<span style="color:#996600;">&quot;-r`gem which perftools | tail -1`&quot;</span> ruby my_app.<span style="color:#9900CC;">rb</span></pre>
</div>
<p>The actual profiling of your code is done via a sampling process where the interpreter is interrupted on a periodic interval, the state is frozen for a few milliseconds and a snapshot of the stack is taken. This does mean that your process has to run for long enough for the profiler to collect a few samples (be careful with short scripts). Once the process is complete you can extract a quick text summary. For example, here is sample output for a simple Sinatra application:</p>
<p><a href="javascript:showme('3944_2');"> <b>> sinatra-perf.rb</b></a>
<div style=" background:white;" id=3944_2>
<pre class="ruby"><span style="color:#006600; font-weight:bold;">&#91;</span>ilya@igvita r<span style="color:#006600; font-weight:bold;">&#93;</span>&gt; CPUPROFILE=/tmp/sinatra RUBYOPT=<span style="color:#996600;">&quot;-r`gem which perftools | tail -1`&quot;</span> ruby hello.<span style="color:#9900CC;">rb</span>
<span style="color:#006600; font-weight:bold;">&#91;</span>ilya@igvita r<span style="color:#006600; font-weight:bold;">&#93;</span>&gt; ab -c <span style="color:#006666;">10</span> -n <span style="color:#006666;">2000</span> localhost:<span style="color:#006666;">4567</span>/
&nbsp;
<span style="color:#006600; font-weight:bold;">&#91;</span>ilya@igvita r<span style="color:#006600; font-weight:bold;">&#93;</span>&gt; pprof.<span style="color:#9900CC;">rb</span> --text --ignore=Gem /tmp/sinatra
Total: <span style="color:#006666;">502</span> samples
     <span style="color:#006666;">180</span>  <span style="color:#006666;">35.9</span>%  <span style="color:#006666;">35.9</span>%      <span style="color:#006666;">412</span>  <span style="color:#006666;">82.1</span>% EventMachine.<span style="color:#9900CC;">run_machine</span>
     <span style="color:#006666;">150</span>  <span style="color:#006666;">29.9</span>%  <span style="color:#006666;">65.7</span>%      <span style="color:#006666;">150</span>  <span style="color:#006666;">29.9</span>% <span style="color:#CC00FF; font-weight:bold;">Time</span><span style="color:#008000; font-style:italic;">#initialize</span>
      <span style="color:#006666;">38</span>   <span style="color:#006666;">7.6</span>%  <span style="color:#006666;">73.3</span>%       <span style="color:#006666;">38</span>   <span style="color:#006666;">7.6</span>% garbage_collector
      <span style="color:#006666;">24</span>   <span style="color:#006666;">4.8</span>%  <span style="color:#006666;">78.1</span>%       <span style="color:#006666;">24</span>   <span style="color:#006666;">4.8</span>% <span style="color:#CC00FF; font-weight:bold;">Time</span><span style="color:#008000; font-style:italic;">#-</span>
      <span style="color:#006666;">17</span>   <span style="color:#006666;">3.4</span>%  <span style="color:#006666;">81.5</span>%       <span style="color:#006666;">17</span>   <span style="color:#006666;">3.4</span>% <span style="color:#CC0066; font-weight:bold;">String</span><span style="color:#008000; font-style:italic;">#split</span>
      <span style="color:#006666;">10</span>   <span style="color:#006666;">2.0</span>%  <span style="color:#006666;">83.5</span>%       <span style="color:#006666;">10</span>   <span style="color:#006666;">2.0</span>% <span style="color:#CC00FF; font-weight:bold;">IO</span><span style="color:#008000; font-style:italic;">#write</span>
       <span style="color:#006666;">6</span>   <span style="color:#006666;">1.2</span>%  <span style="color:#006666;">84.7</span>%        <span style="color:#006666;">6</span>   <span style="color:#006666;">1.2</span>% <span style="color:#CC00FF; font-weight:bold;">StringIO</span><span style="color:#008000; font-style:italic;">#size</span>
&nbsp;
<span style="color:#008000; font-style:italic;"># generate a visualization of the callstack</span>
<span style="color:#006600; font-weight:bold;">&#91;</span>ilya@igvita r<span style="color:#006600; font-weight:bold;">&#93;</span>&gt; pprof.<span style="color:#9900CC;">rb</span> --gif /tmp/sinatra &gt; /tmp/sinatra.<span style="color:#9900CC;">gif</span></pre>
</div>
<p><div class='download-link'>
							<a href='http://www.igvita.com/download.php?file=http://www.github.com/tmm1/perftools.rb/tree/master/.git'><img alt='Download' class='leftalign' src='http://www.igvita.com/wp-content/plugins/dBeautifier/icons/github.png' /></a>
							<h4>
								<a href='http://www.igvita.com/download.php?file=http://www.github.com/tmm1/perftools.rb/tree/master/.git'>perftools.rb (Google-perftools for Ruby code)</a>
							</h4><p>Downloads: 267 File Size: 0.0 KB </p>
						</div></p>
<h4><strong>Visualizing Performance & Bottlenecks With pprof.rb</strong></h4>
<p><a href="http://www.igvita.com/posts/09/sinatra.gif"><img align="left" style="margin-right: 1em;" src="http://www.igvita.com/posts/09/sinatra-perf-preview.png"/></a> However, while the text analysis is definitely useful, the real gold nugget in Google's perftools library is its ability to visualize the profiling information. By converting our profiler output into GIF or PDF formats, we immediately find several interesting codepaths in our Sinatra application (click on preview): majority of the time is spent in EventMachine (as expected), about 8% of time is taken up by the GC, and <strong>30%</strong> of time is spent in the logging code calling Time.new! Armed with that information we now know where the bottlenecks are, their associated codepaths, and we can now work on optimizing our application!</p>
<p>In similar fashion, Aman produced visualizations for <a href="http://perftools-rb.rubyforge.org/examples/rubygems.gif">rubygems</a>, <a href="http://perftools-rb.rubyforge.org/examples/merb.gif">Merb</a>, <a href="http://perftools-rb.rubyforge.org/examples/rails.gif">Rails</a>, and <a href="http://perftools-rb.rubyforge.org/examples/eventmachine-epoll+nothreads.gif">EventMachine</a> - tons of insights in each graph! Definitely a nice library to add to your debug and performance optimization toolkit!</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/igvita?a=iLBoEZZ5xSk:p54VXuL84WY:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/igvita?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=iLBoEZZ5xSk:p54VXuL84WY:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/igvita?i=iLBoEZZ5xSk:p54VXuL84WY:D7DqB2pKExk" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=iLBoEZZ5xSk:p54VXuL84WY:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/igvita?i=iLBoEZZ5xSk:p54VXuL84WY:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=iLBoEZZ5xSk:p54VXuL84WY:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/igvita?i=iLBoEZZ5xSk:p54VXuL84WY:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=iLBoEZZ5xSk:p54VXuL84WY:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/igvita?i=iLBoEZZ5xSk:p54VXuL84WY:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/igvita/~4/iLBoEZZ5xSk" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.igvita.com/2009/06/13/profiling-ruby-with-googles-perftools/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.igvita.com/2009/06/13/profiling-ruby-with-googles-perftools/</feedburner:origLink></item>
		<item>
		<title>Easy Map-Reduce With Hadoop Streaming</title>
		<link>http://feedproxy.google.com/~r/igvita/~3/0D9r7R_1Kcg/</link>
		<comments>http://www.igvita.com/2009/06/01/easy-map-reduce-with-hadoop-streaming/#comments</comments>
		<pubDate>Mon, 01 Jun 2009 15:05:01 +0000</pubDate>
		<dc:creator>Ilya Grigorik</dc:creator>
		
		<category><![CDATA[Architecture]]></category>

		<category><![CDATA[hadoop]]></category>

		<category><![CDATA[map-reduce]]></category>

		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://www.igvita.com/?p=486</guid>
		<description><![CDATA[If you're considering doing large scale analysis of structured data (access logs, for example), there are dozens of enterprise-level solutions ranging from specialized streaming databases, to the more mundane data warehousing solutions with star topologies and column store semantics. Google, facing the same problem, developed a system called Sawzall, which leverages their existing Map-Reduce clusters [...]]]></description>
			<content:encoded><![CDATA[<p><img align="left" src="http://www.igvita.com/posts/09/hadoop.png" style="margin-right: 1em;"/>If you're considering doing large scale analysis of structured data (access logs, for example), there are dozens of enterprise-level solutions ranging from specialized streaming databases, to the more mundane data warehousing solutions with star topologies and column store semantics. Google, facing the same problem, developed a system called <a href="http://labs.google.com/papers/sawzall.html">Sawzall</a>, which leverages their existing Map-Reduce clusters for large scale parallel data analysis by adding a DSL for easy manipulation of data. </p>
<p>Undoubtedly inspired by Google's work, the guys at NY Times recently <a href="http://open.blogs.nytimes.com/2009/05/11/announcing-the-mapreduce-toolkit/">released MRToolkit</a>, a Ruby framework for setting up and running Apache Hadoop jobs with a heavy Sawzall flavor. Short of a great community contribution, and a great <a href="http://code.google.com/p/mrtoolkit/source/browse/trunk">source tree</a> to get familiar with, perhaps the most interesting part of the project is the great showcase of <a href="http://hadoop.apache.org/core/docs/r0.15.2/streaming.html">Hadoop Streaming</a> at work - an interface worth getting familiar with. </p>
<h4><strong>Hadoop Streaming: Simple Map-Reduce</strong></h4>
<p>Writing a Map/Reduce job in Hadoop usually entails writing two Java functions: a map which splits the dataset into independent chunks, and a reduce which combines the results to perform some useful analysis. The framework takes care of the sorting, coordination and scheduling, and thus provides the underlying abstraction for distributed computing. However, although the framework is implemented in Java, Hadoop's Streaming interface is an important, and an often overlooked feature, which allows us to write Map/Reduce applications in any language that is capable of working with STDIN and STDOUT!</p>
<p>In fact, we can create a simple Map/Reduce word count application with nothing but standard *nix tools available on any machine: <a href="http://linux.die.net/man/1/cat">cat</a>, and <a href="http://linux.die.net/man/1/wc">wc</a>. Assuming your data is already loaded into an HDFS cluster, we can kick-off our job:</p>
<blockquote><p>
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \<br />
    -input myInputDirs \<br />
    -output myOutputDir \<br />
    -mapper /bin/cat \<br />
    -reducer /bin/wc
</p></blockquote>
<p>As the example shows, our map and reduce scripts can be any executable that reads input from STDIN (line by line) and emit output to STDOUT. Have a Ruby/Python/Bash script that is capable of both? Congratulations, you can write a Map/Reduce job on Hadoop!</p>
<h4><strong>MRToolkit: Hadoop + Ruby = Sawzall </strong></h4>
<p><a href="http://code.google.com/p/mrtoolkit/">Map/Reduce Toolkit</a> by NY Times engineers is a great example of a Ruby DSL on top of the Hadoop Streaming interface. Specifically aimed at simplifying their internal log processing jobs, it exposes just the necessary bits for handling the access log inputs and provides a number of predefined reduce steps: unique, counter, etc. For example, to get a list of all unique visitor IP's, the entire program consists of:</p>
<p><a href="javascript:showme('2598_1');"> <b>> mr-unique-ip.rb</b></a>
<div style=" background:white;" id=2598_1>
<pre class="ruby"><span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'mrtoolkit'</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">class</span> MainMap &lt; MapBase
  <span style="color:#9966CC; font-weight:bold;">def</span> declare
    <span style="color:#008000; font-style:italic;"># declare log fields</span>
    field <span style="color:#ff3333; font-weight:bold;">:ip</span>
    field <span style="color:#ff3333; font-weight:bold;">:request</span>
    field <span style="color:#ff3333; font-weight:bold;">:status</span>
&nbsp;
    emit <span style="color:#ff3333; font-weight:bold;">:ip_ua</span>
    emit <span style="color:#ff3333; font-weight:bold;">:ip</span>
    emit <span style="color:#ff3333; font-weight:bold;">:ua</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> process<span style="color:#006600; font-weight:bold;">&#40;</span>input, output<span style="color:#006600; font-weight:bold;">&#41;</span>
    ua = input.<span style="color:#9900CC;">ua</span>.<span style="color:#CC0066; font-weight:bold;">split</span><span style="color:#006600; font-weight:bold;">&#40;</span>/\\s/<span style="color:#006600; font-weight:bold;">&#41;</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">0</span><span style="color:#006600; font-weight:bold;">&#93;</span>
    output.<span style="color:#9900CC;">ip_ua</span> = <span style="color:#996600;">&quot;#{input.ip}|#{ua}&quot;</span>
    output.<span style="color:#9900CC;">ip</span> = input.<span style="color:#9900CC;">ip</span>
    output.<span style="color:#9900CC;">ua</span> = ua
    output
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">class</span> MainJob &lt; JobBase
  <span style="color:#9966CC; font-weight:bold;">def</span> job
    mapper MainMap
    reducer UniqueCountReduce, <span style="color:#006666;">2</span>
    indir <span style="color:#996600;">&quot;logs&quot;</span>
    outdir <span style="color:#996600;">&quot;ip-ua&quot;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre>
</div>
<p>Take a <a href="http://code.google.com/p/mrtoolkit/source/browse/trunk/lib/stream_runner.rb">closer look at the execution logic</a>, as that's where all the magic happens for spooling up Hadoop jobs and invoking the Ruby interpreter. This pattern can be easily adapted for any language and offers a very powerful way to leverage Hadoop instead of trying to build your own distributed processing layer.</p>
<p>In fact, we should think of other higher-order frameworks we can build on top of this pattern: data mining & machine learning, video & audio processing, etc., it's all fair game!</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/igvita?a=0D9r7R_1Kcg:0pBlGxK9Yls:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/igvita?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=0D9r7R_1Kcg:0pBlGxK9Yls:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/igvita?i=0D9r7R_1Kcg:0pBlGxK9Yls:D7DqB2pKExk" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=0D9r7R_1Kcg:0pBlGxK9Yls:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/igvita?i=0D9r7R_1Kcg:0pBlGxK9Yls:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=0D9r7R_1Kcg:0pBlGxK9Yls:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/igvita?i=0D9r7R_1Kcg:0pBlGxK9Yls:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=0D9r7R_1Kcg:0pBlGxK9Yls:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/igvita?i=0D9r7R_1Kcg:0pBlGxK9Yls:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/igvita/~4/0D9r7R_1Kcg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.igvita.com/2009/06/01/easy-map-reduce-with-hadoop-streaming/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.igvita.com/2009/06/01/easy-map-reduce-with-hadoop-streaming/</feedburner:origLink></item>
		<item>
		<title>Fibers &amp; Cooperative Scheduling in Ruby</title>
		<link>http://feedproxy.google.com/~r/igvita/~3/KbihHxZw5fs/</link>
		<comments>http://www.igvita.com/2009/05/13/fibers-cooperative-scheduling-in-ruby/#comments</comments>
		<pubDate>Wed, 13 May 2009 17:43:51 +0000</pubDate>
		<dc:creator>Ilya Grigorik</dc:creator>
		
		<category><![CDATA[Ruby]]></category>

		<category><![CDATA[fibers]]></category>

		<category><![CDATA[mri]]></category>

		<category><![CDATA[threads]]></category>

		<guid isPermaLink="false">http://www.igvita.com/?p=455</guid>
		<description><![CDATA[Continuations have been a part of the Ruby API for quite some time, but for a variety of reasons have not seen much practical use: early Ruby 1.8 implementations suffered from serious memory leak problems, which were consequently mostly resolved, and the somewhat academic nature of the concept didn't help either. However, with the production [...]]]></description>
			<content:encoded><![CDATA[<p><img align="left" style="margin-right: 1em;border:1px solid #999" src="http://www.igvita.com/posts/09/ruby-fibers.png"/>Continuations have been a part of the Ruby API for quite some time, but for a variety of reasons have not seen much practical use: early <a href="http://www.ruby-doc.org/docs/ProgrammingRuby/html/ref_c_continuation.html">Ruby 1.8 implementations</a> suffered from serious memory leak problems, which were consequently mostly resolved, and the somewhat academic nature of the concept didn't help either. However, with the production versions of Ruby 1.9 out in the wild, continuations (now known as <strong>Fibers</strong>) deserve a second look.</p>
<p>Unlike a function, which has a defined entry and exit point, a continuation can return (yield) many times over, which means that the execution of a Fiber can be arbitrarily suspended and then resumed from the same point. While somewhat subtle, this is actually a very practical feature: continuations implicitly preserve state without having the programmer to worry about it.</p>
<h4><strong>Fiber Performance in Ruby</strong></h4>
<p>Continuations in Ruby 1.8 (MRI) suffered from a number of <a href="http://www.infoq.com/news/2009/01/ruby-patches-fix-leaks">serious performance problems</a>, but the introduction of Fibers in Ruby 1.9 warrants that we revisit the concept. First of all, Fibers are now <a href="http://oldmoe.blogspot.com/2008/08/ruby-fibers-vs-ruby-threads.html">cheaper to create and perform better than threads</a> due to their nature as a lightweight context. And even better, they give us the ability to do cooperative scheduling - instead of using the preemptive context switch model we're all used to with threading, we can manually schedule Fibers to pass control back and forth whenever we want to. Let's take a look at an admittedly contrived, but also a realistic scenario: two execution threads; one thread blocks for 40ms on an IO call and then takes 10ms to post-process the data; second thread needs 50ms of pure CPU time; vs same scenario implemented with Fibers and cooperative scheduling.</p>
<p align="center"><img src="http://www.igvita.com/posts/09/fibers-vs-threads.png" alt="Fibers vs Threads" /></p>
<p>By default, Ruby MRI uses a fair scheduler, which means that each thread is given an equal time to run (10ms quantum) before it is suspended and the next thread is put in control. If one of your threads is inside a blocking call during those 10ms, think of it as time wasted - everyone is sleeping! By contrast, Fibers force the programmer to do explicit scheduling which can certainly add to the complexity of the program, but offer us the full flexibility of determining how our CPU resources are used and also help us avoid the need for locks in mutexes in our code!</p>
<h4><strong>Fibers and Cooperative IO Scheduling</strong></h4>
<p>Event-driven and asynchronous programming models are great candidates for application of Fibers. <a href="http://www.igvita.com/2008/05/27/ruby-eventmachine-the-speed-demon/">Ruby EventMachine</a> allows us to asynchronously defer blocks of code to be executed when the resource (socket, file descriptor, etc) responds, which means that unlike a regular Net::HTTP call, we don't have to spin-wait indefinitely for the response. However, this model comes at a cost: it is often hard to retrofit code with asynchronous libraries because the callbacks have to be nested and the abstraction often breaks down (compare <a href="http://github.com/igrigorik/em-http-request/tree">em-http-request</a> to <a href="http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html">net/http</a> API). Fibers solve this problem for us by allowing us to turn an asynchronous library into what looks like a synchronous API without losing the advantages of IO-scheduling of the asynchronous execution:</p>
<p><a href="javascript:showme('6245_1');"> <b>> fibered-http.rb</b></a>
<div style=" background:white;" id=6245_1>
<pre class="ruby"><span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'eventmachine'</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'em-http'</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'fiber'</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> async_fetch<span style="color:#006600; font-weight:bold;">&#40;</span>url<span style="color:#006600; font-weight:bold;">&#41;</span>
  f = Fiber.<span style="color:#9900CC;">current</span>
  http = <span style="color:#6666ff; font-weight:bold;">EventMachine::HttpRequest</span>.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span>url<span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">get</span> <span style="color:#ff3333; font-weight:bold;">:timeout</span> =&gt; <span style="color:#006666;">10</span>
  http.<span style="color:#9900CC;">callback</span> <span style="color:#006600; font-weight:bold;">&#123;</span> f.<span style="color:#9900CC;">resume</span><span style="color:#006600; font-weight:bold;">&#40;</span>http<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#125;</span>
  http.<span style="color:#9900CC;">errback</span> <span style="color:#006600; font-weight:bold;">&#123;</span> f.<span style="color:#9900CC;">resume</span><span style="color:#006600; font-weight:bold;">&#40;</span>http<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#125;</span>
&nbsp;
  <span style="color:#0000FF; font-weight:bold;">return</span> Fiber.<span style="color:#9966CC; font-weight:bold;">yield</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
EventMachine.<span style="color:#9900CC;">run</span> <span style="color:#9966CC; font-weight:bold;">do</span>
  Fiber.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#123;</span>
    <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Setting up HTTP request #1&quot;</span>
    data = async_fetch<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'http://www.google.com/'</span><span style="color:#006600; font-weight:bold;">&#41;</span>
    <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Fetched page #1: #{data.response_header.status}&quot;</span>
&nbsp;
    EventMachine.<span style="color:#9900CC;">stop</span>
  <span style="color:#006600; font-weight:bold;">&#125;</span>.<span style="color:#9900CC;">resume</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#008000; font-style:italic;"># Setting up HTTP request #1</span>
<span style="color:#008000; font-style:italic;"># Fetched page #1: 302</span>
&nbsp;</pre>
</div>
<p>Running the following code shows in-order output, which means that we've simulated a synchronous API while using an asynchronous library! Now imagine running multiple fibers in parallel while repeating this pattern: you will have interleaved execution which will be scheduled via IO interrupts - no extra context switches, no locks! Joe Damato and Aman Gupta <a href="http://timetobleed.com/fibers-implemented-for-ruby-1867/">ported the functionality of Fibers to Ruby 1.8.(6|7)</a>, and also provide a <a href="http://timetobleed.com/fibers-implemented-for-ruby-1867/">great example</a> of converting the <a href="http://www.igvita.com/2008/10/27/scaling-activerecord-with-mysqlplus/">asynchronous MySQL driver</a> to work in synchronous fashion. Looking for more? Take a look at an <a href="http://github.com/careo/fiber-examples/blob/ad11d8c787206464f56869add13a219b10faf62c/rpc/future_rpc.rb">implementation of Futures in Ruby</a> and the <a href="http://www.espace.com.eg/docs/neverblock/index.html">Neverblock library</a>.</p>
<h4><strong>Cooperative Scheduling: M Threads + N Fibers Model</strong></h4>
<p>While manual scheduling of Fibers may seem like a high cost, much of it can be abstracted into the underlying libraries and the potential benefits are enormous: cooperative scheduling is the optimal scheduling model and it does not incur the complexity inherent with threading. Having said that, there is still a place for threads! Fibers will not give you concurrency on multi-core systems and they can only be resumed in the thread that created them. Hence, if we were aiming for the optimal resource utilization: run M threads, where M corresponds to number of cores, and optimize the number of Fibers (N) which can run within each thread. This way we can utilize all the available cores, and also take advantage of the IO scheduling model!</p>
<h4><strong>Fibers in JRuby, Rubinius and Other Applications</strong></h4>
<p><img align="left" style="margin-right: 1em;" src="http://www.igvita.com/posts/09/jruby-rubinius-ironruby.png"/>Unlike the MRI Ruby VM, both JRuby and Rubinius implement "Poor Man's Fibers" where each Fiber is in fact, mapped to a native thread. This of course looses on the conceptual efficiency side when compared to MRI, but it still remains a viable and a more efficient option than the pure preemptive scheduling.</p>
<p>In fact, while Fibers are great tool for abstracting asynchronous execution, they are also great for lazy evaluation patterns (such as <a href="http://en.wikipedia.org/wiki/Generator_%28computer_science%29">Generators</a>) and can even be used to <a href="http://www.double.co.nz/pdf/continuations.pdf">add state/session persistence to the HTTP protocol</a>! Wee is a Ruby <a href="http://www.infoq.com/news/2009/03/wee">continuations based web framework</a>, which is likely <a href="http://lib.store.yahoo.net/lib/paulgraham/bbnexcerpts.txt">inspired by some of Paul Grahams work at Viaweb</a> (great read). Get Ruby 1.9 on your system, give Fibers a try!</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/igvita?a=KbihHxZw5fs:WeW1-uKrrHk:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/igvita?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=KbihHxZw5fs:WeW1-uKrrHk:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/igvita?i=KbihHxZw5fs:WeW1-uKrrHk:D7DqB2pKExk" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=KbihHxZw5fs:WeW1-uKrrHk:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/igvita?i=KbihHxZw5fs:WeW1-uKrrHk:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=KbihHxZw5fs:WeW1-uKrrHk:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/igvita?i=KbihHxZw5fs:WeW1-uKrrHk:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=KbihHxZw5fs:WeW1-uKrrHk:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/igvita?i=KbihHxZw5fs:WeW1-uKrrHk:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/igvita/~4/KbihHxZw5fs" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.igvita.com/2009/05/13/fibers-cooperative-scheduling-in-ruby/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.igvita.com/2009/05/13/fibers-cooperative-scheduling-in-ruby/</feedburner:origLink></item>
		<item>
		<title>Open Source Software Apprentice</title>
		<link>http://feedproxy.google.com/~r/igvita/~3/P5fiC4lmGEA/</link>
		<comments>http://www.igvita.com/2009/04/27/open-source-software-apprentice/#comments</comments>
		<pubDate>Mon, 27 Apr 2009 16:11:45 +0000</pubDate>
		<dc:creator>Ilya Grigorik</dc:creator>
		
		<category><![CDATA[Agile Methods]]></category>

		<category><![CDATA[apprentice]]></category>

		<category><![CDATA[foss]]></category>

		<guid isPermaLink="false">http://www.igvita.com/?p=396</guid>
		<description><![CDATA[Learning through experimentation and tinkering is one of the best ways to proficiency in any field. Combined with a feedback loop from an experienced mentor, it is almost a guaranteed path to success. This model of apprenticeship first pioneered in the Middle Ages by numerous craft guilds exchanged masters time for inexpensive labor from the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/papajack/321166072/in/photostream/"><img align="left" src="http://www.igvita.com/posts/09/apprentice.png" style="margin: 0.5em 1em 0pt 0pt; border:1px solid #999"/></a>Learning through experimentation and tinkering is one of the best ways to proficiency in any field. Combined with a feedback loop from an experienced mentor, it is almost a guaranteed path to success. This model of apprenticeship first pioneered in the Middle Ages by numerous craft guilds exchanged masters time for inexpensive labor from the apprentice, and by all accounts, gave us thousands of accomplished individuals who may not have made it otherwise. Trade tricks and regional specialties were passed on and methods were evolved and improved. I think there are a few things we (the open source community) can learn from that past.</p>
<p>Listening to the <a href="http://pivotallabs.com/users/chad/blog/articles/791-gogaruco-09-discussion-panel-ruby-application-frameworks">frameworks panel</a> at GoGaRuCo (<a href="http://rubyonrails.org/">Rails</a>, <a href="http://merbivore.com/">Merb</a>, <a href="http://www.sinatrarb.com/">Sinatra</a>, <a href="http://adhearsion.com/">Adhearsion</a>, <a href="http://rad.rubyforge.org/">RAD</a>, <a href="http://shoooes.net/">Shoes</a>) the message  was abundantly clear: distributed version control systems (DVCS - <a href="http://git-scm.com/">Git</a>, <a href="http://www.selenic.com/mercurial/wiki/">Mercurial</a>, <a href="http://bazaar-vcs.org/">Bazaar</a>) are changing the game in the open source world. Every project attributed at least part of their success to the new "easy fork, branch and merge" model, which both lowers the entry barrier and encourages experimentation. Visualization of <a href="http://www.igvita.com/2009/01/27/ruby-swarms-visualizing-rails-git/">Ruby on Rails contributions</a> is a great example of this trend (with a big explosion in new contributors after the switch to Git, <a href="http://weblog.rubyonrails.org/2008/4/2/rails-is-moving-from-svn-to-git">circa April '08</a>). </p>
<h4><strong>Social Coding & Distributed Version Control</strong></h4>
<p><img align="left" src="http://www.igvita.com/posts/09/dvcs.png" style="margin: 0pt 1em 0pt 0pt;"/>It is hard to scientifically measure the impact of a switch to DVCS, since there are a number of hidden variables in the equation. For example, at least one reason why the number of contributors rises dramatically is due to the fixed attribution model: distributed version control allows to trace code and patches back to the original author instead of having the commit guardians merge changes under their own name. Let's face it, we all like to see our name and get recognition, there is nothing wrong with that, and in fact, it should be encouraged.</p>
<p>"<em>Social coding, community, and switch to DVCS</em>" is the new mantra in the open source world. But while definitely a step in the right direction, it also strikes me as a passive strategy: "<em>put it in DVCS and they will come, community will grow</em>". This approach hasn't worked for many companies in the past when it comes to product development, nor do I think it will work wonders in the open source world. After all, while building a community and contributing code are not mutually exclusive, they are very different tasks. Some developers can successfully juggle both roles, some can't, and majority doesn't even think about it. A source tree with one contributor is a personal itch, two is a project, and three is a tribe. <strong>We need to build tribes</strong>.</p>
<h4><strong>Fostering the Role of the Software Apprentice</strong></h4>
<p>Every field has a rockstar (or a few), it may be the original author of a successful project (<a href="http://en.wikipedia.org/wiki/Linus_Torvalds">Linus</a>: Linux, <a href="http://en.wikipedia.org/wiki/Michael_(Monty)_Widenius">Monty</a>: MySQL, <a href="http://en.wikipedia.org/wiki/Rasmus_Lerdorf">Rasmus</a>: PHP, <a href="http://en.wikipedia.org/wiki/Yukihiro_Matsumoto">Matz</a>: Ruby), or a core contributor who knows the source tree inside out. Their time is scarce, but I am always amazed how approachable and willing to help most of them are. Databases tickle your fancy? Did you know that <a href="https://launchpad.net/drizzle/+topcontributors">Drizzle team</a> (one of the most promising DB initiatives) is <a href="https://launchpad.net/drizzle/+mentoring">looking for people to mentor</a>? In fact, I think it is more than a curious coincidence that any successful open source project today is also one that is actively engaging in apprenticeship programs. Master-Apprentice model is an incredible community builder.</p>
<p><img align="left" src="http://www.igvita.com/posts/09/pairon.png" style="margin: 0pt 1em 0pt 0pt;"/>Like most Ruby developers, I am a big fan of GitHub as it allows me to track and contribute to dozens of projects in a fraction of time I would have spent previously. However, I think some of the tooling is missing. <a href="https://launchpad.net/">Launchpad</a> which uses bazaar DVCS and hosts projects such as <a href="https://launchpad.net/ubuntu">Ubuntu</a> and <a href="https://launchpad.net/mysql">MySQL</a> is already tapping into the master-apprentice community (ex. <a href="https://launchpad.net/drizzle/+mentoring">Drizzle mentor projects</a>). But we can do better. Those <em>roadmap</em> and <em>todo</em> txt's in our source trees or a <a href="http://github.com/kr/beanstalkd/issues">hijacked bug-tracker</a> as a roadmap tool is pointing to one thing: they need to be social artifacts. If the author publishes his wishlist and offers his time to mentor and guide the contributor, everyone wins. Instead of guessing and bickering over the features, I can learn from the best, and together we can start a tribe. Let me vote on features, propose new ones, let's discuss them - hardly revolutionary concepts for the social web.</p>
<h4><strong>Mentors, please stand up!</strong></h4>
<p>Have a project? Let the community know what you want, put yourself out there, offer your time. <a href="http://code.google.com/soc/">Google Summer of Code</a> is a fantastic initiative to pair students and mentors (<a href="http://wiki.rubyonrails.org/gsoc/2009/ideas">Rails + GSoC '09 projects</a>), but I don't think it's the money that ultimately motivates either side to contribute. Of course, this is no magic bullet, and make no mistake, as a mentor you'll have to invest just as much, if not more of your time to make a feature come true. Outside contributors will make 'newbie mistakes', and they will try your patience, but the endgame is worth it.</p>
<p>I want to help, I want to learn, I want to see roadmaps, I want to discuss wishlists, I want to build tribes. </p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/igvita?a=P5fiC4lmGEA:2e3ZQVUJe4c:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/igvita?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=P5fiC4lmGEA:2e3ZQVUJe4c:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/igvita?i=P5fiC4lmGEA:2e3ZQVUJe4c:D7DqB2pKExk" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=P5fiC4lmGEA:2e3ZQVUJe4c:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/igvita?i=P5fiC4lmGEA:2e3ZQVUJe4c:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=P5fiC4lmGEA:2e3ZQVUJe4c:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/igvita?i=P5fiC4lmGEA:2e3ZQVUJe4c:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=P5fiC4lmGEA:2e3ZQVUJe4c:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/igvita?i=P5fiC4lmGEA:2e3ZQVUJe4c:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/igvita/~4/P5fiC4lmGEA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.igvita.com/2009/04/27/open-source-software-apprentice/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.igvita.com/2009/04/27/open-source-software-apprentice/</feedburner:origLink></item>
		<item>
		<title>Ruby Proxies for Scale and Monitoring</title>
		<link>http://feedproxy.google.com/~r/igvita/~3/vIPemocY9bk/</link>
		<comments>http://www.igvita.com/2009/04/20/ruby-proxies-for-scale-and-monitoring/#comments</comments>
		<pubDate>Mon, 20 Apr 2009 19:33:14 +0000</pubDate>
		<dc:creator>Ilya Grigorik</dc:creator>
		
		<category><![CDATA[Architecture]]></category>

		<category><![CDATA[eventmachine]]></category>

		<category><![CDATA[proxy]]></category>

		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://www.igvita.com/?p=386</guid>
		<description><![CDATA[Lift the curtain behind any modern web application and you will find at least a few proxy servers orchestrating the show. Caching proxies such as Varnish and Squid help us take the load of our application servers; reverse proxies such as Haproxy and Nginx help us partition and distribute the workload to multiple workers, all [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.igvita.com/posts/09/duplex-proxy.png" align="left" style="margin-right:1em;" />Lift the curtain behind any modern web application and you will find at least a few proxy servers orchestrating the show. Caching proxies such as <a href="http://varnish.projects.linpro.no/">Varnish</a> and <a href="http://www.squid-cache.org/">Squid</a> help us take the load of our application servers; reverse proxies such as <a href="http://haproxy.1wt.eu/">Haproxy</a> and <a href="http://wiki.nginx.org/Main">Nginx</a> help us partition and distribute the workload to multiple workers, all without revealing the underlying architecture to the user. In the Ruby world, <a href="http://railscasts.com/episodes/151-rack-middleware">Rack middleware</a> and <a href="http://weblog.rubyonrails.org/2008/12/17/introducing-rails-metal">Rails Metal</a> are sister concepts: both allow the programmer to inject functionality in the pre or post-processing step of the HTTP request. </p>
<p>However, nobody said that we should limit ourselves to HTTP, or that the proxy server has to be transparent to the user! After all, there is a great number of other potential use cases which we can use in our infrastructure: intercepting data, validating requests, benchmarking, logging, etc. In fact, a proxy server can be a powerful swiss-army knife in the right hands. Want to intercept SMTP traffic to detect spam? Maybe encrypt or decrypt a datastream on the fly? It’s all surprisingly simple with Ruby.</p>
<h4><strong>Proxy language: Transparent, Intercepting, Cut-through ...</strong></h4>
<div style="width:600px;text-align:left" id="__ss_1307248"><object style="margin:0px" width="600" height="490"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=igvita-gogaruco-proxies-090417152908-phpapp01&stripped_title=ruby-proxies-for-scale-performance-and-monitoring-gogaruco-igvitacom" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=igvita-gogaruco-proxies-090417152908-phpapp01&stripped_title=ruby-proxies-for-scale-performance-and-monitoring-gogaruco-igvitacom" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="600" height="490"></embed></object></div>
<p>Proxy servers can be placed in numerous places between the user and the destination, they can be chained and they can even alter the data at will. A transparent proxy will not modify the request or response and is commonly used for load balancing, authentication, or validation. On the other hand, an intercepting proxy is often used to modify the request or response to provide some added service to user or architect: transform data on the fly, encryption, extending a protocol, etc. Needless to say, intercepting proxies are a wonderful tool!</p>
<p align="center"><img src="http://www.igvita.com/posts/09/proxy-server.png" /></p>
<h4><strong>Ruby Proxy: Duplex Benchmarking</strong></h4>
<p>With minimal overhead, a proxy server can allow us to duplicate a request to multiple servers, for example to production and staging. Instead of the "<a href="http://www.igvita.com/2008/09/30/load-testing-with-log-replay/">record and replay headache</a>" we can instrument ourselves with a real-time performance debugging and monitoring tool: a request gets forked but only the production response is forwarded back to the client. From there, we can analyze the response time in realtime, compare the response bodies, or even alter the data at will: </p>
<p><a href="javascript:showme('2103_1');"> <b>> em-proxy.rb</b></a>
<div style=" background:white;" id=2103_1>
<pre class="ruby">options = <span style="color:#006600; font-weight:bold;">&#123;</span>
  <span style="color:#ff3333; font-weight:bold;">:proxy</span> =&gt; <span style="color:#006600; font-weight:bold;">&#123;</span>:port =&gt; <span style="color:#006666;">9000</span>, <span style="color:#ff3333; font-weight:bold;">:host</span> =&gt; <span style="color:#996600;">&quot;10.2.1.0&quot;</span><span style="color:#006600; font-weight:bold;">&#125;</span>,
  <span style="color:#ff3333; font-weight:bold;">:production</span> =&gt; <span style="color:#006600; font-weight:bold;">&#123;</span>:port =&gt; <span style="color:#006666;">9000</span>, <span style="color:#ff3333; font-weight:bold;">:host</span> =&gt; <span style="color:#996600;">&quot;10.2.1.1&quot;</span><span style="color:#006600; font-weight:bold;">&#125;</span>,
  <span style="color:#ff3333; font-weight:bold;">:benchmark</span> =&gt; <span style="color:#006600; font-weight:bold;">&#123;</span>:post =&gt; <span style="color:#006666;">9000</span>, <span style="color:#ff3333; font-weight:bold;">:host</span> =&gt; <span style="color:#996600;">&quot;10.2.1.2&quot;</span><span style="color:#006600; font-weight:bold;">&#125;</span>
<span style="color:#006600; font-weight:bold;">&#125;</span>
&nbsp;
EventMachine.<span style="color:#9900CC;">run</span> <span style="color:#9966CC; font-weight:bold;">do</span>
  <span style="color:#6666ff; font-weight:bold;">EventMachine::ProxyServer</span>.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span>options<span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">start</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre>
</div>
<p align="center"><img src="http://www.igvita.com/posts/09/live-benchmark.png" /></p>
<p><div class='download-link'>
							<a href='http://www.igvita.com/download.php?file=http://www.github.com/igrigorik/em-proxy/tree/master/.git'><img alt='Download' class='leftalign' src='http://www.igvita.com/wp-content/plugins/dBeautifier/icons/github.png' /></a>
							<h4>
								<a href='http://www.igvita.com/download.php?file=http://www.github.com/igrigorik/em-proxy/tree/master/.git'>em-proxy (Ruby EventMachine Duplex Proxy)</a>
							</h4><p>Downloads: 445 File Size: 0.0 KB </p>
						</div></p>
<p>EM-Proxy is a barebones proxy implemented with <a href="http://rubyeventmachine.com/">Ruby EventMachine</a> which uses the reactor pattern for handling network connections. The performance overhead in the simplest proxy implementation is roughly 3-5% in the latency - a very low cost for the added functionality. Best of all, it is only ~300 lines of Ruby start to finish, and easily extensible. Take a <a href="http://github.com/igrigorik/em-proxy/tree/9310a9d3c14c064585947e83e98560e1221f6af1/lib/em-proxy">look through the source</a>, it's a powerful and wonderful hammer!</p>
<p class="download">Updated <a href="http://www.slideshare.net/igrigorik/ruby-proxies-for-scale-performance-and-monitoring-gogaruco-igvitacom-1396734">slides from RailsConf 2009</a> and updated code for <a href="http://github.com/igrigorik/em-proxy/tree/master">EM-Proxy on Github</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/igvita?a=vIPemocY9bk:7EXEdQXBrYw:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/igvita?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=vIPemocY9bk:7EXEdQXBrYw:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/igvita?i=vIPemocY9bk:7EXEdQXBrYw:D7DqB2pKExk" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=vIPemocY9bk:7EXEdQXBrYw:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/igvita?i=vIPemocY9bk:7EXEdQXBrYw:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=vIPemocY9bk:7EXEdQXBrYw:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/igvita?i=vIPemocY9bk:7EXEdQXBrYw:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=vIPemocY9bk:7EXEdQXBrYw:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/igvita?i=vIPemocY9bk:7EXEdQXBrYw:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/igvita/~4/vIPemocY9bk" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.igvita.com/2009/04/20/ruby-proxies-for-scale-and-monitoring/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.igvita.com/2009/04/20/ruby-proxies-for-scale-and-monitoring/</feedburner:origLink></item>
		<item>
		<title>Henry Ford &amp; Event Driven Architecture</title>
		<link>http://feedproxy.google.com/~r/igvita/~3/32hBimdKu2g/</link>
		<comments>http://www.igvita.com/2009/04/06/henry-ford-event-driven-architecture/#comments</comments>
		<pubDate>Mon, 06 Apr 2009 15:00:51 +0000</pubDate>
		<dc:creator>Ilya Grigorik</dc:creator>
		
		<category><![CDATA[Architecture]]></category>

		<category><![CDATA[seda]]></category>

		<guid isPermaLink="false">http://www.igvita.com/?p=376</guid>
		<description><![CDATA[Talking about architecture, as with most things in the computer industry, often requires deciphering a soup of three or four letter acronyms (XML, API, SOAP), all of which can often be described in much simpler terms. Working on a presentation on "Event Driven Architecture" for MeshU, I was struck how we've managed to obscure such [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.igvita.com/posts/09/assembly-line.png" align="left" style="margin-right:1em;" />Talking about architecture, as with most things in the computer industry, often requires deciphering a soup of three or four letter acronyms (<a href="http://en.wikipedia.org/wiki/Xml">XML</a>, <a href="http://en.wikipedia.org/wiki/Api">API</a>, <a href="http://en.wikipedia.org/wiki/SOAP">SOAP</a>), all of which can often be described in much simpler terms. Working on a presentation on "<a href="http://www.meshu.ca/blog/2009/02/27/qa-with-ilya-grigorik-from-aiderss/">Event Driven Architecture</a>" for <a href="http://www.meshu.ca/">MeshU</a>, I was struck how we've managed to obscure such a simple and familiar concept into something that few developers can understand: <strong>it's an assembly line</strong>!</p>
<p>As an industry, we've come a long way in the past decade. Not so long ago, everyone rolled their own flat-file databases, nowadays the challenge is with storing and accessing terabytes of data. However, a lot of this innovation is often a process of reinventing the wheel. It may not seem so trendy, but looking back exactly a century ago is a good place to start: the manufacturing & assembly line revolution.</p>
<h4><strong>Henry Ford: Beyond Horizontal Scalability</strong></h4>
<p align="center"><img src="http://www.igvita.com/posts/09/proxy-model.png" /></p>
<p>Horizontal scalability is all the rage, that is, if you can achieve it. If you've managed to abstract your database layer, then just keep adding servers, hide them behind a collection of proxy servers, and you're in the money. It works, up to a point.</p>
<p><img src="http://www.igvita.com/posts/09/manufacturing-model.png" align="left" style="margin-right:1em;" />What's remarkable is how similar this pattern is to the way a car was manufactured in the early 1900's. A team of two or three men would work a car, start to finish, and would take anywhere from a day or two to finish the order. Need more capacity, hire more teams! Life is great.</p>
<p>Then <a href="http://en.wikipedia.org/wiki/Henry_ford">Henry Ford came along</a> and changed the game. The simple insight that the efficiency can be increased by an order of magnitude by creating a workflow, as opposed to optimizing the start-to-finish process, is actually quite remarkable. Ford invented the assembly line, we're calling it "<a href="http://en.wikipedia.org/wiki/Staged_event-driven_architecture">Staged Event Driven Architecture (SEDA)</a>". End result, ninety three seconds to manufacture a car vs one to two days, and over one million cars produced every year. </p>
<h4><strong>Event Driven Architecture: The steps</strong></h4>
<p>First and foremost, stop thinking about time to complete. Think workflow instead. There is no reason to hold up the front of the line if someone else can do the work later. By deferring the work you cut down your response time, which means snappy user experience, and it also means that you're able to take advantage of the latest industry buzzwords: elastic computing, and real-time web. Keywords to explore: <a href="http://jira.amqp.org/confluence/display/AMQP/Advanced+Message+Queuing+Protocol">AMQP</a>, <a href="http://xmpp.org/">XMPP</a>, <a href="http://en.wikipedia.org/wiki/Cloud_computing">cloud computing</a>. Or, take a look at the slides from my MeshU presentation:</p>
<div style="text-align:center;width:600px;text-align:left" id="__ss_1254702"><object style="margin:0px" width="600" height="500"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=igvita-eda-meshu-090406095245-phpapp01&rel=0&stripped_title=event-driven-architecture-meshu-ilya-grigorik" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=igvita-eda-meshu-090406095245-phpapp01&rel=0&stripped_title=event-driven-architecture-meshu-ilya-grigorik" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="600" height="500"></embed></object></div>
<h4><strong>Cloud Computing: Assembly Line 2.0</strong></h4>
<p>It's tacky, but cloud computing is "<em>assembly line 2.0</em>". Even given all the industry hype, it is hard to underestimate the impact that cloud computing is going to have on our industry. All of the sudden, the cost of hiring / firing another server is zero, and the datacenter (factory) costs are swallowed by the provider - we can build and disassemble our factories in a matter of minutes! Toyota's "<a href="http://en.wikipedia.org/wiki/Cloud_computing">just in time</a>" manufacturing is as close as it gets in the physical world, but it's only a fraction of the efficiency we can achieve in the virtual world.</p>
<p>Having said that, we still have a thing or two to learn from the physical world.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/igvita?a=32hBimdKu2g:YZDG9DT1IAk:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/igvita?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=32hBimdKu2g:YZDG9DT1IAk:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/igvita?i=32hBimdKu2g:YZDG9DT1IAk:D7DqB2pKExk" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=32hBimdKu2g:YZDG9DT1IAk:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/igvita?i=32hBimdKu2g:YZDG9DT1IAk:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=32hBimdKu2g:YZDG9DT1IAk:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/igvita?i=32hBimdKu2g:YZDG9DT1IAk:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=32hBimdKu2g:YZDG9DT1IAk:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/igvita?i=32hBimdKu2g:YZDG9DT1IAk:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/igvita/~4/32hBimdKu2g" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.igvita.com/2009/04/06/henry-ford-event-driven-architecture/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.igvita.com/2009/04/06/henry-ford-event-driven-architecture/</feedburner:origLink></item>
		<item>
		<title>Ruby Algorithms: Sorting, Trie &amp; Heaps</title>
		<link>http://feedproxy.google.com/~r/igvita/~3/SATgvhVbDMg/</link>
		<comments>http://www.igvita.com/2009/03/26/ruby-algorithms-sorting-trie-heaps/#comments</comments>
		<pubDate>Thu, 26 Mar 2009 16:02:52 +0000</pubDate>
		<dc:creator>Ilya Grigorik</dc:creator>
		
		<category><![CDATA[Ruby]]></category>

		<category><![CDATA[heap]]></category>

		<category><![CDATA[sort]]></category>

		<category><![CDATA[trie]]></category>

		<guid isPermaLink="false">http://www.igvita.com/?p=347</guid>
		<description><![CDATA[A good choice of an algorithm or a data structure can make or break an application. By and large, Ruby provides enough native primitives such as Array, Hash, and Set, to get you by in most cases, but we've all ran into situations where the performance or the memory footprint of these structures left us [...]]]></description>
			<content:encoded><![CDATA[<p><img align="left" style="margin: 0pt 1em 0pt 0pt;" src="http://www.igvita.com/blog/posts/09/ruby-gsoc.png"/>A good choice of an algorithm or a data structure can make or break an application. By and large, Ruby provides enough native primitives such as <a href="http://www.ruby-doc.org/core/classes/Array.html">Array</a>, <a href="http://www.ruby-doc.org/core/classes/Hash.html">Hash</a>, and <a href="http://www.ruby-doc.org/core/classes/Set.html">Set</a>, to get you by in most cases, but we've all ran into situations where the performance or the memory footprint of these structures left us wanting for more.</p>
<p>Sensing a nice opportunity, <a href="http://kanwei.com/">Kanwei Li</a> and <a href="http://www.halostatue.ca/">Austin Ziegler</a> partnered up for the '08 Google Summer of Code (<a href="http://socghop.appspot.com/">GSoC</a>) to translate some of the most well known algorithms and data structures into Ruby, resulting in implementations of nine different containers and ten different search and sorting algorithms! And while I would encourage you to <a href="http://github.com/kanwei/algorithms/tree/master">explore the entire project</a>, a few aspects deserve special attention: first, there is the question of performance of sorting algorithms in Ruby, and second, Trie, Heap and Priority Queues are data structures worth keeping in mind.</p>
<h4><strong>Sorting Performance in Ruby</strong></h4>
<p>Turns out, the default sorting implementation in Ruby is very hard to beat. As part of the GSoC project, Kanwei Li implemented the naive (<a href="http://en.wikipedia.org/wiki/Bubble_sort">bubble sort</a>) and all the way up to some of the best in the trade (<a href="http://en.wikipedia.org/wiki/Quicksort">quicksort</a>), but the performance of all of them is well below the native implementation.</p>
<p>A pass through the MRI source code reveals that the the Array.sort! is implemented via quicksort. And while the benchmarks do show a four times performance hit when compared to the raw C++ version sorting an array of integers, you have to remember that an equivalent implementation in Ruby is operating on array of integer objects! In other words, native Array.sort! is a highly optimized piece of code: use it, embrace it, and don't worry about it!</p>
<h4><strong>Trie: Ordered Tree Data Structure</strong></h4>
<p><img align="left" style="margin: 0pt 1em 0pt 0pt;" src="http://www.igvita.com/blog/posts/09/trie.png"/><a href="http://en.wikipedia.org/wiki/Trie">Trie</a> is both an incredibly useful and an often overlooked data structure. More commonly known as a 'prefix tree', it can be an very compact way to store and pattern match on a collection of objects. A good example is working with a set of strings which are used as keys to a value, but the same pattern could also be used to match on <a href="http://github.com/kanwei/algorithms/blob/81987340f24edb7fd8ca60885b1f04d1907f8c03/lib/containers/trie.rb">any sequence of objects</a>. For example, a Trie can be used to implement a route pattern matcher in Rails. Of course, an example is worth a thousand words:</p>
<p><a href="javascript:showme('1940_1');"> <b>> trie.rb</b></a>
<div style=" background:white;" id=1940_1>
<pre class="ruby">t = <span style="color:#6666ff; font-weight:bold;">Containers::Trie</span>.<span style="color:#9900CC;">new</span>
&nbsp;
<span style="color:#008000; font-style:italic;">#       key     value</span>
t.<span style="color:#9900CC;">push</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Hello&quot;</span>, <span style="color:#996600;">&quot;World&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
t.<span style="color:#9900CC;">push</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Hilly&quot;</span>, <span style="color:#996600;">&quot;Billy&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
t.<span style="color:#9900CC;">push</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Hello Bob&quot;</span>, <span style="color:#996600;">&quot;Heya&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
&nbsp;
<span style="color:#008000; font-style:italic;"># regular lookup</span>
t<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;Hilly&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span> <span style="color:#008000; font-style:italic;"># =&gt; &quot;Billy&quot;</span>
&nbsp;
<span style="color:#008000; font-style:italic;"># find longest matching prefix</span>
t.<span style="color:#9900CC;">longest_prefix</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Hello stranger&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#008000; font-style:italic;"># =&gt; &quot;Hello&quot;</span>
&nbsp;
<span style="color:#008000; font-style:italic;"># find all keys that start with H*llo (* is a wildcard)</span>
t.<span style="color:#9900CC;">wildcard</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;H*ll*&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#008000; font-style:italic;"># =&gt; [&quot;Hello&quot;, &quot;Hilly&quot;]</span>
&nbsp;</pre>
</div>
<h4><strong>Heaps and Priority Queues</strong></h4>
<p><img align="left" style="margin: 0pt 1em 0pt 0pt;" src="http://www.igvita.com/blog/posts/09/priority-queue.png"/><a href="http://en.wikipedia.org/wiki/Heap_(data_structure)">Heaps</a> and <a href="http://en.wikipedia.org/wiki/Priority_queue">Priority Queues</a> are often a drop in performance replacement for a hand sorted hash. Unlike a simple associative container, heaps and priority queues build up an internal sorted tree, which allows you to have constant time access to the highest (or lowest) priority object without having to sort or manually groom the container. Best of all, the underlying <a href="http://github.com/kanwei/algorithms/blob/81987340f24edb7fd8ca60885b1f04d1907f8c03/lib/containers/priority_queue.rb">Ruby implementation</a> uses a <a href="http://en.wikipedia.org/wiki/Fibonacci_heap">Fibonacci heap</a>, which means constant time lookup and inserts, and O(log n) removal:</p>
<p><a href="javascript:showme('1940_2');"> <b>> priority-queue.rb</b></a>
<div style=" background:white;" id=1940_2>
<pre class="ruby">pq = <span style="color:#6666ff; font-weight:bold;">Containers::PriorityQueue</span>.<span style="color:#9900CC;">new</span>
&nbsp;
pq.<span style="color:#9900CC;">push</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Job A&quot;</span>, <span style="color:#006666;">15</span><span style="color:#006600; font-weight:bold;">&#41;</span>
pq.<span style="color:#9900CC;">push</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Job B&quot;</span>, <span style="color:#006666;">12</span><span style="color:#006600; font-weight:bold;">&#41;</span>
pq.<span style="color:#9900CC;">push</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Job C&quot;</span>, <span style="color:#006666;">33</span><span style="color:#006600; font-weight:bold;">&#41;</span>
&nbsp;
pq.<span style="color:#9900CC;">pop</span> <span style="color:#008000; font-style:italic;"># =&gt; &quot;Job C&quot;</span>
pq.<span style="color:#9900CC;">pop</span> <span style="color:#008000; font-style:italic;"># =&gt; &quot;Job A&quot;</span>
pq.<span style="color:#9900CC;">size</span> <span style="color:#008000; font-style:italic;"># =&gt; 1</span></pre>
</div>
<p><div class='download-link'>
							<a href='http://www.igvita.com/download.php?file=http://www.github.com/kanwei/algorithms/tree/master/.git'><img alt='Download' class='leftalign' src='http://www.igvita.com/wp-content/plugins/dBeautifier/icons/github.png' /></a>
							<h4>
								<a href='http://www.igvita.com/download.php?file=http://www.github.com/kanwei/algorithms/tree/master/.git'>algorithms.git (Ruby Algorithms)</a>
							</h4><p>Downloads: 432 File Size: 0.0 KB </p>
						</div> </p>
<h4><strong>Exploring data structures & GSoC '09</strong></h4>
<p>The project also <a href="http://github.com/kanwei/algorithms/tree/81987340f24edb7fd8ca60885b1f04d1907f8c03/lib/containers">contains implementations</a> of the <a href="http://en.wikipedia.org/wiki/Kd-tree">KD-Tree</a>, <a href="http://en.wikipedia.org/wiki/Deque">Deque</a>, <a href="http://en.wikipedia.org/wiki/Stack_(data_structure)">Stack</a>, <a href="http://en.wikipedia.org/wiki/Suffix_array">Suffix Array</a>, and a few others, all of which are definitely worth exploring for educational value. For those who are curious, take a look at <a href="http://www.igvita.com/2009/02/04/ruby-19-internals-ordered-hash/">Red-Black Trees</a> and <a href="http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/">Bloomfilters</a> as well.</p>
<p>Kudos to Kanwei Li and Austin Ziegler for their great work and it is also worth highlighting that the <a href="http://wiki.rubyonrails.org/gsoc/2009/ideas#students">student applications for the '09 projects</a> are due the end of the end of next week! There is a <a href="http://wiki.rubyonrails.org/gsoc/2009/ideas">great lineup of project ideas for Rails</a> and some great <a href="http://wiki.rubyonrails.org/gsoc/2009/ideas#mentors">mentors</a> have recruited their time to help with the cause. If you're a student, or have a friend who is into Ruby and qualifies, do let them know! It is an amazing opportunity to learn from the best and make a great contribution at the same time.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/igvita?a=SATgvhVbDMg:UK_MQHMhrkE:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/igvita?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=SATgvhVbDMg:UK_MQHMhrkE:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/igvita?i=SATgvhVbDMg:UK_MQHMhrkE:D7DqB2pKExk" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=SATgvhVbDMg:UK_MQHMhrkE:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/igvita?i=SATgvhVbDMg:UK_MQHMhrkE:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=SATgvhVbDMg:UK_MQHMhrkE:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/igvita?i=SATgvhVbDMg:UK_MQHMhrkE:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/igvita?a=SATgvhVbDMg:UK_MQHMhrkE:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/igvita?i=SATgvhVbDMg:UK_MQHMhrkE:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/igvita/~4/SATgvhVbDMg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.igvita.com/2009/03/26/ruby-algorithms-sorting-trie-heaps/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.igvita.com/2009/03/26/ruby-algorithms-sorting-trie-heaps/</feedburner:origLink></item>
	</channel>
</rss>
