<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Eigenjoy</title>
	
	<link>http://eigenjoy.com</link>
	<description>a programming blog</description>
	<lastBuildDate>Sat, 05 May 2012 13:43:50 +0000</lastBuildDate>
	
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/xcombinator" /><feedburner:info uri="xcombinator" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Michael Ellsberg’s (Mixergy) Advice on Copywriting</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/pUXPbFyWZCE/</link>
		<comments>http://eigenjoy.com/2011/12/14/michael-ellsbergs-mixergy-advice-on-copywriting/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 18:19:53 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[marketing]]></category>

		<guid isPermaLink="false">http://eigenjoy.com/?p=577</guid>
		<description><![CDATA[Today we&#8217;re going to take a detour from the regular topics at eigenjoy and talk about a topic a lot of hackers need help with: copywriting.
If you&#8217;re reading this blog you probably need help copywriting.
A common trend I&#8217;ve seen with hackers is that while we&#8217;re strong on technical prowess, we&#8217;re short on communicating why our [...]]]></description>
			<content:encoded><![CDATA[<p>Today we&#8217;re going to take a detour from the regular topics at eigenjoy and talk about a topic a lot of hackers need help with: copywriting.</p>
<p>If you&#8217;re reading this blog you probably need help copywriting.</p>
<p>A common trend I&#8217;ve seen with hackers is that while we&#8217;re strong on technical prowess, we&#8217;re short on communicating why our work is important. We can whip up a beautiful rails app in a few hours and craft elegant functional combinators but we struggle to explain why our product is worth someone&#8217;s attention (and money). </p>
<p>You may be thinking: wait a minute, copywriting is sleazy and I don&#8217;t want to stoop to that. I would argue that while some direct response marketing is scummy, copywriting, in the broader sense, is a lot more than cheap tricks. Being able to craft empathetic and direct copy can mean the difference between failure and success with your product.</p>
<h2>Enter Michael Ellsberg.</h2>
<p>I first took note of Michael when I read his <a href="http://www.forbes.com/sites/michaelellsberg/2011/05/16/3-steps-to-build-your-own-social-economy/">excellent Forbes article building a social economy through helping others</a>. (He also expounds on this idea in <a href="http://www.forbes.com/sites/michaelellsberg/2011/08/31/how-to-network-your-way-to-world-class-mentors-the-thiel-fellowship-lecture-part-1/">his talk to the Thiel Fellows</a>. Highly recommended.)</p>
<p><center><br />
<iframe src="http://player.vimeo.com/video/28292408?title=0&amp;byline=0&amp;portrait=0" width="600" height="450" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe></p>
<p><a href="http://vimeo.com/28292408">Michael Ellsberg at the Thiel Fellowship Retreat</a> from <a href="http://vimeo.com/user2013376">Michael Ellsberg</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
<p></center></p>
<p>Michael was recently <a href="http://mixergy.com/michael-ellsberg-education-of-millionaires-interview/">interviewed by Andrew Warner on Mixergy</a> and they spent a bit of time talking about how to become a better copywriter. </p>
<p>Michael&#8217;s advice was this:<strong> create a new email address</strong> and <strong>sign up for the email lists by master copywriters</strong>. This allows you to read and learn their techniques for free. In the interview, Michael also mentioned a number of classic books in the field of copywriting.</p>
<p>I&#8217;ve collected links to each of these resources and I&#8217;ve listed them for you below.*</p>
<h2>Direct Response Copywriters</h2>
<p>Please note! These folks range from the soft-sell to the <strong>extreme-hard-sell</strong> (e.g. Dan Kennedy). Please be aware that I&#8217;m not necessarily endorsing these hard-edge techniques, but rather, I want to make you aware of the full range of tools you have in your toolbox.</p>
<ul>
<li><a href="http://www.ebenpaganvideos.com">Eben Pagan </a></li>
<li><a href="http://dankennedy.com/">Dan Kennedy</a></li>
<li><a href="http://mattfurey.com/">Matt Furey </a></li>
<li><a href="http://marieforleo.com/">Marie Forleo </a></li>
<li><a href="http://www.jonathanfields.com/blog/">Jonathan Fields</a></li>
</ul>
<h2>Copywriting Books</h2>
<ul>
<li><a href="http://www.amazon.com/Advertising-Methods-Prentice-Business-Classics/dp/0130957011">Tested Advertising Methods by John Caples</a></li>
<li><a href="http://www.amazon.com/dp/0844231010/">Scientific Advertising by Claude Hopkins</a> also available as a <a href="http://www.scientificadvertising.com/ScientificAdvertising.pdf">free pdf</a></li>
<li><a href="http://www.amazon.com/Ogilvy-Advertising-David/dp/039472903X">On Advertising by David Ogilvy</a></li>
<li><a href="http://www.amazon.com/Breakthrough-Advertising-Eugene-M-Schwartz/dp/0887232981">Breakthrough Advertising by Eugene Schwartz</a></li>
</ul>
<p>Hope this helps!</p>
<p>P.S.<br />
Since we&#8217;re on the topic of copywriting, I&#8217;d suggest you checkout Marc-André Cournoyer&#8217;s <a href="http://copywritingforgeeks.com/">Copy Writing for Geeks</a>. It&#8217;s a great intro for hackers who want to quickly improve their writing skills.</p>
<p>* Note: None of these are affiliate links, I have no financial incentive with any of these folks.</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F14%2Fmichael-ellsbergs-mixergy-advice-on-copywriting%2F&amp;title=Michael%20Ellsberg%27s%20%28Mixergy%29%20Advice%20on%20Copywriting&amp;notes=Today%20we%27re%20going%20to%20take%20a%20detour%20from%20the%20regular%20topics%20at%20eigenjoy%20and%20talk%20about%20a%20topic%20a%20lot%20of%20hackers%20need%20help%20with%3A%20copywriting.%0D%0A%0D%0AIf%20you%27re%20reading%20this%20blog%20you%20probably%20need%20help%20copywriting.%0D%0A%0D%0AA%20common%20trend%20I%27ve%20seen%20with%20hackers%20is" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F14%2Fmichael-ellsbergs-mixergy-advice-on-copywriting%2F&amp;title=Michael%20Ellsberg%27s%20%28Mixergy%29%20Advice%20on%20Copywriting" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F14%2Fmichael-ellsbergs-mixergy-advice-on-copywriting%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Michael%20Ellsberg%27s%20%28Mixergy%29%20Advice%20on%20Copywriting%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F14%2Fmichael-ellsbergs-mixergy-advice-on-copywriting%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F14%2Fmichael-ellsbergs-mixergy-advice-on-copywriting%2F&amp;t=Michael%20Ellsberg%27s%20%28Mixergy%29%20Advice%20on%20Copywriting" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F14%2Fmichael-ellsbergs-mixergy-advice-on-copywriting%2F&amp;title=Michael%20Ellsberg%27s%20%28Mixergy%29%20Advice%20on%20Copywriting&amp;annotation=Today%20we%27re%20going%20to%20take%20a%20detour%20from%20the%20regular%20topics%20at%20eigenjoy%20and%20talk%20about%20a%20topic%20a%20lot%20of%20hackers%20need%20help%20with%3A%20copywriting.%0D%0A%0D%0AIf%20you%27re%20reading%20this%20blog%20you%20probably%20need%20help%20copywriting.%0D%0A%0D%0AA%20common%20trend%20I%27ve%20seen%20with%20hackers%20is" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F14%2Fmichael-ellsbergs-mixergy-advice-on-copywriting%2F&amp;t=Michael%20Ellsberg%27s%20%28Mixergy%29%20Advice%20on%20Copywriting" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F14%2Fmichael-ellsbergs-mixergy-advice-on-copywriting%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/pUXPbFyWZCE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/12/14/michael-ellsbergs-mixergy-advice-on-copywriting/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2011/12/14/michael-ellsbergs-mixergy-advice-on-copywriting/</feedburner:origLink></item>
		<item>
		<title>Understanding run n with conde in the Reasoned Schemer</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/yDPsEK9eyWo/</link>
		<comments>http://eigenjoy.com/2011/12/11/understanding-run-n-with-conde-in-the-reasoned-schemer/#comments</comments>
		<pubDate>Sun, 11 Dec 2011 21:13:38 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://eigenjoy.com/?p=567</guid>
		<description><![CDATA[I&#8217;m working through The Reasoned Schemer and had some trouble seeing clearly how results are returned when using run n. After Googling a bit, I found a clear walkthrough of the logic on an obscure message board. I&#8217;m reposting here to make sure it doesn&#8217;t disappear.
The key insight is that the n in run n [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m working through <a href="http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&#038;tid=10663">The Reasoned Schemer</a> and had some trouble seeing clearly how results are returned when using <code>run n</code>. After Googling a bit, I found a clear walkthrough of the logic on <a href="http://www.groupsrv.com/computers/about538892.html">an obscure message board</a>. I&#8217;m reposting here to make sure it doesn&#8217;t disappear.</p>
<p>The key insight is that the <code>n</code> in <code>run n</code> refers not to the number of <em>goals</em> that should be tried, but rather the number of <em>answers</em> that should be returned, if available. Full text below:</p>
<blockquote><p>
Posted: Mon Jan 14, 2008 2:19 pm	</p>
<p>Hi all, </p>
<p>I&#8217;m currently re-reading the Reasoned Schemer, but am having a bit of<br />
difficulty understanding how values are computed for recursive<br />
functions. For example, in chapter 2, the recursive function lolo and<br />
section 24 asks what is the value of the following: </p>
<p>(run 5 (x)<br />
(lolo ((a b) (c d) . x))) </p>
<p>The first value (), is obvious to me. The variable x is fresh, so it<br />
is associated with () via the nullo check on line 4 of the method. </p>
<p>Here&#8217;s my explanation for how the second value is derived. Since we<br />
are asking for another value, and the expressions are evaluated within<br />
the conde, we refresh x and try the next line. At this point, the caro<br />
associates the fresh variable x with the cons of the fresh variable a<br />
and the fresh variable d-prime (introduced by the caro). Then, listo<br />
is called on a. Since a is fresh, this call associates a with ().<br />
Since this question succeeds, we try the answer. The cdro associates x<br />
with the cons of fresh variable a-prime (introduced by the cdro) and<br />
the fresh variable d. Then, lolo is called on d, and so d (being<br />
fresh), is associated with (). </p>
<p>Since a-prime and a co-share, and a is (), and d and d-prime co-share,<br />
and d is (), x can be successfully associated with the cons of () and<br />
(), so the result is (()). </p>
<p>Here&#8217;s where I get confused. This invocation of run 2 has actually<br />
produced #s 3 times, once to produce () and twice to produce (())<br />
(once in each of the calls to listo and the recursive call to lolo).<br />
However, we have *asked* for only two goals to be met, in order to<br />
produce two values. If I invoke &#8220;run 3 &#8230;,&#8221; however, which conde<br />
lines in which recursive frames are being evaluated, and having their<br />
associations preserved? Or rather, how is the result expanding beyond<br />
(())? </p>
<p>At the high level, I understand why the answer is (() (()) (() ()) (()<br />
() ()) (() () () ())). But I&#8217;m still missing something at the lower<br />
level that makes it harder for me to understand how some of the more<br />
advanced examples, like the adder, work. </p>
<p>Thanks,<br />
Joe<br />
=====================</p>
<p>Posted: Tue Jan 15, 2008 3:30 pm</p>
<p>(run 2 (q) g1 g2 g3) does *not* specify that at most two goal<br />
invocations should succeed, or that at most two #s goals should be<br />
tried. Rather, run 2 specifies that we want two answers (if there are<br />
two answers to be had), regardless of how many goals must be tried,<br />
must succeed, or must fail in order to get those answers. </p>
<p>Hope this helps. </p>
<p>&#8211;Will
</p>
</blockquote>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F&amp;title=Understanding%20%3Ctt%3Erun%20n%3C%2Ftt%3E%20with%20%3Ctt%3Econde%3C%2Ftt%3E%20in%20the%20Reasoned%20Schemer&amp;notes=I%27m%20working%20through%20The%20Reasoned%20Schemer%20and%20had%20some%20trouble%20seeing%20clearly%20how%20results%20are%20returned%20when%20using%20run%20n.%20After%20Googling%20a%20bit%2C%20I%20found%20a%20clear%20walkthrough%20of%20the%20logic%20on%20an%20obscure%20message%20board.%20I%27m%20reposting%20here%20to%20make%20sure%20it%20doe" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F&amp;title=Understanding%20%3Ctt%3Erun%20n%3C%2Ftt%3E%20with%20%3Ctt%3Econde%3C%2Ftt%3E%20in%20the%20Reasoned%20Schemer" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Understanding%20%3Ctt%3Erun%20n%3C%2Ftt%3E%20with%20%3Ctt%3Econde%3C%2Ftt%3E%20in%20the%20Reasoned%20Schemer%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F&amp;t=Understanding%20%3Ctt%3Erun%20n%3C%2Ftt%3E%20with%20%3Ctt%3Econde%3C%2Ftt%3E%20in%20the%20Reasoned%20Schemer" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F&amp;title=Understanding%20%3Ctt%3Erun%20n%3C%2Ftt%3E%20with%20%3Ctt%3Econde%3C%2Ftt%3E%20in%20the%20Reasoned%20Schemer&amp;annotation=I%27m%20working%20through%20The%20Reasoned%20Schemer%20and%20had%20some%20trouble%20seeing%20clearly%20how%20results%20are%20returned%20when%20using%20run%20n.%20After%20Googling%20a%20bit%2C%20I%20found%20a%20clear%20walkthrough%20of%20the%20logic%20on%20an%20obscure%20message%20board.%20I%27m%20reposting%20here%20to%20make%20sure%20it%20doe" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F&amp;t=Understanding%20%3Ctt%3Erun%20n%3C%2Ftt%3E%20with%20%3Ctt%3Econde%3C%2Ftt%3E%20in%20the%20Reasoned%20Schemer" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/yDPsEK9eyWo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/12/11/understanding-run-n-with-conde-in-the-reasoned-schemer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2011/12/11/understanding-run-n-with-conde-in-the-reasoned-schemer/</feedburner:origLink></item>
		<item>
		<title>Challenges in Large-Scale Web Crawling</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/DKlpl1_uOkU/</link>
		<comments>http://eigenjoy.com/2011/09/14/challenges-in-large-scale-web-crawling/#comments</comments>
		<pubDate>Wed, 14 Sep 2011 21:11:14 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[big-data]]></category>
		<category><![CDATA[crawling]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=551</guid>
		<description><![CDATA[Simple web crawling is easy but when you start crawling several hundred million pages there are a number of difficult challenges.
Last Friday, I gave a talk on how to overcome some of the challenges of large-scale web crawling at Berkeley. Below are the slides from that talk.
 Challenges in Large-Scale Web Crawling 
 View more [...]]]></description>
			<content:encoded><![CDATA[<p>Simple web crawling is easy but when you start crawling several hundred million pages there are a number of difficult challenges.</p>
<p>Last Friday, I gave a talk on how to overcome some of the challenges of large-scale web crawling at Berkeley. Below are the slides from that talk.</p>
<div style="width:780px" id="__ss_9259880"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/jashmenn/challenges-in-largescale-web-crawling" title="Challenges in Large-Scale Web Crawling" target="_blank">Challenges in Large-Scale Web Crawling</a></strong> <object id="__sse9259880" width="780" height="651"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=web-crawling-sept-2011-110914155702-phpapp02&#038;stripped_title=challenges-in-largescale-web-crawling&#038;userName=jashmenn" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed name="__sse9259880" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=web-crawling-sept-2011-110914155702-phpapp02&#038;stripped_title=challenges-in-largescale-web-crawling&#038;userName=jashmenn" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="780" height="651"></embed></object></p>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/jashmenn" target="_blank">Nate Murray</a> </div>
</p>
</div>
<div class="overflow-div">
<p>introduction to<br />
WEB CRAWLING<br />
&#038; extraction	by Nate Murray<br />
ask for show of hands &#8211; how many written any code to download data automatically &#8211; downloaded megabytes / gigabytes / terabytes ?<br />
WHO AM I ?<br />
Nate Murray<br />
AT&#038;T Interactive (Yellowpages.com) TB-scale data since 2009 Various crawlers since 2005<br />
what is<br />
WEB CRAWLING?<br />
definition:<br />
web crawler<br />
a program that browses the web.<br />
crawler downloads pages and puts them in storage somewhere<br />
definition:<br />
web extraction<br />
transforming web data into data<br />
unstructured<br />
structured<br />
but the reality is, if your data is really unstructured then you’re quickly talking about NLP.<br />
definition:<br />
web extraction<br />
transforming	web data into data<br />
semistructured<br />
structured<br />
motivation<br />
- find internet friends (delicious bookmarks) &#8211; extract hours of operation &#8211; make a video player service (co-occurring commenters) &#8211; power tools search engine (affiliate fees)<br />
motivation:	bookmark buddies<br />
URL Title Users<br />
- find internet friends (delicious bookmarks) &#8211; extract hours of operation &#8211; make a video player service (co-occurring commenters) &#8211; power tools search engine (affiliate fees)<br />
motivation:<br />
- find internet friends (delicious bookmarks) &#8211; extract hours of operation &#8211; make a video player service (co-occurring commenters) &#8211; power tools search engine (affiliate fees)<br />
motivation:	business hours<br />
Day<br />
Openness<br />
MMon<br />
Closed<br />
TTue<br />
11:30-14:30<br />
17:30-22:00<br />
Wed<br />
11:30-14:30<br />
17:30-22:00<br />
Thur<br />
11:30-14:30<br />
17:30-22:00<br />
Fri<br />
11:30-14:30<br />
17:30-22:00<br />
Sat<br />
12:00-14:30<br />
17:00-22:00<br />
Sun<br />
-<br />
17:00-21:00<br />
- find internet friends (delicious bookmarks) &#8211; extract hours of operation &#8211; make a video player service (co-occurring commenters) &#8211; power tools search engine (affiliate fees)<br />
motivation:<br />
- find internet friends (delicious bookmarks) &#8211; extract hours of operation &#8211; make a video player service (co-occurring commenters) &#8211; power tools search engine (affiliate fees)<br />
motivation:	recommend videos<br />
Users<br />
- find internet friends (delicious bookmarks) &#8211; extract hours of operation &#8211; make a video player service (co-occurring commenters) &#8211; power tools search engine (affiliate fees)<br />
motivation:<br />
- find internet friends (delicious bookmarks) &#8211; extract hours of operation &#8211; make a video player service (co-occurring commenters) &#8211; power tools search engine (affiliate fees)<br />
motivation:	vertical search<br />
Image<br />
SKU<br />
Name<br />
Price<br />
Rating<br />
- find internet friends (delicious bookmarks) &#8211; extract hours of operation &#8211; make a video player service (co-occurring commenters) &#8211; power tools search engine (affiliate fees)<br />
motivation:<br />
- find internet friends (delicious bookmarks) &#8211; extract hours of operation &#8211; make a video player service (co-occurring commenters) &#8211; power tools search engine (affiliate fees)<br />
DESIRED PROPERTIES<br />
DESIRED PROPERTIES<br />
SPEED<br />
speed. really you just want to download as many pages as you can as fast as possible<br />
• •<br />
Politeness Distributed<br />
it’s easy to burden<br />
small servers (for any significant<br />
crawl) n machines =<br />
n*m pages-per-second every machine should<br />
perform equal work crawl each page exactly once<br />
• •<br />
•<br />
Linear Scalability Even partitioning<br />
Minimum overlap<br />
CONSTRAINTS<br />
BASIC ALGORITHM<br />
Initialize: UrlsDone = null<br />
UrlFrontier = {&#8217;google.com/index.html&#8217;, ..} Repeat<br />
url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl<br />
If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)<br />
crawler downloads pages and puts them in storage somewhere<br />
architecture overview<br />
FETCHER<br />
CRAWL<br />
PLANNER URL<br />
QUEUE<br />
URLs<br />
Web Data<br />
INTERNET<br />
Web Data<br />
Web Data<br />
STORAGE<br />
CHALLENGES<br />
challenges:<br />
depends on your ambitions<br />
as the old saying goes ‘everything is fast for small n’<br />
challenges:<br />
Google’s Index Size:<br />
1998 &#8211; 26 million 2005 &#8211; 8 billion 2008 &#8211; 1 trillion<br />
http://www.nytimes.com/2005/08/15/technology/15search.html<br />
http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html<br />
challenges:<br />
< 10MM<br />
small crawls are easy<br />
and not that interesting [click] i hate to draw too hard of a line, but i also want to give you an intuition about the numbers we’re talking about here. i’d say a small crawl is roughly less than 10mm<br />
challenges:<br />
large crawls are interesting<br />
challenges:<br />
DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Extracting URLs<br />
challenges:<br />
DNS LOOKUP<br />
Initialize: UrlsDone = null<br />
UrlFrontier = {&#8217;google.com/index.html&#8217;, ..} Repeat<br />
url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl<br />
If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)<br />
crawler downloads pages and puts them in storage somewhere<br />
challenges:<br />
DNS LOOKUP<br />
can easily be a bottleneck<br />
challenges:<br />
DNS LOOKUP<br />
•<br />
consider running your own DNS servers<br />
• • •<br />
djbdns PowerDNS etc.<br />
challenges:<br />
DNS LOOKUP<br />
•<br />
be aware of software limitations • gethostbyaddr is synchronized<br />
•<br />
same with many “default” DNS clients<br />
challenges:<br />
DNS LOOKUP<br />
You’ll know when you need it<br />
challenges:<br />
URLs CRAWLED<br />
Initialize:<br />
UrlsDone = null<br />
UrlFrontier = {&#8217;google.com/index.html&#8217;, ..} Repeat<br />
url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath())<br />
UrlsDone.insert(url)<br />
newUrls = parseForLinks(html) For each newUrl<br />
If not UrlsDone.contains(newUrl)<br />
then UrlsTodo.insert(newUrl)<br />
challenges:<br />
URLs CRAWLED<br />
1 machine, store in memory<br />
NAPKIN CALCULATION<br />
~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations<br />
+8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712<br />
x 100 million =~ 5.4 gigabytes<br />
doable if you have a lot of memory. but turn that 100mm into 1 billion and you’re really in trouble. can we do better?<br />
can we do better?<br />
BLOOM FILTERS<br />
BLOOM FILTERS<br />
answers the question:<br />
is this item in the set?<br />
BLOOM FILTERS<br />
Have we crawled: http://www.xcombinator.com?<br />
• •<br />
answers either:<br />
yes, probably definitely not<br />
challenges:<br />
URLs CRAWLED<br />
1 machine, bloom filter<br />
NAPKIN CALCULATION<br />
100 million URLs<br />
1 in 100 million chance of false positive<br />
=~ 457 megabytes see: http://hur.st/bloomfilter?n=100000000&#038;p=1.0E-8<br />
1/10th<br />
• • •<br />
•<br />
acceptable<br />
BLOOM FILTER<br />
drawbacks<br />
probabilistic &#8211; occasional errors<br />
solutions<br />
estimate # of items ahead of time<br />
can’t delete<br />
•<br />
• •<br />
not hard, see Dynamic BFs<br />
pick granularity (days) cascade them<br />
BLOOM FILTERS<br />
references:<br />
http://en.wikipedia.org/wiki/Bloom_filter<br />
http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/<br />
challenges:<br />
POLITENESS<br />
obey robots.txt<br />
this won’t always work. especially for large sites<br />
rule of thumb: wait 2 seconds (w.r.t. ip)<br />
this won’t always work. especially for large sites<br />
centralized politeness<br />
SPOF contention<br />
zookeeper<br />
challenges:<br />
POLITENESS<br />
•<br />
Options:<br />
• • •<br />
central database distributed locks (paxos/sigma/zookeeper) controlled URL distribution<br />
http://en.wikipedia.org/wiki/Paxos_(computer_science)<br />
http://zookeeper.apache.org/<br />
challenges:<br />
URL FRONTIER<br />
url frontier<br />
idea:<br />
consistently distribute URLs based on IP<br />
modulo<br />
IP<br />
SHA-1<br />
bucket (mod 5)<br />
174.132.225.106<br />
4dd14b0b&#8230;<br />
2<br />
74.125.224.115<br />
cf4b7594&#8230;<br />
1<br />
157.166.255.19<br />
0ac4d141&#8230;<br />
4<br />
69.22.138.129<br />
6c1584fa&#8230;<br />
4<br />
98.139.50.166<br />
327252c5&#8230;<br />
3<br />
benefits:<br />
same IP always goes to same machine simple<br />
this won’t always work. especially for large sites<br />
drawbacks:<br />
susceptible to skew can’t add / remove nodes without pain<br />
re hash the whole keyspace, every url will go to a new machine<br />
consistent hashing<br />
source: http://michaelnielsen.org/blog/consistent-hashing/<br />
source: http://michaelnielsen.org/blog/consistent-hashing/<br />
source: http://michaelnielsen.org/blog/consistent-hashing/<br />
source: http://michaelnielsen.org/blog/consistent-hashing/<br />
benefits:<br />
~ 1/(n+1) URLs move on add/remove<br />
virtual nodes help skew robust (no SOP)<br />
drawbacks:<br />
naive solution won’t work for large sites<br />
further reading:<br />
Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications (2001) Stoica et al.<br />
Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007<br />
Tapestry: A Resilient Global-Scale Overlay for Service Deployment (2004) Zhao et al.<br />
challenges:<br />
QUEUEING URLS<br />
situation:<br />
URL not recently crawled allowed by robots.txt polite<br />
you’ve got urls&#8230;<br />
how to you order them?<br />
(within a single machine)<br />
threads<br />
hash each lane:<br />
123<br />
http://yachtmaintenanceco.com/<br />
http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu<br />
http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/<br />
http://www.musiciansdfw.org/ http://www.ariana.org/<br />
123<br />
123<br />
123<br />
ERLANG<br />
lookup: erlang B / C / engset<br />
Agner Erlang<br />
telephone lines &#8211; used in call centers to calculate number of agents you need relative to call volume<br />
as many threads as possible<br />
number of agents doesn’t really translate<br />
don’t sort input URLs<br />
fetch<br />
http://abcnews.go.com/<br />
http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1<br />
wait fetch wait fetch wait<br />
no waiting!<br />
http://yachtmaintenanceco.com/<br />
http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu<br />
http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/<br />
challenges:<br />
EXTRACTING URLS<br />
challenges:<br />
EXTRACTING URLS<br />
the internet is full of garbage<br />
challenges:<br />
EXTRACTING URLS<br />
enormous pages terrible markup ridiculous urls<br />
☃.net/ “unicode snowman dot net”<br />
challenges:<br />
EXTRACTING URLS<br />
be prepared:<br />
use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII use a URL normalizer<br />
SOFTWARE<br />
• •<br />
software advice:<br />
goals determine scale someone else has already done it<br />
- web crawling and scraping isn’t new<br />
2 second crawler:<br />
function wgetspider() { wget &#8211;html-extension &#8211;convert-links &#8211;mirror \<br />
&#8211;page-requisites &#8211;progress=bar &#8211;level=5	\ &#8211;no-parent &#8211;no-verbose \ &#8211;no-check-certificate &#8220;$@&#8221;;<br />
} $ wgetspider http://www.ischool.berkeley.edu/<br />
- web crawling and scraping isn’t new<br />
• • •<br />
java crawlers:<br />
Heritrix (Internet Archive) Nutch (Lucene) Bixo (Hadoop / Cascading)<br />
http://crawler.archive.org/<br />
http://nutch.apache.org/ http://bixo.101tec.com/<br />
• • •<br />
extraction packages:<br />
mechanize BeautifulSoup &#038; urllib2 Scrapy<br />
http://wwwsearch.sourceforge.net/mechanize/<br />
http://www.crummy.com/software/BeautifulSoup/ http://scrapy.org/<br />
wrapper induction(ish)<br />
• • • •<br />
Ariel RoadRunner TemplateMaker scrubyt<br />
http://ariel.rubyforge.org/index.html<br />
http://www.dia.uniroma3.it/db/roadRunner/ http://code.google.com/p/templatemaker/ http://scrubyt.rubyforge.org/files/README.html<br />
QUESTIONS?<br />
FEEDBACK:<br />
nate@xcombinator.com<br />
www.xcombinator.com @xcombinator</p>
</div>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F14%2Fchallenges-in-large-scale-web-crawling%2F&amp;title=Challenges%20in%20Large-Scale%20Web%20Crawling&amp;notes=Simple%20web%20crawling%20is%20easy%20but%20when%20you%20start%20crawling%20several%20hundred%20million%20pages%20there%20are%20a%20number%20of%20difficult%20challenges.%0D%0A%0D%0ALast%20Friday%2C%20I%20gave%20a%20talk%20on%20how%20to%20overcome%20some%20of%20the%20challenges%20of%20large-scale%20web%20crawling%20at%20Berkeley.%20Below%20a" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F14%2Fchallenges-in-large-scale-web-crawling%2F&amp;title=Challenges%20in%20Large-Scale%20Web%20Crawling" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F14%2Fchallenges-in-large-scale-web-crawling%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Challenges%20in%20Large-Scale%20Web%20Crawling%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F14%2Fchallenges-in-large-scale-web-crawling%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F14%2Fchallenges-in-large-scale-web-crawling%2F&amp;t=Challenges%20in%20Large-Scale%20Web%20Crawling" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F14%2Fchallenges-in-large-scale-web-crawling%2F&amp;title=Challenges%20in%20Large-Scale%20Web%20Crawling&amp;annotation=Simple%20web%20crawling%20is%20easy%20but%20when%20you%20start%20crawling%20several%20hundred%20million%20pages%20there%20are%20a%20number%20of%20difficult%20challenges.%0D%0A%0D%0ALast%20Friday%2C%20I%20gave%20a%20talk%20on%20how%20to%20overcome%20some%20of%20the%20challenges%20of%20large-scale%20web%20crawling%20at%20Berkeley.%20Below%20a" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F14%2Fchallenges-in-large-scale-web-crawling%2F&amp;t=Challenges%20in%20Large-Scale%20Web%20Crawling" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F14%2Fchallenges-in-large-scale-web-crawling%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/DKlpl1_uOkU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/09/14/challenges-in-large-scale-web-crawling/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2011/09/14/challenges-in-large-scale-web-crawling/</feedburner:origLink></item>
		<item>
		<title>Binary Search Revisited</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/Ymw_kPxD6NM/</link>
		<comments>http://eigenjoy.com/2011/09/09/binary-search-revisited/#comments</comments>
		<pubDate>Fri, 09 Sep 2011 17:11:08 +0000</pubDate>
		<dc:creator>Matt Pulver</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[C++]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=506</guid>
		<description><![CDATA[As ubiquitous as it is, the standard binary search algorithm can still be further optimized by using bit operations to iterate through its sorted list, in place of arithmetic. Admittedly, this is primarily an academic discussion, since the code improvement does not decrease the logarithmic complexity of the standard algorithm.  Nevertheless, a well-developed programming [...]]]></description>
			<content:encoded><![CDATA[<p>As ubiquitous as it is, the standard binary search algorithm can still be further optimized by using bit operations to iterate through its sorted list, in place of arithmetic. Admittedly, this is primarily an academic discussion, since the code improvement does not decrease the logarithmic complexity of the standard algorithm.  Nevertheless, a well-developed programming intuition should by default implement (or at a minimum consider) a solution similar to the one presented here, prior to &#8220;the obvious&#8221; standard arithmetic solution.</p>
<p>Here&#8217;s an example. We want to find the largest index i such that haystack[i] <= needle. Needle = 15, and haystack is a sorted list of the first 8 prime numbers, indexed by binary numbers:</p>
<p><code>000</code> &nbsp; &nbsp; 2<br />
<code>001</code> &nbsp; &nbsp; 3<br />
<code>010</code> &nbsp; &nbsp; 5<br />
<code>011</code> &nbsp; &nbsp; 7<br />
<code>100</code> &nbsp; &nbsp; 11<br />
<code>101</code> &nbsp; &nbsp; 13<br />
<code>110</code> &nbsp; &nbsp; 17<br />
<code>111</code> &nbsp; &nbsp; 19</p>
<p>Note first that the haystack index requires no more than 3 bits.  Therefore we start with the highest order bit set: b=<code>100</code>. (<code>This</code> <code>font</code> denotes binary numbers here.) Is haystack[<code>100</code>] <= 15?  Yes, 11<=15. Observe this means that the first bit of the index we are looking for is <code>1</code>. We look at the next bit. Is haystack[<code>110</code>] <= 15? No, 17 <= 15 is false. This means the 2nd bit is <code>0</code>. Finally, is haystack[<code>101</code>] <= 15? Yes, 13 <= 15. Therefore the index we are looking for is <code>101</code>.</p>
<p>In essence, the main loop to find the index i is simply:</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;">    <span style="color: #0000ff;">for</span><span style="color: #008000;">&#40;</span> <span style="color: #008080;">;</span> b <span style="color: #008080;">;</span> b <span style="color: #000080;">&gt;&gt;=</span> <span style="color: #0000dd;">1</span> <span style="color: #008000;">&#41;</span>
        <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span> haystack<span style="color: #008000;">&#91;</span>i<span style="color: #000040;">|</span>b<span style="color: #008000;">&#93;</span> <span style="color: #000080;">&lt;=</span> needle <span style="color: #008000;">&#41;</span> i <span style="color: #000040;">|</span><span style="color: #000080;">=</span> b<span style="color: #008080;">;</span></pre></div></div>

<p>where b is the value with 1 bit set, that began with b=<code>100</code> and i=<code>0</code>.</p>
<p><em>It is the natural structure of the bits within the index variable that automatically tracks both the upper and lower bounds of the search window for each iteration.</em> Compare this with the less efficient and more verbose arithmetic performed in the main loops of standard binary search algorithms.</p>
<p>The full C++ code for the improved binary search algorithm fbsearch() is:</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #666666;">// Binary search revisited.</span>
&nbsp;
<span style="color: #666666;">// Define only one of these:</span>
<span style="color: #666666;">// #define SETBIT_FAST</span>
<span style="color: #666666;">// #define SETBIT_FASTER</span>
<span style="color: #339900;">#define SETBIT_FASTEST</span>
&nbsp;
<span style="color: #339900;">#ifdef SETBIT_FAST</span>
<span style="color: #339900;">#include &lt;math.h&gt;</span>
<span style="color: #339900;">#endif</span>
&nbsp;
<span style="color: #666666;">// Return 1 &lt;&lt; log_2(list_size-1), or 0 if list_size == 1.</span>
<span style="color: #666666;">// This sets the initial value of b in fbsearch().</span>
<span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">unsigned</span> init_bit<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">unsigned</span> list_size <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
<span style="color: #339900;">#ifdef SETBIT_FAST</span>
    <span style="color: #0000ff;">return</span> list_size <span style="color: #000080;">==</span> <span style="color: #0000dd;">1</span> <span style="color: #008080;">?</span> <span style="color: #0000dd;">0</span> <span style="color: #008080;">:</span>
        <span style="color: #0000dd;">1</span> <span style="color: #000080;">&lt;&lt;</span> <span style="color: #0000ff;">int</span><span style="color: #008000;">&#40;</span> <span style="color: #0000dd;">log</span><span style="color: #008000;">&#40;</span>list_size<span style="color: #000040;">-</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">/</span> M_LN2 <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #339900;">#endif</span>
<span style="color: #339900;">#ifdef SETBIT_FASTER</span>
    <span style="color: #0000ff;">return</span> list_size <span style="color: #000080;">==</span> <span style="color: #0000dd;">1</span> <span style="color: #008080;">?</span> <span style="color: #0000dd;">0</span> <span style="color: #008080;">:</span>
        <span style="color: #0000dd;">1</span> <span style="color: #000080;">&lt;&lt;</span> <span style="color: #008000;">&#40;</span> <span style="color: #0000dd;">sizeof</span><span style="color: #008000;">&#40;</span><span style="color: #0000ff;">unsigned</span><span style="color: #008000;">&#41;</span> <span style="color: #000080;">&lt;&lt;</span> <span style="color: #0000dd;">3</span> <span style="color: #008000;">&#41;</span> <span style="color: #000040;">-</span> <span style="color: #0000dd;">1</span>
             <span style="color: #000040;">-</span> __builtin_clz<span style="color: #008000;">&#40;</span>list_size<span style="color: #000040;">-</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #339900;">#endif</span>
<span style="color: #339900;">#ifdef SETBIT_FASTEST</span>
    <span style="color: #0000ff;">unsigned</span> b<span style="color: #008080;">;</span>
    __asm__ <span style="color: #008000;">&#40;</span> <span style="color: #FF0000;">&quot;decl %%eax;&quot;</span>
              <span style="color: #FF0000;">&quot;je DONE;&quot;</span>
              <span style="color: #FF0000;">&quot;bsrl %%eax, %%ecx;&quot;</span> <span style="color: #666666;">// BSR - Bit Scan Reverse (386+)</span>
              <span style="color: #FF0000;">&quot;movl $1, %%eax;&quot;</span>
              <span style="color: #FF0000;">&quot;shll %%cl, %%eax;&quot;</span>
              <span style="color: #FF0000;">&quot;DONE:&quot;</span> <span style="color: #008080;">:</span> <span style="color: #FF0000;">&quot;=a&quot;</span> <span style="color: #008000;">&#40;</span>b<span style="color: #008000;">&#41;</span> <span style="color: #008080;">:</span> <span style="color: #FF0000;">&quot;a&quot;</span> <span style="color: #008000;">&#40;</span>list_size<span style="color: #008000;">&#41;</span>
    <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    <span style="color: #0000ff;">return</span> b<span style="color: #008080;">;</span>
<span style="color: #339900;">#endif</span>
<span style="color: #008000;">&#125;</span>
&nbsp;
<span style="color: #666666;">// Return the greatest unsigned i where haystack[i] &lt;= needle.</span>
<span style="color: #666666;">// If i does not exist (haystack is empty, or needle &lt; haystack[0])</span>
<span style="color: #666666;">// then return unsigned(-1). T can be any type for which the binary</span>
<span style="color: #666666;">// operator &lt;= is defined.</span>
<span style="color: #0000ff;">template</span> <span style="color: #000080;">&lt;</span><span style="color: #0000ff;">typename</span> T<span style="color: #000080;">&gt;</span>
<span style="color: #0000ff;">unsigned</span> fbsearch<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">const</span> T haystack<span style="color: #008000;">&#91;</span><span style="color: #008000;">&#93;</span>, <span style="color: #0000ff;">unsigned</span> haystack_size,
                   <span style="color: #0000ff;">const</span> T<span style="color: #000040;">&amp;</span> needle <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
    <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span> haystack_size <span style="color: #000080;">==</span> <span style="color: #0000dd;">0</span> <span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">return</span> <span style="color: #0000ff;">unsigned</span><span style="color: #008000;">&#40;</span><span style="color: #000040;">-</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    <span style="color: #0000ff;">unsigned</span> i <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span><span style="color: #008080;">;</span>
    <span style="color: #0000ff;">for</span><span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">unsigned</span> b <span style="color: #000080;">=</span> init_bit<span style="color: #008000;">&#40;</span>haystack_size<span style="color: #008000;">&#41;</span> <span style="color: #008080;">;</span> b <span style="color: #008080;">;</span> b <span style="color: #000080;">&gt;&gt;=</span> <span style="color: #0000dd;">1</span> <span style="color: #008000;">&#41;</span>
    <span style="color: #008000;">&#123;</span>
        <span style="color: #0000ff;">unsigned</span> j <span style="color: #000080;">=</span> i <span style="color: #000040;">|</span> b<span style="color: #008080;">;</span>
        <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span> haystack_size <span style="color: #000080;">&lt;=</span> j <span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">continue</span><span style="color: #008080;">;</span>
        <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span> haystack<span style="color: #008000;">&#91;</span>j<span style="color: #008000;">&#93;</span> <span style="color: #000080;">&lt;=</span> needle <span style="color: #008000;">&#41;</span> i <span style="color: #000080;">=</span> j<span style="color: #008080;">;</span>
        <span style="color: #0000ff;">else</span>
        <span style="color: #008000;">&#123;</span>
            <span style="color: #0000ff;">for</span><span style="color: #008000;">&#40;</span> b <span style="color: #000080;">&gt;&gt;=</span> <span style="color: #0000dd;">1</span> <span style="color: #008080;">;</span> b <span style="color: #008080;">;</span> b <span style="color: #000080;">&gt;&gt;=</span> <span style="color: #0000dd;">1</span> <span style="color: #008000;">&#41;</span>
                <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span> haystack<span style="color: #008000;">&#91;</span>i<span style="color: #000040;">|</span>b<span style="color: #008000;">&#93;</span> <span style="color: #000080;">&lt;=</span> needle <span style="color: #008000;">&#41;</span> i <span style="color: #000040;">|</span><span style="color: #000080;">=</span> b<span style="color: #008080;">;</span>
            <span style="color: #0000ff;">break</span><span style="color: #008080;">;</span>
        <span style="color: #008000;">&#125;</span>
    <span style="color: #008000;">&#125;</span>
    <span style="color: #0000ff;">return</span> i <span style="color: #000040;">||</span> <span style="color: #000040;">*</span>haystack <span style="color: #000080;">&lt;=</span> needle <span style="color: #008080;">?</span> i <span style="color: #008080;">:</span> <span style="color: #0000ff;">unsigned</span><span style="color: #008000;">&#40;</span><span style="color: #000040;">-</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span>
&nbsp;
<span style="color: #666666;">// Example Usage</span>
<span style="color: #339900;">#include &lt;iostream&gt;</span>
<span style="color: #0000ff;">using</span> <span style="color: #0000ff;">namespace</span> std<span style="color: #008080;">;</span>
&nbsp;
<span style="color: #0000ff;">int</span> main<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
    <span style="color: #0000ff;">const</span> <span style="color: #0000ff;">int</span> sorted_list<span style="color: #008000;">&#91;</span><span style="color: #008000;">&#93;</span> <span style="color: #000080;">=</span> <span style="color: #008000;">&#123;</span> <span style="color: #0000dd;">2</span>, <span style="color: #0000dd;">3</span>, <span style="color: #0000dd;">5</span>, <span style="color: #0000dd;">7</span>, <span style="color: #0000dd;">11</span>, <span style="color: #0000dd;">13</span>, <span style="color: #0000dd;">17</span>, <span style="color: #0000dd;">19</span>, <span style="color: #0000dd;">23</span> <span style="color: #008000;">&#125;</span><span style="color: #008080;">;</span>
    <span style="color: #0000ff;">const</span> <span style="color: #0000ff;">unsigned</span> list_size <span style="color: #000080;">=</span> <span style="color: #0000dd;">sizeof</span><span style="color: #008000;">&#40;</span>sorted_list<span style="color: #008000;">&#41;</span><span style="color: #000040;">/</span><span style="color: #0000dd;">sizeof</span><span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    <span style="color: #0000ff;">int</span> needle <span style="color: #000080;">=</span> <span style="color: #0000dd;">15</span><span style="color: #008080;">;</span>
    <span style="color: #0000dd;">cout</span> <span style="color: #000080;">&lt;&lt;</span> <span style="color: #FF0000;">&quot;fbsearch(sorted_list,&quot;</span><span style="color: #000080;">&lt;&lt;</span>list_size<span style="color: #000080;">&lt;&lt;</span><span style="color: #FF0000;">','</span><span style="color: #000080;">&lt;&lt;</span>needle<span style="color: #000080;">&lt;&lt;</span><span style="color: #FF0000;">&quot;) = &quot;</span>
         <span style="color: #000080;">&lt;&lt;</span> fbsearch<span style="color: #008000;">&#40;</span>sorted_list,list_size,needle<span style="color: #008000;">&#41;</span> <span style="color: #000080;">&lt;&lt;</span> endl<span style="color: #008080;">;</span>
    <span style="color: #666666;">// fbsearch(sorted_list,9,15) = 5</span>
    <span style="color: #0000ff;">return</span> <span style="color: #0000dd;">0</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<p>Ancillary notes:</p>
<ul>
<li>The function init_bit() sets the initial value of b to having a single bit set in the highest position for indexing the array list. The three SETBIT_FAST, SETBIT_FASTER, and SETBIT_FASTEST code blocks within it are simply 3 different ways of calculating the initial value of b, whose speed ranks are the reverse of their cross-platform compatibility. The first method should compile everywhere, and simply calculates it from a base-2 logarithm. The second method uses the faster GNU __builtin_clz method (thanks Chao Xu for the suggestion) that counts the highest order bit from the left. The third and fastest method uses the Intel assembly instruction BSR that counts it from the right.</li>
<li>The null return value of unsigned(-1) (=4294967295 for 32-bit unsigned) is the unique value that will never otherwise be returned. Therefore it can safely be checked for by the calling function without risk of misinterpretation.</li>
<li>The one <code lang="cpp">else</code> block can safely be removed without changing the overall functionality. It exists to simply save having to check that the array index is in bounds on every iteration, since it is at a point in logic where that condition is guaranteed.</li>
</ul>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F&amp;title=Binary%20Search%20Revisited&amp;notes=As%20ubiquitous%20as%20it%20is%2C%20the%20standard%20binary%20search%20algorithm%20can%20still%20be%20further%20optimized%20by%20using%20bit%20operations%20to%20iterate%20through%20its%20sorted%20list%2C%20in%20place%20of%20arithmetic.%20Admittedly%2C%20this%20is%20primarily%20an%20academic%20discussion%2C%20since%20the%20code%20impro" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F&amp;title=Binary%20Search%20Revisited" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Binary%20Search%20Revisited%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F&amp;t=Binary%20Search%20Revisited" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F&amp;title=Binary%20Search%20Revisited&amp;annotation=As%20ubiquitous%20as%20it%20is%2C%20the%20standard%20binary%20search%20algorithm%20can%20still%20be%20further%20optimized%20by%20using%20bit%20operations%20to%20iterate%20through%20its%20sorted%20list%2C%20in%20place%20of%20arithmetic.%20Admittedly%2C%20this%20is%20primarily%20an%20academic%20discussion%2C%20since%20the%20code%20impro" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F&amp;t=Binary%20Search%20Revisited" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/Ymw_kPxD6NM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/09/09/binary-search-revisited/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2011/09/09/binary-search-revisited/</feedburner:origLink></item>
		<item>
		<title>activeuuid – binary uuid primary keys in Rails 3.1 on MySQL</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/qaR-RaITq-M/</link>
		<comments>http://eigenjoy.com/2011/08/29/activeuuid-binary-uuid-primary-keys-in-rails-3-1-on-mysql/#comments</comments>
		<pubDate>Mon, 29 Aug 2011 14:42:38 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[rails]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=494</guid>
		<description><![CDATA[I&#8217;ve been debating a lot recently if I want to use NoSQL (Cassandra) for my next project or if I should just use MySQL. I plan to write a longer post on this soon, but for now I want to share with you a little gem I wrote.
I plan to use NoSQL in the not-too-distance [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been debating a lot recently if I want to use NoSQL (Cassandra) for my next project or if I should just use MySQL. I plan to write a longer post on this soon, but for now I want to share with you a little gem I wrote.</p>
<p>I plan to use NoSQL in the not-too-distance future, but not today. When I port the data over I don&#8217;t want to have to map &#8220;old ids&#8221; to &#8220;new ids&#8221;. By using a <tt>uuid</tt> from day one I can easily maintain referential integrity when I migrate to a new datastore.</p>
<p>I looked up how this is typically done in Rails using MySQL and all I could find was examples that store the UUID as strings (e.g. <a href="https://gist.github.com/937739">here</a>). The UUID is natively just 16 bytes so storing this in a <tt>VARCHAR(36)</tt> seemed like a poor solution. Ideally we would store the UUID as binary and let the ORM deal with translating it to and from a UUID object.</p>
<p><tt>activeuuid</tt> solves this problem. <tt>activeuuid</tt> is a Rails plugin that will let you use <tt>UUIDTools::UUID</tt> objects as primary keys for your ActiveRecords and it stores them in MySQL as <tt>binary(16)</tt>.</p>
<p>Here&#8217;s how to use it:</p>
<h2>Create a Migration</h2>
<p><tt>activeuuid</tt> adds the <tt>uuid</tt> type to your migrations. Example:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#9966CC; font-weight:bold;">class</span> CreateEmails <span style="color:#006600; font-weight:bold;">&lt;</span> <span style="color:#6666ff; font-weight:bold;">ActiveRecord::Migration</span>
  <span style="color:#9966CC; font-weight:bold;">def</span> <span style="color:#0000FF; font-weight:bold;">self</span>.<span style="color:#9900CC;">up</span>
    create_table <span style="color:#ff3333; font-weight:bold;">:emails</span>, <span style="color:#ff3333; font-weight:bold;">:id</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#0000FF; font-weight:bold;">false</span>  <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>t<span style="color:#006600; font-weight:bold;">|</span>
      t.<span style="color:#9900CC;">uuid</span> <span style="color:#ff3333; font-weight:bold;">:id</span>, <span style="color:#ff3333; font-weight:bold;">:unique</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#0000FF; font-weight:bold;">true</span>
      t.<span style="color:#9900CC;">uuid</span> <span style="color:#ff3333; font-weight:bold;">:sender_id</span>  <span style="color:#008000; font-style:italic;"># belongs_to :sender</span>
&nbsp;
      t.<span style="color:#CC0066; font-weight:bold;">string</span> <span style="color:#ff3333; font-weight:bold;">:subject</span>
      t.<span style="color:#9900CC;">text</span> <span style="color:#ff3333; font-weight:bold;">:body</span>
&nbsp;
      t.<span style="color:#9900CC;">timestamp</span> <span style="color:#ff3333; font-weight:bold;">:sent_at</span>
      t.<span style="color:#9900CC;">timestamps</span>
    <span style="color:#9966CC; font-weight:bold;">end</span>
    add_index <span style="color:#ff3333; font-weight:bold;">:emails</span>, <span style="color:#ff3333; font-weight:bold;">:id</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> <span style="color:#0000FF; font-weight:bold;">self</span>.<span style="color:#9900CC;">down</span>
    drop_table <span style="color:#ff3333; font-weight:bold;">:emails</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

<h2>include ActiveUUID::UUID in your model</h2>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#9966CC; font-weight:bold;">class</span> Email <span style="color:#006600; font-weight:bold;">&lt;</span> <span style="color:#6666ff; font-weight:bold;">ActiveRecord::Base</span>
  <span style="color:#9966CC; font-weight:bold;">include</span> <span style="color:#6666ff; font-weight:bold;">ActiveUUID::UUID</span>
  belongs_to <span style="color:#ff3333; font-weight:bold;">:sender</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

<h2>use it</h2>
<p>Here are some example specs:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'spec_helper'</span>
&nbsp;
describe Email <span style="color:#9966CC; font-weight:bold;">do</span>
&nbsp;
  context <span style="color:#996600;">&quot;when using uuid's as keys&quot;</span> <span style="color:#9966CC; font-weight:bold;">do</span>
    before<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#ff3333; font-weight:bold;">:each</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#9966CC; font-weight:bold;">do</span>
      Email.<span style="color:#9900CC;">delete_all</span>
      <span style="color:#0066ff; font-weight:bold;">@guid</span> = <span style="color:#996600;">&quot;1dd74dd0-d116-11e0-99c7-5ac5d975667e&quot;</span>
      <span style="color:#0066ff; font-weight:bold;">@e</span> = Email.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#ff3333; font-weight:bold;">:subject</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#996600;">&quot;hello&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:body</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#996600;">&quot;world&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">|</span>e<span style="color:#006600; font-weight:bold;">|</span> e.<span style="color:#9900CC;">id</span> = <span style="color:#6666ff; font-weight:bold;">UUIDTools::UUID</span>.<span style="color:#9900CC;">parse</span><span style="color:#006600; font-weight:bold;">&#40;</span>@guid<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#125;</span>
      <span style="color:#0066ff; font-weight:bold;">@e</span>.<span style="color:#9900CC;">save</span>
    <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
    it <span style="color:#996600;">&quot;the id guid should be equal to the uuid&quot;</span> <span style="color:#9966CC; font-weight:bold;">do</span>
      <span style="color:#0066ff; font-weight:bold;">@e</span>.<span style="color:#9900CC;">id</span>.<span style="color:#9900CC;">to_s</span>.<span style="color:#9900CC;">should</span> eql<span style="color:#006600; font-weight:bold;">&#40;</span>@guid<span style="color:#006600; font-weight:bold;">&#41;</span>
    <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
    it <span style="color:#996600;">&quot;should be able to find an email by the uuid&quot;</span> <span style="color:#9966CC; font-weight:bold;">do</span>
      f = Email.<span style="color:#9900CC;">find</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#6666ff; font-weight:bold;">UUIDTools::UUID</span>.<span style="color:#9900CC;">parse</span><span style="color:#006600; font-weight:bold;">&#40;</span>@guid<span style="color:#006600; font-weight:bold;">&#41;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
      f.<span style="color:#9900CC;">id</span>.<span style="color:#9900CC;">to_s</span>.<span style="color:#9900CC;">should</span> eql<span style="color:#006600; font-weight:bold;">&#40;</span>@guid<span style="color:#006600; font-weight:bold;">&#41;</span>
    <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

<h2>Benefits</h2>
<p>There are a number of benefits to using UUIDs as primary keys.:</p>
<ul>
<li>no id conflict during multi-master write</li>
<li>no locking due to auto-increment</li>
<li>with time-based UUIDs you can store a timestamp within your UUID</li>
<li>you can create natural keys (based on the SHA of model attributes)</li>
</ul>
<h2>Future work</h2>
<ul>
<li>more transparent support for natural and composite keys</li>
<li>support for MySQLs `INSERT &#8230; ON DUPLICATE KEY UPDATE` syntax</li>
<li>support a primary column name other than `id`</li>
<li> work on other databases (Postgres, etc)</li>
<li>tests</li>
</ul>
<h2>Installation</h2>
<p>Add this to your <tt>Gemfile</tt></p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">    gem <span style="color:#996600;">&quot;activeuuid&quot;</span></pre></div></div>

<p>Or get the code here: <a href="https://github.com/jashmenn/activeuuid">https://github.com/jashmenn/activeuuid</a></p>
<h2>References</h2>
<ul>
<li><a href="http://bret.appspot.com/entry/how-friendfeed-uses-mysql">http://bret.appspot.com/entry/how-friendfeed-uses-mysql</a></li>
<li><a href="http://kekoav.com/blog/36-computers/58-uuids-as-primary-keys-in-mysql.html ">http://kekoav.com/blog/36-computers/58-uuids-as-primary-keys-in-mysql.html</a></li>
<li><a href="https://gist.github.com/937739">https://gist.github.com/937739</a></li>
<li><a href="http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html">http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html</a></li>
<li><a href="http://krow.livejournal.com/497839.html">http://krow.livejournal.com/497839.html</a></li>
<li><a href="https://github.com/jamesgolick/friendly">https://github.com/jamesgolick/friendly</a></li>
</ul>
<h2>Dependencies</h2>
<p>Note that this depends on Rails 3.1 because it uses the custom column serialization added by Aaron Patterson. (See: <a href="https://github.com/rails/rails/commit/ebe485fd8ec80a1a9b86516bc6f74bc5bbba3476">https://github.com/rails/rails/commit/ebe485fd8ec80a1a9b86516bc6f74bc5bbba3476</a>)</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F29%2Factiveuuid-binary-uuid-primary-keys-in-rails-3-1-on-mysql%2F&amp;title=activeuuid%20-%20binary%20uuid%20primary%20keys%20in%20Rails%203.1%20on%20MySQL&amp;notes=I%27ve%20been%20debating%20a%20lot%20recently%20if%20I%20want%20to%20use%20NoSQL%20%28Cassandra%29%20for%20my%20next%20project%20or%20if%20I%20should%20just%20use%20MySQL.%20I%20plan%20to%20write%20a%20longer%20post%20on%20this%20soon%2C%20but%20for%20now%20I%20want%20to%20share%20with%20you%20a%20little%20gem%20I%20wrote.%0D%0A%0D%0AI%20plan%20to%20use%20NoSQL%20in%20t" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F29%2Factiveuuid-binary-uuid-primary-keys-in-rails-3-1-on-mysql%2F&amp;title=activeuuid%20-%20binary%20uuid%20primary%20keys%20in%20Rails%203.1%20on%20MySQL" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F29%2Factiveuuid-binary-uuid-primary-keys-in-rails-3-1-on-mysql%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=activeuuid%20-%20binary%20uuid%20primary%20keys%20in%20Rails%203.1%20on%20MySQL%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F29%2Factiveuuid-binary-uuid-primary-keys-in-rails-3-1-on-mysql%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F29%2Factiveuuid-binary-uuid-primary-keys-in-rails-3-1-on-mysql%2F&amp;t=activeuuid%20-%20binary%20uuid%20primary%20keys%20in%20Rails%203.1%20on%20MySQL" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F29%2Factiveuuid-binary-uuid-primary-keys-in-rails-3-1-on-mysql%2F&amp;title=activeuuid%20-%20binary%20uuid%20primary%20keys%20in%20Rails%203.1%20on%20MySQL&amp;annotation=I%27ve%20been%20debating%20a%20lot%20recently%20if%20I%20want%20to%20use%20NoSQL%20%28Cassandra%29%20for%20my%20next%20project%20or%20if%20I%20should%20just%20use%20MySQL.%20I%20plan%20to%20write%20a%20longer%20post%20on%20this%20soon%2C%20but%20for%20now%20I%20want%20to%20share%20with%20you%20a%20little%20gem%20I%20wrote.%0D%0A%0D%0AI%20plan%20to%20use%20NoSQL%20in%20t" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F29%2Factiveuuid-binary-uuid-primary-keys-in-rails-3-1-on-mysql%2F&amp;t=activeuuid%20-%20binary%20uuid%20primary%20keys%20in%20Rails%203.1%20on%20MySQL" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F29%2Factiveuuid-binary-uuid-primary-keys-in-rails-3-1-on-mysql%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/qaR-RaITq-M" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/08/29/activeuuid-binary-uuid-primary-keys-in-rails-3-1-on-mysql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2011/08/29/activeuuid-binary-uuid-primary-keys-in-rails-3-1-on-mysql/</feedburner:origLink></item>
		<item>
		<title>hector.rb: the pleasant JRuby Cassandra client (wraps Hector)</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/DW6UACh5OtM/</link>
		<comments>http://eigenjoy.com/2011/08/24/hector-rb-jruby-cassandra-client-wraps-hector/#comments</comments>
		<pubDate>Wed, 24 Aug 2011 17:04:58 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[big-data]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=479</guid>
		<description><![CDATA[Hector is a Java Cassandra client. It&#8217;s a nice abstraction over making raw Thrift calls. Hector&#8217;s features include:

an object-oriented way to interface with Cassandra
serialization helpers
failover support
connection pooling
jmx support

There is already a Ruby cassandra gem, but it uses the Ruby Thrift bindings which do not work well for JRuby. In any case, I want to be [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.xcombinator.com/wp-content/uploads/2011/08/hector.jpeg"><img src="http://www.xcombinator.com/wp-content/uploads/2011/08/hector.jpeg" alt="hector" title="hector" width="300" height="286" class="alignleft size-full wp-image-480" /></a><a href="http://prettyprint.me/2010/02/23/hector-a-java-cassandra-client/">Hector</a> is a Java <a href="http://wiki.apache.org/cassandra/">Cassandra</a> client. It&#8217;s a nice abstraction over making raw Thrift calls. Hector&#8217;s features include:</p>
<ul>
<li>an object-oriented way to interface with Cassandra</li>
<li>serialization helpers</li>
<li>failover support</li>
<li>connection pooling</li>
<li>jmx support</li>
</ul>
<p>There is already a Ruby <tt><a href="https://github.com/fauna/cassandra">cassandra</a></tt> gem, but it uses the Ruby Thrift bindings which do not work well for JRuby. In any case, I want to be able to seamlessly serialize Java objects and store them in Cassandra.</p>
<p>Thus <tt><a href="https://github.com/jashmenn/hector.rb">hector.rb</a></tt> was born. <tt>hector.rb</tt> is a JRuby Cassandra client that wraps Hector. The interface is based on <tt><a href="https://github.com/pingles/clj-hector">clj-hector</a></tt>. Eventually, I&#8217;d like the interface to be API compatible with the <tt>cassandra</tt> gem.</p>
<h2>Example Usage</h2>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'java'</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'hector'</span>
cluster = Hector.<span style="color:#9900CC;">cluster</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Hector&quot;</span>, <span style="color:#996600;">&quot;127.0.0.1:9160&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
ks_name = java.<span style="color:#9900CC;">util</span>.<span style="color:#9900CC;">UUID</span>.<span style="color:#9900CC;">randomUUID</span>.<span style="color:#9900CC;">to_s</span>.<span style="color:#CC0066; font-weight:bold;">gsub</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;-&quot;</span>,<span style="color:#996600;">&quot;&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
client = Hector.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#0000FF; font-weight:bold;">nil</span>, cluster, <span style="color:#ff3333; font-weight:bold;">:retries</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#006666;">2</span>, <span style="color:#ff3333; font-weight:bold;">:exception_classes</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
column_families = <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006600; font-weight:bold;">&#123;</span>:name <span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;a&quot;</span><span style="color:#006600; font-weight:bold;">&#125;</span>, <span style="color:#006600; font-weight:bold;">&#123;</span>:name <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#996600;">&quot;b&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:type</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#ff3333; font-weight:bold;">:super</span><span style="color:#006600; font-weight:bold;">&#125;</span><span style="color:#006600; font-weight:bold;">&#93;</span>
client.<span style="color:#9900CC;">add_keyspace</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#123;</span>:name <span style="color:#006600; font-weight:bold;">=&gt;</span> ks_name, <span style="color:#ff3333; font-weight:bold;">:strategy</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#ff3333; font-weight:bold;">:local</span>, <span style="color:#ff3333; font-weight:bold;">:replication</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#006666;">1</span>, <span style="color:#ff3333; font-weight:bold;">:column_families</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> column_families<span style="color:#006600; font-weight:bold;">&#125;</span><span style="color:#006600; font-weight:bold;">&#41;</span> 
client.<span style="color:#9900CC;">keyspace</span> = ks_name
&nbsp;
sopts = <span style="color:#006600; font-weight:bold;">&#123;</span>:n_serializer <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#ff3333; font-weight:bold;">:string</span>, <span style="color:#ff3333; font-weight:bold;">:v_serializer</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#ff3333; font-weight:bold;">:string</span>, <span style="color:#ff3333; font-weight:bold;">:s_serializer</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#ff3333; font-weight:bold;">:string</span><span style="color:#006600; font-weight:bold;">&#125;</span>
client.<span style="color:#9900CC;">put_row</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;a&quot;</span>, <span style="color:#996600;">&quot;row-key&quot;</span>, <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#996600;">&quot;k&quot;</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#996600;">&quot;v&quot;</span><span style="color:#006600; font-weight:bold;">&#125;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
client.<span style="color:#9900CC;">get_rows</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;a&quot;</span>, <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;row-key&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>, sopts<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#008000; font-style:italic;"># =&gt; {&quot;row-key&quot; =&gt; {'k' =&gt; 'v'}}</span>
client.<span style="color:#9900CC;">get_columns</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;a&quot;</span>, <span style="color:#996600;">&quot;row-key&quot;</span>, <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;k&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>, sopts<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#008000; font-style:italic;"># =&gt; {'k' =&gt; 'v'}</span>
&nbsp;
&nbsp;
client.<span style="color:#9900CC;">put_row</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;b&quot;</span>, <span style="color:#996600;">&quot;row-key&quot;</span>, 
               <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#996600;">&quot;SuperCol&quot;</span>  <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#996600;">&quot;k&quot;</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#996600;">&quot;v&quot;</span>, <span style="color:#996600;">&quot;k2&quot;</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#996600;">&quot;v2&quot;</span><span style="color:#006600; font-weight:bold;">&#125;</span>,
                 <span style="color:#996600;">&quot;SuperCol2&quot;</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#996600;">&quot;k&quot;</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#996600;">&quot;v&quot;</span>, <span style="color:#996600;">&quot;k2&quot;</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#996600;">&quot;v2&quot;</span><span style="color:#006600; font-weight:bold;">&#125;</span> <span style="color:#006600; font-weight:bold;">&#125;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
client.<span style="color:#9900CC;">get_super_columns</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;b&quot;</span>, <span style="color:#996600;">&quot;row-key&quot;</span>, <span style="color:#996600;">&quot;SuperCol&quot;</span>, <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;k2&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>, sopts<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#008000; font-style:italic;"># =&gt; {&quot;k2&quot; =&gt; &quot;v2&quot;}</span></pre></div></div>

<p>For more examples, <a href="https://github.com/jashmenn/hector.rb/blob/master/spec/hector_spec.rb">see the tests</a>.</p>
<h2>Installation</h2>
<p><code>gem install --source http://gems.xcombinator.com hector</code><br />
or<br />
<a href="https://github.com/jashmenn/hector.rb">Get the source here</a>.</p>
<h2>Future Work</h2>
<p>Now that we can easily deal with [super]rows and columns in Cassandra, the next step is to put an ActiveModel abstraction over it. I&#8217;m currently working on modifying <tt>cassandra_object</tt> to work on JRuby with <tt>hector.rb</tt>. If you&#8217;d like to follow development <a href="https://github.com/jashmenn/cassandra_object/tree/hector">the branch is here</a>.</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F24%2Fhector-rb-jruby-cassandra-client-wraps-hector%2F&amp;title=hector.rb%3A%20the%20pleasant%20JRuby%20Cassandra%20client%20%28wraps%20Hector%29&amp;notes=Hector%20is%20a%20Java%20Cassandra%20client.%20It%27s%20a%20nice%20abstraction%20over%20making%20raw%20Thrift%20calls.%20Hector%27s%20features%20include%3A%0D%0A%0D%0A%09an%20object-oriented%20way%20to%20interface%20with%20Cassandra%0D%0A%20%20%20%20%20%20%20%20serialization%20helpers%0D%0A%09failover%20support%0D%0A%09connection%20pooling%0D%0A%09jmx%20su" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F24%2Fhector-rb-jruby-cassandra-client-wraps-hector%2F&amp;title=hector.rb%3A%20the%20pleasant%20JRuby%20Cassandra%20client%20%28wraps%20Hector%29" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F24%2Fhector-rb-jruby-cassandra-client-wraps-hector%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=hector.rb%3A%20the%20pleasant%20JRuby%20Cassandra%20client%20%28wraps%20Hector%29%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F24%2Fhector-rb-jruby-cassandra-client-wraps-hector%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F24%2Fhector-rb-jruby-cassandra-client-wraps-hector%2F&amp;t=hector.rb%3A%20the%20pleasant%20JRuby%20Cassandra%20client%20%28wraps%20Hector%29" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F24%2Fhector-rb-jruby-cassandra-client-wraps-hector%2F&amp;title=hector.rb%3A%20the%20pleasant%20JRuby%20Cassandra%20client%20%28wraps%20Hector%29&amp;annotation=Hector%20is%20a%20Java%20Cassandra%20client.%20It%27s%20a%20nice%20abstraction%20over%20making%20raw%20Thrift%20calls.%20Hector%27s%20features%20include%3A%0D%0A%0D%0A%09an%20object-oriented%20way%20to%20interface%20with%20Cassandra%0D%0A%20%20%20%20%20%20%20%20serialization%20helpers%0D%0A%09failover%20support%0D%0A%09connection%20pooling%0D%0A%09jmx%20su" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F24%2Fhector-rb-jruby-cassandra-client-wraps-hector%2F&amp;t=hector.rb%3A%20the%20pleasant%20JRuby%20Cassandra%20client%20%28wraps%20Hector%29" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F08%2F24%2Fhector-rb-jruby-cassandra-client-wraps-hector%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/DW6UACh5OtM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/08/24/hector-rb-jruby-cassandra-client-wraps-hector/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2011/08/24/hector-rb-jruby-cassandra-client-wraps-hector/</feedburner:origLink></item>
		<item>
		<title>cascading-simhash a library to cluster by minhashes in Hadoop</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/t31gqDEOz6A/</link>
		<comments>http://eigenjoy.com/2011/05/09/cascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop/#comments</comments>
		<pubDate>Mon, 09 May 2011 22:21:11 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[big-data]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=455</guid>
		<description><![CDATA[simhashing
Say you have a large corpus of web documents and you want to group them together by some notion of &#8220;similarity&#8221;. For instance, we may want to detect plagiarism or find content that appears on multiple pages of a site.  
In this scenario, it&#8217;s impractical to do a pairwise comparison of all documents. Fortunately, [...]]]></description>
			<content:encoded><![CDATA[<h2>simhashing</h2>
<p>Say you have a large corpus of web documents and you want to group them together by some notion of &#8220;similarity&#8221;. For instance, we may want to detect plagiarism or find content that appears on multiple pages of a site.  </p>
<p>In this scenario, it&#8217;s impractical to do a pairwise comparison of all documents. Fortunately, we can use <em>simhashing</em>. </p>
<p>Broadly speaking, simhashing is a algorithm that calculates a &#8220;cluster id&#8221; (the minimum hash, or <em>minhash</em>) from the content. Because the minhash for an item is calculated independently of the other items in the set, minhashing is an ideal candidate for MapReduce. </p>
<p>Ryan Moulton has written a <a href="http://knol.google.com/k/simple-simhashing">wonderful article on Simhashing</a>. I&#8217;m not going to repeat his content here, so if you&#8217;re unfamiliar with simhashing I encourage you to go and read his article first.</p>
<p>In his article, Ryan sketches the proof that the probability that any two sets (in this case, documents) share the same minhash is equal to their <a href="http://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity coefficient</a>. This is a really neat result because we are able to get the Jaccard index without having to actually compare the intersection of the two sets directly. </p>
<h2>cascading-simhash</h2>
<p>I&#8217;ve created a <a href="https://github.com/jashmenn/cascading-simhash">library for calculating simhashes in Hadoop</a>. It&#8217;s written in Clojure and Java and uses Casacalog and Cascading. </p>
<p>To use it, you 1) input tuples consisting of a <tt>(document_id, body)</tt> and 2) define how to tokenize your <tt>body</tt>. The job emits tuples of the form <tt>(minhash, document_id, body)</tt>. You can then use <tt>minhash</tt> as the key for your next phase. (All records that share a <tt>minhash</tt> are potential duplicates.) </p>
<p>The library can be called from either Clojure or Java. Additionally, <tt>Simhash</tt> returns a <tt><a href="http://www.cascading.org/javadoc/cascading/flow/Flow.html">Flow</a></tt> so you can use it in your <tt>Cascade</tt> if you want to make it part of a bigger pipeline.</p>
<h2>A Java example</h2>
<p>Here&#8217;s a quick example on how to use the library from Java:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #008000; font-style: italic; font-weight: bold;">/**
 * Simple Simhash - an example of how to use Simhash
 *
 * To run this example:
 *   lein uberjar
 *   lein classpath &gt; classpath
 *   java -cp `cat classpath`:build/cascading-simhash-1.0.0-SNAPSHOT-standalone.jar simhash.examples.SimpleSimhash &quot;test-resources/test-documents.txt&quot;
 **/</span>
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> SimpleSimhash <span style="color: #009900;">&#123;</span>
  <span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000000; font-weight: bold;">final</span> Logger LOG <span style="color: #339933;">=</span> Logger.<span style="color: #006633;">getLogger</span><span style="color: #009900;">&#40;</span> SimpleSimhash.<span style="color: #000000; font-weight: bold;">class</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #008000; font-style: italic; font-weight: bold;">/**
   * Create a tokenizer that is a subclass of clojure.lang.AFn and
   * implements invoke(Object body)
   **/</span>
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000000; font-weight: bold;">class</span> Tokenizer <span style="color: #000000; font-weight: bold;">extends</span> AFn <span style="color: #009900;">&#123;</span>
&nbsp;
    <span style="color: #008000; font-style: italic; font-weight: bold;">/**
     * Your tokenization logic goes here
     *
     * @param String body
     * @return something seq-able
     */</span>
    <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #003399;">Object</span> invoke<span style="color: #009900;">&#40;</span><span style="color: #003399;">Object</span> body<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">Exception</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #003399;">String</span> b <span style="color: #339933;">=</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span><span style="color: #009900;">&#41;</span>body<span style="color: #339933;">;</span>
      <span style="color: #000000; font-weight: bold;">return</span> b.<span style="color: #006633;">split</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot; &quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> main<span style="color: #009900;">&#40;</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> args <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    Tap inputTap <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Hfs<span style="color: #009900;">&#40;</span> <span style="color: #000000; font-weight: bold;">new</span> TextDelimited<span style="color: #009900;">&#40;</span> 
                                <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;docid&quot;</span>, <span style="color: #0000ff;">&quot;body&quot;</span><span style="color: #009900;">&#41;</span>, <span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span> <span style="color: #009900;">&#41;</span>,
                            args<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    Tap outputTap <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> StdoutTap<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// create the flow</span>
    Flow simhashFlow <span style="color: #339933;">=</span> Simhash.<span style="color: #006633;">simhash</span><span style="color: #009900;">&#40;</span>inputTap, outputTap, 
                                       <span style="color: #cc66cc;">2</span>, <span style="color: #666666; font-style: italic;">// combine n-th lowest minhashes (e.g. 2) </span>
                                       SimpleSimhash.<span style="color: #006633;">Tokenizer</span>.<span style="color: #000000; font-weight: bold;">class</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    simhashFlow.<span style="color: #006633;">complete</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// or add to your Cascade, etc</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>Notice a few things here:</p>
<ul>
<li>We&#8217;re inputting a tap of two fields: <tt>(docid, body)</tt></li>
<li>The <tt>2</tt> parameter is the number of minhashes to combine. In this case, we will combine the 2 lowest hashes to create one minhash. This parameter controls the overlap required for a match. In this case, the two sets much share the same 2 minhashes in order to match.</li>
<li>The Tokenizer is a subclass of <tt>clojure.lang.AFn</tt>. Override the <tt>invoke(Object)</tt> method and you will be passed the body the current record. In this case, we&#8217;re tokenizing by doing a simple <tt>String</tt> split.</li>
</ul>
<p>If you&#8217;ve <a href="https://github.com/jashmenn/cascading-simhash">checked out the source</a> you can run it like this:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">lein uberjar
lein classpath <span style="color: #000000; font-weight: bold;">&gt;</span> classpath
java <span style="color: #660033;">-cp</span> <span style="color: #000000; font-weight: bold;">`</span><span style="color: #c20cb9; font-weight: bold;">cat</span> classpath<span style="color: #000000; font-weight: bold;">`</span>:build<span style="color: #000000; font-weight: bold;">/</span>cascading-simhash-1.0.0-SNAPSHOT-standalone.jar simhash.examples.SimpleSimhash <span style="color: #ff0000;">&quot;test-resources/test-documents.txt&quot;</span></pre></div></div>

<p>Given:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #666666; font-style: italic;"># test-resources/test-documents.txt</span>
<span style="color: #666666; font-style: italic;"># docid \t body</span>
DocA	my dog has fleas
DocB	my dog has fleas
DocC	my dog has hair
DocD	see spot run
DocE	We hold these truths</pre></div></div>

<p>We get:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">RESULTS
<span style="color: #660033;">-----------------------</span>
23fd68296bc65391799c8c441faf4403c729256f	DocE	We hold these truths
402183e1cbc52e7c87eb230c281f35e4b27c2a39	DocD	see spot run
49c31c1459a7603bd5680d11285a5716c4ba3903	DocA	my dog has fleas
49c31c1459a7603bd5680d11285a5716c4ba3903	DocB	my dog has fleas
58e5a2035461323a37102e22273c9b25cbb9df61	DocC	my dog has hair
<span style="color: #660033;">-----------------------</span></pre></div></div>

<h2>A Clojure example</h2>
<p>Similarly, here&#8217;s how to run the library from Clojure. This time we use bi-grams as the tokens.</p>

<div class="wp_syntax"><div class="code"><pre class="lisp" style="font-family:monospace;"><span style="color: #66cc66;">&#40;</span>ns simhash<span style="color: #66cc66;">.</span>examples<span style="color: #66cc66;">.</span>bigrams
  <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">use</span> 
   <span style="color: #66cc66;">&#91;</span>simhash core util<span style="color: #66cc66;">&#93;</span>
   <span style="color: #66cc66;">&#91;</span>cascalog api testing<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>
  <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">require</span> 
   <span style="color: #66cc66;">&#91;</span>simhash <span style="color: #66cc66;">&#91;</span>taps <span style="color: #66cc66;">:</span><span style="color: #555;">as</span> t<span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">&#91;</span>ops <span style="color: #66cc66;">:</span><span style="color: #555;">as</span> ops<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#93;</span>
   <span style="color: #66cc66;">&#91;</span>clojure<span style="color: #66cc66;">.</span>contrib<span style="color: #66cc66;">.</span>str-utils <span style="color: #66cc66;">:</span><span style="color: #555;">as</span> stu<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>
  <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">gen-class</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>defn my-source <span style="color: #66cc66;">&#91;</span>path<span style="color: #66cc66;">&#93;</span>
  <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&lt;</span>- <span style="color: #66cc66;">&#91;</span>?docid ?body<span style="color: #66cc66;">&#93;</span>
      <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#40;</span>hfs-textline path<span style="color: #66cc66;">&#41;</span> ?line<span style="color: #66cc66;">&#41;</span>
      <span style="color: #66cc66;">&#40;</span>ops/re-split-op <span style="color: #66cc66;">&#91;</span>#<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span> <span style="color: #cc66cc;">2</span><span style="color: #66cc66;">&#93;</span> ?line <span style="color: #66cc66;">:&gt;</span> ?docid ?body<span style="color: #66cc66;">&#41;</span>
      <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">distinct</span> false<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>defn tokenize 
  <span style="color: #ff0000;">&quot;tokenize into bi-grams (sliding window)&quot;</span>
  <span style="color: #66cc66;">&#91;</span>body<span style="color: #66cc66;">&#93;</span>
  <span style="color: #66cc66;">&#40;</span>map
   <span style="color: #66cc66;">&#40;</span>fn <span style="color: #66cc66;">&#91;</span>tokens<span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">&#40;</span>stu/str-join <span style="color: #ff0000;">&quot; &quot;</span> tokens<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
   <span style="color: #66cc66;">&#40;</span>partition <span style="color: #cc66cc;">2</span> <span style="color: #cc66cc;">1</span> <span style="color: #66cc66;">&#40;</span>stu/re-split #<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\s</span>+&quot;</span> body<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>defn -main <span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&amp;</span> args<span style="color: #66cc66;">&#93;</span>
  <span style="color: #66cc66;">&#40;</span>?- <span style="color: #66cc66;">&#40;</span>stdout<span style="color: #66cc66;">&#41;</span> 
      <span style="color: #66cc66;">&#40;</span>simhash-q <span style="color: #66cc66;">&#40;</span>my-source <span style="color: #66cc66;">&#40;</span>first args<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
                 <span style="color: #cc66cc;">2</span> <span style="color: #808080; font-style: italic;">;; number of minhashes</span>
                 tokenize<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>A few things to point out about the Clojure example:</p>
<ul>
<li><tt>simhash-q</tt> is just a Cascalog query. Unlike the Java example (which required a <tt>Tap</tt> as the input) <tt>simhash-q</tt> can accept any other Cascalog query as the input.</li>
<li>You must use <tt>gen-class</tt> on the namespace that holds your <tt>tokenize</tt> function. This is because Cascading will serialize your <tt>Flow</tt> and it has a hard time with functions generated at run-time. Generally speaking, if the <tt>tokenize</tt> function isn&#8217;t aot compiled into a class you&#8217;re going to run into problems.</li>
</ul>
<p>The project also includes a tokenizer for extracting text from HTML documents. For examples see <tt><a href="https://github.com/jashmenn/cascading-simhash/blob/master/src/clj/simhash/tokenizers/html_text.clj">tokenizers.html_text.clj</a></tt> for an example on how to write a tokenizer in Clojure. See <tt><a href="https://github.com/jashmenn/cascading-simhash/blob/master/src/java/simhash/examples/HtmlSimhash.java">HtmlSimhash.java</a></tt> for a Java example on how to use it.</p>
<h2>Summary</h2>
<p>Simhashing in MapReduce is a quick way to find clusters in a huge amount of data. By using Cascading and Cascalog we&#8217;re able to work with MapReduce jobs at the level of functions rather than individual map-reduce phases.</p>
<p>Have any data you need clustered? Try <tt><a href="https://github.com/jashmenn/cascading-simhash">cascading-simhash</a></tt> and let me know how it goes!</p>
<p>Learn more about big data <a href="http://twitter.com/xcombinator">by following me on twitter</a>.</p>
<p>You can get the jars via clojars:<br />
leiningen:</p>

<div class="wp_syntax"><div class="code"><pre class="lisp" style="font-family:monospace;">  <span style="color: #66cc66;">&#91;</span>cascading-simhash <span style="color: #ff0000;">&quot;1.0.0-SNAPSHOT&quot;</span><span style="color: #66cc66;">&#93;</span></pre></div></div>

<p>maven</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;"><span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;dependency<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;groupId<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>cascading-simhash<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/groupId<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;artifactId<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>cascading-simhash<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/artifactId<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;version<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>1.0.0-SNAPSHOT<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/version<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/dependency<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></pre></div></div>

<p><a href="https://github.com/jashmenn/cascading-simhash">View the source on github</a>.</p>
<p>References</p>
<ol>
<li><a href="http://knol.google.com/k/simple-simhashing">http://knol.google.com/k/simple-simhashing</a></li>
<li> <a href="http://en.wikipedia.org/wiki/Jaccard_index">http://en.wikipedia.org/wiki/Jaccard_index</a>
</li>
<li> <a href="http://en.wikipedia.org/wiki/MinHash">http://en.wikipedia.org/wiki/MinHash</a>
</li>
</ol>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F&amp;title=cascading-simhash%20a%20library%20to%20cluster%20by%20minhashes%20in%20Hadoop&amp;notes=simhashing%0D%0A%0D%0ASay%20you%20have%20a%20large%20corpus%20of%20web%20documents%20and%20you%20want%20to%20group%20them%20together%20by%20some%20notion%20of%20%22similarity%22.%20For%20instance%2C%20we%20may%20want%20to%20detect%20plagiarism%20or%20find%20content%20that%20appears%20on%20multiple%20pages%20of%20a%20site.%20%20%0D%0A%0D%0AIn%20this%20scena" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F&amp;title=cascading-simhash%20a%20library%20to%20cluster%20by%20minhashes%20in%20Hadoop" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=cascading-simhash%20a%20library%20to%20cluster%20by%20minhashes%20in%20Hadoop%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F&amp;t=cascading-simhash%20a%20library%20to%20cluster%20by%20minhashes%20in%20Hadoop" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F&amp;title=cascading-simhash%20a%20library%20to%20cluster%20by%20minhashes%20in%20Hadoop&amp;annotation=simhashing%0D%0A%0D%0ASay%20you%20have%20a%20large%20corpus%20of%20web%20documents%20and%20you%20want%20to%20group%20them%20together%20by%20some%20notion%20of%20%22similarity%22.%20For%20instance%2C%20we%20may%20want%20to%20detect%20plagiarism%20or%20find%20content%20that%20appears%20on%20multiple%20pages%20of%20a%20site.%20%20%0D%0A%0D%0AIn%20this%20scena" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F&amp;t=cascading-simhash%20a%20library%20to%20cluster%20by%20minhashes%20in%20Hadoop" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/t31gqDEOz6A" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/05/09/cascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2011/05/09/cascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop/</feedburner:origLink></item>
		<item>
		<title>Why is XOR the default way to combine hashes</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/6jvEjVyWkgI/</link>
		<comments>http://eigenjoy.com/2011/05/04/why-is-xor-the-default-way-to-combine-hashes/#comments</comments>
		<pubDate>Wed, 04 May 2011 21:24:08 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=438</guid>
		<description><![CDATA[I&#8217;ve been reading up on Simhashing as a way to find duplicate content in web data. I came across Ryan Moulton&#8217;s excellent article on Simhashing. Ryan&#8217;s preferred method of Simhashing is to take the n-minimum hash values from the set. In his pseudocode to combine the n hash values he XORs them. This got me [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been reading up on Simhashing as a way to find duplicate content in web data. I came across Ryan Moulton&#8217;s <a href="http://knol.google.com/k/simple-simhashing">excellent article on Simhashing</a>. Ryan&#8217;s preferred method of Simhashing is to take the n-minimum hash values from the set. In his pseudocode to combine the n hash values he <tt>XOR</tt>s them. This got me wondering:</p>
<p>Why is <tt>XOR</tt> the default way to combine hashes? </p>
<p>I posted this question on Stackoverflow and got <a href="http://stackoverflow.com/questions/5889238/why-is-xor-the-default-way-to-combine-hashes">this excellent and concise answer</a> from Greg Hewgill:</p>
<blockquote><p>
Assuming random (1-bit) inputs, the AND function output probability distribution is 25% 0 and 75% 1. Conversely, OR is 75% 0 and 25% 1.</p>
<p>The XOR function is 50% 0 and 50% 1, therefore it is good for combining uniform probability distributions.
</p>
</blockquote>
<p>The distributions become clear when we chart the output of each of the operations:</p>
<p><a href="http://www.xcombinator.com/wp-content/uploads/2011/05/prob.jpg"><img src="http://www.xcombinator.com/wp-content/uploads/2011/05/prob.jpg" alt="and-or-xor" title="and-or-xor" width="347" height="528" class="size-full wp-image-441" /></a></p>
<p>If you are combining two random bits with AND you have a 75% chance of <tt>1</tt>. Similarly, if you combine two random bits with OR you have a 75% chance of <tt>0</tt>.</p>
<p>My friend <a href="http://www.zinkov.com/">Rob Zinkov</a> [1] explains it by framing it in terms of entropy:</p>
<blockquote><p>
[When combining hashes] essentially you want an operation that maximizes entropy since lower entropy implies the hash is storing less data than it appears.</p>
<p>With AND and OR lots of the variation in bits is being removed, so the hash you end up with tend to be biased to towards 1s (as in the OR case) or 0s (as in the AND case)</p>
<p>That bias means each bit is storing slightly less information (since that bias is present).
</p>
</blockquote>
<p>Summary: If you need to combine two hashes, XOR them together and you&#8217;ll have a better chance at maintaining the entropy of the original hashes.</p>
<p><em><br />
[1] Rob runs the <a href="http://www.meetup.com/LA-Machine-Learning/">L.A. Machine Learning</a> meetup group. If you&#8217;re in L.A. <tt>AND</tt> interested in machine learning, stop by sometime.<br />
</em></p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F&amp;title=Why%20is%20XOR%20the%20default%20way%20to%20combine%20hashes&amp;notes=I%27ve%20been%20reading%20up%20on%20Simhashing%20as%20a%20way%20to%20find%20duplicate%20content%20in%20web%20data.%20I%20came%20across%20Ryan%20Moulton%27s%20excellent%20article%20on%20Simhashing.%20Ryan%27s%20preferred%20method%20of%20Simhashing%20is%20to%20take%20the%20n-minimum%20hash%20values%20from%20the%20set.%20In%20his%20pseudocod" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F&amp;title=Why%20is%20XOR%20the%20default%20way%20to%20combine%20hashes" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Why%20is%20XOR%20the%20default%20way%20to%20combine%20hashes%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F&amp;t=Why%20is%20XOR%20the%20default%20way%20to%20combine%20hashes" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F&amp;title=Why%20is%20XOR%20the%20default%20way%20to%20combine%20hashes&amp;annotation=I%27ve%20been%20reading%20up%20on%20Simhashing%20as%20a%20way%20to%20find%20duplicate%20content%20in%20web%20data.%20I%20came%20across%20Ryan%20Moulton%27s%20excellent%20article%20on%20Simhashing.%20Ryan%27s%20preferred%20method%20of%20Simhashing%20is%20to%20take%20the%20n-minimum%20hash%20values%20from%20the%20set.%20In%20his%20pseudocod" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F&amp;t=Why%20is%20XOR%20the%20default%20way%20to%20combine%20hashes" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/6jvEjVyWkgI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/05/04/why-is-xor-the-default-way-to-combine-hashes/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2011/05/04/why-is-xor-the-default-way-to-combine-hashes/</feedburner:origLink></item>
		<item>
		<title>Custom Hive UDFs in Clojure</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/J6cxofzLL6Q/</link>
		<comments>http://eigenjoy.com/2011/04/29/custom-hive-udfs-in-clojure/#comments</comments>
		<pubDate>Fri, 29 Apr 2011 15:19:57 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[big-data]]></category>
		<category><![CDATA[crawling]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=424</guid>
		<description><![CDATA[
Introduction


We process all of our web-crawl data in Hadoop. If I&#8217;m writing jobs that will only be run by my team, then Cascalog is my tool of choice. But unfortunately, not everyone is going to learn Cascalog (much less Cascading or Clojure). However, many people know a little SQL and the best tool for them [...]]]></description>
			<content:encoded><![CDATA[<div id="outline-container-1_1" class="outline-3">
<h3 id="sec-1_1">Introduction</h3>
<div class="outline-text-3" id="text-1_1">
<p>
We process all of our web-crawl data in Hadoop. If I&#8217;m writing jobs that will only be run by my team, then <a href="https://github.com/nathanmarz/cascalog">Cascalog</a> is my tool of choice. But unfortunately, not everyone is going to learn Cascalog (much less Cascading or Clojure). However, many people know a little SQL and the best tool for them to use data in Hadoop is Hive.
</p>
<p>
Hive is great for straightforward, ad-hoc queries and it makes Hadoop accessible for SQL-minded folks who may not be programmers.
</p>
<p>
Hive&#8217;s functionality can be extended by writing User Defined Functions (UDFs). By writing custom UDFs you can create little mappers and reducers that can be easily stuffed into queries.
</p>
<p>
<i>This article was written with Hive 0.5.0. YMMV.</i>
</p>
</div>
</div>
<div id="outline-container-1_2" class="outline-3">
<h3 id="sec-1_2">UDFs &#8211; 1 to 1 </h3>
<div class="outline-text-3" id="text-1_2">
<p>
Let&#8217;s begin with the simplest case, lower-casing a string:
</p>

<div class="wp_syntax"><div class="code"><pre class="lisp" style="font-family:monospace;">  <span style="color: #66cc66;">&#40;</span>ns smoker<span style="color: #66cc66;">.</span>udf<span style="color: #66cc66;">.</span>MyLowerCase
    <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">import</span> <span style="color: #66cc66;">&#91;</span>org<span style="color: #66cc66;">.</span>apache<span style="color: #66cc66;">.</span>hadoop<span style="color: #66cc66;">.</span>hive<span style="color: #66cc66;">.</span>ql<span style="color: #66cc66;">.</span>exec UDF<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>
    <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">import</span> <span style="color: #66cc66;">&#91;</span>org<span style="color: #66cc66;">.</span>apache<span style="color: #66cc66;">.</span>hadoop<span style="color: #66cc66;">.</span>io Text<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>
    <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">require</span> <span style="color: #66cc66;">&#91;</span>clojure<span style="color: #66cc66;">.</span>contrib<span style="color: #66cc66;">.</span>str-utils2 <span style="color: #66cc66;">:</span><span style="color: #555;">as</span> su<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>
    <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">gen-class</span>
     <span style="color: #66cc66;">:</span><span style="color: #b1b100;">name</span> smoker<span style="color: #66cc66;">.</span>udf<span style="color: #66cc66;">.</span>MyLowerCase
     <span style="color: #66cc66;">:</span><span style="color: #555;">extends</span> org<span style="color: #66cc66;">.</span>apache<span style="color: #66cc66;">.</span>hadoop<span style="color: #66cc66;">.</span>hive<span style="color: #66cc66;">.</span>ql<span style="color: #66cc66;">.</span>exec<span style="color: #66cc66;">.</span>UDF
     <span style="color: #66cc66;">:</span><span style="color: #555;">methods</span> <span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#91;</span>evaluate <span style="color: #66cc66;">&#91;</span>org<span style="color: #66cc66;">.</span>apache<span style="color: #66cc66;">.</span>hadoop<span style="color: #66cc66;">.</span>io<span style="color: #66cc66;">.</span>Text<span style="color: #66cc66;">&#93;</span> org<span style="color: #66cc66;">.</span>apache<span style="color: #66cc66;">.</span>hadoop<span style="color: #66cc66;">.</span>io<span style="color: #66cc66;">.</span>Text<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
  <span style="color: #66cc66;">&#40;</span>defn #<span style="color: #66cc66;">^</span>Text -evaluate 
    <span style="color: #ff0000;">&quot;Lower-case the text&quot;</span>
    <span style="color: #66cc66;">&#91;</span>this #<span style="color: #66cc66;">^</span>Text s<span style="color: #66cc66;">&#93;</span>
    <span style="color: #66cc66;">&#40;</span><span style="color: #b1b100;">when</span> s
      <span style="color: #66cc66;">&#40;</span>Text<span style="color: #66cc66;">.</span> <span style="color: #66cc66;">&#40;</span>su/lower-<span style="color: #b1b100;">case</span> <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">.</span>toString s<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>
Here we use <code>gen-class</code> to subclass <code>exec.UDF</code>. We use <code>gen-class</code> to generate a <code>.class</code> that can be called from Java.
</p>
<p>
We can run this query like so:
</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">  <span style="color: #666666; font-style: italic;"># make the jar</span>
  lein compile
  lein uberjar   <span style="color: #666666; font-style: italic;"># include dependencies for Hive/Hadoop</span>
&nbsp;
  <span style="color: #666666; font-style: italic;"># tell hive about your jars</span>
  hive <span style="color: #660033;">--auxpath</span> .<span style="color: #000000; font-weight: bold;">/</span>build
  add jar <span style="color: #000000; font-weight: bold;">/</span>home<span style="color: #000000; font-weight: bold;">/</span>nmurray<span style="color: #000000; font-weight: bold;">/</span>hive-jars<span style="color: #000000; font-weight: bold;">/</span>smoker-standalone.jar;
  list jars; <span style="color: #666666; font-style: italic;"># verfiy it is there</span>
&nbsp;
  <span style="color: #666666; font-style: italic;"># create your operations</span>
  create temporary <span style="color: #000000; font-weight: bold;">function</span> my_lower <span style="color: #c20cb9; font-weight: bold;">as</span> <span style="color: #ff0000;">'smoker.udf.MyLowerCase'</span>;
&nbsp;
  <span style="color: #666666; font-style: italic;"># given:  a_table</span>
  <span style="color: #666666; font-style: italic;"># format: id,sentence</span>
  <span style="color: #000000;">1</span>,My dog has fleas
  <span style="color: #000000;">2</span>,My <span style="color: #c20cb9; font-weight: bold;">cat</span> Mr. Mittens has fleas
&nbsp;
  SELECT my_lower<span style="color: #7a0874; font-weight: bold;">&#40;</span>sentence<span style="color: #7a0874; font-weight: bold;">&#41;</span> from a_table;
&nbsp;
  <span style="color: #666666; font-style: italic;"># returns:</span>
  my dog has fleas
  my <span style="color: #c20cb9; font-weight: bold;">cat</span> mr. mittens has fleas</pre></div></div>

<p>
Easy!
</p>
</div>
</div>
<div id="outline-container-1_3" class="outline-3">
<h3 id="sec-1_3">UDTFs &#8211; 1 to Many </h3>
<div class="outline-text-3" id="text-1_3">
<p>
One problem with an <code>exec.UDF</code> is that is that you  can only return one record. Often, we will want to take one record and transform it into multiple records. For this we can use a <code>GenericUDTF</code>.
</p>
<p>
A Generic User-defined Table Generating Function (<code>GenericUDTF</code>) generates a variable number of output rows for a single input row.
</p>
<p>
For instance, say we want to take our sentences and generate a count for each word:
</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">&nbsp;
  <span style="color: #666666; font-style: italic;"># given:  a_table</span>
  <span style="color: #666666; font-style: italic;"># format: id,sentence</span>
  <span style="color: #000000;">1</span>,My dog has fleas
  <span style="color: #000000;">2</span>,My <span style="color: #c20cb9; font-weight: bold;">cat</span> Mr. Mittens has fleas
&nbsp;
  SELECT tokenize<span style="color: #7a0874; font-weight: bold;">&#40;</span>sentence<span style="color: #7a0874; font-weight: bold;">&#41;</span> AS <span style="color: #7a0874; font-weight: bold;">&#40;</span>word, count<span style="color: #7a0874; font-weight: bold;">&#41;</span> FROM a_table;
&nbsp;
  <span style="color: #666666; font-style: italic;"># returns:</span>
  my <span style="color: #000000;">1</span>
  dog <span style="color: #000000;">1</span>
  has <span style="color: #000000;">1</span>
  fleas <span style="color: #000000;">1</span>
  my <span style="color: #000000;">1</span>
  <span style="color: #c20cb9; font-weight: bold;">cat</span> <span style="color: #000000;">1</span>
  mr. <span style="color: #000000;">1</span>
  mittens <span style="color: #000000;">1</span>
  has <span style="color: #000000;">1</span>
  fleas <span style="color: #000000;">1</span></pre></div></div>

<p>
(From there you could easily add a <code>GROUP BY</code> clause and find the global count for each word.)
</p>
<p>
To do this we subclass a <code>UDTF</code>. Subclassing a <code>UDTF</code> in Clojure is more involved than with a <code>UDF</code>. However, with <a href="https://github.com/jashmenn/smoker">smoker</a> I&#8217;ve created some functions that make it much easier to create custom <code>UDTFs</code>.
</p>

<div class="wp_syntax"><div class="code"><pre class="lisp" style="font-family:monospace;">  <span style="color: #66cc66;">&#40;</span>ns smoker<span style="color: #66cc66;">.</span>udf<span style="color: #66cc66;">.</span>MyTokenize
    <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">require</span> <span style="color: #66cc66;">&#91;</span>smoker<span style="color: #66cc66;">.</span>udtf<span style="color: #66cc66;">.</span>gen <span style="color: #66cc66;">:</span><span style="color: #555;">as</span> gen<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>
    <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">import</span> <span style="color: #66cc66;">&#91;</span>org<span style="color: #66cc66;">.</span>apache<span style="color: #66cc66;">.</span>hadoop<span style="color: #66cc66;">.</span>hive<span style="color: #66cc66;">.</span>serde2<span style="color: #66cc66;">.</span>objectinspector<span style="color: #66cc66;">.</span>primitive 
              PrimitiveObjectInspectorFactory<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>
    <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">require</span> <span style="color: #66cc66;">&#91;</span>clojure<span style="color: #66cc66;">.</span>contrib<span style="color: #66cc66;">.</span>str-utils2 <span style="color: #66cc66;">:</span><span style="color: #555;">as</span> su<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
  <span style="color: #66cc66;">&#40;</span>gen/gen-udtf<span style="color: #66cc66;">&#41;</span>
  <span style="color: #66cc66;">&#40;</span>gen/gen-wrapper-methods 
   <span style="color: #66cc66;">&#91;</span>PrimitiveObjectInspectorFactory/javaStringObjectInspector
    PrimitiveObjectInspectorFactory/javaIntObjectInspector<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
  <span style="color: #66cc66;">&#40;</span>defn -operate <span style="color: #66cc66;">&#91;</span>this line<span style="color: #66cc66;">&#93;</span>
    <span style="color: #66cc66;">&#40;</span>map 
     <span style="color: #66cc66;">&#40;</span>fn <span style="color: #66cc66;">&#91;</span>token<span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">&#91;</span>token <span style="color: #66cc66;">&#40;</span><span style="color: #b1b100;">Integer</span>/valueOf <span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span> 
     <span style="color: #66cc66;">&#40;</span>su/split line #<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\s</span>+&quot;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>
The function <code>gen-udtf</code> creates the <code>gen-class</code> directive needed to compile this package into a java <code>.class</code> file. <code>gen-wrapper-methods</code> lets you specify what types your function will be emitting per tuple.
</p>
<p>
Use the <code>PrimitiveObjectInspectorFactory</code> to specify the types you&#8217;d plan on returning. Example:
</p>

<div class="wp_syntax"><div class="code"><pre class="lisp" style="font-family:monospace;"><span style="color: #66cc66;">&#40;</span>gen/gen-wrapper-methods 
 <span style="color: #66cc66;">&#91;</span>PrimitiveObjectInspectorFactory/javaStringObjectInspector
  PrimitiveObjectInspectorFactory/javaIntObjectInspector<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>
Will allow you to return a tuple of <code>(String, int)</code>
</p>
<p>
Now we write an <code>-operate</code> method that accepts <code>[this &amp; args]</code> and returns a seq of tuples that match the specified types.
</p>
</div>
</div>
<div id="outline-container-1_4" class="outline-3">
<h3 id="sec-1_4">UDTFs + LATERAL VIEW </h3>
<div class="outline-text-3" id="text-1_4">
<p>
One problem with <code>UDTFs</code> (in Hive 0.5.0) is that you can only have one <code>UDTF</code> per query and the <code>UDTF</code> is the only thing you can <code>SELECT</code> in that query. This is a problem. Let&#8217;s say that we want to get the count of each word <i>within a particular document</i>.
</p>
<p>
This simple approach doesn&#8217;t work:
</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">  <span style="color: #666666; font-style: italic;"># wrong</span>
  SELECT <span style="color: #c20cb9; font-weight: bold;">id</span>, tokenize<span style="color: #7a0874; font-weight: bold;">&#40;</span>sentence<span style="color: #7a0874; font-weight: bold;">&#41;</span> AS <span style="color: #7a0874; font-weight: bold;">&#40;</span>word, count<span style="color: #7a0874; font-weight: bold;">&#41;</span> FROM a_table;
  <span style="color: #666666; font-style: italic;"># Hive ERROR...</span></pre></div></div>

<p>
Instead, we use must use the <code>LATERAL VIEW</code> syntax <sup><a class="footref" name="fnr.1" href="#fn.1">1</a></sup>:
</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">  SELECT tokenize<span style="color: #7a0874; font-weight: bold;">&#40;</span>sentence<span style="color: #7a0874; font-weight: bold;">&#41;</span> AS <span style="color: #7a0874; font-weight: bold;">&#40;</span>word, count<span style="color: #7a0874; font-weight: bold;">&#41;</span> FROM a_table;
&nbsp;
  SELECT <span style="color: #c20cb9; font-weight: bold;">id</span>, word, count
  FROM a_table 
  LATERAL VIEW tokenize<span style="color: #7a0874; font-weight: bold;">&#40;</span>sentence<span style="color: #7a0874; font-weight: bold;">&#41;</span> tokenizedTable <span style="color: #c20cb9; font-weight: bold;">as</span> word, count 
&nbsp;
  <span style="color: #666666; font-style: italic;"># returns:</span>
  <span style="color: #666666; font-style: italic;"># id word count</span>
  <span style="color: #000000;">1</span> my <span style="color: #000000;">1</span>
  <span style="color: #000000;">1</span> dog <span style="color: #000000;">1</span>
  <span style="color: #000000;">1</span> has <span style="color: #000000;">1</span>
  <span style="color: #000000;">1</span> fleas <span style="color: #000000;">1</span>
  <span style="color: #000000;">2</span> my <span style="color: #000000;">1</span>
  <span style="color: #000000;">2</span> <span style="color: #c20cb9; font-weight: bold;">cat</span> <span style="color: #000000;">1</span>
  <span style="color: #000000;">2</span> mr. <span style="color: #000000;">1</span>
  <span style="color: #000000;">2</span> mittens <span style="color: #000000;">1</span>
  <span style="color: #000000;">2</span> has <span style="color: #000000;">1</span>
  <span style="color: #000000;">2</span> fleas <span style="color: #000000;">1</span></pre></div></div>

<p>
Now you can easily add a <code>GROUP BY id,word</code> clause and <code>SUM</code> the <code>count</code> to get a count of the words <span style="text-decoration:underline;">within</span> each document.
</p>
</div>
</div>
<div id="outline-container-1_5" class="outline-3">
<h3 id="sec-1_5">Summary</h3>
<div class="outline-text-3" id="text-1_5">
<p>
Writing Hive <code>UDFs</code> are a powerful way to extend the functionality of Hive. Writing Hive <code>UDFs</code> in Clojure using <a href="https://github.com/jashmenn/smoker">smoker</a> is fast and easy.
</p>
<p>
If you&#8217;re interested in how the Hive boilerplate code is generated, checkout the code in <a href="https://github.com/jashmenn/smoker/blob/master/src/clj/smoker/udtf/gen.clj">gen.clj</a> or you can checkout the <code>smoker</code> project on <a href="https://github.com/jashmenn/smoker">github</a> .
</p>
<p>
You can follow me on <a href="http://twitter.com/xcombinator">twitter here</a>.
</p>
</div>
</div>
<div id="footnotes">
<h2 class="footnotes">Footnotes: </h2>
<div id="text-footnotes">
<p class="footnote"><sup><a class="footnum" name="fn.1" href="#fnr.1">1</a></sup> In Hive 0.5.0 if you have a <code>WHERE</code> clause with your <code>LATERAL VIEW</code> you may get a <code>FAILED: Unknown exception: null</code> error. The temporary fix is to put <code>set hive.optimize.ppd=false;</code> before your query. See: <a href="http://wiki.apache.org/hadoop/Hive/LanguageManual/LateralView">LateralView in HiveQL</a> and <a href="https://issues.apache.org/jira/browse/HIVE-1056">HIVE-1056</a>.
</p>
</div>
</div>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F04%2F29%2Fcustom-hive-udfs-in-clojure%2F&amp;title=Custom%20Hive%20UDFs%20in%20Clojure&amp;notes=%0D%0AIntroduction%0D%0A%0D%0A%0D%0A%0D%0AWe%20process%20all%20of%20our%20web-crawl%20data%20in%20Hadoop.%20If%20I%27m%20writing%20jobs%20that%20will%20only%20be%20run%20by%20my%20team%2C%20then%20Cascalog%20is%20my%20tool%20of%20choice.%20But%20unfortunately%2C%20not%20everyone%20is%20going%20to%20learn%20Cascalog%20%28much%20less%20Cascading%20or%20Clojure" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F04%2F29%2Fcustom-hive-udfs-in-clojure%2F&amp;title=Custom%20Hive%20UDFs%20in%20Clojure" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F04%2F29%2Fcustom-hive-udfs-in-clojure%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Custom%20Hive%20UDFs%20in%20Clojure%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F04%2F29%2Fcustom-hive-udfs-in-clojure%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F04%2F29%2Fcustom-hive-udfs-in-clojure%2F&amp;t=Custom%20Hive%20UDFs%20in%20Clojure" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F04%2F29%2Fcustom-hive-udfs-in-clojure%2F&amp;title=Custom%20Hive%20UDFs%20in%20Clojure&amp;annotation=%0D%0AIntroduction%0D%0A%0D%0A%0D%0A%0D%0AWe%20process%20all%20of%20our%20web-crawl%20data%20in%20Hadoop.%20If%20I%27m%20writing%20jobs%20that%20will%20only%20be%20run%20by%20my%20team%2C%20then%20Cascalog%20is%20my%20tool%20of%20choice.%20But%20unfortunately%2C%20not%20everyone%20is%20going%20to%20learn%20Cascalog%20%28much%20less%20Cascading%20or%20Clojure" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F04%2F29%2Fcustom-hive-udfs-in-clojure%2F&amp;t=Custom%20Hive%20UDFs%20in%20Clojure" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F04%2F29%2Fcustom-hive-udfs-in-clojure%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/J6cxofzLL6Q" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/04/29/custom-hive-udfs-in-clojure/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2011/04/29/custom-hive-udfs-in-clojure/</feedburner:origLink></item>
		<item>
		<title>Clojure’s keyword can fill up your PermGen space</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/hfnbZlL0JLg/</link>
		<comments>http://eigenjoy.com/2011/03/02/clojures-keyword-can-fill-up-your-permgen-space/#comments</comments>
		<pubDate>Wed, 02 Mar 2011 15:49:08 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[clojure]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=413</guid>
		<description><![CDATA[We&#8217;ve been working on a custom web-crawler for a few months now. Recently we were having a problem where after a few minutes the JVM would run out of PermGen space.
If you&#8217;re not familiar with PermGen space, it is a portion of memory reserved for the JVM itself. It is used for storing information about [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;ve been working on a custom web-crawler for a few months now. Recently we were having a problem where after a few minutes the JVM would run out of PermGen space.</p>
<p>If you&#8217;re not familiar with PermGen space, it is a portion of memory reserved for the JVM itself. It is used for storing information about Classes and interned Strings. </p>
<p>When you intern a String the JVM stores a single copy of that String in the PermGen space.  This can save RAM because only one copy of the String will exist in the system. It can also speed up <tt>==</tt> comparisons for two interned Strings because you only have to compare the reference not the characters.</p>
<p>The problem is, the PermGen space is typically very small (64m is a common default). So if you have many classes or a lot of interned Strings, you can easily blow out the PermGen space.</p>
<p>This problem was showing up in our crawler and we traced it to how we were parsing <tt>robots.txt</tt>.</p>
<blockquote><p>
<tt>robots.txt</tt> is a convention website owners can use that will instruct<br />
crawlers how to act while they are on their site. All polite crawlers<br />
use them. For example: </p>
<pre>
Disallow: /no-crawl/
Allow: /
Sitemap: http://www.foo.com/sitemap.xml
</pre>
</blockquote>
<p>In our crawler, we&#8217;ve written a custom <tt>robots.txt</tt> parsing library: <a href="https://github.com/retiman/clj-robots">clj-robots (github)</a>.</p>
<p>In <tt>clj-robots</tt>, there was one section of the code where we were taking the left hand side of the <tt>robots.txt</tt> and converting it to a keyword. This made for cleaner code than comparing Strings. Since there are only a fixed number of <tt>robots.txt</tt> directives, this should be safe, right?</p>
<p>It turns out it isn&#8217;t safe. First, you don&#8217;t know what people are actually going to put in their <tt>robots.txt</tt>. Second, what we forgot was that many sites don&#8217;t have a <tt>robots.txt</tt>, but they don&#8217;t return an empty 404, they often return their custom 404 HTML page. What happened was we were parsing an HTML page as a <tt>robots.txt</tt> and then interning everything that looked like a <tt>robots.txt</tt> directive. </p>
<p>The result was a spectacular &#8220;java.lang.OutOfMemoryError: PermGen space&#8221; after just a few minutes. The general principle here is that you should never allow user-generated input become an interned String.</p>
<p>Lessons learned:</p>
<ul>
<li>PermGen stores Classes and interned Strings</li>
<li>Clojure&#8217;s <tt>keyword</tt> interns a String </li>
<li>Don&#8217;t call <tt>keyword</tt> on user-generated input</li>
<li>A profiling tool (e.g. <a href="http://www.ej-technologies.com/products/jprofiler/overview.html">JProfiler</a>) can be your best friend in these situations</li>
</ul>
<p>References:</p>
<ul>
<li><a href="http://java.sun.com/docs/hotspot/gc1.4.2/faq.html">Java GC FAQ</a></li>
<li><a href="https://github.com/richhickey/clojure/blob/master/src/jvm/clojure/lang/Keyword.java">Keyword.java from Clojure Core</li>
<li><a href="http://mindprod.com/jgloss/interned.html">interned Strings : Java Glossary</a></li>
</ul>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F&amp;title=Clojure%27s%20keyword%20can%20fill%20up%20your%20PermGen%20space&amp;notes=We%27ve%20been%20working%20on%20a%20custom%20web-crawler%20for%20a%20few%20months%20now.%20Recently%20we%20were%20having%20a%20problem%20where%20after%20a%20few%20minutes%20the%20JVM%20would%20run%20out%20of%20PermGen%20space.%0D%0A%0D%0AIf%20you%27re%20not%20familiar%20with%20PermGen%20space%2C%20it%20is%20a%20portion%20of%20memory%20reserved%20for%20" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F&amp;title=Clojure%27s%20keyword%20can%20fill%20up%20your%20PermGen%20space" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Clojure%27s%20keyword%20can%20fill%20up%20your%20PermGen%20space%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F&amp;t=Clojure%27s%20keyword%20can%20fill%20up%20your%20PermGen%20space" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F&amp;title=Clojure%27s%20keyword%20can%20fill%20up%20your%20PermGen%20space&amp;annotation=We%27ve%20been%20working%20on%20a%20custom%20web-crawler%20for%20a%20few%20months%20now.%20Recently%20we%20were%20having%20a%20problem%20where%20after%20a%20few%20minutes%20the%20JVM%20would%20run%20out%20of%20PermGen%20space.%0D%0A%0D%0AIf%20you%27re%20not%20familiar%20with%20PermGen%20space%2C%20it%20is%20a%20portion%20of%20memory%20reserved%20for%20" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F&amp;t=Clojure%27s%20keyword%20can%20fill%20up%20your%20PermGen%20space" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/hfnbZlL0JLg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/03/02/clojures-keyword-can-fill-up-your-permgen-space/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2011/03/02/clojures-keyword-can-fill-up-your-permgen-space/</feedburner:origLink></item>
		<item>
		<title>World’s Fastest Binary Search?</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/TyjBsv8aoao/</link>
		<comments>http://eigenjoy.com/2011/01/21/worlds-fastest-binary-search/#comments</comments>
		<pubDate>Fri, 21 Jan 2011 12:17:48 +0000</pubDate>
		<dc:creator>Matt Pulver</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=379</guid>
		<description><![CDATA[(Edited 9 Sept 2011: A few minor improvements have been made; see the updated post instead: Binary Search Revisited.)
Every CS student knows how to write a binary search algorithm. The question is, are we really making full use of the fact that we are using a binary computer to do a binary search? While it [...]]]></description>
			<content:encoded><![CDATA[<p>(Edited 9 Sept 2011: A few minor improvements have been made; see the updated post instead: <a href="http://www.xcombinator.com/2011/09/09/binary-search-revisited/">Binary Search Revisited</a>.)</p>
<p>Every CS student knows how to write a <a href="http://en.wikipedia.org/wiki/Binary_search_algorithm">binary search algorithm</a>. The question is, are we really making full use of the fact that we are using a binary computer to do a binary search? While it is true that computers are good at doing arithmetic, they are even better at doing pure bit logic. Often times our grammar school arithmetic way of thinking can get in the way of solving a problem that has a more elegant solution in terms of lower-level bit logic. Binary search in an ordered list is one such problem.</p>
<p>Your homework problem: re-write the typical textbook binary search in an ordered list algorithm, making it <em>faster</em>, without doing any arithmetic within the main loop of the search algorithm (underlying pointer arithmetic by direct access of array elements is ok).<br />
<span id="more-379"></span></p>
<p>Solution in C++:<br />
<font face="monospace"><br />
<font color="#8080ff">// Fastest binary search</font><br />
<font color="#ff40ff">#include </font><font color="#ff6060">&lt;iostream&gt;</font><br />
<font color="#ff40ff">#include </font><font color="#ff6060">&lt;math.h&gt;</font></p>
<p><font color="#8080ff">// Given a value x, return the index of the largest</font><br />
<font color="#8080ff">// element in a sorted list less than or equal to x.</font><br />
<font color="#8080ff">// Return -1 if x is less than all elements, or list</font><br />
<font color="#8080ff">// is empty. T can be any type for which &lt;= is defined.</font><br />
<font color="#00ff00">template</font>&nbsp;&lt;<font color="#00ff00">typename</font>&nbsp;T&gt;<br />
<font color="#00ff00">int</font>&nbsp;fbsearch( <font color="#00ff00">const</font>&nbsp;T *sorted_list, <font color="#00ff00">size_t</font>&nbsp;list_size, T x )<br />
{<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#ffff00">if</font>( list_size &lt;= <font color="#ff6060">1</font>&nbsp;)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color="#ffff00">return</font>&nbsp;list_size &amp;&amp; *sorted_list &lt;= x ? <font color="#ff6060">0</font>&nbsp;: -<font color="#ff6060">1</font>;<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#00ff00">unsigned</font>&nbsp;i = <font color="#ff6060">0</font>;<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#00ff00">unsigned</font>&nbsp;b = <font color="#ff6060">1</font>&nbsp;&lt;&lt; <font color="#00ff00">int</font>( log(list_size-<font color="#ff6060">1</font>) / <font color="#ff6060">M_LN2</font>&nbsp;);<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#ffff00">for</font>( ; b ; b &gt;&gt;= <font color="#ff6060">1</font>&nbsp;)<br />
&nbsp;&nbsp;&nbsp;&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color="#00ff00">unsigned</font>&nbsp;j = i | b;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color="#ffff00">if</font>( j &lt; list_size &amp;&amp; sorted_list[j] &lt;= x ) i = j;<br />
&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#ffff00">return</font>&nbsp;i || *sorted_list &lt;= x ? i : -<font color="#ff6060">1</font>;<br />
}</p>
<p><font color="#ffff00">using</font>&nbsp;<font color="#00ff00">namespace</font>&nbsp;std;</p>
<p><font color="#00ff00">int</font>&nbsp;main()<br />
{<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#00ff00">const</font>&nbsp;<font color="#00ff00">int</font>&nbsp;sorted_list[] =<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{ <font color="#ff6060">2</font>, <font color="#ff6060">3</font>, <font color="#ff6060">5</font>, <font color="#ff6060">7</font>, <font color="#ff6060">11</font>, <font color="#ff6060">13</font>, <font color="#ff6060">17</font>, <font color="#ff6060">19</font>, <font color="#ff6060">23</font>, <font color="#ff6060">29</font>&nbsp;};<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#00ff00">const</font>&nbsp;<font color="#00ff00">size_t</font>&nbsp;list_size = <font color="#ffff00">sizeof</font>(sorted_list)/<font color="#ffff00">sizeof</font>(<font color="#00ff00">int</font>);<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#00ff00">int</font>&nbsp;i = <font color="#ff6060">7</font>;<br />
&nbsp;&nbsp;&nbsp;&nbsp;cout &lt;&lt; <font color="#ff6060">&quot;fbsearch(sorted_list,&quot;</font>&lt;&lt;list_size&lt;&lt;<font color="#ff6060">&#8216;,&#8217;</font>&lt;&lt;i&lt;&lt;<font color="#ff6060">&quot;) = &quot;</font><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;&lt; fbsearch(sorted_list,list_size,i) &lt;&lt; endl;<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#ffff00">return</font>&nbsp;<font color="#ff6060">0</font>;<br />
}<br />
</font></p>
<p>Technical Notes:</p>
<ul>
<li>Feel free to add a statement within the loop that returns the index if found. Note that for large n, you will spend more time on average checking for equality than the time you will save by returning early.</li>
<li>Instead of dividing by ln(2), feel free to save the constant 1/ln(2) and multiply by it instead.</li>
</ul>
<p>So what&#8217;s going on here? Take a list of the first 10 prime numbers, for example, and enumerate them using base-2:</p>
<pre>
0000   2
0001   3
0010   5
0011   7
0100  11
0101  13
0110  17
0111  19
1000  23
1001  29
</pre>
<p>If we are given one of the items on the list, say 13, then we are looking for its 4-bit index. Each bit is a yes/no question. The first bit asks, &#8220;Are you at index 1000 or higher?&#8221; If no, then the second bit asks, &#8220;Are you at index 0100 or higher?&#8221; If yes, then the third bit asks, &#8220;Are you at 0110 or higher?&#8221; If no then the last bit asks, &#8220;Are you at 0101 or higher?&#8221; By now we have asked all 4 questions, and thus have all 4 bits of the index.</p>
<p>Moral of the story: Sometimes it&#8217;s better to think like a silicon being than a carbon one.</p>
<p>p.s. Happy birthday, Nate!</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F&amp;title=World%27s%20Fastest%20Binary%20Search%3F&amp;notes=%28Edited%209%20Sept%202011%3A%20A%20few%20minor%20improvements%20have%20been%20made%3B%20see%20the%20updated%20post%20instead%3A%20Binary%20Search%20Revisited.%29%0D%0A%0D%0AEvery%20CS%20student%20knows%20how%20to%20write%20a%20binary%20search%20algorithm.%20The%20question%20is%2C%20are%20we%20really%20making%20full%20use%20of%20the%20fact%20that%20we" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F&amp;title=World%27s%20Fastest%20Binary%20Search%3F" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=World%27s%20Fastest%20Binary%20Search%3F%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F&amp;t=World%27s%20Fastest%20Binary%20Search%3F" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F&amp;title=World%27s%20Fastest%20Binary%20Search%3F&amp;annotation=%28Edited%209%20Sept%202011%3A%20A%20few%20minor%20improvements%20have%20been%20made%3B%20see%20the%20updated%20post%20instead%3A%20Binary%20Search%20Revisited.%29%0D%0A%0D%0AEvery%20CS%20student%20knows%20how%20to%20write%20a%20binary%20search%20algorithm.%20The%20question%20is%2C%20are%20we%20really%20making%20full%20use%20of%20the%20fact%20that%20we" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F&amp;t=World%27s%20Fastest%20Binary%20Search%3F" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/TyjBsv8aoao" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/01/21/worlds-fastest-binary-search/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2011/01/21/worlds-fastest-binary-search/</feedburner:origLink></item>
		<item>
		<title>URL Normalization in Clojure</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/9oxr_IjXjJk/</link>
		<comments>http://eigenjoy.com/2010/12/02/url-normalization-in-clojure/#comments</comments>
		<pubDate>Thu, 02 Dec 2010 18:15:28 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[crawling]]></category>
		<category><![CDATA[clojure]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=367</guid>
		<description><![CDATA[Bandwidth is often one of the first bottlenecks you&#8217;ll hit when web crawling. So, it&#8217;s in your best interest to crawl each page only once (ignoring recrawls). In order to know that you&#8217;ve already crawled a page you need to keep an identifier of each page that you&#8217;ve crawled. 
The naive solution to this is [...]]]></description>
			<content:encoded><![CDATA[<p>Bandwidth is often one of the first bottlenecks you&#8217;ll hit when web crawling. So, it&#8217;s in your best interest to crawl each page only once (ignoring recrawls). In order to know that you&#8217;ve already crawled a page you need to keep an identifier of each page that you&#8217;ve crawled. </p>
<p>The naive solution to this is to just use the URL as the key. But it&#8217;s easy to see that this will cause duplicate pages to be downloaded because:</p>
<ul>
<li>URLs for a given page aren&#8217;t consistent even within a given site. (e.g. <code>http://www.foo.com/</code> and <code>http://www.foo.com/index.html</code>)</li>
<li>Many pages have links to anchor tags which are all on the same page (e.g. <code>http://www.foo.com/index.html</code> and <code>http://www.foo.com/index.html#locations</code>)</li>
</ul>
<p>Pop Quiz: What&#8217;s the &#8220;normal&#8221; form of each of these URL?</p>
<pre><code>http://www.foo.com:80/foo
http://www.foo.com/foo/../foo#bam
http://:@www.FOO.com/foo/../foo
</code></pre>
<p>The answer: <code>http://www.foo.com/foo</code>.</p>
<p>We need a URL <em>normalizer</em> that will return a consistent URL for all URLs that point to a given page. Again, note that a single <em>page</em> has many <em>URL</em>s.</p>
<blockquote>
<p>If the URL <code>http://:@www.FOO.com/foo/../foo</code> seems a bit contrived,<br />
  let me tell you: it isn&#8217;t. As soon as you start crawling you learn the web is full of hideous markup including non-intuitive (and nonsensical) URLs.</p>
</blockquote>
<p>URL normalization is one of those problems that seems simple but, in fact, the details get pretty hairy. So <a href="http://yakkstr.com/users/ddonnell">Jay Donnell</a> and I have been working on a small URL normalizer that makes it easy. It&#8217;s still young but already passes a <a href="https://github.com/jashmenn/url-normalizer/blob/master/test/url_normalizer/test/core.clj">large number of tests</a>, including most of the <a href="http://www.intertwingly.net/wiki/pie/PaceCanonicalIds">Pace URL Normalization Tests</a>.</p>
<h2>Usage</h2>
<pre><code> (ns my.namespace
     (:use [url-normalizer.core))

 (canonicalize-url "http://www.example.com:80/foo#bar")
 -&gt; "http://www.example.com/foo"
</code></pre>
<p>The code is <a href="https://github.com/jashmenn/url-normalizer">on github</a> and the jar is <a href="http://clojars.org/url-normalizer">on clojars</a>.</p>
<p>The inspiration for this library comes from <a href="http://intertwingly.net/blog/2004/08/04/Urlnorm">Sam Ruby&#8217;s <code>urlnorm.py</code></a>.</p>
<blockquote>
<p>Interested in URL Normalization? Want to write a large-scale web-crawler in Clojure? We&#8217;re hiring. <a href="mailto:nmurray@attinteractive.com">Send me an email</a>.</p>
</blockquote>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F02%2Furl-normalization-in-clojure%2F&amp;title=URL%20Normalization%20in%20Clojure&amp;notes=Bandwidth%20is%20often%20one%20of%20the%20first%20bottlenecks%20you%27ll%20hit%20when%20web%20crawling.%20So%2C%20it%27s%20in%20your%20best%20interest%20to%20crawl%20each%20page%20only%20once%20%28ignoring%20recrawls%29.%20In%20order%20to%20know%20that%20you%27ve%20already%20crawled%20a%20page%20you%20need%20to%20keep%20an%20identifier%20of%20each%20" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F02%2Furl-normalization-in-clojure%2F&amp;title=URL%20Normalization%20in%20Clojure" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F02%2Furl-normalization-in-clojure%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=URL%20Normalization%20in%20Clojure%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F02%2Furl-normalization-in-clojure%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F02%2Furl-normalization-in-clojure%2F&amp;t=URL%20Normalization%20in%20Clojure" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F02%2Furl-normalization-in-clojure%2F&amp;title=URL%20Normalization%20in%20Clojure&amp;annotation=Bandwidth%20is%20often%20one%20of%20the%20first%20bottlenecks%20you%27ll%20hit%20when%20web%20crawling.%20So%2C%20it%27s%20in%20your%20best%20interest%20to%20crawl%20each%20page%20only%20once%20%28ignoring%20recrawls%29.%20In%20order%20to%20know%20that%20you%27ve%20already%20crawled%20a%20page%20you%20need%20to%20keep%20an%20identifier%20of%20each%20" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F02%2Furl-normalization-in-clojure%2F&amp;t=URL%20Normalization%20in%20Clojure" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F02%2Furl-normalization-in-clojure%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/9oxr_IjXjJk" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/12/02/url-normalization-in-clojure/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2010/12/02/url-normalization-in-clojure/</feedburner:origLink></item>
		<item>
		<title>Extract Text from a HTML Document in Clojure</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/P5fGCZ1DtqA/</link>
		<comments>http://eigenjoy.com/2010/12/01/extract-text-from-a-html-document-in-clojure/#comments</comments>
		<pubDate>Wed, 01 Dec 2010 21:03:32 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[clojure]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=361</guid>
		<description><![CDATA[There are many Java HTML parsers and it can be tricky to figure out which one to use. If you need to quickly extract just the text of a document I&#8217;d recommend using the Jericho HTML Parser.
Here&#8217;s a quick example on how to use it:

;; lein dependency: [net.htmlparser.jericho/jericho-html &#34;3.1&#34;]
&#40;ns foo.preprocess
  &#40;:import 
   [...]]]></description>
			<content:encoded><![CDATA[<p>There are many <a href="http://java-source.net/open-source/html-parsers">Java HTML parsers</a> and it can be tricky to figure out which one to use. If you need to quickly extract just the text of a document I&#8217;d recommend using the <a href="http://jericho.htmlparser.net/docs/index.html">Jericho HTML Parser</a>.</p>
<p>Here&#8217;s a quick example on how to use it:</p>

<div class="wp_syntax"><div class="code"><pre class="lisp" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">;; lein dependency: [net.htmlparser.jericho/jericho-html &quot;3.1&quot;]</span>
<span style="color: #66cc66;">&#40;</span>ns foo<span style="color: #66cc66;">.</span>preprocess
  <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">import</span> 
   <span style="color: #66cc66;">&#91;</span>java<span style="color: #66cc66;">.</span>io File BufferedInputStream FileInputStream<span style="color: #66cc66;">&#93;</span>
   <span style="color: #66cc66;">&#91;</span>net<span style="color: #66cc66;">.</span>htmlparser<span style="color: #66cc66;">.</span>jericho Source TextExtractor<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>defn extract-text 
  <span style="color: #ff0000;">&quot;given File returns a String of the extracted text&quot;</span>
  <span style="color: #66cc66;">&#91;</span>f<span style="color: #66cc66;">&#93;</span>
  <span style="color: #66cc66;">&#40;</span><span style="color: #b1b100;">let</span> <span style="color: #66cc66;">&#91;</span>source <span style="color: #66cc66;">&#40;</span>Source<span style="color: #66cc66;">.</span> <span style="color: #66cc66;">&#40;</span>BufferedInputStream<span style="color: #66cc66;">.</span> <span style="color: #66cc66;">&#40;</span>FileInputStream<span style="color: #66cc66;">.</span> f<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>
    <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">.</span>toString <span style="color: #66cc66;">&#40;</span>TextExtractor<span style="color: #66cc66;">.</span> source<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>def filename <span style="color: #ff0000;">&quot;data/some-index.html&quot;</span><span style="color: #66cc66;">&#41;</span>
<span style="color: #66cc66;">&#40;</span>extract-text <span style="color: #66cc66;">&#40;</span>java<span style="color: #66cc66;">.</span>io<span style="color: #66cc66;">.</span>File<span style="color: #66cc66;">.</span> filename<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>TextExtractor has sensible defaults and ignores the css and javascript by default. See the <a href="http://jericho.htmlparser.net/docs/javadoc/index.html">TextExtractor</a> class for more details.</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F&amp;title=Extract%20Text%20from%20a%20HTML%20Document%20in%20Clojure&amp;notes=There%20are%20many%20Java%20HTML%20parsers%20and%20it%20can%20be%20tricky%20to%20figure%20out%20which%20one%20to%20use.%20If%20you%20need%20to%20quickly%20extract%20just%20the%20text%20of%20a%20document%20I%27d%20recommend%20using%20the%20Jericho%20HTML%20Parser.%0D%0A%0D%0AHere%27s%20a%20quick%20example%20on%20how%20to%20use%20it%3A%0D%0A%0D%0A%0D%0A%3B%3B%20lein%20dep" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F&amp;title=Extract%20Text%20from%20a%20HTML%20Document%20in%20Clojure" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Extract%20Text%20from%20a%20HTML%20Document%20in%20Clojure%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F&amp;t=Extract%20Text%20from%20a%20HTML%20Document%20in%20Clojure" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F&amp;title=Extract%20Text%20from%20a%20HTML%20Document%20in%20Clojure&amp;annotation=There%20are%20many%20Java%20HTML%20parsers%20and%20it%20can%20be%20tricky%20to%20figure%20out%20which%20one%20to%20use.%20If%20you%20need%20to%20quickly%20extract%20just%20the%20text%20of%20a%20document%20I%27d%20recommend%20using%20the%20Jericho%20HTML%20Parser.%0D%0A%0D%0AHere%27s%20a%20quick%20example%20on%20how%20to%20use%20it%3A%0D%0A%0D%0A%0D%0A%3B%3B%20lein%20dep" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F&amp;t=Extract%20Text%20from%20a%20HTML%20Document%20in%20Clojure" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/P5fGCZ1DtqA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/12/01/extract-text-from-a-html-document-in-clojure/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2010/12/01/extract-text-from-a-html-document-in-clojure/</feedburner:origLink></item>
		<item>
		<title>Boyer-Moore string search algorithm in Ruby</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/1u070d8_lBM/</link>
		<comments>http://eigenjoy.com/2010/10/27/boyer-moore-string-search-algorithm-in-ruby/#comments</comments>
		<pubDate>Wed, 27 Oct 2010 14:09:46 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=318</guid>
		<description><![CDATA[Just a quick post. I&#8217;ve converted the C code from the wikipedia entry (this version) on the Boyer-Moore string search algorithm to Ruby. I&#8217;ve extended it to support searches on token arrays and regular expressions.
You can find the code on github.
Usage:

    BoyerMoore.search&#40;haystack, needle&#41;   # returns index of needle or nil

Examples:
Basic [...]]]></description>
			<content:encoded><![CDATA[<p>Just a quick post. I&#8217;ve converted the C code from the <a href="http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm">wikipedia entry</a> <small>(<a href="http://en.wikipedia.org/w/index.php?title=Boyer%E2%80%93Moore_string_search_algorithm&#038;diff=391986850&#038;oldid=391398281">this version</a>)</small> on the Boyer-Moore string search algorithm to Ruby. I&#8217;ve extended it to support searches on token arrays and regular expressions.</p>
<p>You can find the <a href="http://github.com/jashmenn/boyermoore">code on github</a>.</p>
<p>Usage:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span>haystack, needle<span style="color:#006600; font-weight:bold;">&#41;</span>   <span style="color:#008000; font-style:italic;"># returns index of needle or nil</span></pre></div></div>

<p>Examples:</p>
<p>Basic search in string:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;ANPANMAN&quot;</span>, <span style="color:#996600;">&quot;ANP&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>   <span style="color:#008000; font-style:italic;"># =&gt; 0</span>
    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;ANPANMAN&quot;</span>, <span style="color:#996600;">&quot;ANPXX&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#008000; font-style:italic;"># =&gt; nil </span>
    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;foobar&quot;</span>, <span style="color:#996600;">&quot;bar&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>     <span style="color:#008000; font-style:italic;"># =&gt; 3</span></pre></div></div>

<p>You can also search an array of tokens:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;&lt;b&gt;&quot;</span>, <span style="color:#996600;">&quot;hi&quot;</span>, <span style="color:#996600;">&quot;&lt;/b&gt;&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>, <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;hi&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span>         <span style="color:#008000; font-style:italic;"># =&gt; 1 </span>
    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;bam&quot;</span>, <span style="color:#996600;">&quot;foo&quot;</span>, <span style="color:#996600;">&quot;bar&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>, <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;foo&quot;</span>, <span style="color:#996600;">&quot;bar&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#008000; font-style:italic;"># =&gt; 1 </span>
    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;bam&quot;</span>, <span style="color:#996600;">&quot;bar&quot;</span>, <span style="color:#996600;">&quot;baz&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>, <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;foo&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span>        <span style="color:#008000; font-style:italic;"># =&gt; nil</span></pre></div></div>

<p>A token can be a regular expression:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;Sing&quot;</span>, <span style="color:#996600;">&quot;99&quot;</span>, <span style="color:#996600;">&quot;Luftballon&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>, <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006600; font-weight:bold;">/</span>\d<span style="color:#006600; font-weight:bold;">+/</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span> == <span style="color:#006666;">1</span>
    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;Nate Murray&quot;</span>, <span style="color:#996600;">&quot;5 Pine Street&quot;</span>, <span style="color:#996600;">&quot;Los Angeles&quot;</span>, <span style="color:#996600;">&quot;CA&quot;</span>, <span style="color:#996600;">&quot;90210&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>, <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006600; font-weight:bold;">/</span>^\w<span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006666;">2</span><span style="color:#006600; font-weight:bold;">&#125;</span>$<span style="color:#006600; font-weight:bold;">/</span>, <span style="color:#006600; font-weight:bold;">/</span>^\d<span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006666;">5</span><span style="color:#006600; font-weight:bold;">&#125;</span>$<span style="color:#006600; font-weight:bold;">/</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span> == <span style="color:#006666;">3</span></pre></div></div>

<p>Notes:</p>
<p>The regular-expression token matching is a bit of a hack and will be fairly slow because every hash miss is compared against every regular expression key. You probably shouldn&#8217;t use the regular expression token search for anything more than a toy.</p>
<p>Download the <a href="http://github.com/jashmenn/boyermoore">Boyer-Moore string search algorithm in Ruby</a>.</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F&amp;title=Boyer-Moore%20string%20search%20algorithm%20in%20Ruby&amp;notes=Just%20a%20quick%20post.%20I%27ve%20converted%20the%20C%20code%20from%20the%20wikipedia%20entry%20%28this%20version%29%20on%20the%20Boyer-Moore%20string%20search%20algorithm%20to%20Ruby.%20I%27ve%20extended%20it%20to%20support%20searches%20on%20token%20arrays%20and%20regular%20expressions.%0D%0A%0D%0AYou%20can%20find%20the%20code%20on%20github." title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F&amp;title=Boyer-Moore%20string%20search%20algorithm%20in%20Ruby" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Boyer-Moore%20string%20search%20algorithm%20in%20Ruby%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F&amp;t=Boyer-Moore%20string%20search%20algorithm%20in%20Ruby" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F&amp;title=Boyer-Moore%20string%20search%20algorithm%20in%20Ruby&amp;annotation=Just%20a%20quick%20post.%20I%27ve%20converted%20the%20C%20code%20from%20the%20wikipedia%20entry%20%28this%20version%29%20on%20the%20Boyer-Moore%20string%20search%20algorithm%20to%20Ruby.%20I%27ve%20extended%20it%20to%20support%20searches%20on%20token%20arrays%20and%20regular%20expressions.%0D%0A%0D%0AYou%20can%20find%20the%20code%20on%20github." title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F&amp;t=Boyer-Moore%20string%20search%20algorithm%20in%20Ruby" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/1u070d8_lBM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/10/27/boyer-moore-string-search-algorithm-in-ruby/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2010/10/27/boyer-moore-string-search-algorithm-in-ruby/</feedburner:origLink></item>
		<item>
		<title>A Paging UIScrollView in Cocos2d (with previews)</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/uyZ-rtYC8nA/</link>
		<comments>http://eigenjoy.com/2010/09/08/a-paging-uiscrollview-in-cocos2d-with-previews/#comments</comments>
		<pubDate>Thu, 09 Sep 2010 03:46:36 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=300</guid>
		<description><![CDATA[I&#8217;ve created a sample project that shows how to do a paged UIScrollView within Cocos2d. Here&#8217;s a video showing the effect:

You can find the code on github
My solution&#8217;s main ideas are adapted from these two pages:

http://getsetgames.com/2009/08/21/cocos2d-and-uiscrollview/
http://blog.proculo.de/archives/180-Paging-enabled-UIScrollView-With-Previews.html

My contribution is combining the UIScrollView with previews with Cocos2d and cleaning it up.
If you haven&#8217;t tried to implement this [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve created a sample project that shows how to do a paged <code>UIScrollView</code> within Cocos2d. Here&#8217;s a video showing the effect:</p>
<p><object width="499" height="306"><param name="movie" value="http://www.youtube.com/v/2IgbRzGfBHk?fs=1&amp;hl=en_US&amp;rel=0&amp;hd=1"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/2IgbRzGfBHk?fs=1&amp;hl=en_US&amp;rel=0&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="499" height="306"></embed></object></p>
<p>You can find the code <a href="http://github.com/jashmenn/shapes-panels">on github</a></p>
<p>My solution&#8217;s main ideas are adapted from these two pages:</p>
<ul>
<li><a href="http://getsetgames.com/2009/08/21/cocos2d-and-uiscrollview/">http://getsetgames.com/2009/08/21/cocos2d-and-uiscrollview/</a></li>
<li><a href="http://blog.proculo.de/archives/180-Paging-enabled-UIScrollView-With-Previews.html">http://blog.proculo.de/archives/180-Paging-enabled-UIScrollView-With-Previews.html</a></li>
</ul>
<p>My contribution is combining the <code>UIScrollView</code> with previews with Cocos2d and cleaning it up.</p>
<p>If you haven&#8217;t tried to implement this before it might not be obvious why this is tricky to implement. Apple&#8217;s <code>UIScrollView</code> allows you to have a view which scrolls and optionally snaps to pages. The effect you see in the video above (and in Angry Birds level selection and many other apps) shows a preview of each panel on either side. This let&#8217;s you easily see if a next or previous page exists and you see a preview of that page.  </p>
<p>The problem is that Apple&#8217;s <code>UIScrollView</code> doesn&#8217;t let you set the width of the frame, so you can&#8217;t page less than a whole screen (well, a whole width of the<code>UIScrollView</code>, more on that later). </p>
<p>To get around this I originally tried writing my own paging controller. If you&#8217;ve tried this you&#8217;ll know that it is extremely tricky to get the same interaction dynamics as Apple&#8217;s. (For instance, pull out your phone and play with the Photo application. Notice if you just drag slowly you lack enough inertia to go to the next page so it will snap back to the frame your are on.  If you flick fast over a small area the page will skip to the next frame. Etc.) While at first glance the rules seem easy to reimplement, you have to cover a lot of edge cases to recreating the familiar paging interaction.</p>
<p>So ideally we need to figure out a way to use Apple&#8217;s <code>UIScrollView</code> and we should save ourselves a lot of work.</p>
<p>Like we said above, a <code>UIScrollView</code> will only page the width of the entire <code>UIScrollView</code>.  So to get this preview effect you can create a <code>UIScrollView</code> that is less than the width of the entire screen. The problem here is that any touches that lie outside of that <code>UIScrollView</code> (say on the edge of the screen won&#8217;t be sent to the <code>UIScrollView</code>.</p>
<p>Our solution, (again, borrowed largely from the above links) looks like this:</p>
<p><a href="http://www.xcombinator.com/wp-content/uploads/2010/09/shapes-panels-post.jpg"><img src="http://www.xcombinator.com/wp-content/uploads/2010/09/shapes-panels-post.jpg" alt="Jacob's Shapes Panels Layers" title="shapes-panels-post" width="507" height="235" class="aligncenter size-full wp-image-306" /></a></p>
<p>The idea is this:</p>
<ul>
<li>We create a <code>CCMenu</code> and add it to a <code>CCLayer</code></li>
<li>The <code>UIScrollView</code> is resized to the width of our panel images (smaller than the whole screen)</li>
<li>The <code>UIScrollView</code> transforms its scrolling action into moving the position of the <code>CCLayer</code> containing our <code>CCMenu</code></li>
<li>We create a full-screen <code>TouchDelegatingView</code> that simply forwards its touches on to the <code>UIScrollView</code></li>
</ul>
<h2>More Details</h2>
<p>In <a href="http://www.littlehiccup.com">Jacob&#8217;s Shapes</a> (JS), we have a <code>GameController</code> which knows all of the levels. For the sake of this example, we&#8217;re just going to store all the level names in an <code>NSArray</code>.</p>

<div class="wp_syntax"><div class="code"><pre class="objc" style="font-family:monospace;"><span style="color: #6e371a;"># HCUPPanelScene.m (in onEnter)</span>
<span style="color: #400080;">NSArray</span><span style="color: #002200;">*</span> panelNames <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span><span style="color: #400080;">NSArray</span> arrayWithObjects<span style="color: #002200;">:</span> 
    <span style="color: #bf1d1a;">@</span><span style="color: #bf1d1a;">&quot;amazon&quot;</span>, <span style="color: #bf1d1a;">@</span><span style="color: #bf1d1a;">&quot;arctic&quot;</span>,
    <span style="color: #bf1d1a;">@</span><span style="color: #bf1d1a;">&quot;brkfst&quot;</span>, <span style="color: #bf1d1a;">@</span><span style="color: #bf1d1a;">&quot;camp&quot;</span>, 
    <span style="color: #bf1d1a;">@</span><span style="color: #bf1d1a;">&quot;city&quot;</span>, <span style="color: #a61390;">nil</span><span style="color: #002200;">&#93;</span>;
<span style="color: #a61390;">int</span> numberOfPages <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span>panelNames count<span style="color: #002200;">&#93;</span>;
&nbsp;
<span style="color: #11740a; font-style: italic;">// create an empty layer for us to work with</span>
CCLayer<span style="color: #002200;">*</span> panels <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span>CCLayer node<span style="color: #002200;">&#93;</span>;</pre></div></div>

<h2>Custom CCMenu and CCMenuItem</h2>
<p>We use a custom subclass of <code>CCMenu</code> and <code>CCMenuItem</code>, <code>NMPanelMenu</code> and <code>NMPanelMenuItem</code>, respectively. <code>NMPanelMenu</code> tweaks how the current item is determined. Overriding <code>NMPanelMenuItem</code> allows us to add metadata about the panel, play sounds, and optimize how we use the images for selected panels.</p>

<div class="wp_syntax"><div class="code"><pre class="objc" style="font-family:monospace;"><span style="color: #6e371a;"># HCUPPanelScene.m</span>
NMPanelMenu<span style="color: #002200;">*</span> menu <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span>NMPanelMenu menuWithItems<span style="color: #002200;">:</span> <span style="color: #a61390;">nil</span><span style="color: #002200;">&#93;</span>;
<span style="color: #a61390;">float</span> onePanelWide <span style="color: #002200;">=</span> <span style="color: #002200;">-</span><span style="color: #2400d9;">1</span>;
&nbsp;
<span style="color: #11740a; font-style: italic;">// Now add the panels</span>
<span style="color: #a61390;">for</span><span style="color: #002200;">&#40;</span><span style="color: #a61390;">int</span> i<span style="color: #002200;">=</span><span style="color: #2400d9;">0</span>; i <span style="color: #002200;">&amp;</span>lt; numberOfPages; i<span style="color: #002200;">++</span><span style="color: #002200;">&#41;</span> <span style="color: #002200;">&#123;</span>
    <span style="color: #400080;">NSString</span><span style="color: #002200;">*</span> currentName <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span>panelNames objectAtIndex<span style="color: #002200;">:</span>i<span style="color: #002200;">&#93;</span>;
    CCSprite<span style="color: #002200;">*</span> pane2 <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span>CCSprite spriteWithFile<span style="color: #002200;">:</span><span style="color: #002200;">&#91;</span><span style="color: #400080;">NSString</span> stringWithFormat<span style="color: #002200;">:</span> <span style="color: #bf1d1a;">@</span><span style="color: #bf1d1a;">&quot;%@-panel.png&quot;</span>, currentName<span style="color: #002200;">&#93;</span><span style="color: #002200;">&#93;</span>;
    NMPanelMenuItem<span style="color: #002200;">*</span> menuItem2 <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span>NMPanelMenuItem alloc<span style="color: #002200;">&#93;</span> initFromNormalSprite<span style="color: #002200;">:</span>pane2 
                                                                selectedSprite<span style="color: #002200;">:</span>pane2
                                                                  activeSprite<span style="color: #002200;">:</span>pane2
                                                                disabledSprite<span style="color: #002200;">:</span>pane2
                                                                          name<span style="color: #002200;">:</span>currentName
                                                                        target<span style="color: #002200;">:</span>self selector<span style="color: #002200;">:</span><span style="color: #a61390;">@selector</span><span style="color: #002200;">&#40;</span>levelPicked<span style="color: #002200;">:</span><span style="color: #002200;">&#41;</span><span style="color: #002200;">&#93;</span>;
    menuItem2.world <span style="color: #002200;">=</span> i;
    menuItem2.name <span style="color: #002200;">=</span> currentName;
    <span style="color: #002200;">&#91;</span>menu addChild<span style="color: #002200;">:</span> menuItem2<span style="color: #002200;">&#93;</span>;
    <span style="color: #002200;">&#91;</span>menuItem2 release<span style="color: #002200;">&#93;</span>;
    <span style="color: #11740a; font-style: italic;">// set onePanelWide to be the width of the first panel</span>
    <span style="color: #a61390;">if</span><span style="color: #002200;">&#40;</span>i<span style="color: #002200;">==</span><span style="color: #2400d9;">0</span><span style="color: #002200;">&#41;</span> onePanelWide <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span>pane2 textureRect<span style="color: #002200;">&#93;</span>.size.width;
<span style="color: #002200;">&#125;</span></pre></div></div>

<p>Here we used <code>CCSprite#spriteWithFile</code>, but in JS we use Zwoptex-created sprite sheets for the panels and then create sprites from those frames. This makes a huge difference in the load time of this scene when you have 20 panels. In JS, instead of loading 20 textures (one for each panel) we only load 2 textures, each containing 10 panels each. </p>
<p>JS is graphics heavy and we definitely had to pay attention to file sizes to keep it under 22MB. Originally I had created two versions of each panel, one for &#8220;off&#8221; and one with a glow for &#8220;on&#8221; (active/selected). Each of the panels as a transparent png was somewhere around 100k. So 100k x 2 for each state x 20 panels was somewhere around 4MB just for this single scene. </p>
<p>We decided to sacrifice a bit of the quality of the glow for the &#8220;on&#8221; state and just create one transparent image for the glow and reuse that for every panel.</p>
<p>To use the glow a portion of our <code>NMPanelMenuItem</code> looks like this:</p>

<div class="wp_syntax"><div class="code"><pre class="objc" style="font-family:monospace;"><span style="color: #6e371a;"># NMPanelMenuItem.m</span>
<span style="color: #002200;">-</span><span style="color: #002200;">&#40;</span><span style="color: #a61390;">void</span><span style="color: #002200;">&#41;</span> activate
<span style="color: #002200;">&#123;</span>
    isActive_ <span style="color: #002200;">=</span> <span style="color: #a61390;">YES</span>;
    <span style="color: #11740a; font-style: italic;">// play sound here</span>
    <span style="color: #002200;">&#91;</span>super activate<span style="color: #002200;">&#93;</span>;
<span style="color: #002200;">&#125;</span>
&nbsp;
<span style="color: #002200;">-</span><span style="color: #002200;">&#40;</span><span style="color: #a61390;">void</span><span style="color: #002200;">&#41;</span> draw
<span style="color: #002200;">&#123;</span>
    <span style="color: #a61390;">if</span><span style="color: #002200;">&#40;</span>isActive_<span style="color: #002200;">&#41;</span> <span style="color: #002200;">&#123;</span>
        <span style="color: #002200;">&#91;</span>self.activeImage draw<span style="color: #002200;">&#93;</span>;
        <span style="color: #a61390;">if</span><span style="color: #002200;">&#40;</span>self.showGlow<span style="color: #002200;">&#41;</span> <span style="color: #002200;">&#91;</span>self.glow draw<span style="color: #002200;">&#93;</span>;
    <span style="color: #002200;">&#125;</span> <span style="color: #a61390;">else</span> <span style="color: #002200;">&#123;</span>
        <span style="color: #002200;">&#91;</span>super draw<span style="color: #002200;">&#93;</span>;
    <span style="color: #002200;">&#125;</span>
<span style="color: #002200;">&#125;</span></pre></div></div>

<p>Where <code>self.glow</code> is a <code>CCSprite</code> attached to the <code>NMPanelMenuItem</code>. </p>
<h2>Adding the Cocos2d Panels</h2>
<p>Next we need to setup some basic options for how much padding we want and what the total width of the panels layer is going to be. Then we add the panels to our scene and set the position.</p>

<div class="wp_syntax"><div class="code"><pre class="objc" style="font-family:monospace;"><span style="color: #6e371a;"># HCUPPanelScene.m</span>
<span style="color: #a61390;">float</span> padding <span style="color: #002200;">=</span> <span style="color: #2400d9;">15</span>;
<span style="color: #a61390;">float</span> totalPanelWidth <span style="color: #002200;">=</span> onePanelWide <span style="color: #002200;">+</span> padding<span style="color: #002200;">*</span><span style="color: #2400d9;">2</span>;
<span style="color: #a61390;">float</span> totalWidth <span style="color: #002200;">=</span> numberOfPages <span style="color: #002200;">*</span> totalPanelWidth; <span style="color: #11740a; font-style: italic;">// (wait, do we need padding in here?)</span>
&nbsp;
<span style="color: #a61390;">int</span> currentWorldOffset <span style="color: #002200;">=</span> <span style="color: #2400d9;">0</span>;    <span style="color: #11740a; font-style: italic;">// current world number. </span>
<span style="color: #11740a; font-style: italic;">// int currentWorldOffset = 1; // Try changing to 1 and see what happens</span>
&nbsp;
<span style="color: #002200;">&#91;</span>menu alignItemsHorizontallyWithPadding<span style="color: #002200;">:</span> padding<span style="color: #002200;">*</span><span style="color: #2400d9;">2</span><span style="color: #002200;">&#93;</span>;
&nbsp;
<span style="color: #11740a; font-style: italic;">// add our panels layer</span>
<span style="color: #002200;">&#91;</span>panels addChild<span style="color: #002200;">:</span>menu<span style="color: #002200;">&#93;</span>;
<span style="color: #002200;">&#91;</span>self addChild<span style="color: #002200;">:</span>panels<span style="color: #002200;">&#93;</span>;
&nbsp;
<span style="color: #11740a; font-style: italic;">// set the position of the menu to the center of the very first panel</span>
menu.position <span style="color: #002200;">=</span> ccpAdd<span style="color: #002200;">&#40;</span>menu.position, ccp<span style="color: #002200;">&#40;</span>totalWidth<span style="color: #002200;">/</span><span style="color: #2400d9;">2</span> <span style="color: #002200;">-</span> totalPanelWidth<span style="color: #002200;">/</span><span style="color: #2400d9;">2</span>, <span style="color: #2400d9;">0</span><span style="color: #002200;">&#41;</span><span style="color: #002200;">&#41;</span>;</pre></div></div>

<p>Note that the panels are the visual representation but we haven&#8217;t added in any scrolling dynamics. To do that we need to add a <code>UIScrollView</code>.</p>
<h2>Adding the UIScrollView</h2>
<p>Here we do two things: </p>
<ol>
<li>Add our <code>CocosOverlayScrollView</code> which is only one panel wide (less than the whole screen). If we had this layer only then we wouldn&#8217;t be notified of touches on the edge of the screen.</li>
<li>
<p>We add the <code>TouchDelegatingView</code> which is full screen. The <code>TouchDelegatingView</code> will delegate any touches it receives to our paging scroll view</p>
</ol>

<div class="wp_syntax"><div class="code"><pre class="objc" style="font-family:monospace;"><span style="color: #6e371a;"># HCUPPanelScene.m</span>
<span style="color: #11740a; font-style: italic;">// Note that we're only concerned with a horizontal iPhone. If your game is</span>
<span style="color: #11740a; font-style: italic;">// vertical, change accordingly</span>
touchDelegatingView <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span>TouchDelegatingView alloc<span style="color: #002200;">&#93;</span> initWithFrame<span style="color: #002200;">:</span>CGRectMake<span style="color: #002200;">&#40;</span><span style="color: #2400d9;">0</span>, <span style="color: #2400d9;">0</span>, <span style="color: #2400d9;">320</span>, <span style="color: #2400d9;">480</span><span style="color: #002200;">&#41;</span><span style="color: #002200;">&#93;</span>;
scrollView <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span>CocosOverlayScrollView alloc<span style="color: #002200;">&#93;</span> initWithFrame<span style="color: #002200;">:</span>CGRectMake<span style="color: #002200;">&#40;</span><span style="color: #2400d9;">0</span>, <span style="color: #2400d9;">0</span>, <span style="color: #2400d9;">320</span>, totalPanelWidth<span style="color: #002200;">&#41;</span>
                                                  numPages<span style="color: #002200;">:</span> numberOfPages
                                                     width<span style="color: #002200;">:</span> totalPanelWidth
                                                     layer<span style="color: #002200;">:</span> panels<span style="color: #002200;">&#93;</span>;
touchDelegatingView.scrollView <span style="color: #002200;">=</span> scrollView;
&nbsp;
<span style="color: #11740a; font-style: italic;">// this is just to pre-set the scroll view to a particular panel</span>
<span style="color: #002200;">&#91;</span>scrollView setContentOffset<span style="color: #002200;">:</span> CGPointMake<span style="color: #002200;">&#40;</span><span style="color: #2400d9;">0</span>, currentWorldOffset <span style="color: #002200;">*</span> totalPanelWidth<span style="color: #002200;">&#41;</span> animated<span style="color: #002200;">:</span> <span style="color: #a61390;">NO</span><span style="color: #002200;">&#93;</span>;
&nbsp;
<span style="color: #11740a; font-style: italic;">// Add views to cocos2d</span>
<span style="color: #11740a; font-style: italic;">// We called it a TouchDelegatingView, but it actually isn't containing anything at all.</span>
<span style="color: #11740a; font-style: italic;">// In reality it is just taking up any space under our ScrollView and delegating the touches. </span>
<span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span>CCDirector sharedDirector<span style="color: #002200;">&#93;</span> openGLView<span style="color: #002200;">&#93;</span> addSubview<span style="color: #002200;">:</span>touchDelegatingView<span style="color: #002200;">&#93;</span>;
<span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span>CCDirector sharedDirector<span style="color: #002200;">&#93;</span> openGLView<span style="color: #002200;">&#93;</span> addSubview<span style="color: #002200;">:</span>scrollView<span style="color: #002200;">&#93;</span>;
&nbsp;
<span style="color: #002200;">&#91;</span>scrollView release<span style="color: #002200;">&#93;</span>;
<span style="color: #002200;">&#91;</span>touchDelegatingView release<span style="color: #002200;">&#93;</span>;</pre></div></div>

<p>You can configure your <code>UIScrollView</code> options by simply changing the code in <code>CocosOverlayScrollView#initWithFrame:numPages:width:layer</code>. (Note that this class was originally written by <a href="http://blog.proculo.de/archives/180-Paging-enabled-UIScrollView-With-Previews.html">Alexander Repty</a>)</p>
<p>The <code>TouchDelegatingView</code> simply delegates any touches it receives to the <code>CocosOverlayScrollView</code>.</p>
<p>And there you have it! Feel free to <a href="http://github.com/jashmenn/shapes-panels">fork and make any changes to the code</a> and send me a pull request.</p>
<p>What do you think? Have any ideas for cleaning it up? Leave your comments below!</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F&amp;title=A%20Paging%20UIScrollView%20in%20Cocos2d%20%28with%20previews%29&amp;notes=I%27ve%20created%20a%20sample%20project%20that%20shows%20how%20to%20do%20a%20paged%20UIScrollView%20within%20Cocos2d.%20Here%27s%20a%20video%20showing%20the%20effect%3A%0D%0A%0D%0A%0D%0A%0D%0AYou%20can%20find%20the%20code%20on%20github%0D%0A%0D%0AMy%20solution%27s%20main%20ideas%20are%20adapted%20from%20these%20two%20pages%3A%0D%0A%0D%0A%0D%0Ahttp%3A%2F%2Fgetsetgames.co" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F&amp;title=A%20Paging%20UIScrollView%20in%20Cocos2d%20%28with%20previews%29" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=A%20Paging%20UIScrollView%20in%20Cocos2d%20%28with%20previews%29%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F&amp;t=A%20Paging%20UIScrollView%20in%20Cocos2d%20%28with%20previews%29" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F&amp;title=A%20Paging%20UIScrollView%20in%20Cocos2d%20%28with%20previews%29&amp;annotation=I%27ve%20created%20a%20sample%20project%20that%20shows%20how%20to%20do%20a%20paged%20UIScrollView%20within%20Cocos2d.%20Here%27s%20a%20video%20showing%20the%20effect%3A%0D%0A%0D%0A%0D%0A%0D%0AYou%20can%20find%20the%20code%20on%20github%0D%0A%0D%0AMy%20solution%27s%20main%20ideas%20are%20adapted%20from%20these%20two%20pages%3A%0D%0A%0D%0A%0D%0Ahttp%3A%2F%2Fgetsetgames.co" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F&amp;t=A%20Paging%20UIScrollView%20in%20Cocos2d%20%28with%20previews%29" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/uyZ-rtYC8nA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/09/08/a-paging-uiscrollview-in-cocos2d-with-previews/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2010/09/08/a-paging-uiscrollview-in-cocos2d-with-previews/</feedburner:origLink></item>
		<item>
		<title>a crawler using wget and xargs</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/M18Ci1Uw0Qg/</link>
		<comments>http://eigenjoy.com/2010/09/06/a-crawler-using-wget-and-xargs/#comments</comments>
		<pubDate>Mon, 06 Sep 2010 20:18:14 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[crawling]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=252</guid>
		<description><![CDATA[How long would it take to crawl a billion pages using wget and xargs?
We&#8217;re on a quest to write a scalable web crawler.  Our goal is to build a web crawler that will download a billion pages a week.  We&#8217;ve calculated that to download a billion pages in a week we need to [...]]]></description>
			<content:encoded><![CDATA[<h2>How long would it take to crawl a billion pages using <code>wget</code> and <code>xargs</code>?</h2>
<p>We&#8217;re on a quest to write a scalable web crawler.  Our goal is to build a web crawler that will download a billion pages a week.  We&#8217;ve calculated that to download a billion pages in a week we need to sustain a rate of <em>1653 pages per second</em> . </p>
<p>The problem with these kinds of numbers is that, unless you are familiar with web-crawling, it is not obvious how fast that really is. How fast  can a simple crawler go? 10 pages per second? A thousand?  </p>
<p>We set out to benchmark the simplest thing that could possibly work: <code>wget</code> and <code>xargs</code>.</p>
<h2>Our Tools</h2>
<p><a href="http://en.wikipedia.org/wiki/Wget"><code>wget</code></a> is a popular tool used for downloading files from the web. It has a flexible set of options and built in support for crawling. </p>
<p><a href="http://en.wikipedia.org/wiki/Xargs"><code>xargs</code></a> is used to run a command repeatedly over a given set of inputs.  In our case, we&#8217;re using a fixed URL list as our input. We use <code>xargs</code> as our &#8220;thread-pool&#8221; (it&#8217;s actually a &#8220;process-pool&#8221;).  Using the <code>-P &lt;numprocs&gt;</code> option. <code>xargs</code> will run through the input file of URLs and each wget process will take a URL off the stack and run until it finishes the crawl for that domain. The number of concurrent processes is limited by <code>&lt;numprocs&gt;</code>.</p>
<h2>Napkin Calculations</h2>
<p>Before we actually run our jobs, let&#8217;s try to predict the kind of results we&#8217;ll get. I&#8217;m running the jobs on my MacBookPro Intel Core 2 Duo with 4GB RAM. I&#8217;m on my home network where I have AT&amp;T U-Verse with advertised download speed of 18Mbps (mega <em>bits</em> per second).</p>
<p><code>wget</code> measures rate limiting in kilobytes rather than kilobits, so we&#8217;ll use bytes rather than bits:</p>
<pre><code>18Mbps = 2.25 megabytes per second =~ 2300 kilobytes/s
</code></pre>
<p>We&#8217;re just doing rough calculations at this point, so lets just guess that the mean size of each page is 10KB. At this page size, the absolute best number we can expect to get is around 230 pages/second before we saturate my connection.</p>
<h2>Politeness</h2>
<p>While we want our crawler as a system to go as fast as possible, we don&#8217;t want to hit any one server too many times. Not only might we get banned, but we it isn&#8217;t kind to the site owners. Many smaller servers can&#8217;t handle the load of a crawler thrown against it at full speed. </p>
<p>So if we want to get to 200+ pages/sec we&#8217;re going to have to have many concurrent connections. <code>wget</code> has a number of options that we can set to be nicer to each individual server. So our strategy will be to crawl many servers concurrently, but only hit a particular server lightly.</p>
<p>Here are a few of the relevant <code>wget</code> options we will set:</p>
<ul>
<li><code>--wait=2</code> and <code>--random-wait</code> &#8211; wait a random amount of time between requests averaging 2 seconds. The waiting time is for the servers benefit but the random time is for ours. Given that we are going to be running a large number of processes in parallel, we&#8217;d rather have them be out of step with each other.</li>
<li><code>--tries=5</code> &#8211; only retry 5 times</li>
<li><code>--timestamping</code> &#8211; If the file exists on disk, send the server a HEAD request and check the <code>Last-Modified</code> header. If the file on disk has a timestamp greater than or equal to the <code>Last-Modified</code> date, we don&#8217;t request the whole page. This extra <code>HEAD</code> request doesn&#8217;t really slow us down because <code>wget</code> will only request it if the file already exists on disk. This is just a little extra protection in case our separate processes start to visit the same sites.</li>
</ul>
<h2>DMOZ Sample Set</h2>
<p>For multiple runs of our test we don&#8217;t want to hit one particular server repeatedly. We&#8217;re going to use <a href="http://www.dmoz.org/">DMOZ</a> to get a random sample of URLs to test and use a few commands to extract some random URLs:</p>
<pre><code>mkdir -p data/dmoz
curl -0 http://rdf.dmoz.org/rdf/content.rdf.u8.gz &gt; data/dmoz/dmoz-content.rdf.u8.gz
cd data/dmoz
unzip data/dmoz/dmoz-content.rdf.u8.gz
cat dmoz-content.rdf.u8 | grep http | grep r:resource | \
    grep -o '&lt;link r:resource=['"'"'"][^"'"'"']*['"'"'"]' | \
    sed -e 's/^&lt;link r:resource=["'"'"']//' -e 's/["'"'"']$//' \
    &gt; urls.txt
ruby random-lines.rb urls.txt 300 &gt; random-urls.txt
</code></pre>
<blockquote>
<p>(You can get the <a href="http://gist.github.com/raw/262758/1c981f2c77d32614da8bfcfe36366a19fccfea4a/random-lines.rb"><code>random-lines.rb</code> script here</a>).</p>
</blockquote>
<p>The DMOZ file is around 300MB so this will take a few minutes. The DMOZ RDF file is well formed, so we&#8217;re just using <code>grep</code> and <code>sed</code> to extract the URLs.</p>
<h2>Shaping <code>wget</code></h2>
<p>Our <code>wget</code> command is below. You can see we aren&#8217;t trying very hard to access a page that doesn&#8217;t respond quickly (the various <code>timeout</code> options). Also, we&#8217;re only looking 5 pages deep per URL. We are not visiting any &#8220;parent&#8221; pages, that is, we&#8217;re not crawling up any directories. We don&#8217;t want any images or binary files (the <code>reject</code>) options and we don&#8217;t care about invalid SSL certificates (<code>no-check-certificate</code>).</p>
<pre><code>wget \
  --tries=5 \
  --dns-timeout=30 \
  --connect-timeout=5 \
  --read-timeout=5 \
  --timestamping \
  --directory-prefix=data/pages \
  --wait=2 \
  --random-wait \
  --recursive \
  --level=5 \
  --no-parent \
  --no-verbose \
  --reject *.jpg --reject *.gif \
  --reject *.png --reject *.css \
  --reject *.pdf --reject *.bz2 \
  --reject *.gz  --reject *.zip \
  --reject *.mov --reject *.fla \
  --reject *.xml \
  --no-check-certificate
</code></pre>
<h2>DNS</h2>
<p>A nice thing about our setup is that each <code>wget</code> process is assigned to one domain. <code>wget</code> caches the DNS lookup so we only need to make one DNS request per process. A problem with this setup is that <code>wget</code> uses <code>gethostbyname</code> (or <code>getaddrinfo</code> depending on your platform). A quick check on <code>man gethostbyname</code> shows that on my BSD-based Mac <code>gethostbyname</code> is thread-safe e.g. it is synchronized. The result is that there is going to be some resource starvation when we have hundreds of <code>wget</code> processes all trying to call BIND all at the same time.  <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.6331">1</a></p>
<p>We set a DNS timeout of 30 seconds here, but in practice I found that it didn&#8217;t matter much. All of the processes race to grab the DNS lookup lock at the beginning, a large number time-out (waiting for the lock), but the requests even out over a couple of minutes.</p>
<h2>xargs</h2>
<p><code>xargs</code> acts as our thread-pool or, more specifically, our process-pool.   </p>
<p><code>-P</code> specifies the number of processes to use. <code>-I &lt;sub&gt;</code> is a substitution parameter. It means &#8220;for each line in <code>STDIN</code> run <code>CMD</code> substituting the current line for <code>&lt;sub&gt;</code>&#8220;. So below we substitute <code>_URL_</code> with the actual URL contained in the <code>URLS_FILE</code>. </p>
<pre><code>cat $URLS_FILE | xargs -P $CRAWLERS -I _URL_ $WGET_CMD _URL_
</code></pre>
<h2>Code</h2>
<p>Our crawler script looks like this:</p>
<pre><code>#!/bin/bash
# a basic crawler in bash
# usage: crawl.sh urlfile.txt &lt;num-procs&gt;
URLS_FILE=$1
CRAWLERS=$2

mkdir -p data/pages

WGET_CMD="wget \
  --tries=5 \
  --dns-timeout=30 \
  --connect-timeout=5 \
  --read-timeout=5 \
  --timestamping \
  --directory-prefix=data/pages \
  --wait=2 \
  --random-wait \
  --recursive \
  --level=5 \
  --no-parent \
  --no-verbose \
  --reject *.jpg --reject *.gif \
  --reject *.png --reject *.css \
  --reject *.pdf --reject *.bz2 \
  --reject *.gz  --reject *.zip \
  --reject *.mov --reject *.fla \
  --reject *.xml \
  --no-check-certificate"

cat $URLS_FILE | xargs  -P $CRAWLERS -I _URL_ $WGET_CMD _URL_
</code></pre>
<p>I&#8217;ve put this <a href="http://github.com/jashmenn/bashpider">code on github</a> with a <code>Rakefile</code> so you can follow along.</p>
<pre><code>git clone git://github.com/jashmenn/bashpider.git
cd bashpider
rake data:get_urls # downloads and parses DMOZ, will take a while
rake crawl:restart # this will run a crawl
</code></pre>
<p>If you want to monitor the downloads per second, in another window type the following:</p>
<pre><code>rake crawl:watch
</code></pre>
<p>When you feel you&#8217;ve gathered enough data, <code>CTRL-C</code> to kill both windows and then type:</p>
<pre><code>rake results:process
</code></pre>
<h2>Results at Home</h2>
<p>As you can see from the chart, on my home computer through u-verse we max out at about 150 agents at 27 pages/sec, far below our original estimate of 200 pages/sec.</p>
<p>
<img src="http://www.xcombinator.com/wp-content/uploads/2010/09/wget-pages-per-second-500.png" alt="wget-pages-per-second" title="wget-pages-per-second" width="500" height="335" class="aligncenter size-full wp-image-255" />
</p>
<pre><code>procs pages/sec
10     3.9
25     8.9
50    15.8
75    19.7
100   25.8
125   26.1
150   27.3
175   22.3
200    6.0
</code></pre>
<p>First of all, our initial estimate of 10KB per page was too low. In reality we observed a mean page size of around 37KB. This means on our 2300KB connection we can only expect a best-case download rate of 62 pages/sec. Still, 27 pages/sec is not even half that speed. </p>
<p>The other problem could be DNS requests. My home router also serves as my DNS server. It&#8217;s good enough for home use, but I&#8217;m pretty sure it&#8217;s not up to this task. Come to think of it, I&#8217;m not even sure how fast this CAT5 cable is.  </p>
<p>I think it&#8217;s time to try out this setup in a better environment.</p>
<h2>In the Data Center</h2>
<p>For this experiment we loaded our script onto a beefy 8-core machine with a fat bandwidth connection. </p>
<p>The results were much better:</p>
<p>
<img src="http://www.xcombinator.com/wp-content/uploads/2010/09/wget-pages-per-second-datacenter-500.png" alt="wget-pages-per-second-datacenter" title="wget-pages-per-second-datacenter" width="500" height="334" class="aligncenter size-full wp-image-257" /></p>
<pre><code>procs  pages/sec
  150   54
  200   71
  300  107
  400  141
  500  178
  600  214
  700  244
  800  386
  900  327
 1000  203
 1100  222
 1200  392
 1300  202
 1500  249
 1600  485
 2000  577
 3000  679
 3500  459
 4000  336
</code></pre>
<blockquote>
<p>Take these numbers as rough estimates. For each of these entries I only let them run for a few minutes.</p>
</blockquote>
<h3>Processes</h3>
<p>When I started getting into the thousands of processes, I wondered if I would hit the user process limit. <a href="http://yakkstr.com/users/ddonnell">Jay Donnell</a> pointed out to me that <code>uname</code> will also give the process limit:</p>
<pre><code> ulimit -a
 max user processes              (-u) 268287
</code></pre>
<p>So with 260k+ processes available, we have no problem there.</p>
<h3>Files</h3>
<p>Using <code>wget</code> process gets its own file, which is uncompressed. So we&#8217;ve got a lot of disk IO going on. We&#8217;d probably save a good amount of time if each process just opened one file and appended content to it. We&#8217;d also save the file system from creating hundreds of thousands of inodes.</p>
<p>Also, the decline we see around 3000 agents may be due, in part, to the max number of open files on our system:</p>
<pre><code>$ ulimit -a
open files                      (-n) 1024
</code></pre>
<p>Each crawler waits an average of 2 seconds before making the next request, at which time it makes the request and then downloads the file. So each process is making 1 file every >2 seconds. This means the number of <code>wget</code> processes we can run should be at least twice the max number of open files (2048). </p>
<p>My theory is this: at around 3000 concurrent agents the time it takes to actually download the content means that the probability that we will have enough file descriptors available.  However, once we have 4000 concurrent agents the probability that any two agents will need to write a file at the same time is much higher, we see a significant performance drop.</p>
<p>I think we&#8217;re going to need to look at the file system format. Currently we&#8217;re using ext3, but I&#8217;m not sure if we should switch to xfs. While monitoring the file count using <code>find</code> I kept getting the following error: </p>
<pre><code>find: WARNING: Hard link count is wrong for &lt;some file&gt;: this may be a bug
    in your filesystem driver.  Automatically turning on find's -noleaf option.
    Earlier results may have failed to include directories that should have been searched.
</code></pre>
<p>Also <code>kjournald</code> seemed to be working very hard to keep up with all the file writes. I&#8217;m not sure if this is unavoidable or not. I&#8217;m going to leave this problem for future work.</p>
<h2>Summary</h2>
<p>This crawler is just a baseline to see what performance is possible of basic unix utilities. Obviously, this approach used a list of static URLs and in a &#8220;real&#8221; crawler you probably want to have a mechanism for communicating and prioritizing URLs throughout the system.</p>
<p>That said, if you already know the list of URLs you want to download, you could download tens-of-million pages over a 24-hour window. For instance, if we assume a sustained rate of 600 pages per second you could download <em>51.8 million</em> pages in 24 hours.</p>
<p>So how long would it take to download a billion pages with <code>xargs</code> and <code>wget</code>?</p>
<p>If you had the list of URLs beforehand, according to these numbers it would<br />
take <em>19 days</em> .</p>
<p>To download a billion pages is a week we&#8217;re going to need to figure out a way to download at least 1000 more pages per second. </p>
<p>What we&#8217;ve learned:</p>
<ul>
<li>watch out for file limits</li>
<li>append to a single file rather than creating thousands of tiny files</li>
<li>run your own non-locking DNS server</li>
<li>unix tools are handy and powerful</li>
</ul>
<p>What do you think?</p>
<p>Any suggestions for cranking out more performance out of <code>wget</code>? Should I try increasing my open file limit and see what happens? Think these numbers are ridiculous? Leave a comment below!</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F06%2Fa-crawler-using-wget-and-xargs%2F&amp;title=a%20crawler%20using%20wget%20and%20xargs&amp;notes=How%20long%20would%20it%20take%20to%20crawl%20a%20billion%20pages%20using%20wget%20and%20xargs%3F%0D%0A%0D%0AWe%27re%20on%20a%20quest%20to%20write%20a%20scalable%20web%20crawler.%20%20Our%20goal%20is%20to%20build%20a%20web%20crawler%20that%20will%20download%20a%20billion%20pages%20a%20week.%20%20We%27ve%20calculated%20that%20to%20download%20a%20billion%20pag" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F06%2Fa-crawler-using-wget-and-xargs%2F&amp;title=a%20crawler%20using%20wget%20and%20xargs" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F06%2Fa-crawler-using-wget-and-xargs%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=a%20crawler%20using%20wget%20and%20xargs%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F06%2Fa-crawler-using-wget-and-xargs%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F06%2Fa-crawler-using-wget-and-xargs%2F&amp;t=a%20crawler%20using%20wget%20and%20xargs" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F06%2Fa-crawler-using-wget-and-xargs%2F&amp;title=a%20crawler%20using%20wget%20and%20xargs&amp;annotation=How%20long%20would%20it%20take%20to%20crawl%20a%20billion%20pages%20using%20wget%20and%20xargs%3F%0D%0A%0D%0AWe%27re%20on%20a%20quest%20to%20write%20a%20scalable%20web%20crawler.%20%20Our%20goal%20is%20to%20build%20a%20web%20crawler%20that%20will%20download%20a%20billion%20pages%20a%20week.%20%20We%27ve%20calculated%20that%20to%20download%20a%20billion%20pag" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F06%2Fa-crawler-using-wget-and-xargs%2F&amp;t=a%20crawler%20using%20wget%20and%20xargs" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F06%2Fa-crawler-using-wget-and-xargs%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/M18Ci1Uw0Qg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/09/06/a-crawler-using-wget-and-xargs/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2010/09/06/a-crawler-using-wget-and-xargs/</feedburner:origLink></item>
		<item>
		<title>index and working tree do not reflect changes that are now in HEAD</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/hHW6acJKpxU/</link>
		<comments>http://eigenjoy.com/2010/09/04/index-and-working-tree-do-not-reflect-changes-that-are-now-in-head/#comments</comments>
		<pubDate>Sat, 04 Sep 2010 17:30:54 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=296</guid>
		<description><![CDATA[After my recent git class someone asked this question:

 I was trying some remote git repo tests last night, and when I went to push my changes, I received this warning message:
warning: updating the currently checked out branch; this may cause confusion, as the index and working tree do not reflect changes that are now [...]]]></description>
			<content:encoded><![CDATA[<p>After my recent git class someone asked this question:</p>
<blockquote><p>
 I was trying some remote git repo tests last night, and when I went to push my changes, I received this warning message:</p>
<p><code>warning: updating the currently checked out branch; this may cause confusion, as the index and working tree do not reflect changes that are now in HEAD.</code></p>
<p>On remote server I created a directory and did &#8220;<code>git init</code>&#8220;. Then cloned it from my local machine, did changes, committed, and then push. All seemed straightforward there. Any thoughts?</p>
</blockquote>
<p>The issue git is warning you about is that your remote has both the <em>repository</em> and a <em>working copy</em>. That is, on the remote server you have a directory <code>project/</code> with files in it (the working copy) and the folder <code>project/.git</code> (the repository).</p>
<p>If you push from your local machine to the remote, you will only be updating files in the repository and <strong>not</strong> the working copy. That is, the non-git files will not be changed. This can be confusing because you might log into the remote after you push an expect the working copy to be different.</p>
<p>To deal with this possible confusion <code>git init</code> provides a <code>--bare</code> option. What this does is create the repository only (no working copy).  You can then <code>push</code> and <code>pull</code> from the remote like you might a central svn server.</p>
<p>Let me show an example. Say I have an existing git repository on my local machine and I want to create a new remote to back it up. My workflow would look like this:<br />
<code><br />
ssh me@myserver.com<br />
mkdir ~/git/newproject.git<br />
cd ~/git/newproject.git<br />
git init --bare<br />
exit<br />
git remote add myserver me@myserver.com:/home/nmurray/git/newproject.git<br />
git push myserver master</code></p>
<p>If you want, you could even chain these together as a single command:</p>
<p><code>ssh me@myserver.com "mkdir ~/git/newproject.git &#038;&#038; cd ~/git/newproject.git &#038;&#038; git init --bare" &#038;&#038; echo git remote add myserver me@myservercom:/home/nmurray/git/newproject.git</code></p>
<p>Hope this helps!</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F&amp;title=index%20and%20working%20tree%20do%20not%20reflect%20changes%20that%20are%20now%20in%20HEAD&amp;notes=After%20my%20recent%20git%20class%20someone%20asked%20this%20question%3A%0D%0A%0D%0A%0D%0A%20I%20was%20trying%20some%20remote%20git%20repo%20tests%20last%20night%2C%20and%20when%20I%20went%20to%20push%20my%20changes%2C%20I%20received%20this%20warning%20message%3A%0D%0A%0D%0Awarning%3A%20updating%20the%20currently%20checked%20out%20branch%3B%20this%20may%20caus" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F&amp;title=index%20and%20working%20tree%20do%20not%20reflect%20changes%20that%20are%20now%20in%20HEAD" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=index%20and%20working%20tree%20do%20not%20reflect%20changes%20that%20are%20now%20in%20HEAD%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F&amp;t=index%20and%20working%20tree%20do%20not%20reflect%20changes%20that%20are%20now%20in%20HEAD" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F&amp;title=index%20and%20working%20tree%20do%20not%20reflect%20changes%20that%20are%20now%20in%20HEAD&amp;annotation=After%20my%20recent%20git%20class%20someone%20asked%20this%20question%3A%0D%0A%0D%0A%0D%0A%20I%20was%20trying%20some%20remote%20git%20repo%20tests%20last%20night%2C%20and%20when%20I%20went%20to%20push%20my%20changes%2C%20I%20received%20this%20warning%20message%3A%0D%0A%0D%0Awarning%3A%20updating%20the%20currently%20checked%20out%20branch%3B%20this%20may%20caus" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F&amp;t=index%20and%20working%20tree%20do%20not%20reflect%20changes%20that%20are%20now%20in%20HEAD" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/hHW6acJKpxU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/09/04/index-and-working-tree-do-not-reflect-changes-that-are-now-in-head/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2010/09/04/index-and-working-tree-do-not-reflect-changes-that-are-now-in-head/</feedburner:origLink></item>
		<item>
		<title>Getting Cascading to Read Sequence Files Created Somewhere Else</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/KDCaQpKv5ZI/</link>
		<comments>http://eigenjoy.com/2010/09/02/getting-cascading-to-read-sequence-files-created-somewhere-else/#comments</comments>
		<pubDate>Thu, 02 Sep 2010 20:54:09 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=291</guid>
		<description><![CDATA[Sometimes you can&#8217;t control where your data comes from or how it&#8217;s formatted. For instance, where I work a lot data is stored in SequenceFiles. Unfortunately, the files are not taking advantage of the typing SequenceFiles provide and instead each record is a single field containing delimited string.
I like to use Cascading (or cascalog) for [...]]]></description>
			<content:encoded><![CDATA[<p><em>Sometimes you can&#8217;t control</em> where your data comes from or how it&#8217;s formatted. For instance, where I work a lot data is stored in <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html"><code>SequenceFile</code>s</a>. Unfortunately, the files are not taking advantage of the typing <code>SequenceFile</code>s provide and instead each record is a single field containing delimited string.</p>
<p>I like to use Cascading (or cascalog) for my Hadoop jobs, but out of the box Cascading doesn&#8217;t support using <code>SequenceFile</code>s that were created outside of Cascading. That is to say, Cascading requires that your <code>SequenceFile</code>s values be an instance of <code>Tuple</code>.</p>
<p>The solution is to create your own <code>Scheme</code> that parses a <code>SequenceFile</code> according to your own format. In my case I just want to parse each line as the string list.</p>
<p>The code is simple but may not be obvious for a first-time Cascading user. I hope this will save someone a few minutes.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">    <span style="color: #000000; font-weight: bold;">package</span> <span style="color: #006699;">com.xcombinator</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.IOException</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tap.Tap</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.Fields</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.Tuple</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.TupleEntry</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.Tuples</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.scheme.SequenceFile</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.JobConf</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.OutputCollector</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.SequenceFileInputFormat</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.SequenceFileOutputFormat</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #008000; font-style: italic; font-weight: bold;">/**
     * A SequenceFileAsText is a type of {@link SequenceFile}, however the
     * SequenceFile has been created outside of Cascading and is assumed to have a
     * value of a string.
     */</span>
    <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> SequenceFileAsText <span style="color: #000000; font-weight: bold;">extends</span> SequenceFile
      <span style="color: #009900;">&#123;</span>
      <span style="color: #008000; font-style: italic; font-weight: bold;">/** Field serialVersionUID */</span>
      <span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000000; font-weight: bold;">final</span> <span style="color: #000066; font-weight: bold;">long</span> serialVersionUID <span style="color: #339933;">=</span> 1L<span style="color: #339933;">;</span>
&nbsp;
      <span style="color: #008000; font-style: italic; font-weight: bold;">/** Protected for use by TempDfs and other subclasses. Not for general consumption. */</span>
      <span style="color: #000000; font-weight: bold;">protected</span> SequenceFileAsText<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#123;</span>
        <span style="color: #000000; font-weight: bold;">super</span><span style="color: #009900;">&#40;</span> <span style="color: #000066; font-weight: bold;">null</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
&nbsp;
      <span style="color: #008000; font-style: italic; font-weight: bold;">/**
       * Creates a new SequenceFileAsText instance that stores the given field names.
       *
       * @param fields
       */</span>
      <span style="color: #000000; font-weight: bold;">public</span> SequenceFileAsText<span style="color: #009900;">&#40;</span> Fields fields <span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#123;</span>
        <span style="color: #000000; font-weight: bold;">super</span><span style="color: #009900;">&#40;</span> fields <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
&nbsp;
      @Override
      <span style="color: #000000; font-weight: bold;">public</span> Tuple source<span style="color: #009900;">&#40;</span> <span style="color: #003399;">Object</span> key, <span style="color: #003399;">Object</span> value <span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#123;</span>
        <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>value <span style="color: #000000; font-weight: bold;">instanceof</span> Tuple<span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#123;</span>
          <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #009900;">&#40;</span>Tuple<span style="color: #009900;">&#41;</span> value<span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
        <span style="color: #000000; font-weight: bold;">else</span> <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>value <span style="color: #000000; font-weight: bold;">instanceof</span> <span style="color: #003399;">Comparable</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#123;</span>
          <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #000000; font-weight: bold;">new</span> Tuple<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #003399;">Comparable</span><span style="color: #009900;">&#41;</span> value<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
        <span style="color: #000000; font-weight: bold;">else</span> <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>value <span style="color: #339933;">!=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#123;</span>
          <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #000000; font-weight: bold;">new</span> Tuple<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span>.<span style="color: #006633;">valueOf</span><span style="color: #009900;">&#40;</span>value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
        <span style="color: #000000; font-weight: bold;">else</span>
        <span style="color: #009900;">&#123;</span>
          <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #000000; font-weight: bold;">new</span> Tuple<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #003399;">Comparable</span><span style="color: #009900;">&#41;</span><span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
      <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #009900;">&#125;</span></pre></div></div>

<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F&amp;title=Getting%20Cascading%20to%20Read%20Sequence%20Files%20Created%20Somewhere%20Else&amp;notes=Sometimes%20you%20can%27t%20control%20where%20your%20data%20comes%20from%20or%20how%20it%27s%20formatted.%20For%20instance%2C%20where%20I%20work%20a%20lot%20data%20is%20stored%20in%20SequenceFiles.%20Unfortunately%2C%20the%20files%20are%20not%20taking%20advantage%20of%20the%20typing%20SequenceFiles%20provide%20and%20instead%20each%20rec" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F&amp;title=Getting%20Cascading%20to%20Read%20Sequence%20Files%20Created%20Somewhere%20Else" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Getting%20Cascading%20to%20Read%20Sequence%20Files%20Created%20Somewhere%20Else%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F&amp;t=Getting%20Cascading%20to%20Read%20Sequence%20Files%20Created%20Somewhere%20Else" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F&amp;title=Getting%20Cascading%20to%20Read%20Sequence%20Files%20Created%20Somewhere%20Else&amp;annotation=Sometimes%20you%20can%27t%20control%20where%20your%20data%20comes%20from%20or%20how%20it%27s%20formatted.%20For%20instance%2C%20where%20I%20work%20a%20lot%20data%20is%20stored%20in%20SequenceFiles.%20Unfortunately%2C%20the%20files%20are%20not%20taking%20advantage%20of%20the%20typing%20SequenceFiles%20provide%20and%20instead%20each%20rec" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F&amp;t=Getting%20Cascading%20to%20Read%20Sequence%20Files%20Created%20Somewhere%20Else" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/KDCaQpKv5ZI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/09/02/getting-cascading-to-read-sequence-files-created-somewhere-else/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2010/09/02/getting-cascading-to-read-sequence-files-created-somewhere-else/</feedburner:origLink></item>
		<item>
		<title>git cheatsheet and class notes</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/6hiiyuA9CI0/</link>
		<comments>http://eigenjoy.com/2010/09/01/git-cheat-sheet-and-class-notes/#comments</comments>
		<pubDate>Wed, 01 Sep 2010 19:27:48 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=274</guid>
		<description><![CDATA[I recently gave a talk at work about git. I created a cheat sheet based on <a href="http://clojure.org/cheatsheet">Steve Tayon's Clojure Cheatsheet</a>. 

[caption id="attachment_276" align="center" width="496" caption="Git Cheat Sheet Preview"]<a href="http://www.xcombinator.com/wp-content/uploads/2010/09/git-class-cheat-sheet.pdf"><img src="http://www.xcombinator.com/wp-content/uploads/2010/09/git-cheat-sheet-preview.jpg" alt="Git Cheat Sheet Preview" title="Git Cheat Sheet Preview" width="496" height="347" class="size-full wp-image-276" /></a>[/caption]

I realize there are a <a href="http://zrusin.blogspot.com/2007/09/git-cheat-sheet.html">number</a> <a href="http://github.com/guides/git-cheat-sheet">of</a> <a href="http://cheat.errtheblog.com/s/git">cheatsheets</a> for git already. However, I wanted a simple, one-page sheet specifically for my attendees. 

You can download it here:
<ul>
	<li><a href="http://www.xcombinator.com/wp-content/uploads/2010/09/git-class-cheat-sheet.pdf">git cheatsheet pdf</a></li>
	<li><a href="http://github.com/jashmenn/talks/raw/master/git/cheat-sheet/git-class-cheat-sheet.tex">git cheatsheet LaTeX source</a></li>
</ul>

You can find the raw notes of my talk after the jump.




 
]]></description>
			<content:encoded><![CDATA[<p>I recently gave a talk at work about git. I created a cheatsheet based on <a href="http://clojure.org/cheatsheet">Steve Tayon&#8217;s Clojure Cheatsheet</a>. </p>
<div id="attachment_276" class="wp-caption center" style="width: 506px"><a href="http://www.xcombinator.com/wp-content/uploads/2010/09/git-class-cheat-sheet.pdf"><img src="http://www.xcombinator.com/wp-content/uploads/2010/09/git-cheat-sheet-preview.jpg" alt="Git Cheat Sheet Preview" title="Git Cheat Sheet Preview" width="496" height="347" class="size-full wp-image-276" /></a><p class="wp-caption-text">Git Cheat Sheet Preview</p></div>
<p>I realize there are a <a href="http://zrusin.blogspot.com/2007/09/git-cheat-sheet.html">number</a> <a href="http://github.com/guides/git-cheat-sheet">of</a> <a href="http://cheat.errtheblog.com/s/git">cheatsheets</a> for git already. However, I wanted a simple, one-page sheet specifically for my attendees. </p>
<p>You can download it here:</p>
<ul>
<li><a href="http://www.xcombinator.com/wp-content/uploads/2010/09/git-class-cheat-sheet.pdf">git cheatsheet pdf</a></li>
<li><a href="http://github.com/jashmenn/talks/raw/master/git/cheat-sheet/git-class-cheat-sheet.tex">git cheatsheet LaTeX source</a></li>
</ul>
<p>Like it? Hate it? Find a typo? <a href="http://www.xcombinator.com/2010/09/01/git-cheat-sheet-and-class-notes/#comments">Leave your feedback in the comments!</a></p>
<hr/>
<p>Here are my raw notes from the talk:<br />
<code></p>
<pre>
;; -*- mode: Markdown; -*-

# How to read:
commands are indented
actions to perform while presenting are marked with @
left to right

# Welcome
see progit.org
what is version control

why use it:

  * backup/restore
  * synchronization sharing
  * track changes
  * ownership
  * branching and merging

who has used subversion 

git
  * you've heard its distributed
  * b/c branching and merging

pace - slow, no slides

leave with practical understanding

# Install &amp; Config

    sudo port install git-core +svn
    git config --global user.name "Nate Murray"
    git config --global user.email "nate@natemurray.com"

# Basic Commands

    cd ~
    mkdir -p projects/demo       # explain only a little
    cd projects/demo
    git init
    git status                   # nothing here
    ls -a                        # talk .git repository vs. working copy
    echo "version 1" > README.txt
    git status                   # untracked file
    git add README.txt
    git status                   # changes to be committed
    git commit -m "added version one of the file"
    git status                   # clean

stop, draw the picture of the local operation phases - e.g. svn vs. git

> Principle 1: (almost) everything is local

so now that you know about the staging area, lets do it again

    echo "new file" > sheep.rb
    git status                   # draw untracked
    git add sheep.rb
    git status                   # draw staged
    git commit -m "added"

    cat README.txt                 # draw unmodified
    echo "version 2" > README.txt
    git status                   # draw modified
    git commit -a -m "updated version" # shorthand for git add
    git status

Tips:

    git config --global alias.st status
    git st

# Git Internals

* Before we can talk about branching you *have* to understand how git (tried to avoid this)
* files and folders

three objects -  @ Draw first commit

  * blob        - raw data
  * tree        - folder (stores blobs and trees)
  * commit      - snapshot of the repo + meta 

You won't need to use `git cat-file` on a daily basis. however, understanding
the concepts we're going to talk about is really important for branching.

    git log # view the log
    git show ----  # first commit, whatever it is

    git cat-file -p  ---- # first commit
    git cat-file -p  ---- # tree
    git cat-file -p  ---- # blob

draw the rest using git `cat-file`

    git log           # show the log again
    git cat-file -p ---- # second commit

draw the picture. point out the parent connection.
note committer / author

    git cat-file -p ---- # tree

note here there are two blobs!

finish drawing out the second commit
* git stores reference to first file.
* snapshot of the *whole project*
* git stores each file once
* filename is in the `tree` 

draw the last commit

     git log
     git cat-file -p ---- # third commit

> Principle #2 : Git commits are snapshots

* A commit in git is a snapshot of the entire project, not just a list of diffs.
* snapshot is based on the SHA hash function. guarantees file integrity

# refs/branches

questions?

@ stop. redraw commits as *linear* . looking only at commits

ready to define a branch
a branch is a pointer to a commi
text file with a sha. thats it. 

start with one branch called `master`

    git branch

bash prompt

    # skip this
    tree .git/refs/
    cat .git/refs/heads/master
    git log
    # compare the SHAs

update diagram by adding a `ref` to our commit. (`master`). 

@ draw circle pointing to commit

create testing branch

# branching

So lets create another branch:

    git branch testing
    git branch

only created, didn't switch. just created a ref pointing to this
commit

@ update diagram

How does git know what branch we are "on"?

special ref called `HEAD` that points to the local branch
since we are still on master HEAD points to master

@ add HEAD

To switch working copy, use the `git checkout`

    git checkout testing
    git branch

HEAD moves from `master` to `testing`

@ update diagram

master and testing point to the same commit, working directory isn't changed

checkout means something different in git than it does in svn.
checkout in git to switch our working directory to a particular commit. 

now make changes:

    cat README.txt
    echo "we are on the testing branch!" > README.txt
    cat README.txt
    git commit -a -m "updated the readme"
    git log

@ update diagram, adding new commit. move the testing ref and the HEAD ref with it

add a "test"

    echo "this is a test" > test.rb
    git add test.rb                    # stage it for our commit
    git commit -m "added a test"       # now commit
    git log

@ update diagram - should have two commits

hotfix - scenario: you need to switch back to master

    git checkout master
    ls

@ move HEAD

so notice two things.
1) switching to this branch was fast - everything is local
2) our file test.rb is absent

and if we

    cat README.txt

it says 'version 2' just like we would expect

    echo "applying fix" >> sheep.rb
    cat sheep.rb
    git commit -a -m "applied important fix"
    git log
    git cat-file -p ---- # last commit

@ draw the new commit, and draw its reference back to the parent. move HEAD and master

now fixed, can push into production
and get back to work in `testing`

    git checkout testing
    cat README.txt
    cat test.rb

This is a general pattern:

> Principle #3: Branching is cheap, use it often

If you are working on a particular feature, create a branch. 

If you're coming from svn, making frequent branches might seem unnatural.
in svn, a branch is global -> namespace issues.
vs. git: private branches
name your branch 'test' and it won't collide with anyone elses

But branching itself isn't that useful unless its easy to merge.

* how many of you have merged a branch in svn?
* how many of you enjoyed it?

merging is one of git's strength and git makes it relatively easy

# merging

    cat sheep.rb

two branches: `master` and `testing` - need to merge

    git checkout master
    git merge testing
    git show HEAD

instead of a 'parent' we have a line that says 'merge'
a merge commit has more than one parent

@ draw the commit object
@ draw lines to the commits

    gitx

sometimes merging doesn't go as planned - conflicts

    git checkout -b breaker

this is shorthand for create and then checkout a new branch based on the
current HEAD

    vi sheep.rb # changing fix
    git commit -a -m "changed the fix"
    git checkout master
    vi sheep.rb # improving fix
    git commit -a -m "improved the fix"

@(update diagram, adding breaker and master refs)

    git merge breaker
    git status

there are many diff viewing tools.
* perforce
* opendiff - from apple

    git mergetool -t opendiff

I don't really like using the visual tools.
Sometimes you need character level editing

    vi sheep.rb
    git add sheep.rb
    git commit -a

talk about merge with conflicts

@ update diagram draw new merge commit

    gitx

Questions?

# Remotes

Everything so far on one machine. 

I work offline (I take the train)
If I break something I can rollback see where I was an hour ago 

want to share our changes.
might seem scary or messy because changes to totally independent lines of the code.
but in practice its not a problem.

svn version numbers are incremental - so two repos would get out of
step
no easy way of merging two separate repostories. 

git blob identifiers are a SHA of the content.
if the same content is created anywhere in the universe you'll still
have the same SHA

git doesn't care about where your commits come from or how you get them

Protocols:
  * ssh
  * git
  * http
  * local file system

sample project on our github

    cd ..
    open http://XXX/nmurray/simple-echo
    git clone git@XXX:nmurray/simple-echo.git
    cd simple-echo
    git log

svn checkout just HEAD
vs. git - whole repo

To be able to collaborate with others you have to manage 'remote repositories'.
When you clone a project, you have a default remote called 'origin'. 

    git remote -v

Remotes are pointers to other repositories that are _usually_ over the network.
'pull' and 'push' changes.

    vi README.mkd
    # make a change
    git commit -a -m "make a change"
    git push

If someone else makes a change:

    git pull origin master

This means pull from `origin` the branch `master` into local branch `master`. You can often to just

    git pull

which means pull from origin whatever branch Im on (i.e. HEAD) into this branch.

Now let's say someone pushes a change and I make a change
I can't push unless I pull first. This is good.

# remote forks

So that is while we are on the same line. What if were on different lines?

@(open up webbrowser again)

Bh also has forked my project. But when we say forked, all the means is he has
created his own development line from some of my commits

    git remote add bh git@XXX:bhenderson/simple-echo.git
    git remote -v

Now you shouldn't be surprised to learn that adding the remote doesn't change
anything. First we have to `fetch` hist changes

    git fetch bh

`fetch` brings his commits into my repo but again, doesnt change my working copy.

fetch brought branches + commits into repo
work with those branches just like any other branch.

    git branch -a

So you see here we have 

* `master`, which is our local master
* we have at the bottom `origin/master` which is the origin where we pulled from branch master
* and then we have `bh/master`, which is bhendersons master branch

These are all regular branches: they are just pointers to commits. We
can even checkout as branch 

    git checkout bh/master

scary message

    git checkout master

So how would we merge bhendersons changes with our own? I'm sure you could guess by now. Simply:

    git merge bh/master # don't press enter!!!

But lets take it up a notch.
say you didn't want to merge bh changes in your master branch.
real world, you might not know if his changes would merge cleanly
don't want to mess up your master branch.  

What we are going to do is
* create a new branch,
* merge bhs branch in THAT branch
* then we're going to merge to master.

It will make more sense when we do it. Lets try:

Okay we first want to create a new branch based on our master

    git checkout -b bh-merge
    git branch -a 

Now lets merge his changes

    cat simple-echo.rb
    git merge bh/master
    cat simple-echo.rb
    git log                  # see bh as the author of the commit

okay everything was clean! *phew* now lets go back to master

    git checkout master
    git merge bh-merge
    git log

and there we go! merged nicely.
now I don't need bhendersons merge branch anymore, so lets delete it

    git branch -d bh-merge
    git branch -a

git is distributed

Instead of one central server, that everyone has to sync to,
* independent lines of work can go on.
* If someone creates something good in their branch, they just tell people about it.
* permission-less 

you can see why it is so good for open-source development

questions about branching?

# Advanced

* tagging
* rebase
* cherry pick
* git bisect
* hooks
* tracking branches
* submodules
* interactive staging
* squashing commits
* git-svn
* setting up your own server
* patches via email
* gitjour
</pre>
<p></code></p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F&amp;title=git%20cheatsheet%20and%20class%20notes&amp;notes=I%20recently%20gave%20a%20talk%20at%20work%20about%20git.%20I%20created%20a%20cheat%20sheet%20based%20on%20Steve%20Tayon%27s%20Clojure%20Cheatsheet.%20%0D%0A%0D%0A%0D%0AI%20realize%20there%20are%20a%20number%20of%20cheatsheets%20for%20git%20already.%20However%2C%20I%20wanted%20a%20simple%2C%20one-page%20sheet%20specifically%20for%20my%20attendees.%20%0D%0A%0D%0AYou%20can%20download%20it%20here%3A%0D%0A%0D%0A%09git%20cheatsheet%20pdf%0D%0A%09git%20cheatsheet%20LaTeX%20source%0D%0A%0D%0A%0D%0AYou%20can%20find%20the%20raw%20notes%20of%20my%20talk%20after%20the%20jump.%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%20%0D%0A" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F&amp;title=git%20cheatsheet%20and%20class%20notes" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=git%20cheatsheet%20and%20class%20notes%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F&amp;t=git%20cheatsheet%20and%20class%20notes" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F&amp;title=git%20cheatsheet%20and%20class%20notes&amp;annotation=I%20recently%20gave%20a%20talk%20at%20work%20about%20git.%20I%20created%20a%20cheat%20sheet%20based%20on%20Steve%20Tayon%27s%20Clojure%20Cheatsheet.%20%0D%0A%0D%0A%0D%0AI%20realize%20there%20are%20a%20number%20of%20cheatsheets%20for%20git%20already.%20However%2C%20I%20wanted%20a%20simple%2C%20one-page%20sheet%20specifically%20for%20my%20attendees.%20%0D%0A%0D%0AYou%20can%20download%20it%20here%3A%0D%0A%0D%0A%09git%20cheatsheet%20pdf%0D%0A%09git%20cheatsheet%20LaTeX%20source%0D%0A%0D%0A%0D%0AYou%20can%20find%20the%20raw%20notes%20of%20my%20talk%20after%20the%20jump.%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%20%0D%0A" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F&amp;t=git%20cheatsheet%20and%20class%20notes" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/6hiiyuA9CI0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/09/01/git-cheat-sheet-and-class-notes/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2010/09/01/git-cheat-sheet-and-class-notes/</feedburner:origLink></item>
		<item>
		<title>Desirable Properties for a Web Crawler</title>
		<link>http://feedproxy.google.com/~r/xcombinator/~3/CFQvAMR13ws/</link>
		<comments>http://eigenjoy.com/2010/09/01/desirable-properties-for-a-web-crawler/#comments</comments>
		<pubDate>Wed, 01 Sep 2010 15:15:13 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[crawling]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=247</guid>
		<description><![CDATA[I aim to build a web crawler that can download a billion pages in a week.
Below are some desirable properties any web crawler should have:
Scalability
The web is enormous and continually growing. A crawler should scale
linearly with the number of agent-machines that are added to the
system. This allows us to add more agents as our needs [...]]]></description>
			<content:encoded><![CDATA[<p>I aim to build a web crawler that can download a billion pages in a week.<br />
Below are some desirable properties any web crawler should have:</p>
<h2>Scalability</h2>
<p>The web is enormous and continually growing. A crawler should scale<br />
linearly with the number of agent-machines that are added to the<br />
system. This allows us to add more agents as our needs increase.</p>
<h2>Speed</h2>
<p>Speed is a significant issue at this scale. For example, if we want to crawl 1<br />
billion pages in a week (this is less than 1/1000th of the web), our system<br />
will have to sustain a rate of 1653 downloads per second.</p>
<p>To achieve this speed we need to employ a number of techniques such as<br />
concurrent connections, data compression, dns caching, minimize disk seeks,<br />
etc.</p>
<h2>Politeness</h2>
<p>While we need a high rate of download, we must be <em>polite</em> and not<br />
overload one particular server. Najork et. al. propose limiting requests to<br />
a single server by waiting 10 times the time it took to download the<br />
last page. <a href="http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-173.html">1</a></p>
<h2>Quality</h2>
<p>We aim to build a crawler that visits &#8220;high-quality&#8221; or &#8220;relevant&#8221; pages<br />
well-known quality metric. However, Najork et. al. found that a simple<br />
breadth-first crawl tends to visit high-quality pages first <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.6162&amp;rep=rep1&amp;type=pdf">2</a>.<br />
We&#8217;ll eventually build a more intelligent page-selection mechanism, but for now<br />
breadth-first will work.</p>
<h2>Agents as a Distributed System</h2>
<p>Given that the crawling problem cannot be solved by a single machine,<br />
we are required to form our solution as a distributed<br />
system. Distributed systems introduce more room for failures<br />
and errors in coordination. Therefore we define the following desirable<br />
features for our distributed system:</p>
<h3>Fault Tolerance</h3>
<p>Hardware failure is unavoidable. The failure of one node should not<br />
prevent survivors from continuing to operate.</p>
<h3>Even Partitioning</h3>
<p>The URL frontier should be evenly distributed across all agents to<br />
evenly assign the work. Many crawlers use a hashing function to distribute<br />
the URLs among machines.</p>
<h3>Minimize Overlap</h3>
<p>Overlap is defined as <code>(n-u)/u</code>, where <code>n</code> is the total number of<br />
crawled pages and <code>u</code> is the number of <em>unique</em> pages (sometimes <code>u &lt; n</code><br />
because the same page has been erroneously fetched several times).<br />
Optimally, we want an overlap of 0. <a href="http://vigna.dsi.unimi.it/ftp/papers/UbiCrawler.pdf">3</a></p>
<h3>Agent churn</h3>
<p>During the crawl we may want to add additional resources. The system<br />
should support agents coming and leaving the group. </p>
<h2>Next Steps</h2>
<p>There are five major parts to a web crawler:</p>
<ul>
<li>The URL frontier</li>
<li>IP address lookup</li>
<li>Page download</li>
<li>Page processing</li>
<li>Tracking URLs encountered</li>
</ul>
<p>Over the next few articles we will be designing a each of the five<br />
components. The list of desirable features give us guidelines<br />
that help shape the decisions about each component.</p>
<ol>
<li>Najork M. High-Performance Web Crawling. Systems Research. 2001.</li>
<li>Najork M, Wiener JL. Breadth-First Search Crawling Yields High-Quality Pages. Systems Research. 2001:114-118.</li>
<li>Boldi P, Codenotti B, Santini M, Vigna S. UbiCrawler: a scalable fully distributed Web crawler. Software: Practice and Experience. 2004;34(8):711-726. </li>
</ol>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fdesirable-properties-for-a-web-crawler%2F&amp;title=Desirable%20Properties%20for%20a%20Web%20Crawler&amp;notes=I%20aim%20to%20build%20a%20web%20crawler%20that%20can%20download%20a%20billion%20pages%20in%20a%20week.%20%0D%0ABelow%20are%20some%20desirable%20properties%20any%20web%20crawler%20should%20have%3A%0D%0A%0D%0AScalability%0D%0A%0D%0AThe%20web%20is%20enormous%20and%20continually%20growing.%20A%20crawler%20should%20scale%0D%0Alinearly%20with%20the%20numb" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fdesirable-properties-for-a-web-crawler%2F&amp;title=Desirable%20Properties%20for%20a%20Web%20Crawler" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fdesirable-properties-for-a-web-crawler%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Desirable%20Properties%20for%20a%20Web%20Crawler%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fdesirable-properties-for-a-web-crawler%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fdesirable-properties-for-a-web-crawler%2F&amp;t=Desirable%20Properties%20for%20a%20Web%20Crawler" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fdesirable-properties-for-a-web-crawler%2F&amp;title=Desirable%20Properties%20for%20a%20Web%20Crawler&amp;annotation=I%20aim%20to%20build%20a%20web%20crawler%20that%20can%20download%20a%20billion%20pages%20in%20a%20week.%20%0D%0ABelow%20are%20some%20desirable%20properties%20any%20web%20crawler%20should%20have%3A%0D%0A%0D%0AScalability%0D%0A%0D%0AThe%20web%20is%20enormous%20and%20continually%20growing.%20A%20crawler%20should%20scale%0D%0Alinearly%20with%20the%20numb" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fdesirable-properties-for-a-web-crawler%2F&amp;t=Desirable%20Properties%20for%20a%20Web%20Crawler" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fdesirable-properties-for-a-web-crawler%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
<img src="http://feeds.feedburner.com/~r/xcombinator/~4/CFQvAMR13ws" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/09/01/desirable-properties-for-a-web-crawler/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://eigenjoy.com/2010/09/01/desirable-properties-for-a-web-crawler/</feedburner:origLink></item>
	</channel>
</rss>

