<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>atbrox</title>
	
	<link>http://atbrox.com</link>
	<description />
	<lastBuildDate>Sat, 06 Mar 2010 06:17:54 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/atbrox" /><feedburner:info uri="atbrox" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Initial Thoughts on Yahoo’s Ranking Challenge</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/jhu-QCPFDyY/</link>
		<comments>http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/#comments</comments>
		<pubDate>Sat, 27 Feb 2010 23:15:09 +0000</pubDate>
		<dc:creator>amund</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[netflix]]></category>
		<category><![CDATA[ranking]]></category>
		<category><![CDATA[regression]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[yahoo]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=855</guid>
		<description><![CDATA[
			
				
			
		
Yesterday Yahoo announced the Learning to Rank Challenge &#8211; a pretty interesting challenge (as the somewhat similar Netflix Prize Challenge also was). 
Data and Problem
The data sets contains (to my interpretation) per line:

url &#8211; implicitly encoded as line number in the data set file
relevance &#8211; low number=high relevance and vice versa
query &#8211; represented as an [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F28%2Finitial-thoughts-on-yahoos-ranking-challenge%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F28%2Finitial-thoughts-on-yahoos-ranking-challenge%2F&amp;style=normal" height="61" width="51" /><br />
			</a>
		</div>
<p>Yesterday Yahoo <a href="http://groups.google.com/group/ml-news/browse_thread/thread/bec89f7abee8f9c7#">announced</a> the <a href="http://learningtorankchallenge.yahoo.com/">Learning to Rank Challenge</a> &#8211; a pretty interesting challenge (<em>as the somewhat similar <a href="http://www.netflixprize.com//community/viewtopic.php?id=1537">Netflix Prize Challenge</a> also was</em>). </p>
<p><strong>Data and Problem</strong><br />
The data sets contains (to my interpretation) per line:</p>
<ol>
<li>url &#8211; implicitly encoded as line number in the data set file
<li>relevance &#8211; low number=high relevance and vice versa
<li>query &#8211; represented as an id
<li>features &#8211; up to several hundreds
</ol>
<p>and the problem is to find a function that gives <a href="http://learningtorankchallenge.yahoo.com/instructions.php">relevance numbers per url per query id</a>.</p>
<p><strong>Initial Observation</strong><br />
In dataset 1 there are ~473k URLs and ~19k queries. At first I thought this meant that there are in average 473/19 ~ 24 relevance numbers for each query (see actual distribution of counts in figure below), i.e. corresponding to search result 1 to 24, but it seems like there are several URLs per unique query that has the same relevance (e.g. URLx and URLy both can have relevance 2 for queryZ). The paper <a href="http://portal.acm.org/citation.cfm?id=1390382">Learning to Rank with Ties</a> seems potentially relevant to deal with this.</p>
<p><img src="http://spreadsheets.google.com/oimg?key=0AtUpNWn0bYdJdGlOZWZ0TTgwLUU2Vy1QYXZJT2lUWXc&#038;oid=1&#038;v=1267315743217" /></p>
<p>Multiple URLs that shares relevance for a unique query can perhaps be due to:</p>
<ol>
<li>similar/duplicate content between the URLs?
<li>a frequent query (due to sampling of examples?)
<li>uncertainty about which URL to select for particular a relevance and query?
<li>there is a tie, i.e. they are equally relevant
</ol>
<p><strong>Potential classification approach?</strong><br />
From a classification perspective there are several (perhaps naive?) approaches that could be tried out:</p>
<ol>
<li>Use relevance levels as classes (nominal regression) and use a multiclass-classifier
<li>Train classifier as binary competition within query, i.e. relevance 1 against 2, 3, .., and relevance n against n+1, .. (probably get some sparsity problems due to this)
<li>Binary competition across queries, but is problematic due to that a relevance of 4 for one query could be more relevant than a relevance of 1 for a another query (and there is no easy way to determine that directly from the data), but if the observation related to multiple URLs per relevance level per query (see above) is caused by uncertainty one could perhaps use 1/(number of URLs per relevance level per query) as a weight to either:
<ol>
<li>support training across queries, e.g. a URL for a query with relevance 1 is better that another query of relevance 1 with 37 URLs of that relevance, this approach could perhaps be used somehow using regression? The problem is to compare against different relevance levels, e.g. is a relevance 2 for a query with 1 url more confident than one of relevance 1 for a query with 37 URLs?
<li>use a classifier that supports weighing examples and the approach in 1 or 2.
</ol>
</ol>
<p><Strong>Conclusion</strong><br />
Still have more questions than answers, so next step is the <a href="http://research.microsoft.com/en-us/um/beijing/projects/letor/paper.aspx">learning to rank bibliography</a>.</p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to BlogRO" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to FTW" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to VoxRO" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Twitter" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Initial+Thoughts+on+Yahoo%26%238217%3Bs+Ranking+Challenge&amp;c=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to MySpace" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/&amp;title=Initial+Thoughts+on+Yahoo%26%238217%3Bs+Ranking+Challenge" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Del.icio.us" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/&amp;title=Initial+Thoughts+on+Yahoo%26%238217%3Bs+Ranking+Challenge" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to digg" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/&amp;t=Initial+Thoughts+on+Yahoo%26%238217%3Bs+Ranking+Challenge" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to FaceBook" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Technorati" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/&amp;title=Initial+Thoughts+on+Yahoo%26%238217%3Bs+Ranking+Challenge" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Stumble Upon" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/&amp;title=Initial+Thoughts+on+Yahoo%26%238217%3Bs+Ranking+Challenge" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Google Bookmarks" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/jhu-QCPFDyY" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/</feedburner:origLink></item>
		<item>
		<title>So, what is Hadoop?</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/kZIQgRnxtAE/</link>
		<comments>http://atbrox.com/2010/02/17/hadoop/#comments</comments>
		<pubDate>Wed, 17 Feb 2010 21:39:10 +0000</pubDate>
		<dc:creator>amund</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[bigtable]]></category>
		<category><![CDATA[facebook]]></category>
		<category><![CDATA[gfs]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[thrift]]></category>
		<category><![CDATA[yahoo]]></category>
		<category><![CDATA[zookeeper]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=727</guid>
		<description><![CDATA[
			
				
			
		
Hadoop is a set of open source technologies that supports reliable and cost-efficient ways of dealing with large amounts of data. Given the vast amounts of business critical and required data companies gather (e.g. required due to Sarbanes–Oxley (SOX) or  EU Data Retention Directive), Hadoop becomes increasingly relevant. 
Hadoop Technologies
Several Hadoop technologies are inspired [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F17%2Fhadoop%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F17%2Fhadoop%2F&amp;style=normal" height="61" width="51" /><br />
			</a>
		</div>
<p><a href="http://hadoop.apache.org/">Hadoop</a> is a set of open source technologies that supports reliable and cost-efficient ways of dealing with large amounts of data. Given the vast amounts of business critical and required data companies gather (e.g. required due to <a href="http://en.wikipedia.org/wiki/Sarbanes%E2%80%93Oxley_Act">Sarbanes–Oxley (SOX)</a> or  <a href="http://en.wikipedia.org/wiki/Data_Retention_Directive">EU Data Retention Directive</a>), Hadoop becomes increasingly relevant. </p>
<h2>Hadoop Technologies</h2>
<p>Several Hadoop technologies are inspired by <a href="http://research.google.com/pubs/DistributedSystemsandParallelComputing.html">Google&#8217;s infrastructure</a>.</p>
<h4>1. Processing and Storage</h4>
<p><strong>1.1 Processing &#8211; Mapreduce</strong><br />
Mapreduce can be used to process and extract knowledge from arbitrary amounts of data, e.g. web data, measurement data or financial transactions &#8211; <a href="http://www.slideshare.net/cloudera/hw09-large-scale-transaction-analysis">Visa reduced their processing time for transactional statistics from 1 month to 13 minutes with Hadoop</a>. In order to use Mapreduce developers need to parallelize their problem and program against an API &#8211; <a href="http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/">here for an example of machine learning with Hadoop</a>. Hadoop&#8217;s Mapreduce is inspired by the paper <a href="http://research.google.com/archive/mapreduce.html">MapReduce: Simplified Data Processing on Large Clusters</a>. </p>
<p><strong>1.2 File Storage &#8211; HDFS</strong><br />
HDFS is scalable and distributed file system. It supports configurable degree of replication for reliable storage even when running on cheap hardware. HDFS is inspired by the paper <a href="http://research.google.com/archive/gfs-sosp2003.pdf">The Google File System</a></p>
<p><strong>1.3 Database &#8211; HBase</strong><br />
HBase is a distributed database that supports storing billions of rows with millions of columns that runs on top of HDFS. HBase can replace traditional databases if they get problems scaling or become to expensive licence-wise, see <a href="http://www.docstoc.com/docs/document-preview.aspx?doc_id=12426408&#038;C">this presentation about Hbase</a>. HBase is inspired by the paper <a href="http://research.google.com/archive/bigtable-osdi06.pdf">Bigtable: A Distributed Storage System for Structured Data</a></p>
<h4>2. Data Analysis</h4>
<p>Mapreduce can be used to analyze all kinds of data (e.g. text, multimedia, numerical data) and have high flexibility, but for more structured data the following Hadoop Technologies can be used:</p>
<p><strong>2.1 Pig</strong><br />
SQL-like language/system running on top of Mapreduce. <a href="http://glinden.blogspot.com/2007/04/yahoo-pig-and-google-sawzall.html">Pig is developed by Yahoo</a> and inspired by the paper <a href="http://research.google.com/pubs/pub61.html">Interpreting the Data: Parallel Analysis with Sawzall</a></p>
<p><strong>2.2 Hive</strong><br />
Datawarehouse running on top of Hadoop, developed by Facebook. Query language is very similar to SQL.</p>
<h4>3. Distributed Systems Development</h4>
<p><strong>3.1 Avro</strong><br />
Avro is used for efficient serialization of data and communication between services. It is in several ways similar to <a href="http://code.google.com/apis/protocolbuffers/">Google&#8217;s protocolbuffers</a> and <a href="http://developers.facebook.com/thrift/">Facebook&#8217;s Thrift</a>.</p>
<p><strong>3.2 Zookeeper</strong><br />
Coordination between distributed processes. It is inspired by the paper <a href="http://research.google.com/archive/chubby-osdi06.pdf">The Chubby lock service for loosely-coupled distributed systems</a></p>
<p><strong>3.3 Chukwa</strong><br />
Monitoring of distributed systems.</p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2010/02/17/hadoop/" title="Add 'So, what is Hadoop?' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'So, what is Hadoop?' to BlogRO" alt="Add 'So, what is Hadoop?' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2010/02/17/hadoop/" title="Add 'So, what is Hadoop?' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'So, what is Hadoop?' to FTW" alt="Add 'So, what is Hadoop?' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2010/02/17/hadoop/" title="Add 'So, what is Hadoop?' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'So, what is Hadoop?' to VoxRO" alt="Add 'So, what is Hadoop?' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2010/02/17/hadoop/" title="Add 'So, what is Hadoop?' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'So, what is Hadoop?' to Twitter" alt="Add 'So, what is Hadoop?' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=So%2C+what+is+Hadoop%3F&amp;c=http://atbrox.com/2010/02/17/hadoop/" title="Add 'So, what is Hadoop?' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'So, what is Hadoop?' to MySpace" alt="Add 'So, what is Hadoop?' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2010/02/17/hadoop/&amp;title=So%2C+what+is+Hadoop%3F" title="Add 'So, what is Hadoop?' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'So, what is Hadoop?' to Del.icio.us" alt="Add 'So, what is Hadoop?' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2010/02/17/hadoop/&amp;title=So%2C+what+is+Hadoop%3F" title="Add 'So, what is Hadoop?' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'So, what is Hadoop?' to digg" alt="Add 'So, what is Hadoop?' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2010/02/17/hadoop/&amp;t=So%2C+what+is+Hadoop%3F" title="Add 'So, what is Hadoop?' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'So, what is Hadoop?' to FaceBook" alt="Add 'So, what is Hadoop?' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2010/02/17/hadoop/" title="Add 'So, what is Hadoop?' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'So, what is Hadoop?' to Technorati" alt="Add 'So, what is Hadoop?' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2010/02/17/hadoop/&amp;title=So%2C+what+is+Hadoop%3F" title="Add 'So, what is Hadoop?' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'So, what is Hadoop?' to Stumble Upon" alt="Add 'So, what is Hadoop?' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2010/02/17/hadoop/&amp;title=So%2C+what+is+Hadoop%3F" title="Add 'So, what is Hadoop?' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'So, what is Hadoop?' to Google Bookmarks" alt="Add 'So, what is Hadoop?' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/kZIQgRnxtAE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2010/02/17/hadoop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://atbrox.com/2010/02/17/hadoop/</feedburner:origLink></item>
		<item>
		<title>Mapreduce &amp; Hadoop Algorithms in Academic Papers (updated)</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/ZU_alB_G58o/</link>
		<comments>http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/#comments</comments>
		<pubDate>Fri, 12 Feb 2010 19:19:37 +0000</pubDate>
		<dc:creator>amund</dc:creator>
				<category><![CDATA[Hadoop and Mapreduce]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[machinelearning]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[search]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=662</guid>
		<description><![CDATA[
			
				
			
		
This posting is an update to the similar posting from October 2009, roughly doubling the numbers of papers from the previous posting, the new ones are marked with *
Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.
Which areas do the papers cover?
 Bioinformatics/Medical Informatics
* MapReduce-Based [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F12%2Fmapreduce-hadoop-algorithms-in-academic-papers-updated%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F12%2Fmapreduce-hadoop-algorithms-in-academic-papers-updated%2F&amp;style=normal" height="61" width="51" /><br />
			</a>
		</div>
<p>This posting is an update to the <a href="http://atbrox.com/2009/10/01/mapreduce-and-hadoop-academic-papers/">similar posting from October 2009</a>, roughly doubling the numbers of papers from the previous posting, the new ones are marked with <span style="color: #ff0000;"><strong>*</strong></span></p>
<p><strong>Motivation</strong><br />
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.</p>
<p><strong>Which areas do the papers cover?</strong></p>
<ul> <strong>Bioinformatics/Medical Informatics</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.springerlink.com/content/861l014845934682/">MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network</a> (2009)<br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.biomedcentral.com/1471-2105/11/S1/S15">MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees</a></p>
<p><strong>Machine Translation</strong><br />
<a href="http://www.cs.cmu.edu/~zollmann/publications/samt-toolkit.pdf"> Grammar based statistical MT on Hadoop</a> (2009)<br />
<a href="http://acl.ldc.upenn.edu/D/D07/D07-1090.pdf"> Large Language Models in Machine Translation</a> (2008)</p>
<p><strong>Spatial Data Processing</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://users.cis.fiu.edu/~vagelis/publications/Spatial-MapReduce-SSDBM2009.pdf">Experiences on Processing Spatial Data with MapReduce</a></p>
<p><strong>Information Extraction and Text Processing</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://portal.acm.org/citation.cfm?id=1620950.1620951">Data-intensive text processing with MapReduce</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.aclweb.org/anthology/D/D09/D09-1098.pdf"> Web-Scale Distributional Similarity and Entity Set Expansion</a> (2009)<br />
<a href="http://www.aclweb.org/anthology-new/D/D09/D09-1071.pdf"> The infinite HMM for unsupervised PoS tagging</a> (2009)</p>
<p><strong>Artificial Intelligence/Machine Learning/Data Mining</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.cs.cmu.edu/~ylow/paraml_aistats2009.pdf">Residual Splash for Optimally Parallelizing Belief Propagation</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://portal.acm.org/citation.cfm?id=1646301">Stochastic gradient boosted distributed decision trees</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://jmlr.csail.mit.edu/papers/volume10/newman09a/newman09a.pdf">Distributed Algorithms for Topic Models</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://verma7.com/wp/wp-content/uploads/2009/10/meandre-mapreduce.pdf">When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.springerlink.com/content/m28617946158t788/">Cloud Computing Boosts Business Intelligence of Telecommunication Industry</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.springerlink.com/content/c621194607866223/">Parallel K-Means Clustering Based on MapReduce</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://portal.acm.org/citation.cfm?id=1631067">Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://portal.acm.org/citation.cfm?id=1631272.1631451">Parallel algorithms for mining large-scale rich-media data</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://verma7.com/wp/wp-content/uploads/2009/09/CS597_Spring09_GA.pdf">Scaling Simple and Compact Genetic Algorithms using MapReduce</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.cs.vu.nl/~frankh/postscript/ISWC09.pdf">Scalable Distributed Reasoning using Mapreduce</a><br />
<a href="http://www.cse.nd.edu/~dthain/papers/classify-icdm08.pdf"> Scaling Up Classifiers to Cloud Computers</a> (2008)</p>
<ul>
For an example of Parallel Machine Learning with Hadoop/Mapreduce, check out our <a href="http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/">previous blog post</a>.
</ul>
<p><strong>Ads Analysis</strong><br />
<a href="http://www.cc.gatech.edu/~zha/CSE8801/ad/p209-chen.pdf"> Large-Scale Behavioral Targeting</a> (2009)<br />
<a href="http://research.yahoo.com/files/cikm2008-search%20advertising.pdf "> Search Advertising using Web Relevance Feedback</a> (2008)<br />
<a href="http://research.yahoo.com/workshops/troa-2008/papers/submission_12.pdf"> Predicting Ads’ ClickThrough Rate with Decision Rules </a>(2008)</p>
<p><strong>Search Query Analysis</strong><br />
<a href="http://research.microsoft.com/apps/pubs/default.aspx?id=80592"> BBM: Bayesian Browsing Model from Petabyte-scale Data</a> (2009)<br />
<a href="http://portal.acm.org/citation.cfm?id=1559990&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=54492464&amp;CFTOKEN=33063869"> AIDE: Ad-hoc Intents Detection Engine over Query Logs </a>(2009)</p>
<p><strong>Information Retrieval (Search)</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://paginas.fe.up.pt/~eol/PUBLICATIONS/2009/Efficient%20clustering%20of%20web-derived%20data%20sets.pdf">Efficient Clustering of Web Derived Data Sets</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://web.phys.ntu.edu.tw/phystalks/Theory_seminar_Fall_2009/PageRank_PingYeh.pdf">The PageRank algorithm and application on searching of academic papers</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.springerlink.com/content/h411850464229625/">A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures</a><br />
<a href="http://portal.acm.org/citation.cfm?id=1572106&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=54492520&amp;CFTOKEN=63253841"> On Single-Pass Indexing with MapReduce</a> (2009)<br />
<a href="http://bhavik.me/docs/Paper.pdf"> A Data Parallel Algorithm for XML DOM Parsing</a> (2009)<br />
<a href="http://www.springerlink.com/content/t607305788356537/"> Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web</a> (2008)</p>
<p><strong>Spam &amp; Malware Detection</strong><br />
<a href="http://www.usenix.org/event/leet08/tech/full_papers/zhuang/zhuang.pdf">Characterizing Botnets from Email Spam Records</a> (2008)<br />
- Clustering of emails into spam campaign<br />
- Finding probability that 2 spam messages are sent form same machine<br />
- Estime likelihood of botnets based on common senders in spam campaigns<br />
<a href="http://www.usenix.org/event/hotbots07/tech/full_papers/provos/provos.pdf">The Ghost In The Browser Analysis of Web-based Malware</a> (2007)</p>
<p><strong>Image and Video Processing</strong><br />
<a href="http://www.hpl.hp.com/personal/Thomas_Sandholm/sandholm2009a.pdf">MapReduce Optimization Using Regulated Dynamic Prioritization</a> (2009)<br />
- Video Stream Re-Rendering<br />
<a href="http://www.pittsburgh.intel-research.net/~chensm/papers/IRP-TR-08-05.pdf">Map-Reduce Meets Wider Varieties of Applications</a> (2008)<br />
- Location detection in images</p>
<p><strong>Networking</strong><br />
<a href="http://wwwse.inf.tu-dresden.de/papers/preprint-pfeifer2008reducible.pdf">Reducible Complexity in DNS</a></p>
<p><strong>Simulation</strong><br />
<a href="http://www.pittsburgh.intel-research.net/~chensm/papers/IRP-TR-08-05.pdf">Map-Reduce Meets Wider Varieties of Applications</a> (2008)<br />
- Simulation of earthquakes (geology)</p>
<p><strong>Statistics</strong><br />
<strong><span style="color: #ff0000;">*</span></strong> <a href="http://www.umiacs.umd.edu/~jimmylin/publications/Lin_SIGIR2009.pdf">Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce</a> (2009)<br />
<a href="http://thepublicgrid.org/papers/koufakou_wcci_08.pdf">Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce</a> (2009)<br />
<a href="http://www.hpl.hp.com/personal/Thomas_Sandholm/sandholm2009a.pdf">MapReduce Optimization Using Regulated Dynamic Prioritization</a> (2009)<br />
- Digg.com story recommendations<br />
<a href="http://www.infosci.cornell.edu/weblab/papers/Bank2008.pdf">Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia</a> (2008)<br />
- Measuring Wikipedia Editor similarity<br />
<a href="http://www.pittsburgh.intel-research.net/~chensm/papers/IRP-TR-08-05.pdf">Map-Reduce Meets Wider Varieties of Applications</a> (2008)<br />
- Netflix video recommendation<br />
<a href="http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf">Large-scale Parallel Collaborative Filtering for the Netflix Prize</a> (2008)</p>
<p><strong>Numerical Mathematics</strong><br />
<strong><span style="color: #ff0000;">*</span></strong> <a href="http://arxiv.org/PS_cache/arxiv/pdf/1001/1001.0421v1.pdf">Mapreduce for Integer Factorization</a></p>
<p><strong>Graphs</strong><br />
<strong><span style="color: #ff0000;">*</span></strong> <a href="http://www.springerlink.com/content/654725g772674533/">Distributed Algorithm for Computing Formal Concepts Using Map-Reduce Framework</a><br />
<span style="color: #ff0000;">*</span> <a href="http://www.springerlink.com/content/l805560670136163/">Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce</a><br />
<span style="color: #ff0000;">*</span> <a href="http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120">Graph Twiddling in a MapReduce World</a><br />
<a href="http://www.cis.temple.edu/~vasilis/research/Publications/kdd09.pdf">DOULION: Counting Triangles in Massive Graphs with a Coin</a> (2009)<br />
<a href="http://reports-archive.adm.cs.cmu.edu/anon/ml2008/CMU-ML-08-103.pdf">Fast counting of triangles in real-world networks: proofs, algorithms and observations</a> (2008)</ul>
<p><strong>Who wrote the above papers?</strong> <em>(<font color="#ff0000">section added 20100307</font>)</em><br />
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.<br />
Government Institutions and Universities: US National Security Agency (NSA)<br />
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&#038;M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas</p>
<hr />
<p><font color="#0000ff"><strong>Do you need help with Hadoop/Mapreduce?</strong></font></p>
<div>Contact <a href="http://atbrox.com/about/">Atbrox</a> if you need help with development or parallelization of algorithms for Hadoop/Mapreduce &#8211; <a href="mailto:info@atbrox.com">info@atbrox.com</a>. See <a href="http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/">our previous posting</A> for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce</div>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to BlogRO" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to FTW" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to VoxRO" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Twitter" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%28updated%29&amp;c=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to MySpace" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/&amp;title=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%28updated%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Del.icio.us" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/&amp;title=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%28updated%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to digg" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/&amp;t=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%28updated%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to FaceBook" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Technorati" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/&amp;title=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%28updated%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Stumble Upon" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/&amp;title=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%28updated%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Google Bookmarks" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/ZU_alB_G58o" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/</feedburner:origLink></item>
		<item>
		<title>Parallel Machine Learning for Hadoop/Mapreduce – A Python Example</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/qoTnKURhfes/</link>
		<comments>http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/#comments</comments>
		<pubDate>Mon, 08 Feb 2010 21:27:37 +0000</pubDate>
		<dc:creator>amund</dc:creator>
				<category><![CDATA[Hadoop and Mapreduce]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[infrastructure]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[machinelearning]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[svm]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=572</guid>
		<description><![CDATA[
			
				
			
		
This posting gives an example of how to use Mapreduce, Python and Numpy to parallelize a linear machine learning classifier algorithm for Hadoop Streaming. It also discusses various hadoop/mapreduce-specific approaches how to potentially improve or extend the example.
1. Background
Classification is an everyday task, it is about selecting one out of several outcomes based on their [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F08%2Fparallel-machine-learning-for-hadoopmapreduce-a-python-example%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F08%2Fparallel-machine-learning-for-hadoopmapreduce-a-python-example%2F&amp;style=normal" height="61" width="51" /><br />
			</a>
		</div>
<p>This posting gives an example of how to use Mapreduce, Python and Numpy to parallelize a linear machine learning classifier algorithm for Hadoop Streaming. It also discusses various hadoop/mapreduce-specific approaches how to potentially improve or extend the example.</p>
<h2>1. Background</h2>
<p>Classification is an everyday task, it is about selecting one out of several outcomes based on their features, e.g</p>
<ul>
<li>In recycling of garbage you select the bin based on the material, e.g. plastic, metal or organic.</li>
<li>When purchasing you select the store from based e.g. on its reputation, prior experience, service, inventory and prices</li>
</ul>
<p>Computational Classification &#8211; Supervised Machine Learning &#8211; is quite similar, but requires (relatively) well-formed input data combined with classification algorithms.</p>
<h3>1.1 Examples of classification problems</h3>
<ul>
<li>Finance/Insurance
<ul>
<li>Classify investment opportunities as good or not e.g. based on industry/company metrics, portfolio diversity and currency risk.</li>
<li>Classify credit card transactions as valid or invalid based e.g. location of transaction and credit card holder, date, amount, purchased item or service, history of transactions and similar transactions</li>
</ul>
<li>Biology/Medicine
<ul>
<li>Classification of proteins into structural or functional classes</li>
<li>Diagnostic classification, e.g. <a href="http://www.csie.ntu.edu.tw/~rfchang/prof/ar0302.pdf">cancer tumours based on images</a></li>
</ul>
<li>Internet
<ul>
<li><a href="http://en.wikipedia.org/wiki/Document_classification">Document Classification</a> and <a href="http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html">Ranking</a></p>
<li>Malware classification, email/tweet/web spam classification</ul>
<li>Production Systems (e.g. in energy or petrochemical industries)
<ul>
<li>Classify and detect situations (e.g. sweet spots or risk situations) based on realtime and historic data from sensors</ul>
</li>
</ul>
<h3>1.2 Classification Algorithms</h3>
<p>Classification algorithms comes in various types (e.g. linear, nonlinear, discriminative etc), see my prior postings <a href="http://amundblog.blogspot.com/2008/04/pragmatic-classification-very-basics.html">Pragmatic Classification: The Very Basics</a>  and<a href="http://amundblog.blogspot.com/2008/06/pragmatic-classification-of-classifiers.html"> Pragmatic Classification of Classifiers</a>.<br />
<strong><font color="#0000ff"><br />
Key takeaways about classifiers:<br />
</font></strong></p>
<ol>
<li>There is no silver bullet classifier algorithm or feature extraction method.
<li>Classification algorithms tend to be computationally hard to train, this encourages using a parallel approach, in this case with Hadoop/Mapreduce.
</ol>
<h2>2. Parallel Classification for Hadoop Streaming</h2>
<p>The classifier described belongs to a familiy of classifiers which have in common that they can mathematically be described as Tikhonov Regularization with a Square loss function, this family includes Proximal SVM, Ridge Regression, Shrinkage Regression and Regularized Least-Squares Classification. (<em>note: If you replace the Square Loss function with a Hinge-Loss function you get Support Vector Machine classification</em>). The implemented classifier &#8211; proximal SVM &#8211; is from the paper <a href="ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-08.ps">Incremental Support Vector Machine Classification</a>, referred to as the paper below.</p>
<h3>2.1 training data</h3>
<p>The classifier assumes numerical training data, where each class is either -1.0 og +1.0 (negative or positive class), and features are represented as vectors of positive floating point numbers. In the algorithm below are:</p>
<pre class="brush: plain;">
D - a matrix of training classes, e.g. [[-1.0, 1.0, 1.0, .. ]]
A - a matrix with feature vectors, e.g. [[2.9, 3.3, 11.1, 2.4], .. ]
e - a vector filled with ones, e.g [1.0, 1.0, .., 1.0]
E = [A -e]
mu = scalar constant # used to tune classifier
D - a diagonal matrix with -1.0 or +1.0 values (depending on the class)
</pre>
<h3>2.2 the classifier algorithm</h3>
<p>Training the classifier can be done with right side of the equation (13) from paper</p>
<pre class="brush: plain;">(omega, gamma) = (I/mu + E.T*E).I*(E.T*D*e)
</pre>
<p>Classification of an incoming feature vector x can then be done by calculating:</p>
<pre class="brush: plain;">x.T*omega - gamma</pre>
<p>which returns a number, and the sign of the number corresponds to the class, i.e. positive or negative.</p>
<p>2. Parallelization of the classifier with Hadoop Streaming and Python</p>
<p>Expression (16) in the paper has a nice property, it supports increments (and decrements), in the example there are 2 increments (and 2 decrements), but by induction there can be as many as you want:</p>
<pre class="brush: plain;">
(omega, gamma) = (I/mu + E_.T*E_1 + .. + E_i.T*E_i).I*
                 (E_1.T*D_1*e + .. + E_i.T*D_i*e)
</pre>
<p>where</p>
<pre class="brush: plain;">
E.T*E = E_1.T*E_1 + .. + E_i.T*E_i
</pre>
<p>and</p>
<pre class="brush: plain;">
E.T*De = E_1.T*D_1*e + .. + E_i.T*D_i*e
</pre>
<p>This means that we can parallelize the calculation of E.T*E and E.T*De, by having Hadoop mappers calculate each of the elements of the sums in as in the Python map() code below (sent to reducers as tuples)</p>
<p><img width="500" src="http://atbrox.com/wp-content/uploads/2010/02/parclassifiersinglereducer.png" alt="map() and reduce() - dataflow - basic case" /></p>
<h3>2.3 &#8211; the mapper</h3>
<pre class="brush: plain;">
def map(key, value):
   # input key= class for one training example, e.g. &quot;-1.0&quot;
   classes = [float(item) for item in key.split(&quot;,&quot;)]   # e.g. [-1.0]
   D = numpy.diag(classes)

   # input value = feature vector for one training example, e.g. &quot;3.0, 7.0, 2.0&quot;
   featurematrix = [float(item) for item in value.split(&quot;,&quot;)]
   A = numpy.matrix(featurematrix)

   # create matrix E and vector e
   e = numpy.matrix(numpy.ones(len(A)).reshape(len(A),1))
   E = numpy.matrix(numpy.append(A,-e,axis=1)) 

   # create a tuple with the values to be used by reducer
   # and encode it with base64 to avoid potential trouble with '\t' and '\n' used
   # as default separators in Hadoop Streaming
   producedvalue = base64.b64encode(pickle.dumps( (E.T*E, E.T*D*e) )    

   # note: a single constant key &quot;producedkey&quot; sends to only one reducer
   # somewhat &quot;atypical&quot; due to low degree of parallism on reducer side
   print &quot;producedkey\t%s&quot; % (producedvalue)
</pre>
<h3>2.4 &#8211; the Reducer</h3>
<pre class="brush: plain;">
def reduce(key, values, mu=0.1):
  sumETE = None
  sumETDe = None

  # key isn't used, so ignoring it with _ (underscore).
  for _, value in values:
    # unpickle values
    ETE, ETDe = pickle.loads(base64.b64decode(value))
    if sumETE == None:
      # create the I/mu with correct dimensions
      sumETE = numpy.matrix(numpy.eye(ETE.shape[1])/mu)
    sumETE += ETE

    if sumETDe == None:
      # create sumETDe with correct dimensions
      sumETDe = ETDe
    else:
      sumETDe += ETDe

    # note: omega = result[:-1] and gamma = result[-1]
    # but printing entire vector as output
    result = sumETE.I*sumETDe
    print &quot;%s\t%s&quot; % (key, str(result.tolist()))
</pre>
<h3>2.5 &#8211; Mapper and Reducer Utility Code</h3>
<p>Code used to run map() and reduce() methods, inspired by iterator/generator approach from<a href="http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python"> this mapreduce tutorial</a>.</p>
<pre class="brush: plain;">
def read_input(file, separator=&quot;\t&quot;):
    for line in file:
        yield line.rstrip().split(separator)
</pre>
<pre class="brush: plain;">
def run_mapper(map, separator=&quot;\t&quot;):
    data = read_input(sys.stdin,separator)
    for (key,value) in data:
        map(key,value)
</pre>
<pre class="brush: plain;">
def run_reducer(reduce,separator=&quot;\t&quot;):
    data = read_input(sys.stdin, separator)
    for key, values in groupby(data, itemgetter(0)):
        reduce(key, values)
</pre>
<h2>3. Finished?</h2>
<p>Assume your running time goes through the roof even with the above parallel approach, what to do?</p>
<h3>3.1 Mapper Increment Size really makes a difference!</h3>
<p>Since there is only 1 reducer in the presented implementation, it is useful to let mappers do most of the job. The size of the (increment) matrices &#8211; E.T*E and E.T*D*e given as input to the reducer is independent of number of training data, but dependent on the number of classification features. The workload on the reducer is also dependent on the number of matrices received by the mappes (i.e. increment size), e.g. if you have a 1000 mappers having one billion examples with 100 features each, the reducer would need to do a sum of one trillion 101&#215;101 matrices and one trillion 101&#215;1 vectors if the mapper sent one matrix pair per training example, but if each mapper only sent one pair of E.T*E and E.T*D*e representing all the mappers billion training examples the reducer would only need to summarize 1000 matrix pairs.</p>
<h3>3.2 Avoid stressing the reducer</h3>
<p>Add more (intermediate) reducers (combiners) that calculates partial sums of matrices. In the case of many small increments (and correspondingly many matrices) it can be useful to add an intermediate step that (in parallel) calculates sums of E.T*E and E.T*D*e before sending the sums to the final reducer, this means that the final reducer gets fewer matrices to summarize before calculating the final answer, see figure below.<br />
<img width="500" src="http://atbrox.com/wp-content/uploads/2010/02/machinelearning2.png" alt="flow with intermediate mapreduce step" /></p>
<h3>3.3 Parallelize (or replace) the matrix inversion in the reduction step</h3>
<p>If someone comes along with a training data set with a very high feature-dimension (e.g. recommender systems, bioinformatics or text classification), the matrix inversion in the reducer can become a real bottleneck since such algorithms typically are O(n^3) (and lower bound of <a href="http://amundtveit.info/publications/2003/ComplexityOfMatrixInversion.pdf">Omega(n^2 lg n)</a>), where n is the number of features. A solution to this can be to use or develop hadoop/mapreduce-based parallel matrix inversion, e.g. <a href="http://incubator.apache.org/hama/">Apache Hama</a>, or <a href="http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/">don&#8217;t invert the matrix..</a>.</p>
<h3>3.4 Feature Dimensionality Reduction</h3>
<p>Another approach when having training data with high feature-dimension could be to reduce feature-dimensionality, for more info check out <a href="http://en.wikipedia.org/wiki/Latent_semantic_indexing">Latent Semantic Indexing</a> (and Analysis), <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">Singular Value Decomposition</a> or <a href="http://ict.ewi.tudelft.nl/~lvandermaaten/t-SNE.html">t-Distributed Stochastic Neighbor Embedding</a></p>
<h3>3.5 Reduce IO between  mappers and reducers with compression</h3>
<p><a href="http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression">Twitter presented using LZO compression (on the Cloudera blog) to speed up Hadoop</a>. Inspired by this one could in the case of high feature dimension, i.e. large E.T*E and E.T*D*e matrices, compress the output in the mapper and decompress in the reducer by replacing base64encoding/decoding and pickling above with:</p>
<pre class="brush: plain;">
producedvalue = base64.b64encode(lzo.compress(pickle.dumps( (E.T*E, E.T*D*e) ), level=1)
</pre>
<p>and</p>
<pre class="brush: plain;">
ETE, ETDe = pickle.loads(lzo.decompress(base64.b64decode(value)))
</pre>
<h3>3.6 Do more work with approximately the same computing resources</h3>
<p>The D matrix above represents binary classification with a value of +1 or -1 representing each class. It is quite common to have classification problems with more than 2 classes. Supporting multiple classes is usually done by training by several classifiers, either 1-against-all (1 classifier trained per class) or 1-against-1 (1 classifier trained per unique pair of classes), and the run a tournament of them against each other and pick the most confident. In the case of 1-against-all classification the mapper could probably send multiple E.T*D_c*e &#8211; with one D_c per class and keep the same E.T*E, the reducer would then need to calculate (I/mu + E.TE).I once and independently multiply with several E.T*D_c*e sums to create a set of (omega,gamma) classifiers. For 1-against-1 classification it becomes somewhat more complicated, because it involves creating several E matrices since in the 1-against-1 case only the rows in E where the 2 classes competing occur are relevant.</p>
<h2>4. Code</h2>
<p>(Early) Python code of the algorithm presented above can be found at <a href="http://code.google.com/p/snabler/">http://code.google.com/p/snabler/</a> (open source with Apache Licence). Please let <a href="mailto:amund@atbrox.com">me</a> know if you want to contribute to the project, e.g. from  <a href="http://atbrox.com/2009/10/01/mapreduce-and-hadoop-academic-papers/">mapreduce and hadoop algorithms in academic papers</a>.</p>
<h2>5. More resources about machine learning with Hadoop/Mapreduce?</h2>
<ul>
<li><a href="http://lucene.apache.org/mahout/">Apache Mahout</a> &#8211; active project that implements (in Java) several machine learning algorithms (also unsupervised machine learning, i.e. clustering)
<li>Good paper about machine learning algorithms with mapreduce &#8211; <a href="http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf">http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf</a>
</ul>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to BlogRO" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to FTW" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to VoxRO" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Twitter" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Parallel+Machine+Learning+for+Hadoop%2FMapreduce+%26%238211%3B+A+Python+Example&amp;c=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to MySpace" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/&amp;title=Parallel+Machine+Learning+for+Hadoop%2FMapreduce+%26%238211%3B+A+Python+Example" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Del.icio.us" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/&amp;title=Parallel+Machine+Learning+for+Hadoop%2FMapreduce+%26%238211%3B+A+Python+Example" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to digg" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/&amp;t=Parallel+Machine+Learning+for+Hadoop%2FMapreduce+%26%238211%3B+A+Python+Example" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to FaceBook" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Technorati" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/&amp;title=Parallel+Machine+Learning+for+Hadoop%2FMapreduce+%26%238211%3B+A+Python+Example" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Stumble Upon" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/&amp;title=Parallel+Machine+Learning+for+Hadoop%2FMapreduce+%26%238211%3B+A+Python+Example" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Google Bookmarks" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/qoTnKURhfes" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/</feedburner:origLink></item>
		<item>
		<title>Atbrox Customer Case Study – Scalable Language Processing with Elastic Mapreduce (Hadoop)</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/8sx94xca6LM/</link>
		<comments>http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/#comments</comments>
		<pubDate>Sat, 14 Nov 2009 07:04:32 +0000</pubDate>
		<dc:creator>amund</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[data processing]]></category>
		<category><![CDATA[elastic mapreduce]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[language processing]]></category>
		<category><![CDATA[nlp]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=507</guid>
		<description><![CDATA[
			
				
			
		
We developed a tool for scalable language processing for our customer Lingit using Amazon&#8217;s Elastic Mapreduce.
More details: http://aws.amazon.com/solutions/case-studies/atbrox/
Contact us if you need help with Hadoop/Elastic Mapreduce.
Bookmark to:
]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2009%2F11%2F14%2Fatbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=?url=http%3A%2F%2Fatbrox.com%2F2009%2F11%2F14%2Fatbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop%2F&amp;style=normal" height="61" width="51" /><br />
			</a>
		</div>
<p>We developed a tool for scalable language processing for our customer <a href="http://www.lingit.no">Lingit</a> using Amazon&#8217;s Elastic Mapreduce.</p>
<p><strong>More details:</strong> <a href="http://aws.amazon.com/solutions/case-studies/atbrox/">http://aws.amazon.com/solutions/case-studies/atbrox/</a></p>
<p><a href="http://atbrox.com/contact/">Contact us</a> if you need help with Hadoop/Elastic Mapreduce.</p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to BlogRO" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to FTW" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to VoxRO" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Twitter" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Atbrox+Customer+Case+Study+%26%238211%3B+Scalable+Language+Processing+with+Elastic+Mapreduce+%28Hadoop%29&amp;c=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to MySpace" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/&amp;title=Atbrox+Customer+Case+Study+%26%238211%3B+Scalable+Language+Processing+with+Elastic+Mapreduce+%28Hadoop%29" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Del.icio.us" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/&amp;title=Atbrox+Customer+Case+Study+%26%238211%3B+Scalable+Language+Processing+with+Elastic+Mapreduce+%28Hadoop%29" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to digg" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/&amp;t=Atbrox+Customer+Case+Study+%26%238211%3B+Scalable+Language+Processing+with+Elastic+Mapreduce+%28Hadoop%29" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to FaceBook" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Technorati" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/&amp;title=Atbrox+Customer+Case+Study+%26%238211%3B+Scalable+Language+Processing+with+Elastic+Mapreduce+%28Hadoop%29" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Stumble Upon" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/&amp;title=Atbrox+Customer+Case+Study+%26%238211%3B+Scalable+Language+Processing+with+Elastic+Mapreduce+%28Hadoop%29" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Google Bookmarks" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/8sx94xca6LM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/</feedburner:origLink></item>
		<item>
		<title>How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/WWB0nBggORA/</link>
		<comments>http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 00:29:28 +0000</pubDate>
		<dc:creator>amund</dc:creator>
				<category><![CDATA[Hadoop and Mapreduce]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[infrastructure]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[simpledb]]></category>
		<category><![CDATA[sqs]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=465</guid>
		<description><![CDATA[
			
				
			
		
Elastic Mapreduce default behavior is to read from and store to S3. When you need to access other AWS services, e.g. SQS queues or database services SimpleDB and RDS (MySQL) the best approach from Python is to use Boto. To get Boto to work with Elastic Mapreduce you need to dynamically load boto on each [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2009%2F11%2F11%2Fhow-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=?url=http%3A%2F%2Fatbrox.com%2F2009%2F11%2F11%2Fhow-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services%2F&amp;style=normal" height="61" width="51" /><br />
			</a>
		</div>
<p>Elastic Mapreduce default behavior is to read from and store to S3. When you need to access other AWS services, e.g. SQS queues or database services SimpleDB and RDS (MySQL) the best approach from Python is to use Boto. To get Boto to work with Elastic Mapreduce you need to dynamically load boto on each mapper and reducer, Cloudera&#8217;s Jeff Hammerbacher <a href="http://www.cloudera.com/blog/2008/11/14/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/">outlined how to do that using Hadoop Distributed Cache</a> and Peter Skomorroch <a href="http://datawrangling.com/amazon-elastic-mapreduce-a-web-service-api-for-hadoop">suggested how to load Boto to access Elastic Blockstore (EBS)</a>, this posting is based on those ideas and gives a detailed description how to do it.</p>
<p><strong>How to combine Elastic Mapreduce with other AWS Services</strong></p>
<p>This posting shows how to load boto in an Elastic Mapreduce mapper and gives a simple example how to use simpledb from the same mapper. For accessing other AWS services, e.g. SQS from Elastic Mapreduce check out the Boto documentation (it is quite easy when the boto + emr integration is in place). </p>
<p><strong>Other tools used (prerequisites)</strong>: </p>
<ul>
<li><a href="http://s3tools.org/s3cmd">s3cmd</a> &#8211; to upload/download files to S3
<li><a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264">Elastic Mapreduce Ruby Client</a> &#8211; to launch Elastic Mapreduce jobs
</ul>
<p><strong>Step 1 &#8211; getting and preparing the Boto library</strong></p>
<pre class="brush: plain;">
wget http://boto.googlecode.com/files/boto-1.8d.tar.gz
# note: using virtualenv can be useful if you want to
# keep your local Python installation clean
tar -zxvf boto-1.8d.tar.gz ; cd boto-1.8d ; python setup.py install
cd /usr/local/lib/python2.6/dist-packages/boto-1.8d-py2.6.egg
zip -r boto.mod boto
</pre>
<p><strong>Step 2 &#8211; mapper that loads boto.mod and uses it to access SimpleDB</strong></p>
<pre class="brush: python;">
# this was tested by adding code underneath to the mapper
# s3://elasticmapreduce/samples/wordcount/wordSplitter.py

# get boto library
sys.path.append(&quot;.&quot;)
import zipimport
importer = zipimport.zipimporter('boto.mod')
boto = importer.load_module('boto')

# access simpledb
sdb = boto.connect_sdb(&quot;YourAWSKey&quot;, &quot;YourSecretAWSKey&quot;)
sdb_domain = boto.create_domain(&quot;mymapreducedomain&quot;) # or get_domain()
# ..
# write words to simpledb
  for word in pattern.findall(line):
      item = sdb_domain.create_item(word)
      item[&quot;reversedword&quot;] = word[::-1]
      item.save()
      # ...
</pre>
<p><strong>Step 3 &#8211; json config file &#8211; bototest.json &#8211; for Elastic Mapreduce Ruby Client</strong></p>
<pre class="brush: plain;">
[
  {
	&quot;Name&quot;: &quot;Step 1: testing boto with elastic mapreduce&quot;,
        &quot;ActionOnFailure&quot;: &quot;&lt;action_on_failure&gt;&quot;,
        &quot;HadoopJarStep&quot;: {
		&quot;Jar&quot;: &quot;/home/hadoop/contrib/streaming/hadoop-0.18-streaming.jar&quot;,
          	&quot;Args&quot;: [
            	&quot;-input&quot;, &quot;s3n://elasticmapreduce/samples/wordcount/input&quot;,
            	&quot;-output&quot;, &quot;s3n://yours3bucket/result&quot;,
            	&quot;-mapper&quot;, &quot;s3://yours3bucket/botoWordSplitter.py&quot;,
            	&quot;-cacheFile&quot;, &quot;s3n://yours3bucket/boto.mod#boto.mod&quot;,
          	]
        }
  }
]
</pre>
<p><strong>Step 4 &#8211; Copy necessary files to s3</strong></p>
<pre class="brush: plain;">
s3cmd put boto.mod s3://yours3bucket
s3cmd put botoWordSplitter.py s3://yours3bucket
</pre>
<p><strong>Step 5 &#8211; And run your Elastic Mapreduce job</strong></p>
<pre class="brush: plain;">
 elastic-mapreduce --create \
                   --stream \
                   --json bototest.json \
                   --param &quot;&lt;action_on_failure&gt;=TERMINATE_JOB_FLOW&quot;
</pre>
<p><strong>Conclusion</strong><br />
This showed how to dynamically load boto and use it to access one other AWS service &#8211; SimpleDB &#8211; from Elastic Mapreduce. Boto supports most AWS services, so the same integration approach should work also for other AWS services, e.g. SQS (Queuing Service), <a href="http://www.elastician.com/2009/10/using-rds-in-boto.html">RDS (MySQL Service)</a> and EC2, check out the <a href="http://boto.s3.amazonaws.com/index.html">Boto API documentation</a> or <a href="http://www.slideshare.net/lucamea/controlling-the-cloud-with-python-1407502">Controlling the Cloud with Python</a> for details. </p>
<p><em>Note: a very similar integration approach should work for most Python libraries, also those that use/wrap C/C++ code (e.g. machine learning libraries such as PyML and others), but then it might be needed to do step 1 on Debian AMIs similar to what Elastic Mapreduce is using, check out a <a href="http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/">previous posting</a> for more info about such AMIs.</em></p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to BlogRO" alt="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to FTW" alt="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to VoxRO" alt="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Twitter" alt="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=How+to+combine+Elastic+Mapreduce%2FHadoop+with+other+Amazon+Web+Services&amp;c=http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to MySpace" alt="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/&amp;title=How+to+combine+Elastic+Mapreduce%2FHadoop+with+other+Amazon+Web+Services" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Del.icio.us" alt="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/&amp;title=How+to+combine+Elastic+Mapreduce%2FHadoop+with+other+Amazon+Web+Services" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to digg" alt="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/&amp;t=How+to+combine+Elastic+Mapreduce%2FHadoop+with+other+Amazon+Web+Services" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to FaceBook" alt="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Technorati" alt="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/&amp;title=How+to+combine+Elastic+Mapreduce%2FHadoop+with+other+Amazon+Web+Services" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Stumble Upon" alt="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/&amp;title=How+to+combine+Elastic+Mapreduce%2FHadoop+with+other+Amazon+Web+Services" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Google Bookmarks" alt="Add 'How to combine Elastic Mapreduce/Hadoop with other Amazon Web Services' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/WWB0nBggORA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/</feedburner:origLink></item>
		<item>
		<title>Preliminary Experiences Crawling with 80legs</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/d87ez7eeneI/</link>
		<comments>http://atbrox.com/2009/11/04/preliminary-experiences-crawling-with-80legs/#comments</comments>
		<pubDate>Wed, 04 Nov 2009 10:43:44 +0000</pubDate>
		<dc:creator>amund</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[crawling]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[web services]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=394</guid>
		<description><![CDATA[
			
				
			
		
Back in May 2000 I wrote that &#8220;It seems likely that the specialization in the Internet Information Retrieval (IIR) business will continue.  Internet information crawling, pre-processing, indexing, searching and presentation requires different types of technologies and know-how, this might create opportunities for new companies specializing in only one step of the IIR &#8220;food chain&#8221;"
80legs
80legs [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2009%2F11%2F04%2Fpreliminary-experiences-crawling-with-80legs%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=?url=http%3A%2F%2Fatbrox.com%2F2009%2F11%2F04%2Fpreliminary-experiences-crawling-with-80legs%2F&amp;style=normal" height="61" width="51" /><br />
			</a>
		</div>
<p>Back in May 2000 <a href="http://amundtveit.info/publications/2000/iir.php">I wrote</a> that <em>&#8220;It seems likely that the specialization in the Internet Information Retrieval (IIR) business will continue.  Internet information crawling, pre-processing, indexing, searching and presentation requires different types of technologies and know-how, this might create opportunities for new companies specializing in only one step of the IIR &#8220;food chain&#8221;"</em></p>
<p><strong>80legs</strong></p>
<p><a href="http://80legs.com">80legs</a> is a company specializing in the crawling and preprocessing part, where you can upload your seed urls (where to start crawling), configure your crawl job (depth, domain restrictions etc.) and also run existing or custom analysis code (upload java jar-files) on the fetched pages. When you upload seed files 80legs does some filtering before starting to crawl (e.g. if you have seed urls which are not well-formed), and also handles domain throttling and robots.txt (and perhaps other things).</p>
<p>Computational model: Since you can run custom code per page it can be seen as a mapper part of a MapReduce (Hadoop) job (one map() call per page); for reduce-type processing (over several pages) you need to move your data elsewhere (e.g. EC2 in the cloud). <em>Side note: another domain with &#8220;reduce-less&#8221; mapreduce is quantum computing, check out Michael Nilsen&#8217;s <a href="http://michaelnielsen.org/blog/quantum-computing-for-everyone/">Quantum Computing for Everyone</a></em>.</p>
<p><strong>Testing 80legs</strong></p>
<p>Note: We have only tried with the built-in functionality and no custom code so far.</p>
<p>1) URL extraction</p>
<p>Job description: We used a seed of approximately 1,000 URLs and crawled and analyzed ~2.8 million pages within those domains. The regexp configuration was used (we only provided the URL matching regexp).</p>
<p>Result: Approximately 1 billion URLs were found, and results came in 106 zip-files (each ~14MB packed and ~100MB unpacked) in addition to zip files of the URLs that where crawled.</p>
<p><em>Note: Based on a few smaller similar jobs it looks like the parallelism of 80legs is somewhat dependent of the number of domains in the crawl and perhaps also on their ordering. In case you have a set of URLs where each domain has more than one URL it can be useful to randomize your seed URL file before uploading and running the crawl job, e.g. by using <a href="http://arthurdejong.org/rl/">rl</a> or <a href="http://www.gnu.org/software/coreutils/">coreutil&#8217;s shu</a>f.</em></p>
<p>2) Fetching pages</p>
<p>Job description: We built a set of URLs &#8211; ~80k URLs that we wanted to fetch as html (using their sample application called 80App Get Raw HTML) for further processing. The URLs were split into 4 jobs of ~20k URLs each.</p>
<p>Result: Each job took roughly one hour (they all ran in parallel so the total time spent was 1 hour). We ended up with 5 zip files per job, each zip file having ~25MB of data (100MB unpacked), i.e. ~4*5*100MB = 2GB raw html when unpacked for all jobs.</p>
<p><strong>Conclusion</strong></p>
<p><strong> </strong></p>
<p>80legs is an interesting service that has already proved useful for us, and we will continue to use it in combination with AWS and EC2. Custom code needs to be built (e.g. related to ajax crawling).</p>
<p>(May 2000 &#8211; <a href="http://amundtveit.info/publications/2000/iir.php">A few thoughts about the future of Internet Information Retrieval</a>)</p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2009/11/04/preliminary-experiences-crawling-with-80legs/" title="Add 'Preliminary Experiences Crawling with 80legs' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Preliminary Experiences Crawling with 80legs' to BlogRO" alt="Add 'Preliminary Experiences Crawling with 80legs' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2009/11/04/preliminary-experiences-crawling-with-80legs/" title="Add 'Preliminary Experiences Crawling with 80legs' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Preliminary Experiences Crawling with 80legs' to FTW" alt="Add 'Preliminary Experiences Crawling with 80legs' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2009/11/04/preliminary-experiences-crawling-with-80legs/" title="Add 'Preliminary Experiences Crawling with 80legs' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Preliminary Experiences Crawling with 80legs' to VoxRO" alt="Add 'Preliminary Experiences Crawling with 80legs' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2009/11/04/preliminary-experiences-crawling-with-80legs/" title="Add 'Preliminary Experiences Crawling with 80legs' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Preliminary Experiences Crawling with 80legs' to Twitter" alt="Add 'Preliminary Experiences Crawling with 80legs' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Preliminary+Experiences+Crawling+with+80legs&amp;c=http://atbrox.com/2009/11/04/preliminary-experiences-crawling-with-80legs/" title="Add 'Preliminary Experiences Crawling with 80legs' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Preliminary Experiences Crawling with 80legs' to MySpace" alt="Add 'Preliminary Experiences Crawling with 80legs' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2009/11/04/preliminary-experiences-crawling-with-80legs/&amp;title=Preliminary+Experiences+Crawling+with+80legs" title="Add 'Preliminary Experiences Crawling with 80legs' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Preliminary Experiences Crawling with 80legs' to Del.icio.us" alt="Add 'Preliminary Experiences Crawling with 80legs' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2009/11/04/preliminary-experiences-crawling-with-80legs/&amp;title=Preliminary+Experiences+Crawling+with+80legs" title="Add 'Preliminary Experiences Crawling with 80legs' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Preliminary Experiences Crawling with 80legs' to digg" alt="Add 'Preliminary Experiences Crawling with 80legs' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2009/11/04/preliminary-experiences-crawling-with-80legs/&amp;t=Preliminary+Experiences+Crawling+with+80legs" title="Add 'Preliminary Experiences Crawling with 80legs' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Preliminary Experiences Crawling with 80legs' to FaceBook" alt="Add 'Preliminary Experiences Crawling with 80legs' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2009/11/04/preliminary-experiences-crawling-with-80legs/" title="Add 'Preliminary Experiences Crawling with 80legs' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Preliminary Experiences Crawling with 80legs' to Technorati" alt="Add 'Preliminary Experiences Crawling with 80legs' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2009/11/04/preliminary-experiences-crawling-with-80legs/&amp;title=Preliminary+Experiences+Crawling+with+80legs" title="Add 'Preliminary Experiences Crawling with 80legs' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Preliminary Experiences Crawling with 80legs' to Stumble Upon" alt="Add 'Preliminary Experiences Crawling with 80legs' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2009/11/04/preliminary-experiences-crawling-with-80legs/&amp;title=Preliminary+Experiences+Crawling+with+80legs" title="Add 'Preliminary Experiences Crawling with 80legs' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Preliminary Experiences Crawling with 80legs' to Google Bookmarks" alt="Add 'Preliminary Experiences Crawling with 80legs' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/d87ez7eeneI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2009/11/04/preliminary-experiences-crawling-with-80legs/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://atbrox.com/2009/11/04/preliminary-experiences-crawling-with-80legs/</feedburner:origLink></item>
		<item>
		<title>Unstructured Search for Amazon’s SimpleDB</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/0J5xewjFlaU/</link>
		<comments>http://atbrox.com/2009/10/27/unstructuredsearchforsimpledb/#comments</comments>
		<pubDate>Tue, 27 Oct 2009 14:56:53 +0000</pubDate>
		<dc:creator>amund</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[latency]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[s3]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[simpledb]]></category>
		<category><![CDATA[storage]]></category>
		<category><![CDATA[structured search]]></category>
		<category><![CDATA[unstructured search]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=371</guid>
		<description><![CDATA[
			
				
			
		
SimpleDB is a service primarily for storing and querying structured data (can e.g. be used for  a product catalog with descriptive features per products, or an academic event service with extracted features such as event dates, locations, organizers and topics). (If one wants &#8220;heavier data&#8221; in SimpleDB, e.g. video or images, a good approach be [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2009%2F10%2F27%2Funstructuredsearchforsimpledb%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=?url=http%3A%2F%2Fatbrox.com%2F2009%2F10%2F27%2Funstructuredsearchforsimpledb%2F&amp;style=normal" height="61" width="51" /><br />
			</a>
		</div>
<p>SimpleDB is a service primarily for storing and querying structured data (can e.g. be used for  a product catalog with descriptive features per products, or an academic event service with extracted features such as event dates, locations, organizers and topics). (If one wants &#8220;heavier data&#8221; in SimpleDB, e.g. video or images, a good approach be to add paths to Hadoop DFS or S3 objects in the attributes instead of storing them directly)</p>
<p><strong>Unstructured Search for SimpleDB</strong></p>
<div>This posting presents an approach of how to add (flexible) unstructured search support to SimpleDB (with some preliminary query latency numbers below &#8211; and very preliminary python code). The motivation is:</div>
<ol>
<li>Support unstructured search with very low maintenance</li>
<li>Combine structured and unstructured search</li>
<li>Figure out the feasibility of unstructured search on top of SimpleDB</li>
</ol>
<p><strong>The Structure of SimpleDB</strong></p>
<p>SimpleDB is roughly a persistent hashtable of hashtables, where each row (a named item in the outer hashtable)  has another hashtable with up to 256 key-value pairs (called attributes). The attributes can be 1024 bytes each, so 256 kilobyte totally in the values per row (<em>note: twice that amount if you store data also as part of the keys + 1024 bytes in the item name)</em>. Check out <a href="http://en.wikipedia.org/wiki/Amazon_SimpleDB">Wikipedia for detailed SimpleDB storage characteristi</a>cs.</p>
<p><strong>Inverted files</strong></p>
<p><strong> </strong>Inverted files is a common way of representing indices for unstructured search. In their basic form they (logically) contain a word with a list of pages or files the word occurs on. When a query comes one looks up in the inverted file and finds pages or files where the words in the query occur. (note: if you are curious about inverted file representation check out the survey - <a href="http://portal.acm.org/citation.cfm?id=1132959">Inverted files for text search engines</a>)</p>
<p>One way of representing inverted files on SimpleDB is to map the inverted file on top of the attributes, i.e. have one SimpleDB domain with one word (term), and let the attributes store the list of URLs containing that word. Since each URL contains many words, it can be useful to have a separate SimpleDB domain containing a mapping from hash of URL to URL and use the hash URL in the inverted file (keeps the inverted file smaller). In the draft code we created 250 key-value attributes where each key was a string from &#8220;0&#8243; to &#8220;249&#8243; and each corresponding value contained hash of URLs (and positions of term) joined with two different string separators. If too little space per item &#8211; e.g. for stop words &#8211; one could &#8220;wrap&#8221; the inverted file entry with adding the same term combined with an incremental postfix (note: if that also gave too little space one could also wrap on simpledb domains).</p>
<p><strong>Preliminary query latency results </strong></p>
<p>Warning: Data sets used were  <a href="http://nltk.org">NLTK</a>&#8217;s inaugural collection, so far from the biggest.</p>
<p><img class="alignnone size-full wp-image-376" title="Inverted File Entry Fetch latency Distribution (in seconds)" src="http://atbrox.com/wp-content/uploads/2009/10/simpledb_-_inverted_file_fetchtime_distribution-1.png" alt="Inverted File Entry Fetch latency Distribution (in seconds)" width="450" height="320" /></p>
<p><strong>Conclusion</strong>: the results from 1000 fetches of inverted file entries are relatively stable clustered around 0.020s (20 milliseconds), which are promising enough to pursue further (but still early to decide given only tests on small data sets so far). Balancing with using e.g. memcached could be also be explored, in order to get average fetch time even lower.</p>
<p><a href="http://atbrox.com/wp-content/uploads/2009/10/sdbsearch1.tgz">Preliminary Python cod</a>e including timing results (this was run on an Fedora large EC2 node somewhere in a US east coast data center).</p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2009/10/27/unstructuredsearchforsimpledb/" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to BlogRO" alt="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2009/10/27/unstructuredsearchforsimpledb/" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to FTW" alt="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2009/10/27/unstructuredsearchforsimpledb/" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to VoxRO" alt="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2009/10/27/unstructuredsearchforsimpledb/" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Twitter" alt="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Unstructured+Search+for+Amazon%26%238217%3Bs+SimpleDB&amp;c=http://atbrox.com/2009/10/27/unstructuredsearchforsimpledb/" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to MySpace" alt="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2009/10/27/unstructuredsearchforsimpledb/&amp;title=Unstructured+Search+for+Amazon%26%238217%3Bs+SimpleDB" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Del.icio.us" alt="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2009/10/27/unstructuredsearchforsimpledb/&amp;title=Unstructured+Search+for+Amazon%26%238217%3Bs+SimpleDB" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to digg" alt="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2009/10/27/unstructuredsearchforsimpledb/&amp;t=Unstructured+Search+for+Amazon%26%238217%3Bs+SimpleDB" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to FaceBook" alt="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2009/10/27/unstructuredsearchforsimpledb/" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Technorati" alt="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2009/10/27/unstructuredsearchforsimpledb/&amp;title=Unstructured+Search+for+Amazon%26%238217%3Bs+SimpleDB" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Stumble Upon" alt="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2009/10/27/unstructuredsearchforsimpledb/&amp;title=Unstructured+Search+for+Amazon%26%238217%3Bs+SimpleDB" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Google Bookmarks" alt="Add 'Unstructured Search for Amazon&#8217;s SimpleDB' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/0J5xewjFlaU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2009/10/27/unstructuredsearchforsimpledb/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://atbrox.com/2009/10/27/unstructuredsearchforsimpledb/</feedburner:origLink></item>
		<item>
		<title>How to use C++ Compiled Python for Amazon’s Elastic Mapreduce (Hadoop)</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/DJXaWhK_9D4/</link>
		<comments>http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/#comments</comments>
		<pubDate>Wed, 07 Oct 2009 08:35:36 +0000</pubDate>
		<dc:creator>amund</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[elastic mapreduce]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[shedskin]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=275</guid>
		<description><![CDATA[
			
				
			
		
Sometimes it can be useful to compile Python code for Amazon&#8217;s Elastic Mapreduce into C++ and then into a binary. The motivation for that could be to integrate with (existing) C or C++ code, or increase performance for CPU-intensive mapper or reducer methods. Here follows a description how to do that:

Start a small EC2 node with [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2009%2F10%2F07%2Fhow-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=?url=http%3A%2F%2Fatbrox.com%2F2009%2F10%2F07%2Fhow-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop%2F&amp;style=normal" height="61" width="51" /><br />
			</a>
		</div>
<p>Sometimes it can be useful to compile Python code for Amazon&#8217;s <a href="http://aws.amazon.com/elasticmapreduce/">Elastic Mapreduce</a> into C++ and then into a binary. The motivation for that could be to integrate with (existing) C or C++ code, or increase performance for CPU-intensive mapper or reducer methods. Here follows a description how to do that:</p>
<ol>
<li>Start a <a href="http://aws.amazon.com/ec2/#instance">small EC2 node</a> with AMI similar to the one <a href="http://docs.amazonwebservices.com/ElasticMapReduce/2009-03-31/DeveloperGuide/index.html?introduction.html">Elastic Mapreduce is using</a> (<a href="http://www.debian.org/releases/stable/">Debian Lenny Linux</a>)</li>
<ul>
<li>note: <a href="http://atbrox.com/about/">We</a> used <a href="http://alestic.com/">Alestic</a>&#8217;s <a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1615&#038;categoryID=101">ami-ff46a796</a>
         </ul>
<li>Skim quickly through the <a href="http://shedskin.googlecode.com/files/shedskin-tutorial-0.2.html">Shedskin tutorial</a>
<li>Log into the EC2 node and install the <a href="http://code.google.com/p/shedskin/">Shedskin Python compiler</a></li>
<ul>
<li>Shedskin requires a few libraries: 1) <a href="http://www.hpl.hp.com/personal/Hans_Boehm/gc/">the Boehm-Demers-Weiser garbage collector for C++</a>, 2) <a href="http://www.pcre.org/">PCRE &#8211; Perl Compatible Regular Expressions</a>. The <a href="http://shedskin.googlecode.com/files/shedskin-tutorial-0.2.html">Shedskin tutorial</a> for detailed install instructions.
<li>note: The Alestic Debian AMI is fairly slim, so we had to add some more software make Shedskin work, i.e. GDB
        </ul>
<li>Write your Python mapper or reducer program and compile it into C++ with Shedskin</li>
<ul>
<li>E.g. the command<em>python ss.py mapper.py</em> &#8211; would generate C++ files <em>mapper.hpp</em> and <em>mapper.cpp</em>, a <em>Makefile</em> and an annotated Python file <em>mapper.ss.py</em>.
        </ul>
<li>Optionally update the C++ code generated by Shedskin to use other C or C++ libraries</li>
<ul>
<li>note: with <a href="http://en.wikipedia.org/wiki/F2c">Fortran-to-C</a> you can probably integrate your Python code with existing Fortran code (e.g. numerical/high performance computing libraries). Similar for Cobol (e.g. in financial industry) with <a href="http://www.opencobol.org/">OpenCobol</a> (compiling Cobol into C). Please <a href="http://atbrox.com/about/">let us know</a> if you try or need help with help that.
         </ul>
<li>Add -static as the first CCFLAGS parameter in the generated Makefile to make it a static executable
<li>Compile the C++ code into a binary with make and check that you don&#8217;t get a dynamic executable with ldd (you want a static executable)
<li>Run <a href="http://en.wikipedia.org/wiki/Strip_(Unix)">strip</a> on the binary to make it smaller
<li>Upload your (ready) binary to a chosen location in Amazon S3
<ul>
<li>e.g. via commandline with <a href="http://s3tools.org/s3cmd">S3CMD</a>, with a UI using <a href="http://s3fox.net/">S3Fox</a> or <a href="http://cloudberrylab.com/?id=7">Cloudberry S3 Explorer</a> or programmatically with <a href="http://code.google.com/p/boto/">Boto</a>.
         </ul>
<li>Read <a href="http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/">Elastic Mapreduce Documentation</a> on how to use the binary to run Elastic Mapreduce jobs.
<ul>
<li>note: <a href="http://twitter.com/peteskomoroch">Peter Skomoroch</a> has written a <a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2294">good tutorial for Elastic Mapreduce</a>
         </ul>
</ol>
<p>Note: if you skip the shedskin-related steps this approach would also work if you are looking for how to use C or C++ mappers or reducers with Elastic Mapreduce.</p>
<p>Note: this approach should probably work also with <a href="http://www.cloudera.com/blog/2009/09/10/cdh2-clouderas-distribution-for-hadoop-2/">Cloudera&#8217;s distribution for Hadoop</a>.</p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to BlogRO" alt="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to FTW" alt="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to VoxRO" alt="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Twitter" alt="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=How+to+use+C%2B%2B+Compiled+Python+for+Amazon%26%238217%3Bs+Elastic+Mapreduce+%28Hadoop%29&amp;c=http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to MySpace" alt="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/&amp;title=How+to+use+C%2B%2B+Compiled+Python+for+Amazon%26%238217%3Bs+Elastic+Mapreduce+%28Hadoop%29" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Del.icio.us" alt="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/&amp;title=How+to+use+C%2B%2B+Compiled+Python+for+Amazon%26%238217%3Bs+Elastic+Mapreduce+%28Hadoop%29" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to digg" alt="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/&amp;t=How+to+use+C%2B%2B+Compiled+Python+for+Amazon%26%238217%3Bs+Elastic+Mapreduce+%28Hadoop%29" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to FaceBook" alt="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Technorati" alt="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/&amp;title=How+to+use+C%2B%2B+Compiled+Python+for+Amazon%26%238217%3Bs+Elastic+Mapreduce+%28Hadoop%29" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Stumble Upon" alt="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/&amp;title=How+to+use+C%2B%2B+Compiled+Python+for+Amazon%26%238217%3Bs+Elastic+Mapreduce+%28Hadoop%29" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Google Bookmarks" alt="Add 'How to use C++ Compiled Python for Amazon&#8217;s Elastic Mapreduce (Hadoop)' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/DJXaWhK_9D4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		<feedburner:origLink>http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/</feedburner:origLink></item>
		<item>
		<title>Hadoop World 2009 – some notes from application session</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/4Ar0j0Ubxn0/</link>
		<comments>http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/#comments</comments>
		<pubDate>Sat, 03 Oct 2009 19:01:32 +0000</pubDate>
		<dc:creator>amund</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[finance]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hadoopworld]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[search]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=266</guid>
		<description><![CDATA[
			
				
			
		
Other recommended writeups :

Hadoop World NYC (Hilary Mason)
The View from HadoopWorld (Stephen O&#8217;Grady)
Post Hadoop World Thoughts (Deepak Singh)
Hadoop World, NYC 2009 (Dan Milstein)
Hadoop World Impressions (Steve Laniel)

&#8212;
Location: Roosevelt Hotel, NYC
1235 Joe Cunningham &#8211; Visa &#8211; Large scale transaction analysis
 &#8211; responsible for Visa Technology Strategy and Innovation
been playing with Hadoop for 9 months
probably many in [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2009%2F10%2F03%2Fhadoop-world-2009-notes-from-application-session%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=?url=http%3A%2F%2Fatbrox.com%2F2009%2F10%2F03%2Fhadoop-world-2009-notes-from-application-session%2F&amp;style=normal" height="61" width="51" /><br />
			</a>
		</div>
<p>Other recommended writeups :</p>
<ul>
<li><a href="http://www.hilarymason.com/blog/hadoop-world-nyc/">Hadoop World NYC </a>(Hilary Mason)</li>
<li><a href="http://redmonk.com/sogrady/2009/10/02/hadoopworld/">The View from HadoopWorld</a> (Stephen O&#8217;Grady)</li>
<li><a href="http://mndoci.com/2009/10/03/post-hadoop-world-thoughts/trackback/">Post Hadoop World Thoughts</a> (Deepak Singh)</li>
<li><a href="http://dev.hubspot.com/bid/27047/Hadoop-World-NYC-2009?source=BlogTwitter_[Hadoop+World,+NYC+20]">Hadoop World, NYC 2009</a> (Dan Milstein)</li>
<li><a href="http://dev.hubspot.com/bid/27054/Hadoop-World-impressions">Hadoop World Impressions</a> (Steve Laniel)</li>
</ul>
<p>&#8212;</p>
<p>Location: Roosevelt Hotel, NYC</p>
<p><strong>1235 Joe Cunningham &#8211; Visa &#8211; Large scale transaction analysis<br />
</strong> &#8211; responsible for Visa Technology Strategy and Innovation<br />
been playing with Hadoop for 9 months<br />
probably many in audience learning and starting out with Hadoop</p>
<p>Agenda:<br />
1) VisaNet overview<br />
2) Value-added information products<br />
3) Hadoop@Visa &#8211; research results</p>
<p>About Visa:<br />
- 60 Billion market cap<br />
- well-known card products, and also behind the scene information products<br />
- Visa brand has high trust<br />
- For a card-holder a Visa-card means global acceptance<br />
- For a shopowner, if you get a Visa payment aproval you will be payed</p>
<p>VisaNet<br />
VisaNet is the largest, most advanced payment network in the world<br />
characteristics:<br />
28M locations,<br />
130M authorizations/day,<br />
1500 endpoints,<br />
Processes transactions faster than 1s<br />
1.4M ATMs,<br />
Processes in 175 currencies,<br />
Less than 2s unavailability per year (!)<br />
- according to my calculations six 9s (0.999999366)<br />
16300 financial institutions</p>
<p>Visa Processing Architecture<br />
Security/Access Services -&gt; Message|File|Web<br />
VisaNet Services Integration -&gt; Authorization|Clearing&amp;Settlement<br />
Dispute handling, Risk, Information<br />
Scoring every transaction (used for issuer to approve/decline transaction)</p>
<p>Value added Info products<br />
- Info services<br />
Client: Portfolio Analysis, Visa Incentive Network<br />
Accountholder: transaction alerts, accoutnt updater, tailored rewards<br />
- Risk management services<br />
Account monitoring<br />
Authentication<br />
Encyption</p>
<p>Hadoop@Visa<br />
Run a pipeline of prototypes in lab facility in SF<br />
Any technology taken into Visa needs to match scalability and reliability requirements</p>
<p>Research Lab Setup<br />
- VM System:<br />
Custom Analytic Stacks<br />
Encryption Processing<br />
Relational Database<br />
- Hadoop Systems<br />
Management Stack<br />
Hadoop #1  ~40TB / 42 nodes (2 years of raw transaction data)<br />
Hadoop #2 ~300TB / 28 nodes</p>
<p>Risk Product Use Case<br />
Create critical data model elements, such as keys and transaction statistics, which feed our real-time risk-scoring systems<br />
Input: Transactions &#8211; Merchant Category, Country/Zip<br />
Output: Key &amp; Statistics &#8211; MCCZIP Key &#8211; stats related to account, trans. type, approval, fraud, IP address etc.<br />
Research Sample: 500M distinct accounts, 100M transactions per day, 200 bytes per transaction, 2 years &#8211; 73B transaction (36TB)<br />
Processing time from 1 month to 13 minutes! (note: ~3000 times faster)<br />
(Generate synthetic transactions used to test the model)</p>
<p>Financial Enterprise Fit<br />
- key questions under research:<br />
- what will the Hadoop Solution Stack(s) look like?<br />
- File system, Transaction Sample System, Relational Back-end (integration path), Analytics Processing<br />
- Internal vs external cloud<br />
- How do I get data into a cloud in a secure way.<br />
- How does HSM and security integration work in Hadoop<br />
- What are the missing pieces?</p>
<p>Why Hadoop@Visa?<br />
- analyze volumes of data with response that are not possible today<br />
- requirement: need to fit with existing solutions</p>
<p><strong>Cross Data Center Log Processing &#8211; Stu Hood, Rackspace</strong></p>
<p>(Email and apps division, work on search team)</p>
<p>Agenda<br />
Use Case Backgound<br />
- &#8220;Rackapps&#8221; &#8211; Hybrid Mail Hosting, 40% use a mix of exchange and rackspace mail</p>
<p>Use Case: Log Types</p>
<p>Use Case: Querying<br />
- was the mail delivered?<br />
- spam &#8211; why was it (not) marked as spam<br />
- access &#8211; who checked/failed to check mail?<br />
more advanced questions:<br />
- which delivery routes have the highest latency?<br />
- which are the spammiest IP?<br />
- Where in the world do customers log in from<br />
Elsewhere:<br />
- billing</p>
<p>Previous Solutions<br />
- 1999-2006 &#8211; go to where log files are generated, querying with grep<br />
- 2006-2007 / bulk load to MySQL &#8211; worked for a year</p>
<p>Hadoop Solution<br />
- V3 &#8211; lucene indexes in Hadoop<br />
- 2007- present<br />
- store 7 days uncompressed<br />
- queries take seconds<br />
- long term queries with mapreduce (6M avail for MR queries)<br />
- all 3 datacenters</p>
<p>Alternatives considered:<br />
- Splunk &#8211; good for realtime, but not great for archiving<br />
- Data warehouse package &#8211; not realtime, but fantastic for longterm analysis<br />
- Partioned MySQL &#8211; half-baked solution<br />
=&gt; Hadoop hit the sweet spot</p>
<p>Hadoop Implementation<br />
SW<br />
- collect data using syslog-ng (considering Scribe)<br />
- storage: deposits into Hadoop (scribe will remove that)<br />
HW<br />
- 2-4 collector machines per datacenters<br />
- hundreds of source machines<br />
20 solr nodes</p>
<p>Implementation: Indexing/Querying<br />
- indexing &#8211; uniqe processing code for schema<br />
- querying<br />
- &#8220;realtime&#8221;<br />
- sharded lucene/solr instances merge-index chunk from Hadoop<br />
- using Solr-API<br />
- raw logs<br />
- using Hadoop Streaming and unix grep<br />
- Mapreduce</p>
<p>Implementation: Timeframe<br />
- development &#8211; 1.5 people in 3 months<br />
- deployments &#8211; using clouderas distribution<br />
- roadblocks &#8211; bumped into job-size limits</p>
<p>Have run close to 1 million jobs on our cluster, and it has not gone down (except for other reasons such as maintenance)</p>
<p>Advantages &#8211; storage<br />
- all storage in one place<br />
Raw logs: 3 days, in HDFS<br />
Indexes: 7 days<br />
Archived Indexes: 6 months</p>
<p>Advantages &#8211; analysis<br />
- Java Mapreduce API<br />
- Apache Pig<br />
- ideal for one-off queries<br />
- Hadoop Streaming</p>
<p>Pig Example &#8211; whitehouse.gov mail spoofing</p>
<p>Advantages &#8211; Scalability, Cost, Community<br />
- scalability &#8211; easy to add nodes<br />
- cost &#8211; only hardware<br />
- community &#8211; cloudera has been a benefit, deployment is trivial</p>
<p><strong>Data Processing for Financial Services &#8211; Peter Krey and Sin Lee, JP Morgan Chase</strong></p>
<p>Innovation &amp; Shared Services, Firmwide Engineering &amp; Architecture</p>
<p>note: certain constraints what can be shared due to regulations</p>
<p>JPMorgen Chase + Open Source<br />
- QPD (AMQP) &#8211; top level apache project<br />
- Tyger &#8211; Apache + Tomcat + Spring</p>
<p>Hadoop in the Enterprise &#8211; Economics Driven<br />
- attractive: economics<br />
- Many big lessons from Web 2.0 community<br />
- Potential for Large Capex and Opex &#8220;Dislocation&#8221;<br />
- reduce consumption of enterprise premium resources<br />
- grid computing economics brought to data intensive computing<br />
- stagnant data innovation<br />
- Enabling &amp; potentially disruptive platform<br />
- many historical similarities<br />
- java, linux, tomcat, web/internet<br />
- minis to client/server, client/server to web, solaris to linux, ..<br />
- Key question: what can be built on top of Hadoop?<br />
Back to economics driven &#8211; very cost-effective</p>
<p>Hadoop in the Enterprise &#8211; Choice Driven<br />
- Overuse of relational database containers<br />
- institutional &#8220;Muscle memory&#8221; &#8211; not too much else to choose from<br />
- increasingly large percentage of static data stored in proprietary transactional DBs<br />
- Over-Normalized Schemas: still Makes sense with cheap compute&amp;storage?</p>
<p>- Enterprise Storage &#8220;Prisoners&#8221;<br />
- Captive to the economics &amp; technology of &#8220;a few&#8221; vendors<br />
- Developers need more choice<br />
- Too much proprietary, single-source data infrastructure<br />
- increasing need for minimal/no systems + storage admins</p>
<p>Hadoop in the Enterprise &#8211; Other Drivers<br />
- Growing developer interest in &#8220;Reduced RDBMS&#8221; Data technologies<br />
- open source, distributed, non-relational databases<br />
- growing influence of web 2.0 technologies &amp; thinking of enterprise<br />
- hadoop, cassandra, hbase, hive, couchdb, hadoopDB, .. , others<br />
- memcached for caching</p>
<p>FSI Industry Drivers<br />
- Increased regularity oversight + reporting = More data needed over longer period of time<br />
- triple data amounts from 2007 to 2009<br />
- growing need for less expensive data repository/store<br />
- increased need to support &#8220;one off&#8221; analysis on large data</p>
<p>Active POC Pipeline<br />
- Growing stream of real projects to gauge hadoop &#8220;goodness of fit&#8221;<br />
- broad spectrum of use cases<br />
- driven by need to impact/dislocate OPEX+CAPEX<br />
- looking for orders of magnitude<br />
- evaluated on metric based performance, functional and economic measures<br />
- avoid the &#8220;data falling on the floor phenomena&#8221;<br />
- tools are really really important, keep tools and programming models simple</p>
<p>Hadoop Positiong<br />
- Latency x Storage amount curve,</p>
<p>Cost comparisons<br />
- SAN vs Hadoop HDFS cost comparison (GB/month)<br />
- Hadoop much cheaper</p>
<p>Hadoop Additions and Must Haves:<br />
- Improves SQL Front-End Tool Interoperability<br />
- Improved Security &amp; ACL enforcement &#8211; Kerberos Integration<br />
- Grow Developer Programming Model Skill Sets<br />
- Improve Relational Container Integration &amp; Interop for Data Archival<br />
- Management &amp; Monitoring Tools<br />
- Improved Developer &amp; Debugging Tools<br />
- Reduce Latency via integration with open source data caching<br />
- memcached &#8211; others<br />
- Invitation to FSI or Enterprise roundtable</p>
<p><strong>Protein Alignment &#8211; Paul Brown, Booz Allen</strong></p>
<p>Biological information<br />
- Body &#8211; Cells &#8211; Chromosomes &#8211; Gene &#8211; DNA/RNA</p>
<p>Bioinformatics &#8211; The Pain<br />
- too much data</p>
<p>So What? Querying a database of sequences for similar sequences<br />
- one-to-many comparison<br />
- 58000 proteins in PDB<br />
- Protein alignment frequently used in the development of medicines<br />
- Looking for a certain sequence across species, helps indicate function<br />
Implementation in Hadoop<br />
- distribute database sequence accross each node<br />
- send query seq. inside Mapreduce (or dist.cache)<br />
- scales well<br />
- existing algorithms port easily</p>
<p>So What? Comparing sequences in bulk<br />
- many-to-many<br />
- DNA hybridiation (reconstruction)<br />
Ran on AWS<br />
Hadoop:<br />
- if whole dataset fit into one computer<br />
- Used distributed cache, assign each node a piece of the list<br />
- But if the does not fit on one computer&#8230;.<br />
- pre-join all possible pairs with one MapReduce</p>
<p>So What? Analyzing really big sequences<br />
- one big sequence to many small sequences<br />
- scanning dna for structure<br />
- population genetics<br />
- hadoop implementatoin</p>
<p>Demonstration Implementation: Smith-Waterman Alignment<br />
- one of the more computationally intensive matching and aligmnent techniques<br />
- big matrix &#8211; (sequences to compare on row and column and calculations within)</p>
<p>Amazon implementation<br />
- 250 machines<br />
- E2<br />
- run in 10 minutes for a single sequence. Runs in 24hrs for NxN comparison<br />
- cost $40/hr</p>
<p>==&gt; very cool 3D video of amazon ec2 nodes<br />
- failing job due to 10% of nodes stuck on something (e.g. very long sequences)</p>
<p><strong>Real-time Business Intelligence, Bradford Stephens</strong></p>
<p>Topics<br />
- Scalability and BI<br />
- Costs and Abilities<br />
- Search as BI</p>
<p>Tools: Zookeeper, Hbase, Katta (dist.search on Hadoop) and Bobo (faceted search for lucene)<br />
- http://sourceforge.net/projects/bobo-browse/<br />
- http://sourceforge.net/projects/katta/develop</p>
<p>100TB structured and unstructed data &#8211; Oracle 100M$, Hadoop and Katta 250K$</p>
<p>Building data cubes in real time (with faceted search)</p>
<p>Real-time Mapreduce on HBase<br />
Search/BI as a platform &#8211; &#8220;google my datawarehouse&#8221;</p>
<p><strong>Counting, Clustering and other data tricks, Derek Gottfried, New York Times</strong></p>
<p>back in 2007 &#8211; would like to try as many EC2 instances as possible</p>
<p>Problem<br />
- freeing up historical archives of NYTimes.com (1851-1922)<br />
(in the public domain)</p>
<p>Currently:<br />
- 2009 &#8211; web analytics<br />
3 big data buckets:<br />
1) registration/demographics<br />
2) articles 1851-today<br />
- a lot of metadata about each article<br />
- unique data, extract people, places, .. to each article =&gt; high precision search<br />
3) usage data/web logs<br />
- biggest piece &#8211; piles up</p>
<p>How do we merge the 3 datasets?</p>
<p>Using EC2 &#8211; 20 machines<br />
Hadoop 0.20.0<br />
12 TB of data<br />
Straight MR in Java<br />
(mostly java + postprocessing in python)</p>
<p>combining weblog data with demographic data, e.g. twitter clicks backs by age group</p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/" title="Add 'Hadoop World 2009 – some notes from application session' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Hadoop World 2009 – some notes from application session' to BlogRO" alt="Add 'Hadoop World 2009 – some notes from application session' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/" title="Add 'Hadoop World 2009 – some notes from application session' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Hadoop World 2009 – some notes from application session' to FTW" alt="Add 'Hadoop World 2009 – some notes from application session' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/" title="Add 'Hadoop World 2009 – some notes from application session' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Hadoop World 2009 – some notes from application session' to VoxRO" alt="Add 'Hadoop World 2009 – some notes from application session' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/" title="Add 'Hadoop World 2009 – some notes from application session' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Hadoop World 2009 – some notes from application session' to Twitter" alt="Add 'Hadoop World 2009 – some notes from application session' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Hadoop+World+2009+%E2%80%93+some+notes+from+application+session&amp;c=http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/" title="Add 'Hadoop World 2009 – some notes from application session' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Hadoop World 2009 – some notes from application session' to MySpace" alt="Add 'Hadoop World 2009 – some notes from application session' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/&amp;title=Hadoop+World+2009+%E2%80%93+some+notes+from+application+session" title="Add 'Hadoop World 2009 – some notes from application session' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Hadoop World 2009 – some notes from application session' to Del.icio.us" alt="Add 'Hadoop World 2009 – some notes from application session' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/&amp;title=Hadoop+World+2009+%E2%80%93+some+notes+from+application+session" title="Add 'Hadoop World 2009 – some notes from application session' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Hadoop World 2009 – some notes from application session' to digg" alt="Add 'Hadoop World 2009 – some notes from application session' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/&amp;t=Hadoop+World+2009+%E2%80%93+some+notes+from+application+session" title="Add 'Hadoop World 2009 – some notes from application session' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Hadoop World 2009 – some notes from application session' to FaceBook" alt="Add 'Hadoop World 2009 – some notes from application session' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/" title="Add 'Hadoop World 2009 – some notes from application session' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Hadoop World 2009 – some notes from application session' to Technorati" alt="Add 'Hadoop World 2009 – some notes from application session' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/&amp;title=Hadoop+World+2009+%E2%80%93+some+notes+from+application+session" title="Add 'Hadoop World 2009 – some notes from application session' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Hadoop World 2009 – some notes from application session' to Stumble Upon" alt="Add 'Hadoop World 2009 – some notes from application session' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/&amp;title=Hadoop+World+2009+%E2%80%93+some+notes+from+application+session" title="Add 'Hadoop World 2009 – some notes from application session' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Hadoop World 2009 – some notes from application session' to Google Bookmarks" alt="Add 'Hadoop World 2009 – some notes from application session' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/4Ar0j0Ubxn0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/</feedburner:origLink></item>
	</channel>
</rss>
