<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>atbrox</title>
	
	<link>http://atbrox.com</link>
	<description />
	<lastBuildDate>Thu, 02 Sep 2010 04:44:43 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/atbrox" /><feedburner:info uri="atbrox" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Recommended Mapreduce Workshop</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/-pVkq904OSo/</link>
		<comments>http://atbrox.com/2010/08/31/recommended-mapreduce-workshop/#comments</comments>
		<pubDate>Tue, 31 Aug 2010 08:02:43 +0000</pubDate>
		<dc:creator>Amund Tveit</dc:creator>
				<category><![CDATA[Hadoop and Mapreduce]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=1390</guid>
		<description><![CDATA[If you are interested in Hadoop or Mapreduce, I would like to recommend participating or submitting your paper to the First International Workshop on Theory and Practice of Mapreduce (MAPRED&#8217;2010) (held in correspondance with the 2nd IEEE International Conference on Cloud Computing Technology and Science). (I just joined the workshop as a program committee member) [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2010%2F08%2F31%2Frecommended-mapreduce-workshop%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fatbrox.com%2F2010%2F08%2F31%2Frecommended-mapreduce-workshop%2F&amp;style=normal" height="61" width="50" /><br />
			</a>
		</div>
<p>If you are interested in <a href="http://atbrox.com/hadoop">Hadoop</a> or <a href="http://atbrox.com/mapreduce">Mapreduce</a>, I would like to recommend participating or submitting your paper to the <a href="http://mapreduce.cloudcom.org/">First International Workshop on Theory and Practice of Mapreduce (MAPRED&#8217;2010)</a> (held in correspondance with the <a href="http://2010.cloudcom.org/">2nd IEEE International Conference on Cloud Computing Technology and Science</a>).</p>
<p>(I just joined the workshop as a program committee member)</p>
<p>Best regards,</p>
<p><a href="http://atbrox.com/about/">Amund Tveit</a> (co-founder of Atbrox)</p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2010/08/31/recommended-mapreduce-workshop/" title="Add 'Recommended Mapreduce Workshop' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Recommended Mapreduce Workshop' to BlogRO" alt="Add 'Recommended Mapreduce Workshop' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2010/08/31/recommended-mapreduce-workshop/" title="Add 'Recommended Mapreduce Workshop' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Recommended Mapreduce Workshop' to FTW" alt="Add 'Recommended Mapreduce Workshop' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2010/08/31/recommended-mapreduce-workshop/" title="Add 'Recommended Mapreduce Workshop' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Recommended Mapreduce Workshop' to VoxRO" alt="Add 'Recommended Mapreduce Workshop' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2010/08/31/recommended-mapreduce-workshop/" title="Add 'Recommended Mapreduce Workshop' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Recommended Mapreduce Workshop' to Twitter" alt="Add 'Recommended Mapreduce Workshop' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Recommended+Mapreduce+Workshop&amp;c=http://atbrox.com/2010/08/31/recommended-mapreduce-workshop/" title="Add 'Recommended Mapreduce Workshop' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Recommended Mapreduce Workshop' to MySpace" alt="Add 'Recommended Mapreduce Workshop' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2010/08/31/recommended-mapreduce-workshop/&amp;title=Recommended+Mapreduce+Workshop" title="Add 'Recommended Mapreduce Workshop' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Recommended Mapreduce Workshop' to Del.icio.us" alt="Add 'Recommended Mapreduce Workshop' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2010/08/31/recommended-mapreduce-workshop/&amp;title=Recommended+Mapreduce+Workshop" title="Add 'Recommended Mapreduce Workshop' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Recommended Mapreduce Workshop' to digg" alt="Add 'Recommended Mapreduce Workshop' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2010/08/31/recommended-mapreduce-workshop/&amp;t=Recommended+Mapreduce+Workshop" title="Add 'Recommended Mapreduce Workshop' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Recommended Mapreduce Workshop' to FaceBook" alt="Add 'Recommended Mapreduce Workshop' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2010/08/31/recommended-mapreduce-workshop/" title="Add 'Recommended Mapreduce Workshop' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Recommended Mapreduce Workshop' to Technorati" alt="Add 'Recommended Mapreduce Workshop' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2010/08/31/recommended-mapreduce-workshop/&amp;title=Recommended+Mapreduce+Workshop" title="Add 'Recommended Mapreduce Workshop' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Recommended Mapreduce Workshop' to Stumble Upon" alt="Add 'Recommended Mapreduce Workshop' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2010/08/31/recommended-mapreduce-workshop/&amp;title=Recommended+Mapreduce+Workshop" title="Add 'Recommended Mapreduce Workshop' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Recommended Mapreduce Workshop' to Google Bookmarks" alt="Add 'Recommended Mapreduce Workshop' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/-pVkq904OSo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2010/08/31/recommended-mapreduce-workshop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://atbrox.com/2010/08/31/recommended-mapreduce-workshop/</feedburner:origLink></item>
		<item>
		<title>Word Count with MapReduce on a GPU – A Python Example</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/K52GkHF9sVs/</link>
		<comments>http://atbrox.com/2010/08/20/word-count-with-mapreduce-on-a-gpu-a-python-example/#comments</comments>
		<pubDate>Fri, 20 Aug 2010 12:01:57 +0000</pubDate>
		<dc:creator>Amund Tveit</dc:creator>
				<category><![CDATA[Hadoop and Mapreduce]]></category>
		<category><![CDATA[cuda]]></category>
		<category><![CDATA[gpu]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[nvidia]]></category>
		<category><![CDATA[pycuda]]></category>
		<category><![CDATA[tesla]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=1304</guid>
		<description><![CDATA[Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and research. GPU &#8211; Graphical Processing Unit like the NVIDIA Tesla &#8211; is fascinating hardware, in particular regarding extreme parallelism (hundreds of cores) and memory bandwidth (tens of Gigabytes/second). The main programming languages for programming GPUs are [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2010%2F08%2F20%2Fword-count-with-mapreduce-on-a-gpu-a-python-example%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fatbrox.com%2F2010%2F08%2F20%2Fword-count-with-mapreduce-on-a-gpu-a-python-example%2F&amp;style=normal" height="61" width="50" /><br />
			</a>
		</div>
<p><em>Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. <a href="http://atbrox.com/about/">Our background</a> is from Google, IBM and research.</em></p>
<p>GPU &#8211; Graphical Processing Unit like the <a href="http://www.amazon.com/gp/product/B003WQNUI8?ie=UTF8&#038;tag=amuw-20&#038;linkCode=as2&#038;camp=1789&#038;creative=9325&#038;creativeASIN=B003WQNUI8">NVIDIA Tesla</a><img src="http://www.assoc-amazon.com/e/ir?t=amuw-20&#038;l=as2&#038;o=1&#038;a=B003WQNUI8" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /> &#8211; is fascinating hardware, in particular regarding extreme parallelism (hundreds of cores) and memory bandwidth (tens of Gigabytes/second). The main programming languages for programming GPUs are C-based OpenCL and Nvidia&#8217;s Cuda, in addition there are wrappers to those in many languages, for the following example we use Andreas Klöckner&#8217;s <a href="http://mathema.tician.de/software/pycuda">PyCuda</a> for Python.</p>
<h2>Word Count with PyCuda and MapReduce<br />
</h2>
<p>One of the classic mapreduce examples is word frequency count (i.e. individual word frequencies), but let us start with an even simpler example &#8211; word count, i.e. how many words are there in a (potentially big) string?</p>
<p>In python the default approach would perhaps be to do:</p>
<pre class="brush: plain;">
wordcount = len(bigstring.split())
</pre>
<p>But assuming that you didn&#8217;t have split() or that split() was too slow, what would you do?</p>
<p><b>How to calculate word count?</b><br />
If you have the string <code>mystring = "this is a string"</code> you could iterate through it and count the number of spaces, e.g. with
<pre class="brush: plain;">sum([1 for c in mystring if c == ' '])</pre>
<p> (<em>notice the one-off error</em>), and perhaps split it up and parallelize it somehow. However, if there are several spaces in a row in the string this algorithm will fail, and it doesn&#8217;t use the GPU horsepower.</p>
<p><b>The MapReduce approach</b><br />
Assuming you still have <code>mystring = "this is a string"</code>, try to align the string <em>almost</em> with itself, i.e. have one string being all characters in <code>mystring</code> except the last &#8211; <code>"this is a strin" == mystring[:-1]</code> (<em>called prefix from here</em>), and another string with all characters in <code>mystring</code> except the first &#8211; <code>"his is a string" == mystring[1:]</code> (<em>called suffix from here</em>), and align those two like this:</p>
<pre class="brush: plain;">
this is a strin # prefix
his is a string # suffix
</pre>
<p>you can see that counting all occurences of when the character in the upper string (prefix) is whitespace and the corresponding character in the lower string (suffix) is non-white will give the correct count of words (<em>with the same one-off as above that can be fixed by checking that first character is non-whitespace</em>). This way of counting also deals with multiple spaces in a row (as the above one doesn&#8217;t). This can be expressed in Python with <code>Map()</code> and <code>Reduce()</code> as:</p>
<pre class="brush: plain;">
mystring = &quot;this is a string&quot;
prefix = mystring[:-1]
suffix = mystring[1:]
mapoutput = map(lambda x,y: (x == ' ')*(y != ' '), prefix, suffix)
reduceoutput = reduce(lambda x,y: x+y, mapoutput)
sum = reduceoutput + (mystring[0] != ' ') # fix one off-error
</pre>
<p><strong>Mapreduce with PyCuda</strong></p>
<p>PyCuda supports using python and numpy library with Cuda, and it also has library to support mapreduce type calls on data structures loaded to the GPU (typically arrays), under is my complete code for calculating word count with PyCuda, I used the complete works by Shakespeare as test dataset (<a href="http://manybooks.net/titles/shakespeetext94shaks12.html">downloaded as Plain text</a>) and replicated it hundred times so in total 493820800 bytes (~1/2 Gigabyte) that I uploaded to our Nvidia Tesla C1060 GPU and run word count on (the results were compared with unix command line wc and len(dataset.split()) for smaller datasets).</p>
<pre class="brush: plain;">
import pycuda.autoinit
import numpy
from pycuda import gpuarray, reduction
import time

def createCudaWordCountKernel():
    initvalue = &quot;0&quot;
    mapper = &quot;(a[i] == 32)*(b[i] != 32)&quot; # 32 is ascii code for whitespace
    reducer = &quot;a+b&quot;
    cudafunctionarguments = &quot;char* a, char* b&quot;
    wordcountkernel = reduction.ReductionKernel(numpy.float32, neutral = initvalue,
                                            reduce_expr=reducer, map_expr = mapper,
                                            arguments = cudafunctionarguments)
    return wordcountkernel

def createBigDataset(filename):
    print &quot;reading data&quot;
    dataset = file(filename).read()
    print &quot;creating a big dataset&quot;
    words = &quot; &quot;.join(dataset.split()) # in order to get rid of \t and \n
    chars = [ord(x) for x in words]
    bigdataset = []
    for k in range(100):
        bigdataset += chars
    print &quot;dataset size = &quot;, len(bigdataset)
    print &quot;creating numpy array of dataset&quot;
    bignumpyarray = numpy.array( bigdataset, dtype=numpy.uint8)
    return bignumpyarray

def wordCount(wordcountkernel, bignumpyarray):
    print &quot;uploading array to gpu&quot;
    gpudataset = gpuarray.to_gpu(bignumpyarray)
    datasetsize = len(bignumpyarray)
    start = time.time()
    wordcount = wordcountkernel(gpudataset[:-1],gpudataset[1:]).get()
    stop = time.time()
    seconds = (stop-start)
    estimatepersecond = (datasetsize/seconds)/(1024*1024*1024)
    print &quot;word count took &quot;, seconds*1000, &quot; milliseconds&quot;
    print &quot;estimated throughput &quot;, estimatepersecond, &quot; Gigabytes/s&quot;
    return wordcount

if __name__ == &quot;__main__&quot;:
    bignumpyarray = createBigDataset(&quot;dataset.txt&quot;)
    wordcountkernel = createCudaWordCountKernel()
    wordcount = wordCount(wordcountkernel, bignumpyarray)
</pre>
<p><strong>Results</strong></p>
<pre class="brush: plain;">
python wordcount_pycuda.py
reading data
creating a big dataset, about 1/2 GB of Shakespeare text
dataset size =  493820800
creating numpy array of dataset
uploading array to gpu
word count took  38.4578704834  milliseconds
estimated throughput  11.9587084015  Gigabytes/s (95.67 Gigabit/s)
word count =  89988104.0
</pre>
<p><strong>Improvement Opportunities?</strong><br />
There are plenty of improvement opportunities, in particular fixing the creation of numpy array &#8211;  <code>bignumpyarray = numpy.array( bigdataset, dtype=numpy.uint8)</code> &#8211; which took almost all of the total time.</p>
<p>It is also interesting to notice that this approach doesn&#8217;t gain from using combiners like in Hadoop/Mapreduce (a combiner is basically a reducer that sits on the tail of the mapper and creates partial results in the case of associative and commutative reducer methods, it can for all practical purposes be compared to an afterburner on a jet motor).</p>
<p><a href="http://www.linkedin.com/companies/atbrox" ><img src="http://static.linkedin.com/scds/common/u/img/webpromo/btn_cofollow_badge.png" alt="Atbrox on LinkedIn"></a></p>
<p>Best regards,</p>
<p><a href="http://atbrox.com/about/">Amund Tveit</a> (Atbrox co-founder)</p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2010/08/20/word-count-with-mapreduce-on-a-gpu-a-python-example/" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to BlogRO" alt="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2010/08/20/word-count-with-mapreduce-on-a-gpu-a-python-example/" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to FTW" alt="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2010/08/20/word-count-with-mapreduce-on-a-gpu-a-python-example/" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to VoxRO" alt="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2010/08/20/word-count-with-mapreduce-on-a-gpu-a-python-example/" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Twitter" alt="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Word+Count+with+MapReduce+on+a+GPU+%26%238211%3B+A+Python+Example&amp;c=http://atbrox.com/2010/08/20/word-count-with-mapreduce-on-a-gpu-a-python-example/" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to MySpace" alt="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2010/08/20/word-count-with-mapreduce-on-a-gpu-a-python-example/&amp;title=Word+Count+with+MapReduce+on+a+GPU+%26%238211%3B+A+Python+Example" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Del.icio.us" alt="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2010/08/20/word-count-with-mapreduce-on-a-gpu-a-python-example/&amp;title=Word+Count+with+MapReduce+on+a+GPU+%26%238211%3B+A+Python+Example" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to digg" alt="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2010/08/20/word-count-with-mapreduce-on-a-gpu-a-python-example/&amp;t=Word+Count+with+MapReduce+on+a+GPU+%26%238211%3B+A+Python+Example" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to FaceBook" alt="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2010/08/20/word-count-with-mapreduce-on-a-gpu-a-python-example/" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Technorati" alt="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2010/08/20/word-count-with-mapreduce-on-a-gpu-a-python-example/&amp;title=Word+Count+with+MapReduce+on+a+GPU+%26%238211%3B+A+Python+Example" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Stumble Upon" alt="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2010/08/20/word-count-with-mapreduce-on-a-gpu-a-python-example/&amp;title=Word+Count+with+MapReduce+on+a+GPU+%26%238211%3B+A+Python+Example" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Google Bookmarks" alt="Add 'Word Count with MapReduce on a GPU &#8211; A Python Example' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/K52GkHF9sVs" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2010/08/20/word-count-with-mapreduce-on-a-gpu-a-python-example/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://atbrox.com/2010/08/20/word-count-with-mapreduce-on-a-gpu-a-python-example/</feedburner:origLink></item>
		<item>
		<title>Statistics about Hadoop and Mapreduce Algorithm Papers</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/sVx3K1TXCP0/</link>
		<comments>http://atbrox.com/2010/05/25/statistics-about-hadoop-and-mapreduce-algorithm-papers/#comments</comments>
		<pubDate>Tue, 25 May 2010 16:11:50 +0000</pubDate>
		<dc:creator>Amund Tveit</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[china mobile]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[yahoo]]></category>
		<category><![CDATA[yandex]]></category>
		<category><![CDATA[zhejian university]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=1045</guid>
		<description><![CDATA[Underneath are statistics about which 20 papers (of about 80 papers) were most read in our 3 previous postings about mapreduce and hadoop algorithms (the postings have been read approximately 5000 times). The list is ordered by decreasing reading frequency, i.e. most popular at spot 1. MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2010%2F05%2F25%2Fstatistics-about-hadoop-and-mapreduce-algorithm-papers%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fatbrox.com%2F2010%2F05%2F25%2Fstatistics-about-hadoop-and-mapreduce-algorithm-papers%2F&amp;style=normal" height="61" width="50" /><br />
			</a>
		</div>
<p>Underneath are statistics about which 20 papers (of <a href="http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/">about 80 papers</a>) were most read in our 3 <a href="http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/">previous postings</a> about mapreduce and hadoop algorithms (the postings have been read approximately 5000 times). The list is ordered by decreasing reading frequency, i.e. most popular at spot 1.</p>
<ol>
<li><a href="http://www.springerlink.com/content/861l014845934682/">MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network</a><br />
authors: Yang Liu, Xiaohong Jiang, Huajun Chen , Jun Ma  and Xiangyu Zhang &#8211; Zhejiang University</p>
<li><a href="http://portal.acm.org/citation.cfm?id=1620950.1620951">Data-intensive text processing with Mapreduce</a><br />
authors: Jimmy Lin and Chris Dyer &#8211; University of Maryland</p>
<li><a href="http://www.cc.gatech.edu/~zha/CSE8801/ad/p209-chen.pdf">Large-Scale Behavioral Targeting</a><br />
authors: Ye Chen (eBay), Dmitry Pavlov (Yandex Labs) and John F. Canny (University of California, Berkeley)</p>
<li><a href="http://www.wsdm-conference.org/2010/proceedings/docs/p361.pdf">Improving Ad Relevance in Sponsored Search</a><br />
authors: Dustin Hillard, Stefan Schroedl, Eren Manavoglu, Hema Raghavan and Chris Leggetter (Yahoo Labs)</p>
<li><a href="http://users.cis.fiu.edu/~vagelis/publications/Spatial-MapReduce-SSDBM2009.pdf">Experiences on Processing Spatial Data with MapReduce</a><br />
authors: Ariel Cary, Zhengguo Sun, Vagelis Hristidis and Naphtali Rishe &#8211; Florida International University</p>
<li><a href="http://portal.acm.org/citation.cfm?id=1779599.1779603">Extracting user profiles from large scale data</a><br />
authors: Michal Shmueli-Scheuer, Haggai Roitman, David Carmel, Yosi Mass and David Konopnicki &#8211; IBM Research, Haifa</p>
<li><a href="http://web2py.iiit.ac.in/publications/default/download/techreport.pdf.a373bbf4a5b76063.4164436c69636b5468726f7567685261746549494954485265706f72742e706466.pdf">Predicting the Click-Through Rate for Rare/New Ads</a><br />
authors: Kushal Dave and Vasudeva Varma &#8211; IIIT Hyderabad</p>
<li><a href="http://www.springerlink.com/content/c621194607866223/">Parallel K-Means Clustering Based on MapReduce</a><br />
authors: Weizhong Zhao, Huifang Ma  and Qing He &#8211; Chinese Academy of Sciences</p>
<li><a href="http://www.springerlink.com/content/l805560670136163/">Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce</a><br />
authors: Mohammad Farhan Husain, Pankil Doshi, Latifur Khan and Bhavani Thuraisingham &#8211; University of Texas at Dallas</p>
<li><a href="http://www.pittsburgh.intel-research.net/~chensm/papers/IRP-TR-08-05.pdf">Map-Reduce Meets Wider Varieties of Applications</a><br />
authors: Shimin Chen and Steven W. Schlosser &#8211; Intel Research</p>
<li><a href="http://arxiv.org/ftp/arxiv/papers/1003/1003.0951.pdf">LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems</a><br />
authors: Wei Zhou, Jianfeng Zhan, Dan Meng (Chinese Academy of Sciences), Dongyan Xu (Purdue University) and Zhihong Zhang (China Mobile Research)</p>
<li><a href="http://paginas.fe.up.pt/~eol/PUBLICATIONS/2009/Efficient%20clustering%20of%20web-derived%20data%20sets.pdf">Efficient Clustering of Web-Derived Data Sets</a><br />
authors: Luıs Sarmento, Eugenio Oliveira (University of Porto), Alexander P. Kehlenbeck (Google), Lyle Ungar (University of Pennsylvania)</p>
<li><a href="http://portal.acm.org/citation.cfm?id=1779599.1779601">A novel approach to multiple sequence alignment using hadoop data grids</a><br />
authors: G. Sudha Sadasivam	 and G. Baktavatchalam &#8211; PSG College of Technology</p>
<li><a href="http://www.aclweb.org/anthology/D/D09/D09-1098.pdf">Web-Scale Distributional Similarity and Entity Set Expansion</a><br />
authors: Patrick Pantel, Eric Crestan, Ana-Maria Popescu, Vishnu Vyas (Yahoo Labs) and  Arkady Borkovsky (Yandex Labs)</p>
<li><a href="http://www.cs.cmu.edu/~zollmann/publications/samt-toolkit.pdf">Grammar based statistical MT on Hadoop</a><br />
authors: Ashish Venugopal and Andreas Zollmann (Carnegie Mellon University)</p>
<li><a href="http://jmlr.csail.mit.edu/papers/volume10/newman09a/newman09a.pdf">Distributed Algorithms for Topic Models</a><br />
authors: David Newman, Arthur Asuncion, Padhraic Smyth and Max Welling &#8211; University of California, Irvine</p>
<li><a href="http://portal.acm.org/citation.cfm?id=1631272.1631451">Parallel algorithms for mining large-scale rich-media data</a><br />
authors: Edward Y. Chang, Hongjie Bai and Kaihua Zhu &#8211; Google Research</p>
<li><a href="http://www.cs.ubc.ca/~goyal/research/wsdm339-goyal.pdf">Learning Influence Probabilities In Social Networks</a><br />
authors: Amit Goyal,  Laks V. S. Lakshmanan (University of British Columbia) and Francesco Bonchi (Yahoo! Research)</p>
<li><a href="http://www.biomedcentral.com/1471-2105/11/S1/S15">MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees</a><br />
authors: Suzanne J Matthews and Tiffani L Williams &#8211; Texas A&#038;M University</p>
<li><a href="http://www.computer.org/portal/web/csdl/doi/10.1109/WKDD.2010.54">User-Based Collaborative-Filtering Recommendation Algorithms on Hadoop</a><br />
authors: Zhi-Dan Zhao and Ming-sheng Shang
</ul>
<p><a href="http://www.linkedin.com/companies/atbrox" ><img src="http://static.linkedin.com/scds/common/u/img/webpromo/btn_cofollow_badge.png" alt="Atbrox on LinkedIn"></a></p>
<p>Best regards,</p>
<p><a href="http://atbrox.com/about/">Amund Tveit</a> (Atbrox co-founder)</p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2010/05/25/statistics-about-hadoop-and-mapreduce-algorithm-papers/" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to BlogRO" alt="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2010/05/25/statistics-about-hadoop-and-mapreduce-algorithm-papers/" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to FTW" alt="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2010/05/25/statistics-about-hadoop-and-mapreduce-algorithm-papers/" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to VoxRO" alt="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2010/05/25/statistics-about-hadoop-and-mapreduce-algorithm-papers/" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Twitter" alt="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Statistics+about+Hadoop+and+Mapreduce+Algorithm+Papers&amp;c=http://atbrox.com/2010/05/25/statistics-about-hadoop-and-mapreduce-algorithm-papers/" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to MySpace" alt="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2010/05/25/statistics-about-hadoop-and-mapreduce-algorithm-papers/&amp;title=Statistics+about+Hadoop+and+Mapreduce+Algorithm+Papers" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Del.icio.us" alt="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2010/05/25/statistics-about-hadoop-and-mapreduce-algorithm-papers/&amp;title=Statistics+about+Hadoop+and+Mapreduce+Algorithm+Papers" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to digg" alt="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2010/05/25/statistics-about-hadoop-and-mapreduce-algorithm-papers/&amp;t=Statistics+about+Hadoop+and+Mapreduce+Algorithm+Papers" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to FaceBook" alt="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2010/05/25/statistics-about-hadoop-and-mapreduce-algorithm-papers/" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Technorati" alt="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2010/05/25/statistics-about-hadoop-and-mapreduce-algorithm-papers/&amp;title=Statistics+about+Hadoop+and+Mapreduce+Algorithm+Papers" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Stumble Upon" alt="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2010/05/25/statistics-about-hadoop-and-mapreduce-algorithm-papers/&amp;title=Statistics+about+Hadoop+and+Mapreduce+Algorithm+Papers" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Google Bookmarks" alt="Add 'Statistics about Hadoop and Mapreduce Algorithm Papers' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/sVx3K1TXCP0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2010/05/25/statistics-about-hadoop-and-mapreduce-algorithm-papers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://atbrox.com/2010/05/25/statistics-about-hadoop-and-mapreduce-algorithm-papers/</feedburner:origLink></item>
		<item>
		<title>Towards Cloud Supercomputing</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/ArbDGwB7C2Q/</link>
		<comments>http://atbrox.com/2010/05/24/towards-cloud-supercomputing/#comments</comments>
		<pubDate>Sun, 23 May 2010 22:00:29 +0000</pubDate>
		<dc:creator>Amund Tveit</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[animoto]]></category>
		<category><![CDATA[cray]]></category>
		<category><![CDATA[dell]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[eli lilly]]></category>
		<category><![CDATA[genentech]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[ibm]]></category>
		<category><![CDATA[johnson&johnson]]></category>
		<category><![CDATA[justin.tv]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[microsoft]]></category>
		<category><![CDATA[mpi]]></category>
		<category><![CDATA[oracle]]></category>
		<category><![CDATA[rackspace]]></category>
		<category><![CDATA[sun]]></category>
		<category><![CDATA[supercomputing]]></category>
		<category><![CDATA[zynga]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=1054</guid>
		<description><![CDATA[Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and research. Update 2010-July-13: Can remove towards from the title of this posting today, Amazon just launched cluster compute instances with 10GB network bandwidth between nodes (and presents a run that enters top 500 list at 146th [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2010%2F05%2F24%2Ftowards-cloud-supercomputing%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fatbrox.com%2F2010%2F05%2F24%2Ftowards-cloud-supercomputing%2F&amp;style=normal" height="61" width="50" /><br />
			</a>
		</div>
<p><em>Atbrox is startup company providing <a href="http://atbrox.com/technology/">technology</a> and <a href="http://atbrox.com/services/">services</a> for Search and Mapreduce/Hadoop. <a href="http://atbrox.com/about/">Our background</a> is from Google, IBM and research.</em></p>
<p><font color="#ff0000">Update 2010-July-13:</font> Can remove towards from the title of this posting today, <a href="http://www.allthingsdistributed.com/2010/07/cluster_compute_instance_amazon_ec2.html">Amazon just launched cluster compute instances with 10GB network bandwidth between nodes</a> (and presents a run that enters top 500 list at 146th place, I estimate the run to cost ~$20k).</p>
<p><a href="http://www.top500.org/list/2009/11/100">The Top 500 list</a> is for <a href="http://en.wikipedia.org/wiki/Supercomputer">supercomputers</a> what <a href="http://money.cnn.com/magazines/fortune/fortune500/">Fortune 500</a> is for companies. About 80% of the list are supercomputers built by either <a href="http://www.hp.com">Hewlett Packard</a> or <a href="http://www.ibm.com">IBM</a>, other major supercomputing vendors on the list include <a href="http://www.dell.com">Dell</a>, <a href="http://www.sun.com">Sun </a>(<a href="http://www.oracle.com">Oracle</a>), <a href="http://www.cray.com">Cray </a>and <a href="http://www.cray.com">SGI</a>. <a href="http://www.top500.org/project/linpack">Parallel linpack benchmark result</a> is used as the ranking function for the list position (a derived list &#8211; <a href="http://www.green500.org">green 500</a> &#8211; also includes power-efficiency in the ranking). </p>
<p><strong>Trends towards Cloud Supercomputing</strong><br />
To our knowledge the entire top 500 list is currently based on physical supercomputer installations and no cloud computing configurations (i.e. virtual configurations lasting long enough to calculate the linpack benchmark), that will probably change within in a few years. There are however trends towards cloud-based supercomputing already (in particular within consumer internet services and pharmaceutical computations), here are some concrete examples:</p>
<ol>
<li><a href="http://www.zynga.com/">Zynga</a> (online casual games, e.g. Farmville and Mafia Wars)<br />
Zynga uses 12000 <a href="http://aws.amazon.com/ec2/">Amazon EC2 nodes</a> (ref: <a href="http://www.linkedin.com/in/jaymecox">Manager of Cloud Operations at Zynga</a>)</p>
<li><a href="http://www.animoto.com/">Animoto</a> (online video production service)<br />
Animoto scaled from 40 to 4000 EC2 nodes in 3 days (ref: <a href="http://www.rightscale.com/customers/">CTO, Animoto</a>)</p>
<li><a href="http://highscalability.com/blog/2010/3/4/how-myspace-tested-their-live-site-with-1-million-concurrent.html">Myspace</a> (social network)<br />
Myspace simulated 1 million simultaneous users using 800 large EC2 nodes (3200 cores) (ref: <a href="http://highscalability.com/blog/2010/3/4/how-myspace-tested-their-live-site-with-1-million-concurrent.html">highscalability.com</a>)</p>
<li><a href="http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/">New York Times</a><br />
New York Times used hundreds of EC2 nodes to process their archives in 36 hours (ref: <a href="http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/">The New York Times Archives + Amazon Web Services = TimesMachine</a>)</p>
<li><a href="http://reddit.com">Reddit</a> (news service)<br />
Reddit uses 218 EC2 nodes (ref: <a href="http://www.reddit.com/r/IAmA/comments/a2zte/i_run_reddits_servers_and_do_a_bunch_of_other/">I run reddit&#8217;s servers</a>)
</ol>
<p><strong>Examples with (rough) estimates</strong></p>
<ol>
<li><a href="http://justin.tv">Justin.tv</a> (video service)<br />
In october 2009 Justin.tv users watched 50 million hours of video, and they cost (reported earlier) was about 1 penny per user-video-hour, a very rough estimate would be monthly costs of 50M/0.01 = 500k$, i.e. 12*500k$ = 6M$ anually. Assuming that half their costs are computational, this would be about 3M$/(24*365*0.085) ~ 4029 EC2 nodes 24&#215;7 through the year, but since they are a video site bandwidth is probably a significant fraction of the cost, so cutting the rough estimate in half to around 2000 EC2 nodes.<br />
(ref: <a href="http://www.nytimes.com/2010/01/04/technology/internet/04couch.html">Watching TV Together, Miles Apart</a> and <a href="http://newteevee.com/2007/10/02/justintv-wins-funding-opens-platform/">Justin.tv wins funding, opens platform</a>)</p>
<li><a href="http://www.newsweek.com">Newsweek</a><br />
Newsweek saves up to $500.000 per year by moving to the cloud, assuming they cut their spending in half by using the cloud that would correspond to $500.000/(24h/day*365d/y*0.085$/h) ~ 670 EC2 nodes 24&#215;7 through the year (probably a little less due to storage and bandwidth costs)<br />
(ref: <a href="http://www.mediaweek.com/mw/content_display/news/magazines-newspapers/e3ieae2fa145a05b6f7b9de4c1f6d4d6ba1">Newsweek.com Explores Amazon Cloud Computing</a>)</p>
<li><a href="http://www.recovery.gov">Recovery.gov</a><br />
Recory.gov saves up to $420.000 per year by moving to the cloud, assuming they cut their spending in half by using the cloud that would correspond to $420.000/(24h/day*365d/y*0.085$/h) ~ 560 EC2 nodes 24&#215;7 through the year (probably a little less due to storage and bandwidth costs). (ref: <a href="http://www.smartplanet.com/business/blog/smart-takes/feds-embrace-cloud-computing-move-recoverygov-to-amazon-ec2/6871/">Feds embrace cloud computing; move Recovery.gov to Amazon EC2</a>)
</ol>
<p><strong>Other examples of Cloud Supercomputing</strong></p>
<ol>
<li>Pharmaceutical companies <a href="http://www.lilly.com">Eli Lilly</a>, <a href="http://www.jnj.com/connect/">Johnson &#038; Johnson</a> and <a href="http://www.gene.com">Genentech</a><br />
Offloading computations to the cloud (ref: <a href="http://www.hpcwire.com/blogs/Biotech-HPC-in-the-Cloud-46965352.html">Biotech HPC in the Cloud</a> and <a href="http://pubs.acs.org/cen/coverstory/87/8721cover.html">The new computing pioneers</a>)</p>
<li>Pathwork Diagnostics<br />
Using EC2 for cancer diagnostics (ref: <a href="http://www.hpcwire.com/features/Of-Unknown-Origin-Diagnosing-Cancer-in-the-Cloud-40305727.html">Of Unknown Origin: Diagnosing Cancer in the Cloud</a>)</p>
</ol>
<p>Best regards,</p>
<p><a href="http://atbrox.com/about/">Amund Tveit</a></p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2010/05/24/towards-cloud-supercomputing/" title="Add 'Towards Cloud Supercomputing' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Towards Cloud Supercomputing' to BlogRO" alt="Add 'Towards Cloud Supercomputing' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2010/05/24/towards-cloud-supercomputing/" title="Add 'Towards Cloud Supercomputing' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Towards Cloud Supercomputing' to FTW" alt="Add 'Towards Cloud Supercomputing' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2010/05/24/towards-cloud-supercomputing/" title="Add 'Towards Cloud Supercomputing' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Towards Cloud Supercomputing' to VoxRO" alt="Add 'Towards Cloud Supercomputing' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2010/05/24/towards-cloud-supercomputing/" title="Add 'Towards Cloud Supercomputing' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Towards Cloud Supercomputing' to Twitter" alt="Add 'Towards Cloud Supercomputing' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Towards+Cloud+Supercomputing&amp;c=http://atbrox.com/2010/05/24/towards-cloud-supercomputing/" title="Add 'Towards Cloud Supercomputing' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Towards Cloud Supercomputing' to MySpace" alt="Add 'Towards Cloud Supercomputing' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2010/05/24/towards-cloud-supercomputing/&amp;title=Towards+Cloud+Supercomputing" title="Add 'Towards Cloud Supercomputing' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Towards Cloud Supercomputing' to Del.icio.us" alt="Add 'Towards Cloud Supercomputing' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2010/05/24/towards-cloud-supercomputing/&amp;title=Towards+Cloud+Supercomputing" title="Add 'Towards Cloud Supercomputing' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Towards Cloud Supercomputing' to digg" alt="Add 'Towards Cloud Supercomputing' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2010/05/24/towards-cloud-supercomputing/&amp;t=Towards+Cloud+Supercomputing" title="Add 'Towards Cloud Supercomputing' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Towards Cloud Supercomputing' to FaceBook" alt="Add 'Towards Cloud Supercomputing' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2010/05/24/towards-cloud-supercomputing/" title="Add 'Towards Cloud Supercomputing' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Towards Cloud Supercomputing' to Technorati" alt="Add 'Towards Cloud Supercomputing' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2010/05/24/towards-cloud-supercomputing/&amp;title=Towards+Cloud+Supercomputing" title="Add 'Towards Cloud Supercomputing' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Towards Cloud Supercomputing' to Stumble Upon" alt="Add 'Towards Cloud Supercomputing' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2010/05/24/towards-cloud-supercomputing/&amp;title=Towards+Cloud+Supercomputing" title="Add 'Towards Cloud Supercomputing' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Towards Cloud Supercomputing' to Google Bookmarks" alt="Add 'Towards Cloud Supercomputing' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/ArbDGwB7C2Q" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2010/05/24/towards-cloud-supercomputing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://atbrox.com/2010/05/24/towards-cloud-supercomputing/</feedburner:origLink></item>
		<item>
		<title>Mapreduce &amp; Hadoop Algorithms in Academic Papers (3rd update)</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/UjoLBtFnjFM/</link>
		<comments>http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/#comments</comments>
		<pubDate>Sat, 08 May 2010 16:14:41 +0000</pubDate>
		<dc:creator>Amund Tveit</dc:creator>
				<category><![CDATA[Hadoop and Mapreduce]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[machinelearning]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[yahoo]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=971</guid>
		<description><![CDATA[Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. . Our background is from Google, IBM and research. Contact us if you need help with algorithms for mapreduce This posting is the May 2010 update to the similar posting from February 2010, with 30 new papers compared to the prior posting, new [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2010%2F05%2F08%2Fmapreduce-hadoop-algorithms-in-academic-papers-may-2010-update%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fatbrox.com%2F2010%2F05%2F08%2Fmapreduce-hadoop-algorithms-in-academic-papers-may-2010-update%2F&amp;style=normal" height="61" width="50" /><br />
			</a>
		</div>
<p><em>Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. <a href="http://atbrox.com/about/">. Our background is from Google, IBM and research. <a href="http://atbrox.com/">Contact us</a> if you need help with algorithms for mapreduce</em></p>
<p>This posting is the May 2010 update to the <a href="http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/">similar posting from February 2010</a>, with 30 new papers compared to the prior posting,  new ones are marked with <span style="color: #ff0000;"><strong>*</strong></span>. </p>
<p><strong>Motivation</strong><br />
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.</p>
<p><strong>Which areas do the papers cover?</strong></p>
<ul>
<strong>Ads Analysis</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://www.wsdm-conference.org/2010/proceedings/docs/p361.pdf">Improving ad relevance in sponsored search</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://web2py.iiit.ac.in/publications/default/download/techreport.pdf.a373bbf4a5b76063.4164436c69636b5468726f7567685261746549494954485265706f72742e706466.pdf">Predicting the Click-Through Rate for Rare/New Ads</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://www.cs.ubc.ca/~goyal/research/wsdm339-goyal.pdf">Learning Influence Probabilities in Social Networks</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://pages.stern.nyu.edu/~narchak/wfp0828-archak.pdf">Mining advertiser-specific user behavior using adfactors</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://portal.acm.org/citation.cfm?id=1779599.1779603">Extracting user profiles from large scale data</a><br />
<a href="http://www.cc.gatech.edu/~zha/CSE8801/ad/p209-chen.pdf"> Large-Scale Behavioral Targeting</a> (2009)<br />
<a href="http://research.yahoo.com/files/cikm2008-search%20advertising.pdf "> Search Advertising using Web Relevance Feedback</a> (2008)<br />
<a href="http://research.yahoo.com/workshops/troa-2008/papers/submission_12.pdf"> Predicting Ads’ ClickThrough Rate with Decision Rules </a>(2008)</p>
<p><strong>Bioinformatics/Medical Informatics</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://portal.acm.org/citation.cfm?id=1779599.1779601">A novel approach to multiple sequence alignment using hadoop data grids</a><br />
<a href="http://www.springerlink.com/content/861l014845934682/">MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network</a> (2009)<br />
<a href="http://www.biomedcentral.com/1471-2105/11/S1/S15">MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees</a></p>
<p><strong>Machine Translation</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://ufal.mff.cuni.cz/pbml/93/art-gao-vogel.pdf">Training Phrase-Based Machine Translation Models on the Cloud Open Source Machine Translation Toolkit Chaski</a><br />
<a href="http://www.cs.cmu.edu/~zollmann/publications/samt-toolkit.pdf"> Grammar based statistical MT on Hadoop</a> (2009)<br />
<a href="http://acl.ldc.upenn.edu/D/D07/D07-1090.pdf"> Large Language Models in Machine Translation</a> (2008)</p>
<p><strong>Spatial Data Processing</strong><br />
<a href="http://users.cis.fiu.edu/~vagelis/publications/Spatial-MapReduce-SSDBM2009.pdf">Experiences on Processing Spatial Data with MapReduce</a></p>
<p><strong>Information Extraction and Text Processing</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://www.cs.uchicago.edu/files/ms_paper/soner.pdf">Statistical Sentence Chunking Using Map Reduce</a><br />
<a href="http://portal.acm.org/citation.cfm?id=1620950.1620951">Data-intensive text processing with MapReduce</a><br />
<a href="http://www.aclweb.org/anthology/D/D09/D09-1098.pdf"> Web-Scale Distributional Similarity and Entity Set Expansion</a> (2009)<br />
<a href="http://www.aclweb.org/anthology-new/D/D09/D09-1071.pdf"> The infinite HMM for unsupervised PoS tagging</a> (2009)</p>
<p><strong>Artificial Intelligence/Machine Learning/Data Mining</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://arxiv.org/pdf/1003.0951">LogMaster: Mining Event Correlations in Logs of Large Scale Cluster Systems</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://cseweb.ucsd.edu/~kyocum/pubs/socc122-logothetis.pdf">Stateful Bulk Processing for Incremental Analytics</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://www.signatures.nu/papers/Mining%20Dependency%20in%20Distributed%20Systems%20through%20Unstructured%20Logs%20Analysis.pdf">Mining dependency in distributed systems through unstructured logs analysis</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://www.zib.de/andrzejak/my-papers/MDAC2010-(draft).pdf">Beyond online aggregation: parallel and incremental data mining with online mapreduce</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://researchweb.iiit.ac.in/~jaideep/adm_ctrl.pdf">Learning based opportunistic admission control algorithm for mapreduce as a service</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://www.cs.vu.nl/~frankh/postscript/ESWC10.pdf">OWL reasoning with WebPIE: calculating the closure of 100 billion triples</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://www.illigal.uiuc.edu/pub/papers/IlliGALs/2010001.pdf">Scaling ECGA model building via data-intensive computing</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://portal.acm.org/citation.cfm?id=1779605">SPARQL basic graph pattern processing with iterative mapreduce</a><br />
<a href="http://www.cs.cmu.edu/~ylow/paraml_aistats2009.pdf">Residual Splash for Optimally Parallelizing Belief Propagation</a><br />
<a href="http://portal.acm.org/citation.cfm?id=1646301">Stochastic gradient boosted distributed decision trees</a><br />
<a href="http://jmlr.csail.mit.edu/papers/volume10/newman09a/newman09a.pdf">Distributed Algorithms for Topic Models</a><br />
<a href="http://verma7.com/wp/wp-content/uploads/2009/10/meandre-mapreduce.pdf">When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing</a><br />
<a href="http://www.springerlink.com/content/m28617946158t788/">Cloud Computing Boosts Business Intelligence of Telecommunication Industry</a><br />
 <a href="http://www.springerlink.com/content/c621194607866223/">Parallel K-Means Clustering Based on MapReduce</a><br />
<a href="http://portal.acm.org/citation.cfm?id=1631067">Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce</a><br />
<a href="http://portal.acm.org/citation.cfm?id=1631272.1631451">Parallel algorithms for mining large-scale rich-media data</a><br />
<a href="http://verma7.com/wp/wp-content/uploads/2009/09/CS597_Spring09_GA.pdf">Scaling Simple and Compact Genetic Algorithms using MapReduce</a><br />
<a href="http://www.cs.vu.nl/~frankh/postscript/ISWC09.pdf">Scalable Distributed Reasoning using Mapreduce</a><br />
<a href="http://www.cse.nd.edu/~dthain/papers/classify-icdm08.pdf"> Scaling Up Classifiers to Cloud Computers</a> (2008)</p>
<ul>
For an example of Parallel Machine Learning with Hadoop/Mapreduce, check out our <a href="http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/">previous blog post</a>.
</ul>
<p><strong>Search Query Analysis</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://portal.acm.org/citation.cfm?id=1779599.1779607">Parallelizing Random Walk with Restart for large-scale query recommendation</a><br />
<a href="http://research.microsoft.com/apps/pubs/default.aspx?id=80592"> BBM: Bayesian Browsing Model from Petabyte-scale Data</a> (2009)<br />
<a href="http://portal.acm.org/citation.cfm?id=1559990&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=54492464&amp;CFTOKEN=33063869"> AIDE: Ad-hoc Intents Detection Engine over Query Logs </a>(2009)</p>
<p><strong>Information Retrieval (Search)</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://www.cis.upenn.edu/~zives/research/auto-integrate.pdf">Automatically Incorporating New Sources in Keyword Search-Based Data Integration</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://nlp.cs.nyu.edu/pubs/papers/sekine-ngram10.pdf">Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://www.wsdm-conference.org/2010/proceedings/docs/p381.pdf">Learning URL patterns for webpage de-duplication</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://www-users.cs.umn.edu/~echi/papers/2010-IUI/tagsearch-ASC-PARC.pdf">Information Seeking with Social Signals: Anatomy of a Social Tag-based EXploratory Search Browser</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://eprints.eemcs.utwente.nl/17797/01/mirex.pdf">MIREX: Mapreduce Information Retrieval Experiments</a><br />
<a href="http://paginas.fe.up.pt/~eol/PUBLICATIONS/2009/Efficient%20clustering%20of%20web-derived%20data%20sets.pdf">Efficient Clustering of Web Derived Data Sets</a><br />
<a href="http://web.phys.ntu.edu.tw/phystalks/Theory_seminar_Fall_2009/PageRank_PingYeh.pdf">The PageRank algorithm and application on searching of academic papers</a><br />
<a href="http://www.springerlink.com/content/h411850464229625/">A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures</a><br />
<a href="http://portal.acm.org/citation.cfm?id=1572106&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=54492520&amp;CFTOKEN=63253841"> On Single-Pass Indexing with MapReduce</a> (2009)<br />
<a href="http://bhavik.me/docs/Paper.pdf"> A Data Parallel Algorithm for XML DOM Parsing</a> (2009)<br />
<a href="http://www.springerlink.com/content/t607305788356537/"> Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web</a> (2008)</p>
<p><strong>Spam &amp; Malware Detection</strong><br />
<a href="http://www.usenix.org/event/leet08/tech/full_papers/zhuang/zhuang.pdf">Characterizing Botnets from Email Spam Records</a> (2008)<br />
- Clustering of emails into spam campaign<br />
- Finding probability that 2 spam messages are sent form same machine<br />
- Estime likelihood of botnets based on common senders in spam campaigns<br />
<a href="http://www.usenix.org/event/hotbots07/tech/full_papers/provos/provos.pdf">The Ghost In The Browser Analysis of Web-based Malware</a> (2007)</p>
<p><strong>Image and Video Processing</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://www.hpl.hp.com/techreports/2009/HPL-2009-181.pdf">Font rendering on a GPU-based raster image processor</a><br />
<a href="http://www.hpl.hp.com/personal/Thomas_Sandholm/sandholm2009a.pdf">MapReduce Optimization Using Regulated Dynamic Prioritization</a> (2009)<br />
- Video Stream Re-Rendering<br />
<a href="http://www.pittsburgh.intel-research.net/~chensm/papers/IRP-TR-08-05.pdf">Map-Reduce Meets Wider Varieties of Applications</a> (2008)<br />
- Location detection in images</p>
<p><strong>Networking</strong><br />
<a href="http://wwwse.inf.tu-dresden.de/papers/preprint-pfeifer2008reducible.pdf">Reducible Complexity in DNS</a></p>
<p><strong>Simulation</strong><br />
<a href="http://www.pittsburgh.intel-research.net/~chensm/papers/IRP-TR-08-05.pdf">Map-Reduce Meets Wider Varieties of Applications</a> (2008)<br />
- Simulation of earthquakes (geology)</p>
<p><strong>Statistics</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://www.computer.org/portal/web/csdl/doi/10.1109/WKDD.2010.54">User-based collaborative filtering recommendation algorithms on hadoop</a><br />
<a href="http://www.umiacs.umd.edu/~jimmylin/publications/Lin_SIGIR2009.pdf">Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce</a> (2009)<br />
<a href="http://thepublicgrid.org/papers/koufakou_wcci_08.pdf">Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce</a> (2009)<br />
<a href="http://www.hpl.hp.com/personal/Thomas_Sandholm/sandholm2009a.pdf">MapReduce Optimization Using Regulated Dynamic Prioritization</a> (2009)<br />
- Digg.com story recommendations<br />
<a href="http://www.infosci.cornell.edu/weblab/papers/Bank2008.pdf">Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia</a> (2008)<br />
- Measuring Wikipedia Editor similarity<br />
<a href="http://www.pittsburgh.intel-research.net/~chensm/papers/IRP-TR-08-05.pdf">Map-Reduce Meets Wider Varieties of Applications</a> (2008)<br />
- Netflix video recommendation<br />
<a href="http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf">Large-scale Parallel Collaborative Filtering for the Netflix Prize</a> (2008)</p>
<p><strong>Numerical Mathematics</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://research.microsoft.com/pubs/119077/DNMF.pdf">Distributed non-negative matrix factorization for dyadic data analysis on mapreduce</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://www.cs.uwaterloo.ca/conferences/dl2010/papers/dl2010.pdf#page=464">A mapreduce algorithm for SC</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://jeff.bleugris.com/gpmr_mapreduce2010.pdf">Multi-GPU Volume Rendering using MapReduce</a><br />
<a href="http://arxiv.org/PS_cache/arxiv/pdf/1001/1001.0421v1.pdf">Mapreduce for Integer Factorization</a></p>
<p><strong>Sets &#038; Graphs</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://portal.acm.org/citation.cfm?id=1779599.1779604">Towards scalable RDF graph analytics on MapReduce </a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://asterix.ics.uci.edu/pub/sigmod10-vernica-long.pdf">Efficient Parallel Set-Similarity Joins using Mapreduce</a><br />
<span style="color: #ff0000;"><strong>*</strong></span><a href="http://portal.acm.org/citation.cfm?id=1772715">Max-cover algorithm in map-reduce</a><br />
<a href="http://www.springerlink.com/content/654725g772674533/">Distributed Algorithm for Computing Formal Concepts Using Map-Reduce Framework</a><br />
<a href="http://www.springerlink.com/content/l805560670136163/">Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce</a><br />
<a href="http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120">Graph Twiddling in a MapReduce World</a><br />
<a href="http://www.cis.temple.edu/~vasilis/research/Publications/kdd09.pdf">DOULION: Counting Triangles in Massive Graphs with a Coin</a> (2009)<br />
<a href="http://reports-archive.adm.cs.cmu.edu/anon/ml2008/CMU-ML-08-103.pdf">Fast counting of triangles in real-world networks: proofs, algorithms and observations</a> (2008)</ul>
<p><strong>Who wrote the above papers?</strong><br />
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.<br />
Government Institutions and Universities: US National Security Agency (NSA)<br />
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&#038;M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas</p>
<p><a href="http://www.linkedin.com/companies/atbrox" ><img src="http://static.linkedin.com/scds/common/u/img/webpromo/btn_cofollow_badge.png" alt="Atbrox on LinkedIn"></a></p>
<p>Best regards,<br />
<a href="http://atbrox.com/about/">Amund Tveit</a> (Atbrox co-founder)</p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to BlogRO" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to FTW" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to VoxRO" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Twitter" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%283rd+update%29&amp;c=http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to MySpace" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/&amp;title=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%283rd+update%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Del.icio.us" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/&amp;title=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%283rd+update%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to digg" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/&amp;t=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%283rd+update%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to FaceBook" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Technorati" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/&amp;title=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%283rd+update%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Stumble Upon" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/&amp;title=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%283rd+update%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Google Bookmarks" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (3rd update)' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/UjoLBtFnjFM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		<feedburner:origLink>http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/</feedburner:origLink></item>
		<item>
		<title>Initial Thoughts on Yahoo’s Ranking Challenge</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/jhu-QCPFDyY/</link>
		<comments>http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/#comments</comments>
		<pubDate>Sat, 27 Feb 2010 23:15:09 +0000</pubDate>
		<dc:creator>Amund Tveit</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[netflix]]></category>
		<category><![CDATA[ranking]]></category>
		<category><![CDATA[regression]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[yahoo]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=855</guid>
		<description><![CDATA[Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and Research. Yahoo recently announced the Learning to Rank Challenge &#8211; a pretty interesting web search challenge (as the somewhat similar Netflix Prize Challenge also was). Data and Problem The data sets contains (to my interpretation) per [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F28%2Finitial-thoughts-on-yahoos-ranking-challenge%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F28%2Finitial-thoughts-on-yahoos-ranking-challenge%2F&amp;style=normal" height="61" width="50" /><br />
			</a>
		</div>
<p><em>Atbrox is startup company providing <a href="http://atbrox.com/technology/">technology</a> and <a href="http://atbrox.com/services/">services</a> for Search and Mapreduce/Hadoop. <a href="http://atbrox.com/about/">Our background</a> is from Google, IBM and Research.</em></p>
<p>Yahoo recently <a href="http://groups.google.com/group/ml-news/browse_thread/thread/bec89f7abee8f9c7#">announced</a> the <a href="http://learningtorankchallenge.yahoo.com/">Learning to Rank Challenge</a> &#8211; a pretty interesting web search challenge (<em>as the somewhat similar <a href="http://www.netflixprize.com//community/viewtopic.php?id=1537">Netflix Prize Challenge</a> also was</em>).</p>
<p><strong>Data and Problem</strong><br />
The data sets contains (to my interpretation) per line:</p>
<ol>
<li>url &#8211; implicitly encoded as line number in the data set file</li>
<li>relevance &#8211; low number=high relevance and vice versa</li>
<li>query &#8211; represented as an id</li>
<li>features &#8211; up to several hundreds</li>
</ol>
<p>and the problem is to find a function that gives <a href="http://learningtorankchallenge.yahoo.com/instructions.php">relevance numbers per url per query id</a>.</p>
<p><strong>Initial Observation</strong><br />
In dataset 1 there are ~473k URLs and ~19k queries. At first I thought this meant that there are in average 473/19 ~ 24 relevance numbers for each query (see actual distribution of counts in figure below), i.e. corresponding to search result 1 to 24, but it seems like there are several URLs per unique query that has the same relevance (e.g. URLx and URLy both can have relevance 2 for queryZ). The paper <a href="http://portal.acm.org/citation.cfm?id=1390382">Learning to Rank with Ties</a> seems potentially relevant to deal with this.</p>
<p><img src="http://spreadsheets.google.com/oimg?key=0AtUpNWn0bYdJdGlOZWZ0TTgwLUU2Vy1QYXZJT2lUWXc&amp;oid=1&amp;v=1267315743217" alt="" /></p>
<p>Multiple URLs that shares relevance for a unique query can perhaps be due to:</p>
<ol>
<li>similar/duplicate content between the URLs?</li>
<li>a frequent query (due to sampling of examples?)</li>
<li>uncertainty about which URL to select for particular a relevance and query?</li>
<li>there is a tie, i.e. they are equally relevant</li>
</ol>
<p><strong>Potential classification approach?</strong><br />
From a classification perspective there are several (perhaps naive?) approaches that could be tried out:</p>
<ol>
<li>Use relevance levels as classes (nominal regression) and use a multiclass-classifier</li>
<li>Train classifier as binary competition within query, i.e. relevance 1 against 2, 3, .., and relevance n against n+1, .. (probably get some sparsity problems due to this)</li>
<li>Binary competition across queries, but is problematic due to that a relevance of 4 for one query could be more relevant than a relevance of 1 for a another query (and there is no easy way to determine that directly from the data), but if the observation related to multiple URLs per relevance level per query (see above) is caused by uncertainty one could perhaps use 1/(number of URLs per relevance level per query) as a weight to either:
<ol>
<li>support training across queries, e.g. a URL for a query with relevance 1 is better that another query of relevance 1 with 37 URLs of that relevance, this approach could perhaps be used somehow using regression? The problem is to compare against different relevance levels, e.g. is a relevance 2 for a query with 1 url more confident than one of relevance 1 for a query with 37 URLs?</li>
<li>use a classifier that supports weighing examples and the approach in 1 or 2.</li>
</ol>
</li>
</ol>
<p><strong>More about ranking with machine learning?</strong><br />
Check out the <a href="http://research.microsoft.com/en-us/um/beijing/projects/letor/paper.aspx">learning to rank bibliography</a>.</p>
<h6 class="zemanta-related-title" style="font-size: 1em;">Related articles by Zemanta</h6>
<ul class="zemanta-article-ul">
<li class="zemanta-article-ul-li"><a href="http://www.penn-olson.com/2010/07/08/twitter-has-24-billion-search-queries-per-month/">Twitter Has 24 Billion Search Queries Per Month</a> (penn-olson.com)</li>
<li class="zemanta-article-ul-li"><a href="http://www.seobythesea.com/?p=4006">Bing&#8217;s Categorized Search Results</a> (seobythesea.com)</li>
<li class="zemanta-article-ul-li"><a href="http://www.searchengineguide.com/manoj-jasra/top-rankings-for-facebook-pages.php">Top Rankings for Facebook Pages</a> (searchengineguide.com)</li>
</ul>
<div class="zemanta-pixie" style="margin-top: 10px; height: 15px;"><a class="zemanta-pixie-a" title="Enhanced by Zemanta" href="http://www.zemanta.com/"><img class="zemanta-pixie-img" style="border: none; float: right;" src="http://img.zemanta.com/zemified_e.png?x-id=a7bd584b-8ceb-425a-aad8-04537c2289d7" alt="Enhanced by Zemanta" /></a><span class="zem-script more-related pretty-attribution"><script src="http://static.zemanta.com/readside/loader.js" type="text/javascript"></script></span></div>
<p>Best regards,</p>
<p><a href="http://atbrox.com/about/">Amund Tveit</a></p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to BlogRO" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to FTW" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to VoxRO" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Twitter" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Initial+Thoughts+on+Yahoo%26%238217%3Bs+Ranking+Challenge&amp;c=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to MySpace" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/&amp;title=Initial+Thoughts+on+Yahoo%26%238217%3Bs+Ranking+Challenge" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Del.icio.us" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/&amp;title=Initial+Thoughts+on+Yahoo%26%238217%3Bs+Ranking+Challenge" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to digg" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/&amp;t=Initial+Thoughts+on+Yahoo%26%238217%3Bs+Ranking+Challenge" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to FaceBook" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Technorati" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/&amp;title=Initial+Thoughts+on+Yahoo%26%238217%3Bs+Ranking+Challenge" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Stumble Upon" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/&amp;title=Initial+Thoughts+on+Yahoo%26%238217%3Bs+Ranking+Challenge" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Google Bookmarks" alt="Add 'Initial Thoughts on Yahoo&#8217;s Ranking Challenge' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/jhu-QCPFDyY" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://atbrox.com/2010/02/28/initial-thoughts-on-yahoos-ranking-challenge/</feedburner:origLink></item>
		<item>
		<title>So, what is Hadoop?</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/kZIQgRnxtAE/</link>
		<comments>http://atbrox.com/2010/02/17/hadoop/#comments</comments>
		<pubDate>Wed, 17 Feb 2010 21:39:10 +0000</pubDate>
		<dc:creator>Amund Tveit</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[bigtable]]></category>
		<category><![CDATA[facebook]]></category>
		<category><![CDATA[gfs]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[thrift]]></category>
		<category><![CDATA[yahoo]]></category>
		<category><![CDATA[zookeeper]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=727</guid>
		<description><![CDATA[Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and Research. Hadoop is a set of open source technologies that supports reliable and cost-efficient ways of dealing with large amounts of data. Given the vast amounts of business critical and required data companies gather (e.g. required [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F17%2Fhadoop%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F17%2Fhadoop%2F&amp;style=normal" height="61" width="50" /><br />
			</a>
		</div>
<p><em>Atbrox is startup company providing <a href="http://atbrox.com/technology/">technology</a> and <a href="http://atbrox.com/services/">services</a> for Search and Mapreduce/Hadoop. <a href="http://atbrox.com/about/">Our background</a> is from Google, IBM and Research.</em></p>
<p><a href="http://hadoop.apache.org/">Hadoop</a> is a set of open source technologies that supports reliable and cost-efficient ways of dealing with large amounts of data. Given the vast amounts of business critical and required data companies gather (e.g. required due to <a href="http://en.wikipedia.org/wiki/Sarbanes%E2%80%93Oxley_Act">Sarbanes–Oxley (SOX)</a> or  <a href="http://en.wikipedia.org/wiki/Data_Retention_Directive">EU Data Retention Directive</a>), Hadoop becomes increasingly relevant. </p>
<h2>Hadoop Technologies</h2>
<p>Several Hadoop technologies are inspired by <a href="http://research.google.com/pubs/DistributedSystemsandParallelComputing.html">Google&#8217;s infrastructure</a>.</p>
<h4>1. Processing and Storage</h4>
<p><strong>1.1 Processing &#8211; Mapreduce</strong><br />
Mapreduce can be used to process and extract knowledge from arbitrary amounts of data, e.g. web data, measurement data or financial transactions &#8211; <a href="http://www.slideshare.net/cloudera/hw09-large-scale-transaction-analysis">Visa reduced their processing time for transactional statistics from 1 month to 13 minutes with Hadoop</a>. In order to use Mapreduce developers need to parallelize their problem and program against an API &#8211; <a href="http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/">here for an example of machine learning with Hadoop</a>. Hadoop&#8217;s Mapreduce is inspired by the paper <a href="http://research.google.com/archive/mapreduce.html">MapReduce: Simplified Data Processing on Large Clusters</a>. </p>
<p><strong>1.2 File Storage &#8211; HDFS</strong><br />
HDFS is scalable and distributed file system. It supports configurable degree of replication for reliable storage even when running on cheap hardware. HDFS is inspired by the paper <a href="http://research.google.com/archive/gfs-sosp2003.pdf">The Google File System</a></p>
<p><strong>1.3 Database &#8211; HBase</strong><br />
HBase is a distributed database that supports storing billions of rows with millions of columns that runs on top of HDFS. HBase can replace traditional databases if they get problems scaling or become to expensive licence-wise, see <a href="http://www.docstoc.com/docs/document-preview.aspx?doc_id=12426408&#038;C">this presentation about Hbase</a>. HBase is inspired by the paper <a href="http://research.google.com/archive/bigtable-osdi06.pdf">Bigtable: A Distributed Storage System for Structured Data</a></p>
<h4>2. Data Analysis</h4>
<p>Mapreduce can be used to analyze all kinds of data (e.g. text, multimedia, numerical data) and have high flexibility, but for more structured data the following Hadoop Technologies can be used:</p>
<p><strong>2.1 Pig</strong><br />
SQL-like language/system running on top of Mapreduce. <a href="http://glinden.blogspot.com/2007/04/yahoo-pig-and-google-sawzall.html">Pig is developed by Yahoo</a> and inspired by the paper <a href="http://research.google.com/pubs/pub61.html">Interpreting the Data: Parallel Analysis with Sawzall</a></p>
<p><strong>2.2 Hive</strong><br />
Datawarehouse running on top of Hadoop, developed by Facebook. Query language is very similar to SQL.</p>
<h4>3. Distributed Systems Development</h4>
<p><strong>3.1 Avro</strong><br />
Avro is used for efficient serialization of data and communication between services. It is in several ways similar to <a href="http://code.google.com/apis/protocolbuffers/">Google&#8217;s protocolbuffers</a> and <a href="http://developers.facebook.com/thrift/">Facebook&#8217;s Thrift</a>.</p>
<p><strong>3.2 Zookeeper</strong><br />
Coordination between distributed processes. It is inspired by the paper <a href="http://research.google.com/archive/chubby-osdi06.pdf">The Chubby lock service for loosely-coupled distributed systems</a></p>
<p><strong>3.3 Chukwa</strong><br />
Monitoring of distributed systems.</p>
<hr />
<p><font color="#0000ff"><strong>Do you need help with Hadoop/Mapreduce?</strong></font><br />
A good start could be to read <a href="http://www.amazon.com/gp/product/0596521979?ie=UTF8&#038;tag=amuw-20&#038;linkCode=as2&#038;camp=1789&#038;creative=9325&#038;creativeASIN=0596521979">this book</a>, or contact <a href="http://atbrox.com/about/">Atbrox</a> if you need help with development or parallelization of algorithms for Hadoop/Mapreduce &#8211; <a href="mailto:info@atbrox.com">info@atbrox.com</a>. See <a href="http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/">our posting</A> for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce.</p>
<p>Best regards,</p>
<p><a href="http://atbrox.com/about/">Amund Tveit</a></p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2010/02/17/hadoop/" title="Add 'So, what is Hadoop?' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'So, what is Hadoop?' to BlogRO" alt="Add 'So, what is Hadoop?' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2010/02/17/hadoop/" title="Add 'So, what is Hadoop?' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'So, what is Hadoop?' to FTW" alt="Add 'So, what is Hadoop?' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2010/02/17/hadoop/" title="Add 'So, what is Hadoop?' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'So, what is Hadoop?' to VoxRO" alt="Add 'So, what is Hadoop?' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2010/02/17/hadoop/" title="Add 'So, what is Hadoop?' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'So, what is Hadoop?' to Twitter" alt="Add 'So, what is Hadoop?' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=So%2C+what+is+Hadoop%3F&amp;c=http://atbrox.com/2010/02/17/hadoop/" title="Add 'So, what is Hadoop?' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'So, what is Hadoop?' to MySpace" alt="Add 'So, what is Hadoop?' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2010/02/17/hadoop/&amp;title=So%2C+what+is+Hadoop%3F" title="Add 'So, what is Hadoop?' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'So, what is Hadoop?' to Del.icio.us" alt="Add 'So, what is Hadoop?' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2010/02/17/hadoop/&amp;title=So%2C+what+is+Hadoop%3F" title="Add 'So, what is Hadoop?' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'So, what is Hadoop?' to digg" alt="Add 'So, what is Hadoop?' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2010/02/17/hadoop/&amp;t=So%2C+what+is+Hadoop%3F" title="Add 'So, what is Hadoop?' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'So, what is Hadoop?' to FaceBook" alt="Add 'So, what is Hadoop?' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2010/02/17/hadoop/" title="Add 'So, what is Hadoop?' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'So, what is Hadoop?' to Technorati" alt="Add 'So, what is Hadoop?' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2010/02/17/hadoop/&amp;title=So%2C+what+is+Hadoop%3F" title="Add 'So, what is Hadoop?' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'So, what is Hadoop?' to Stumble Upon" alt="Add 'So, what is Hadoop?' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2010/02/17/hadoop/&amp;title=So%2C+what+is+Hadoop%3F" title="Add 'So, what is Hadoop?' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'So, what is Hadoop?' to Google Bookmarks" alt="Add 'So, what is Hadoop?' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/kZIQgRnxtAE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2010/02/17/hadoop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://atbrox.com/2010/02/17/hadoop/</feedburner:origLink></item>
		<item>
		<title>Mapreduce &amp; Hadoop Algorithms in Academic Papers (updated)</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/ZU_alB_G58o/</link>
		<comments>http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/#comments</comments>
		<pubDate>Fri, 12 Feb 2010 19:19:37 +0000</pubDate>
		<dc:creator>Amund Tveit</dc:creator>
				<category><![CDATA[Hadoop and Mapreduce]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[machinelearning]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[search]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=662</guid>
		<description><![CDATA[The newest and most up-to-date version (May 2010) this blog post is available at http://mapreducebook.org Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from from Google, IBM and Research. This posting is an update to the similar posting from October 2009, roughly doubling the numbers of papers from [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F12%2Fmapreduce-hadoop-algorithms-in-academic-papers-updated%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F12%2Fmapreduce-hadoop-algorithms-in-academic-papers-updated%2F&amp;style=normal" height="61" width="50" /><br />
			</a>
		</div>
<p><b><font color="#ff0000">The newest and most up-to-date version (May 2010) this blog post is available at <a href="http://mapreducebook.org">http://mapreducebook.org</a></font></b></p>
<p><iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F12%2Fmapreduce-hadoop-algorithms-in-academic-papers-updated%2F&amp;layout=standard&amp;show-faces=true&amp;width=450&amp;action=like&amp;colorscheme=light" scrolling="no" frameborder="0" allowTransparency="true" style="border:none; overflow:hidden; width:450px; height:40px"></iframe></p>
<p><em>Atbrox is startup company providing <a href="http://atbrox.com/technology/">technology</a> and <a href="http://atbrox.com/services/">services</a> for Search and Mapreduce/Hadoop. <a href="http://atbrox.com/about/">Our background</a> is from from Google, IBM and Research.</em></p>
<p>This posting is an update to the <a href="http://atbrox.com/2009/10/01/mapreduce-and-hadoop-academic-papers/">similar posting from October 2009</a>, roughly doubling the numbers of papers from the previous posting, the new ones are marked with <span style="color: #ff0000;"><strong>*</strong></span></p>
<p><strong>Motivation</strong><br />
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.</p>
<p><strong>Which areas do the papers cover?</strong></p>
<ul> <strong>Bioinformatics/Medical Informatics</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.springerlink.com/content/861l014845934682/">MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network</a> (2009)<br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.biomedcentral.com/1471-2105/11/S1/S15">MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees</a></p>
<p><strong>Machine Translation</strong><br />
<a href="http://www.cs.cmu.edu/~zollmann/publications/samt-toolkit.pdf"> Grammar based statistical MT on Hadoop</a> (2009)<br />
<a href="http://acl.ldc.upenn.edu/D/D07/D07-1090.pdf"> Large Language Models in Machine Translation</a> (2008)</p>
<p><strong>Spatial Data Processing</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://users.cis.fiu.edu/~vagelis/publications/Spatial-MapReduce-SSDBM2009.pdf">Experiences on Processing Spatial Data with MapReduce</a></p>
<p><strong>Information Extraction and Text Processing</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://portal.acm.org/citation.cfm?id=1620950.1620951">Data-intensive text processing with MapReduce</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.aclweb.org/anthology/D/D09/D09-1098.pdf"> Web-Scale Distributional Similarity and Entity Set Expansion</a> (2009)<br />
<a href="http://www.aclweb.org/anthology-new/D/D09/D09-1071.pdf"> The infinite HMM for unsupervised PoS tagging</a> (2009)</p>
<p><strong>Artificial Intelligence/Machine Learning/Data Mining</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.cs.cmu.edu/~ylow/paraml_aistats2009.pdf">Residual Splash for Optimally Parallelizing Belief Propagation</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://portal.acm.org/citation.cfm?id=1646301">Stochastic gradient boosted distributed decision trees</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://jmlr.csail.mit.edu/papers/volume10/newman09a/newman09a.pdf">Distributed Algorithms for Topic Models</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://verma7.com/wp/wp-content/uploads/2009/10/meandre-mapreduce.pdf">When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.springerlink.com/content/m28617946158t788/">Cloud Computing Boosts Business Intelligence of Telecommunication Industry</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.springerlink.com/content/c621194607866223/">Parallel K-Means Clustering Based on MapReduce</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://portal.acm.org/citation.cfm?id=1631067">Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://portal.acm.org/citation.cfm?id=1631272.1631451">Parallel algorithms for mining large-scale rich-media data</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://verma7.com/wp/wp-content/uploads/2009/09/CS597_Spring09_GA.pdf">Scaling Simple and Compact Genetic Algorithms using MapReduce</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.cs.vu.nl/~frankh/postscript/ISWC09.pdf">Scalable Distributed Reasoning using Mapreduce</a><br />
<a href="http://www.cse.nd.edu/~dthain/papers/classify-icdm08.pdf"> Scaling Up Classifiers to Cloud Computers</a> (2008)</p>
<ul>
For an example of Parallel Machine Learning with Hadoop/Mapreduce, check out our <a href="http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/">previous blog post</a>.
</ul>
<p><strong>Ads Analysis</strong><br />
<a href="http://www.cc.gatech.edu/~zha/CSE8801/ad/p209-chen.pdf"> Large-Scale Behavioral Targeting</a> (2009)<br />
<a href="http://research.yahoo.com/files/cikm2008-search%20advertising.pdf "> Search Advertising using Web Relevance Feedback</a> (2008)<br />
<a href="http://research.yahoo.com/workshops/troa-2008/papers/submission_12.pdf"> Predicting Ads’ ClickThrough Rate with Decision Rules </a>(2008)</p>
<p><strong>Search Query Analysis</strong><br />
<a href="http://research.microsoft.com/apps/pubs/default.aspx?id=80592"> BBM: Bayesian Browsing Model from Petabyte-scale Data</a> (2009)<br />
<a href="http://portal.acm.org/citation.cfm?id=1559990&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=54492464&amp;CFTOKEN=33063869"> AIDE: Ad-hoc Intents Detection Engine over Query Logs </a>(2009)</p>
<p><strong>Information Retrieval (Search)</strong><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://paginas.fe.up.pt/~eol/PUBLICATIONS/2009/Efficient%20clustering%20of%20web-derived%20data%20sets.pdf">Efficient Clustering of Web Derived Data Sets</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://web.phys.ntu.edu.tw/phystalks/Theory_seminar_Fall_2009/PageRank_PingYeh.pdf">The PageRank algorithm and application on searching of academic papers</a><br />
<span style="color: #ff0000;"><strong>*</strong></span> <a href="http://www.springerlink.com/content/h411850464229625/">A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures</a><br />
<a href="http://portal.acm.org/citation.cfm?id=1572106&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=54492520&amp;CFTOKEN=63253841"> On Single-Pass Indexing with MapReduce</a> (2009)<br />
<a href="http://bhavik.me/docs/Paper.pdf"> A Data Parallel Algorithm for XML DOM Parsing</a> (2009)<br />
<a href="http://www.springerlink.com/content/t607305788356537/"> Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web</a> (2008)</p>
<p><strong>Spam &amp; Malware Detection</strong><br />
<a href="http://www.usenix.org/event/leet08/tech/full_papers/zhuang/zhuang.pdf">Characterizing Botnets from Email Spam Records</a> (2008)<br />
- Clustering of emails into spam campaign<br />
- Finding probability that 2 spam messages are sent form same machine<br />
- Estime likelihood of botnets based on common senders in spam campaigns<br />
<a href="http://www.usenix.org/event/hotbots07/tech/full_papers/provos/provos.pdf">The Ghost In The Browser Analysis of Web-based Malware</a> (2007)</p>
<p><strong>Image and Video Processing</strong><br />
<a href="http://www.hpl.hp.com/personal/Thomas_Sandholm/sandholm2009a.pdf">MapReduce Optimization Using Regulated Dynamic Prioritization</a> (2009)<br />
- Video Stream Re-Rendering<br />
<a href="http://www.pittsburgh.intel-research.net/~chensm/papers/IRP-TR-08-05.pdf">Map-Reduce Meets Wider Varieties of Applications</a> (2008)<br />
- Location detection in images</p>
<p><strong>Networking</strong><br />
<a href="http://wwwse.inf.tu-dresden.de/papers/preprint-pfeifer2008reducible.pdf">Reducible Complexity in DNS</a></p>
<p><strong>Simulation</strong><br />
<a href="http://www.pittsburgh.intel-research.net/~chensm/papers/IRP-TR-08-05.pdf">Map-Reduce Meets Wider Varieties of Applications</a> (2008)<br />
- Simulation of earthquakes (geology)</p>
<p><strong>Statistics</strong><br />
<strong><span style="color: #ff0000;">*</span></strong> <a href="http://www.umiacs.umd.edu/~jimmylin/publications/Lin_SIGIR2009.pdf">Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce</a> (2009)<br />
<a href="http://thepublicgrid.org/papers/koufakou_wcci_08.pdf">Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce</a> (2009)<br />
<a href="http://www.hpl.hp.com/personal/Thomas_Sandholm/sandholm2009a.pdf">MapReduce Optimization Using Regulated Dynamic Prioritization</a> (2009)<br />
- Digg.com story recommendations<br />
<a href="http://www.infosci.cornell.edu/weblab/papers/Bank2008.pdf">Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia</a> (2008)<br />
- Measuring Wikipedia Editor similarity<br />
<a href="http://www.pittsburgh.intel-research.net/~chensm/papers/IRP-TR-08-05.pdf">Map-Reduce Meets Wider Varieties of Applications</a> (2008)<br />
- Netflix video recommendation<br />
<a href="http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf">Large-scale Parallel Collaborative Filtering for the Netflix Prize</a> (2008)</p>
<p><strong>Numerical Mathematics</strong><br />
<strong><span style="color: #ff0000;">*</span></strong> <a href="http://arxiv.org/PS_cache/arxiv/pdf/1001/1001.0421v1.pdf">Mapreduce for Integer Factorization</a></p>
<p><strong>Graphs</strong><br />
<strong><span style="color: #ff0000;">*</span></strong> <a href="http://www.springerlink.com/content/654725g772674533/">Distributed Algorithm for Computing Formal Concepts Using Map-Reduce Framework</a><br />
<span style="color: #ff0000;">*</span> <a href="http://www.springerlink.com/content/l805560670136163/">Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce</a><br />
<span style="color: #ff0000;">*</span> <a href="http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120">Graph Twiddling in a MapReduce World</a><br />
<a href="http://www.cis.temple.edu/~vasilis/research/Publications/kdd09.pdf">DOULION: Counting Triangles in Massive Graphs with a Coin</a> (2009)<br />
<a href="http://reports-archive.adm.cs.cmu.edu/anon/ml2008/CMU-ML-08-103.pdf">Fast counting of triangles in real-world networks: proofs, algorithms and observations</a> (2008)</ul>
<p><strong>Who wrote the above papers?</strong> <em>(<font color="#ff0000">section added 20100307</font>)</em><br />
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.<br />
Government Institutions and Universities: US National Security Agency (NSA)<br />
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&#038;M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas</p>
<hr />
<p><font color="#0000ff"><strong>Do you need help with Hadoop/Mapreduce?</strong></font><br />
A good start could be to read <a href="http://www.amazon.com/gp/product/0596521979?ie=UTF8&#038;tag=amuw-20&#038;linkCode=as2&#038;camp=1789&#038;creative=9325&#038;creativeASIN=0596521979">this book</a>, or contact <a href="http://atbrox.com/about/">Atbrox</a> if you need help with development or parallelization of algorithms for Hadoop/Mapreduce &#8211; <a href="mailto:info@atbrox.com">info@atbrox.com</a>. See <a href="http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/">our previous posting</A> for an example parallelizing and implementing a machine learning algorithm for Hadoop/Mapreduce</p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to BlogRO" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to FTW" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to VoxRO" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Twitter" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%28updated%29&amp;c=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to MySpace" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/&amp;title=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%28updated%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Del.icio.us" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/&amp;title=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%28updated%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to digg" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/&amp;t=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%28updated%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to FaceBook" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Technorati" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/&amp;title=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%28updated%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Stumble Upon" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/&amp;title=Mapreduce+%26%23038%3B+Hadoop+Algorithms+in+Academic+Papers+%28updated%29" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Google Bookmarks" alt="Add 'Mapreduce &#038; Hadoop Algorithms in Academic Papers (updated)' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/ZU_alB_G58o" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		<feedburner:origLink>http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/</feedburner:origLink></item>
		<item>
		<title>Parallel Machine Learning for Hadoop/Mapreduce – A Python Example</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/qoTnKURhfes/</link>
		<comments>http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/#comments</comments>
		<pubDate>Mon, 08 Feb 2010 21:27:37 +0000</pubDate>
		<dc:creator>Amund Tveit</dc:creator>
				<category><![CDATA[Hadoop and Mapreduce]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[infrastructure]]></category>
		<category><![CDATA[github]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machinelearning]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[ridge regression]]></category>
		<category><![CDATA[svm]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=572</guid>
		<description><![CDATA[Atbrox is startup providing technology and services for Search and Mapreduce/Hadoop. Our background is from from Google, IBM and Research. Update 2010-June-17 Code for this posting is now on github -http://github.com/atbrox/Snabler This posting gives an example of how to use Mapreduce, Python and Numpy to parallelize a linear machine learning classifier algorithm for Hadoop Streaming. [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F08%2Fparallel-machine-learning-for-hadoopmapreduce-a-python-example%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fatbrox.com%2F2010%2F02%2F08%2Fparallel-machine-learning-for-hadoopmapreduce-a-python-example%2F&amp;style=normal" height="61" width="50" /><br />
			</a>
		</div>
<p><em>Atbrox is startup providing technology and services for Search and Mapreduce/Hadoop. <a href="http://atbrox.com/about/">Our background</a> is from from Google, IBM and Research.</em></p>
<p><font color="#ff0000">Update 2010-June-17</font> Code for this posting is now on github -<a href="http://github.com/atbrox/Snabler">http://github.com/atbrox/Snabler</a> </p>
<p>This posting gives an example of how to use Mapreduce, Python and Numpy to parallelize a linear machine learning classifier algorithm for Hadoop Streaming. It also discusses various hadoop/mapreduce-specific approaches how to potentially improve or extend the example.</p>
<h2>1. Background</h2>
<p>Classification is an everyday task, it is about selecting one out of several outcomes based on their features, e.g</p>
<ul>
<li>In recycling of garbage you select the bin based on the material, e.g. plastic, metal or organic.</li>
<li>When purchasing you select the store from based e.g. on its reputation, prior experience, service, inventory and prices</li>
</ul>
<p>Computational Classification &#8211; Supervised Machine Learning &#8211; is quite similar, but requires (relatively) well-formed input data combined with classification algorithms.</p>
<h3>1.1 Examples of classification problems</h3>
<ul>
<li>Finance/Insurance
<ul>
<li>Classify investment opportunities as good or not e.g. based on industry/company metrics, portfolio diversity and currency risk.</li>
<li>Classify credit card transactions as valid or invalid based e.g. location of transaction and credit card holder, date, amount, purchased item or service, history of transactions and similar transactions</li>
</ul>
<li>Biology/Medicine
<ul>
<li>Classification of proteins into structural or functional classes</li>
<li>Diagnostic classification, e.g. <a href="http://www.csie.ntu.edu.tw/~rfchang/prof/ar0302.pdf">cancer tumours based on images</a></li>
</ul>
<li>Internet
<ul>
<li><a href="http://en.wikipedia.org/wiki/Document_classification">Document Classification</a> and <a href="http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html">Ranking</a></p>
<li>Malware classification, email/tweet/web spam classification</ul>
<li>Production Systems (e.g. in energy or petrochemical industries)
<ul>
<li>Classify and detect situations (e.g. sweet spots or risk situations) based on realtime and historic data from sensors</ul>
</li>
</ul>
<h3>1.2 Classification Algorithms</h3>
<p>Classification algorithms comes in various types (e.g. linear, nonlinear, discriminative etc), see my prior postings <a href="http://amundblog.blogspot.com/2008/04/pragmatic-classification-very-basics.html">Pragmatic Classification: The Very Basics</a>  and<a href="http://amundblog.blogspot.com/2008/06/pragmatic-classification-of-classifiers.html"> Pragmatic Classification of Classifiers</a>.<br />
<strong><font color="#0000ff"><br />
Key takeaways about classifiers:<br />
</font></strong></p>
<ol>
<li>There is no silver bullet classifier algorithm or feature extraction method.
<li>Classification algorithms tend to be computationally hard to train, this encourages using a parallel approach, in this case with Hadoop/Mapreduce.
</ol>
<h2>2. Parallel Classification for Hadoop Streaming</h2>
<p>The classifier described belongs to a familiy of classifiers which have in common that they can mathematically be described as Tikhonov Regularization with a Square loss function, this family includes Proximal SVM, Ridge Regression, Shrinkage Regression and Regularized Least-Squares Classification. (<em>note: If you replace the Square Loss function with a Hinge-Loss function you get Support Vector Machine classification</em>). The implemented classifier &#8211; proximal SVM &#8211; is from the paper <a href="ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-08.ps">Incremental Support Vector Machine Classification</a>, referred to as the paper below.</p>
<h3>2.1 training data</h3>
<p>The classifier assumes numerical training data, where each class is either -1.0 og +1.0 (negative or positive class), and features are represented as vectors of positive floating point numbers. In the algorithm below are:</p>
<pre class="brush: plain;">
D - a matrix of training classes, e.g. [[-1.0, 1.0, 1.0, .. ]]
A - a matrix with feature vectors, e.g. [[2.9, 3.3, 11.1, 2.4], .. ]
e - a vector filled with ones, e.g [1.0, 1.0, .., 1.0]
E = [A -e]
mu = scalar constant # used to tune classifier
D - a diagonal matrix with -1.0 or +1.0 values (depending on the class)
</pre>
<h3>2.2 the classifier algorithm</h3>
<p>Training the classifier can be done with right side of the equation (13) from paper</p>
<pre class="brush: plain;">(omega, gamma) = (I/mu + E.T*E).I*(E.T*D*e)
</pre>
<p>Classification of an incoming feature vector x can then be done by calculating:</p>
<pre class="brush: plain;">x.T*omega - gamma</pre>
<p>which returns a number, and the sign of the number corresponds to the class, i.e. positive or negative.</p>
<p>2. Parallelization of the classifier with Hadoop Streaming and Python</p>
<p>Expression (16) in the paper has a nice property, it supports increments (and decrements), in the example there are 2 increments (and 2 decrements), but by induction there can be as many as you want:</p>
<pre class="brush: plain;">
(omega, gamma) = (I/mu + E_.T*E_1 + .. + E_i.T*E_i).I*
                 (E_1.T*D_1*e + .. + E_i.T*D_i*e)
</pre>
<p>where</p>
<pre class="brush: plain;">
E.T*E = E_1.T*E_1 + .. + E_i.T*E_i
</pre>
<p>and</p>
<pre class="brush: plain;">
E.T*De = E_1.T*D_1*e + .. + E_i.T*D_i*e
</pre>
<p>This means that we can parallelize the calculation of E.T*E and E.T*De, by having Hadoop mappers calculate each of the elements of the sums in as in the Python map() code below (sent to reducers as tuples)</p>
<p><img width="500" src="http://atbrox.com/wp-content/uploads/2010/02/parclassifiersinglereducer.png" alt="map() and reduce() - dataflow - basic case" /></p>
<h3>2.3 &#8211; the mapper</h3>
<pre class="brush: plain;">
def map(key, value):
   # input key= class for one training example, e.g. &quot;-1.0&quot;
   classes = [float(item) for item in key.split(&quot;,&quot;)]   # e.g. [-1.0]
   D = numpy.diag(classes)

   # input value = feature vector for one training example, e.g. &quot;3.0, 7.0, 2.0&quot;
   featurematrix = [float(item) for item in value.split(&quot;,&quot;)]
   A = numpy.matrix(featurematrix)

   # create matrix E and vector e
   e = numpy.matrix(numpy.ones(len(A)).reshape(len(A),1))
   E = numpy.matrix(numpy.append(A,-e,axis=1)) 

   # create a tuple with the values to be used by reducer
   # and encode it with base64 to avoid potential trouble with '\t' and '\n' used
   # as default separators in Hadoop Streaming
   producedvalue = base64.b64encode(pickle.dumps( (E.T*E, E.T*D*e) )    

   # note: a single constant key &quot;producedkey&quot; sends to only one reducer
   # somewhat &quot;atypical&quot; due to low degree of parallism on reducer side
   print &quot;producedkey\t%s&quot; % (producedvalue)
</pre>
<h3>2.4 &#8211; the Reducer</h3>
<pre class="brush: plain;">
def reduce(key, values, mu=0.1):
  sumETE = None
  sumETDe = None

  # key isn't used, so ignoring it with _ (underscore).
  for _, value in values:
    # unpickle values
    ETE, ETDe = pickle.loads(base64.b64decode(value))
    if sumETE == None:
      # create the I/mu with correct dimensions
      sumETE = numpy.matrix(numpy.eye(ETE.shape[1])/mu)
    sumETE += ETE

    if sumETDe == None:
      # create sumETDe with correct dimensions
      sumETDe = ETDe
    else:
      sumETDe += ETDe

    # note: omega = result[:-1] and gamma = result[-1]
    # but printing entire vector as output
    result = sumETE.I*sumETDe
    print &quot;%s\t%s&quot; % (key, str(result.tolist()))
</pre>
<h3>2.5 &#8211; Mapper and Reducer Utility Code</h3>
<p>Code used to run map() and reduce() methods, inspired by iterator/generator approach from<a href="http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python"> this mapreduce tutorial</a>.</p>
<pre class="brush: plain;">
def read_input(file, separator=&quot;\t&quot;):
    for line in file:
        yield line.rstrip().split(separator)
</pre>
<pre class="brush: plain;">
def run_mapper(map, separator=&quot;\t&quot;):
    data = read_input(sys.stdin,separator)
    for (key,value) in data:
        map(key,value)
</pre>
<pre class="brush: plain;">
def run_reducer(reduce,separator=&quot;\t&quot;):
    data = read_input(sys.stdin, separator)
    for key, values in groupby(data, itemgetter(0)):
        reduce(key, values)
</pre>
<h2>3. Finished?</h2>
<p>Assume your running time goes through the roof even with the above parallel approach, what to do?</p>
<h3>3.1 Mapper Increment Size really makes a difference!</h3>
<p>Since there is only 1 reducer in the presented implementation, it is useful to let mappers do most of the job. The size of the (increment) matrices &#8211; E.T*E and E.T*D*e given as input to the reducer is independent of number of training data, but dependent on the number of classification features. The workload on the reducer is also dependent on the number of matrices received by the mappes (i.e. increment size), e.g. if you have a 1000 mappers having one billion examples with 100 features each, the reducer would need to do a sum of one trillion 101&#215;101 matrices and one trillion 101&#215;1 vectors if the mapper sent one matrix pair per training example, but if each mapper only sent one pair of E.T*E and E.T*D*e representing all the mappers billion training examples the reducer would only need to summarize 1000 matrix pairs.</p>
<h3>3.2 Avoid stressing the reducer</h3>
<p>Add more (intermediate) reducers (combiners) that calculates partial sums of matrices. In the case of many small increments (and correspondingly many matrices) it can be useful to add an intermediate step that (in parallel) calculates sums of E.T*E and E.T*D*e before sending the sums to the final reducer, this means that the final reducer gets fewer matrices to summarize before calculating the final answer, see figure below.<br />
<img width="500" src="http://atbrox.com/wp-content/uploads/2010/02/machinelearning2.png" alt="flow with intermediate mapreduce step" /></p>
<h3>3.3 Parallelize (or replace) the matrix inversion in the reduction step</h3>
<p>If someone comes along with a training data set with a very high feature-dimension (e.g. recommender systems, bioinformatics or text classification), the matrix inversion in the reducer can become a real bottleneck since such algorithms typically are O(n^3) (and lower bound of <a href="http://amundtveit.info/publications/2003/ComplexityOfMatrixInversion.pdf">Omega(n^2 lg n)</a>), where n is the number of features. A solution to this can be to use or develop hadoop/mapreduce-based parallel matrix inversion, e.g. <a href="http://incubator.apache.org/hama/">Apache Hama</a>, or <a href="http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/">don&#8217;t invert the matrix..</a>.</p>
<h3>3.4 Feature Dimensionality Reduction</h3>
<p>Another approach when having training data with high feature-dimension could be to reduce feature-dimensionality, for more info check out <a href="http://en.wikipedia.org/wiki/Latent_semantic_indexing">Latent Semantic Indexing</a> (and Analysis), <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">Singular Value Decomposition</a> or <a href="http://ict.ewi.tudelft.nl/~lvandermaaten/t-SNE.html">t-Distributed Stochastic Neighbor Embedding</a></p>
<h3>3.5 Reduce IO between  mappers and reducers with compression</h3>
<p><a href="http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression">Twitter presented using LZO compression (on the Cloudera blog) to speed up Hadoop</a>. Inspired by this one could in the case of high feature dimension, i.e. large E.T*E and E.T*D*e matrices, compress the output in the mapper and decompress in the reducer by replacing base64encoding/decoding and pickling above with:</p>
<pre class="brush: plain;">
producedvalue = base64.b64encode(lzo.compress(pickle.dumps( (E.T*E, E.T*D*e) ), level=1)
</pre>
<p>and</p>
<pre class="brush: plain;">
ETE, ETDe = pickle.loads(lzo.decompress(base64.b64decode(value)))
</pre>
<h3>3.6 Do more work with approximately the same computing resources</h3>
<p>The D matrix above represents binary classification with a value of +1 or -1 representing each class. It is quite common to have classification problems with more than 2 classes. Supporting multiple classes is usually done by training by several classifiers, either 1-against-all (1 classifier trained per class) or 1-against-1 (1 classifier trained per unique pair of classes), and the run a tournament of them against each other and pick the most confident. In the case of 1-against-all classification the mapper could probably send multiple E.T*D_c*e &#8211; with one D_c per class and keep the same E.T*E, the reducer would then need to calculate (I/mu + E.TE).I once and independently multiply with several E.T*D_c*e sums to create a set of (omega,gamma) classifiers. For 1-against-1 classification it becomes somewhat more complicated, because it involves creating several E matrices since in the 1-against-1 case only the rows in E where the 2 classes competing occur are relevant.</p>
<h2>4. Code</h2>
<p>(Early) Python code of the algorithm presented above can be found at <a href="http://github.com/atbrox/Snabler">http://github.com/atbrox/Snabler</a> (open source with Apache Licence). Please let <a href="mailto:amund@atbrox.com">me</a> know if you want to contribute to the project, e.g. from  <a href="http://atbrox.com/2009/10/01/mapreduce-and-hadoop-academic-papers/">mapreduce and hadoop algorithms in academic papers</a>.</p>
<h2>5. More resources about machine learning with Hadoop/Mapreduce?</h2>
<ul>
<li><a href="http://lucene.apache.org/mahout/">Apache Mahout</a> &#8211; active project that implements (in Java) several machine learning algorithms (also unsupervised machine learning, i.e. clustering)
<li>Good paper about machine learning algorithms with mapreduce &#8211; <a href="http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf">http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf</a>
</ul>
<hr />
<p><font color="#0000ff"><strong>Do you need help with Hadoop/Mapreduce?</strong></font><br />
A good start could be to read <a href="http://www.amazon.com/gp/product/0596521979?ie=UTF8&#038;tag=amuw-20&#038;linkCode=as2&#038;camp=1789&#038;creative=9325&#038;creativeASIN=0596521979">this book</a>, or contact <a href="http://atbrox.com/about/">Atbrox</a> if you need help with development or parallelization of algorithms for Hadoop/Mapreduce &#8211; <a href="mailto:info@atbrox.com">info@atbrox.com</a>. </p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to BlogRO" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to FTW" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to VoxRO" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Twitter" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Parallel+Machine+Learning+for+Hadoop%2FMapreduce+%26%238211%3B+A+Python+Example&amp;c=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to MySpace" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/&amp;title=Parallel+Machine+Learning+for+Hadoop%2FMapreduce+%26%238211%3B+A+Python+Example" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Del.icio.us" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/&amp;title=Parallel+Machine+Learning+for+Hadoop%2FMapreduce+%26%238211%3B+A+Python+Example" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to digg" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/&amp;t=Parallel+Machine+Learning+for+Hadoop%2FMapreduce+%26%238211%3B+A+Python+Example" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to FaceBook" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Technorati" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/&amp;title=Parallel+Machine+Learning+for+Hadoop%2FMapreduce+%26%238211%3B+A+Python+Example" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Stumble Upon" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/&amp;title=Parallel+Machine+Learning+for+Hadoop%2FMapreduce+%26%238211%3B+A+Python+Example" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Google Bookmarks" alt="Add 'Parallel Machine Learning for Hadoop/Mapreduce &#8211; A Python Example' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/qoTnKURhfes" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/</feedburner:origLink></item>
		<item>
		<title>Atbrox Customer Case Study – Scalable Language Processing with Elastic Mapreduce (Hadoop)</title>
		<link>http://feedproxy.google.com/~r/atbrox/~3/8sx94xca6LM/</link>
		<comments>http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/#comments</comments>
		<pubDate>Sat, 14 Nov 2009 07:04:32 +0000</pubDate>
		<dc:creator>Amund Tveit</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[data processing]]></category>
		<category><![CDATA[elastic mapreduce]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[language processing]]></category>
		<category><![CDATA[nlp]]></category>

		<guid isPermaLink="false">http://atbrox.com/?p=507</guid>
		<description><![CDATA[We developed a tool for scalable language processing for our customer Lingit using Amazon&#8217;s Elastic Mapreduce. More details: http://aws.amazon.com/solutions/case-studies/atbrox/ Contact us if you need help with Hadoop/Elastic Mapreduce. Bookmark to:]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fatbrox.com%2F2009%2F11%2F14%2Fatbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fatbrox.com%2F2009%2F11%2F14%2Fatbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop%2F&amp;style=normal" height="61" width="50" /><br />
			</a>
		</div>
<p>We developed a tool for scalable language processing for our customer <a href="http://www.lingit.no">Lingit</a> using Amazon&#8217;s Elastic Mapreduce.</p>
<p><strong>More details:</strong> <a href="http://aws.amazon.com/solutions/case-studies/atbrox/">http://aws.amazon.com/solutions/case-studies/atbrox/</a></p>
<p><a href="http://atbrox.com/contact/">Contact us</a> if you need help with Hadoop/Elastic Mapreduce.</p>
<!-- RO Social Bookmarks BEGIN --><div class="social_bookmark"><em>Bookmark to:</em><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://blogro.info/submit.php?url=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to BlogRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/blogro.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to BlogRO" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to BlogRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.ftw.ro/node/add/drigg/?url=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to FTW"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/ftw.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to FTW" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to FTW" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://voxro.com/node/add/drigg/?url=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to VoxRO"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/voxro.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to VoxRO" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to VoxRO" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://twitter.com/home?status=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Twitter"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/twitter.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Twitter" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Twitter" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.myspace.com/Modules/PostTo/Pages/?t=Atbrox+Customer+Case+Study+%26%238211%3B+Scalable+Language+Processing+with+Elastic+Mapreduce+%28Hadoop%29&amp;c=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to MySpace"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/myspace.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to MySpace" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to MySpace" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://del.icio.us/post?url=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/&amp;title=Atbrox+Customer+Case+Study+%26%238211%3B+Scalable+Language+Processing+with+Elastic+Mapreduce+%28Hadoop%29" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Del.icio.us"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/delicious.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Del.icio.us" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Del.icio.us" /></a><br /><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://digg.com/submit?phase=2&amp;url=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/&amp;title=Atbrox+Customer+Case+Study+%26%238211%3B+Scalable+Language+Processing+with+Elastic+Mapreduce+%28Hadoop%29" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to digg"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/digg.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to digg" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to digg" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.facebook.com/share.php?u=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/&amp;t=Atbrox+Customer+Case+Study+%26%238211%3B+Scalable+Language+Processing+with+Elastic+Mapreduce+%28Hadoop%29" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to FaceBook"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/facebook.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to FaceBook" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to FaceBook" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.technorati.com/faves?add=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Technorati"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/technorati.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Technorati" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Technorati" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.stumbleupon.com/submit?url=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/&amp;title=Atbrox+Customer+Case+Study+%26%238211%3B+Scalable+Language+Processing+with+Elastic+Mapreduce+%28Hadoop%29" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Stumble Upon"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/stumbleupon.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Stumble Upon" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Stumble Upon" /></a><a class="social_img" onclick="window.open(this.href, '_blank', 'scrollbars=yes,menubar=no,border=0,height=600,width=750,resizable=yes,toolbar=no,location=no,status=no'); return false;" href="http://www.google.com/bookmarks/mark?op=edit&amp;output=popup&amp;bkmk=http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/&amp;title=Atbrox+Customer+Case+Study+%26%238211%3B+Scalable+Language+Processing+with+Elastic+Mapreduce+%28Hadoop%29" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Google Bookmarks"><img src="http://atbrox.com/wp-content/plugins/ro-social-bookmarks/google.png" title="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Google Bookmarks" alt="Add 'Atbrox Customer Case Study &#8211; Scalable Language Processing with Elastic Mapreduce (Hadoop)' to Google Bookmarks" /></a></div>
<!-- RO Social Bookmarks END --><img src="http://feeds.feedburner.com/~r/atbrox/~4/8sx94xca6LM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://atbrox.com/2009/11/14/atbrox-customer-case-study-scalable-language-processing-with-elastic-mapreduce-hadoop/</feedburner:origLink></item>
	</channel>
</rss>
