<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr="http://purl.org/syndication/thread/1.0" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" gd:etag="W/&quot;CEIERX0_fip7ImA9WhRUE0U.&quot;"><id>tag:blogger.com,1999:blog-12987422</id><updated>2012-01-23T21:21:44.346-08:00</updated><category term="technology" /><category term="bioinformatics" /><category term="computing" /><category term="science" /><title>Piccolblog</title><subtitle type="html">A data scientist's notebook</subtitle><link rel="http://schemas.google.com/g/2005#feed" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/posts/default" /><link rel="alternate" type="text/html" href="http://blog.piccolboni.info/" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><generator version="7.00" uri="http://www.blogger.com">Blogger</generator><openSearch:totalResults>22</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/piccolboni/qdqo" /><feedburner:info uri="piccolboni/qdqo" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId>piccolboni/qdqo</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><entry gd:etag="W/&quot;DUUDQXc7fSp7ImA9WhdVEUw.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-720920467893968175</id><published>2011-09-15T13:07:00.000-07:00</published><updated>2011-09-15T13:07:50.905-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-09-15T13:07:50.905-07:00</app:edited><title>The connected components example, rewritten using RHadoop/rmr</title><content type="html">My new implementation of random mate for mapreduce, using the package &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;rmr&lt;/span&gt; from &lt;a href="http://www.revolutionanalytics.com/"&gt;Revolution Analytics&lt;/a&gt; open source project &lt;a href="http://github.com/RevolutionAnalytics/RHadoop"&gt;RHadoop&lt;/a&gt;.&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;This story has now three episodes. First, I got interested in how to compute &lt;a href="http://en.wikipedia.org/wiki/Connected_component_(graph_theory)"&gt;connected components&lt;/a&gt; in &lt;a href="http://en.wikipedia.org/wiki/MapReduce"&gt;map reduce&lt;/a&gt; in a way that works even for large diameter graphs and &lt;a href="http://blog.piccolboni.info/2010/07/map-reduce-algorithm-for-connected.html"&gt;proposed an algorithm&lt;/a&gt;, in fact a port to map reduce of an old PRAM algorithm. Then I learned &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;a href="http://ml.stat.purdue.edu/rhipe/"&gt;RHIPE&lt;/a&gt;&lt;/span&gt;, a map reduce package for R and &lt;a href="http://blog.piccolboni.info/2011/04/map-reduce-algorithm-for-connected.html"&gt;implemented that algorithm&lt;/a&gt; using it. Then Revolution Analytics knocked at my door and asked me to work on a similar package and give my best stab at an API that's both easy and powerful. You can read a bit more about &lt;a href="https://github.com/RevolutionAnalytics/RHadoop/wiki/Philosophy"&gt;the ideas behind the package&lt;/a&gt;, but keep in mind that this is a young project and there is more work to do to turn those lofty goals into reality. You can help by just trying it out and providing feedback or by contributing to it. From documentation to unit tests, to minor features, to very significant ones, there's plenty to do.&lt;br /&gt;
Granted, clarity and simplicity are in the eye of the beholder and I am hardly unbiased, being the author of the code, the main committer to &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;rmr&lt;/span&gt; and having Revolution as a client. But I invite you to compare this example with its previous incarnation using &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;RHIPE&lt;/span&gt;, namely how the mapreduce jobs are described and invoked, &amp;nbsp;how much less boilerplate code there is and how completely devoid of uncommon R constructs it is, replacing unevaluated expressions and expression substitution with regular functions. Another interesting comparison is with &lt;a href="http://www.hortonworks.com/transitive-closure-in-apache-pig/"&gt;this alternate algorithm&lt;/a&gt; for the same problem, implemented in a mix of Pig and Python. A number of additional examples can be found in the &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;rmr&lt;/span&gt; &lt;a href="http://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial"&gt;tutorial&lt;/a&gt;, including k-means, logistic regression and linear least squares.  Joshua Block of &lt;i&gt;Effective Java&lt;/i&gt; fame recommends to "&lt;a href="http://lcsd05.cs.tamu.edu/slides/keynote.pdf"&gt;write to your API early and often&lt;/a&gt;" (p. 9) and in my mind I had been writing to something like this for a long time, in some respects even before I had heard of mapreduce. I just needed the opportunity and environment to make it happen and I am grateful to Revolution for providing them. But let the &lt;a href="git://gist.github.com/1219908.git"&gt;code&lt;/a&gt;&amp;nbsp;(git repo) speak:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush: r"&gt;library(rmr)

connected.components = function(graph, forest = NULL, i = 0) {
  key.from = function(k,v) keyval(v[['from']], v)
  key.to = function(k,v) keyval(v[['to']], v)
  if(is.null(forest)) {
    ##create trivial forest
    forest = mapreduce(
      input = graph,
      map = function(k, v) list(keyval(v[['from']], NULL), 
                                keyval(v[['to']], NULL)),
      reduce = function(k, vv) keyval(NULL, list(from = k, to = k)))}
  ##merge graph and forest
  graph.forest = equijoin(leftinput =
    equijoin(leftinput = graph,
             rightinput = forest,
             map.left = key.from,
             map.right = key.from,
             reduceall = 
               function(k, vl, vr) 
                 keyval(NULL, 
                        list(parentfrom = vr[['to']], 
                             from = vl[['from']], 
                             to = vl[['to']]))),
    rightinput = forest,
    map.left = key.to,
    map.right = key.from,
    reduceall = function(k, vl, vr) keyval(NULL, 
                                           list(parentfrom = vl[['parentfrom']], 
                                                parentto = vr[['to']], 
                                                from = vl[['from']], 
                                                to = vl[['to']])))
  ##find component merge candidates
  if(active.edges.count(graph.forest) &amp;gt; 0) {
    forest.update = update.parent(graph.forest, i)
    ##depth-reducing step
    forest.new = equijoin(input =
      ##forest update
      equijoin(leftinput = forest,
               rightinput = forest.update,
               outer = "left",
               map.left = key.from,
               map.right = key.from,
               reduceall = function(k,vl,vr) keyval(NULL, 
                                                    if(all(is.na(vr))) {vl} 
                                                    else {vr})),
      outer = "left",
      map.left = key.to,
      map.right = key.from,
      reduceall = function(k,vl,vr) keyval(NULL, 
                                           list(from = vl[['from']], 
                                                to = if(all(is.na(vr)){vl[['to']]} 
                                                     else {vr[['to']]})))
    ##recursion
    connected.components(graph, forest.new, i+1)}
  else  forest}
  
active.edges.count = function(gf) {
  from.dfs(mapreduce(input = gf, 
                     map = function(k,v) 
                       if(v[['parentfrom']] != v[['parentto']] ) 
                       keyval(NULL,1) else keyval(NULL, 0), 
                     reduce = function(k, vv) keyval(NULL, sum(unlist(vv))), 
                     combine = T))[[1]]$val}

update.parent = function(input, iter) {
  library(digest)
  symmetry.break = function(x,r) {
    as.integer(paste("0x", substr(digest(c(x,r)), 1,1), sep = ""))%%2 == 1}
  mapreduce(input = input,
            map = function(k, v) {
              if(v[['parentfrom']] != v[['parentto']]) {
                symmetry.from = symmetry.break(v[['parentfrom']], iter);
                symmetry.to = symmetry.break(v[['parentto']], iter);
                if(symmetry.from != symmetry.to) {
                  if(symmetry.from) {
                    keyval(v[['parentfrom']],list(from = v[['parentfrom']], 
                                                  to = v[['parentto']]))}
                  else {
                    keyval(v[['parentto']], list(from = v[['parentto']], 
                                                 to = v[['parentfrom']]))}}}},
            reduce = function(k, vv) keyval(NULL, vv[[1]]),
            combine = F)}

random.graph = to.dfs(unique(lapply(1:10,
                                    function(i)keyval(NULL, 
                                                      c(from = sample(1:10,1), 
                                                        to = sample(1:10,1))))))


&lt;/pre&gt;
One remark to better understand how this code works (you may need to take the &lt;a href="http://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial"&gt;rmr tutorial&lt;/a&gt; to really get all the details): the objects returned by mapreduce calls do not directly contain the data but point to it. Since we are potentially working on big data, the data stays on HDFS unless the programmer explicitly requests otherwise. These objects take care of creating temp filenames, cleaning them up when they are not referenced anymore, and can be passed around from mapreduce call to mapreduce call. It is a simple but recurring concern that rmr hopefully takes off the programmer's plate once and for all. The line-by-line description may not be so interesting to you if you are familiar with the old RHIPE based implementation, but here it is for completeness sake.&lt;br /&gt;
&lt;i&gt;Line 1&lt;/i&gt;: Load the library.
&lt;br /&gt;
&lt;i&gt;Line 3&lt;/i&gt;: The main function. It takes the graph to analyze, optionally a forest representing a partition into connected components, which start as individual nodes and end up as maximal connected components by the time the algorithm terminates, and an integer. The forest shares the nodes with the input graph and the roots are marked by a self-loop. Forest elements are trees and in this algorithms have depth 1 most of the time, that is it's a forest of stars. The integer argument represents the iteration and is used to generate correlated pseudorandomness across all nodes, details later.
&lt;br /&gt;
&lt;i&gt;Lines 4–5&lt;/i&gt;: Nodes are represented by integers and edges by a list with a &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;from&lt;/span&gt; and a &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;to&lt;/span&gt; element, one per record. These two functions, used over and over again as map functions, set the key to the &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;from&lt;/span&gt; and &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;to&lt;/span&gt; node of an edge resp., while leaving the value alone.
&lt;br /&gt;
&lt;i&gt;Lines 6–12&lt;/i&gt;: If a forest is not provided as argument, initialize it to the set of all  isolated nodes. Add a self-loop to mark the root.
&lt;br /&gt;
&lt;i&gt;Lines 14–32&lt;/i&gt;: merge the graph and the forest so that every node in the graph is labelled with its parent in the forest, that is the component it belongs too. This is implemented with two successive join-like operations, which in turn are implemented as map-reduce jobs. &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;rmr&lt;/span&gt; takes care of some of the ugly details of performing joins in mapreduce providing equijoins as part of the package. Build from the dev branch to get this feature.
&lt;br /&gt;
&lt;i&gt;Lines 34–53&lt;/i&gt;: active edges are edges connecting nodes belonging to different components. If none exists, then components are maximal and the algorithm terminates. The test for active edges is implemented as a separate map-reduce job whereas it could be folded into the forest update job to be described shortly, but we went for modularity here. The forest.update call decides, for each active edge, which components should be merged and how and expresses that as additional edges to be merged into the forest. The next step is formed by two chained equijoins: the inner one, which executes first, merges in the new edges into the forest while keeping the trees trees (invariant: each node has one parent). The outer one transforms each tree, now potentially of depth two after a merger with another tree, into a star.
&lt;br /&gt;
&lt;i&gt;Lines 55–56&lt;/i&gt;: recursive call. Keep growing that forest or return it when done.
&lt;br /&gt;
&lt;i&gt;Lines 58–64&lt;/i&gt;: count active edges. Works on the merged graph-forest data structure, so it's pretty trivial to check for components that should be merged, that is edges incident on different components. The interesting bit is that this is a potentially very large computation that ends with a single number. That's why we can simply grab that data from HDFS with &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;from.dfs&lt;/span&gt; and bring into memory, where it can be used for the termination condition.
&lt;br /&gt;
&lt;i&gt;Lines 66–83&lt;/i&gt;: This is the logic that decides which components to actually merge and I think the most technically sophisticated bit (not my idea). If we just tried to merge every pair of components connected by an active edge, mayhem would ensue. The forest would not be a forest anymore, and cycles would be possible. We need to elect "donor" components and "acceptor" components and exclusively merge the former into the latter. That will leave the forest a forest after the merger step is performed. The smart part is the very simple and distributed mechanism employed to decide donors and acceptors. It's a pseudorandom boolean computed as a hash function of the component root node name, which works as component id, and the iteration number (the somewhat convoluted &lt;i&gt;line 70&lt;/i&gt; does just that, would love to hear of a simpler way). The latter is included so that the "donor" and "acceptor" assignments change at each iteration and eventually all active edges have a chance to be used for a merger, thus becoming active no more. 
&lt;br /&gt;
&lt;i&gt;Lines 85–88&lt;/i&gt;: How to create some test data


&lt;br /&gt;
&lt;br /&gt;
If you liked this or anything else piqued your interest, please give &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;rmr&lt;/span&gt; a spin and &lt;a href="mailto:antonio@piccolboni.info"&gt;let me&lt;/a&gt; know of any problems and comments or just new applications and examples.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-720920467893968175?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=V_bEOEKPZKE:weA74DiQRNQ:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=V_bEOEKPZKE:weA74DiQRNQ:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=V_bEOEKPZKE:weA74DiQRNQ:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/V_bEOEKPZKE" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/720920467893968175/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2011/09/connected-components-example-rewritten.html#comment-form" title="2 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/720920467893968175?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/720920467893968175?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/V_bEOEKPZKE/connected-components-example-rewritten.html" title="The connected components example, rewritten using RHadoop/rmr" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>2</thr:total><feedburner:origLink>http://blog.piccolboni.info/2011/09/connected-components-example-rewritten.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DEUNQXs7eCp7ImA9WhZQGU8.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-4726269619950367059</id><published>2011-04-27T10:44:00.000-07:00</published><updated>2011-04-27T10:44:50.500-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-04-27T10:44:50.500-07:00</app:edited><title>A map reduce algorithm for connected components: implementation</title><content type="html">At long last, a complete implementation of the &lt;a href="http://blog.piccolboni.info/2010/07/map-reduce-algorithm-for-connected.html"&gt;algorithm&lt;/a&gt; I described some time ago.&lt;br /&gt;
&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;You are kindly advised to go back and check the algorithm motivation and description in my &lt;a href="http://blog.piccolboni.info/2010/07/map-reduce-algorithm-for-connected.html"&gt;older post&lt;/a&gt;, but the short of it is that it is a map reduce algorithm for connected components that is not sensitive to the diameter of the graph, a first at that time to the best of my knowledge. It works by merging subset of nodes that are connected by an edge and it does so in a highly parallel way — the number of subsets drops exponentially at each iteration, by \(1/2\) in expectation in the undirected version of the algorithm, in an adaptation of a pre-existing algorithm for the PRAM model.&lt;br /&gt;
&lt;br /&gt;
As I discussed in a &lt;a href="http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html"&gt;recent post&lt;/a&gt;, I got interested in Rhipe, an R library for Hadoop, and this is what we are going to use here. Wait, you are saying, a graph algorithm in a statistics oriented language? That's exactly right. First, I thought it would be more interesting to test R and Rhipe outside their confort zone of computational statistics. Second, in the same post I didn't find another Hadoop library that fit all my criteria, so Rhipe had to do. Third, the R community does not limit itself to statistics and machine learning, even if those are its fundamental strengths. In fact there are a graph package — &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;graph&lt;/span&gt; — and even a graph drawing package — &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Rgraphviz&lt;/span&gt;. A couple more notes on the implementation. This is not production-ready code. For instance, I did not bother removing the various temp files, since I wanted to inspect them after the fact. There is also a counter update that is placed in the middle of a loop without much consideration for scalability. These are not fundamental problems and can be easily fixed. Moreover, you will see that I call multiple times the map reduce relational join about which I reported in &lt;a href="http://blog.piccolboni.info/2011/04/bringing-relational-joins-to-rhipe.html"&gt;this post&lt;/a&gt;. In fact, there are only two "custom" map reduce jobs involved here, one of which is trivial and I think this supports the idea that the relational join is an important abstraction for map reduce programs, not just a case of SQL-envy. And without further ado, the program:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush: r"&gt;function(E, F_ = NULL, i =0) {
  T1 = tempfile()
  T2 = tempfile()
  Fupdate = tempfile()
  Fnew  = tempfile()
  Ftemp = tempfile()

  if (is.null(F_)) {
                                        #create trivial forest
    F_ = tempfile()
    rhex(
         rhmr(
              ifolder = E,
              ofolder = F_,
              map = expression({lapply(map.values, function(v) {
                rhcollect(as.integer(v$from), NULL)
                rhcollect(as.integer(v$to), NULL)})}),
              reduce  = expression(post = {rhcollect(reduce.key, list(from = reduce.key, to = reduce.key))}),
              inout = c("sequence", "sequence")))}

                                        #merge graphs and forest
  rhreljoin (ileftfolder= E,
             irightfolder= F_,
             ofolder=T1,
             map.left.key = function(k,v) as.integer(v$from),
             map.right.key = function(k,v) as.integer(v$from),
             reduce.value = function(k,ve,vf) list(parentfrom = vf$to, from = ve$from, to = ve$to))
  rhreljoin (ileftfolder= T1,
             irightfolder= F_,
             ofolder=T2,
             map.left.key = function(k,v) as.integer(v$to),
             map.right.key = function(k,v) as.integer(v$from),
             reduce.value = function(k,vt1,vf) list(parentfrom = vt1$parentfrom, parentto = vf$to, from = vt1$from, to = vt1$to))
  
                                        #find component merge candidates
  live_edges = update_parent (ifolder= T2, ofolder= Fupdate, i)$counters$live_edges
  if (!is.null(live_edges)) {
    print(paste("live edges:", live_edges))
                                        #forest update
    rhreljoin (ileftfolder= F_ ,
               irightfolder= Fupdate,
               leftouter= T,
               map.left.key = function(k,v) as.integer(v$from),
               map.right.key = function(k,v) as.integer(v$from),
               reduce.value= function(k,vl,vr) if(is.na(vr)) {vl} else {vr},
               ofolder= Ftemp)
                                        #depth-reducing step
    rhreljoin (ifolder= Ftemp ,
               ofolder = Fnew,
               leftouter = T,
               map.left.key = function(k,v) as.integer(v$to),
               map.right.key = function(k,v) as.integer(v$from),
               reduce.value= function(k,vl,vr) list(from = vl$from, to = if(is.na(vr$to)){vl$to} else {vr$to}))
                                        #recursion
    randmate(E, Fnew, i+1)}
  else  return (F_)
}
&lt;/pre&gt;&lt;br /&gt;
&lt;i&gt;Line 1&lt;/i&gt;&lt;br /&gt;
The signature has three arguments, the graph itself, the forest used to represent node sets that will evolve into connected components and recursion level that is used to control pseudo-random choices in a coordinate fashion across processes. The second and third argument need not be provided by the end user, they are only useful for recursive calls. Graphs are represented as lists of lists with two elements, named "from" and "to" — albeit this algorithm is for undirected graphs.&lt;br /&gt;
&lt;i&gt;Lines 2–6&lt;/i&gt;&lt;br /&gt;
Some temp files. I do not remove them in this implementation, but in production you should.&lt;br /&gt;
&lt;i&gt;Lines 8–20&lt;/i&gt;&lt;br /&gt;
On the first call, the forest is not provided. Initialize it with the trivial forest, one node per tree — here self-loops mark the root in a slight departure from standard forests.&lt;br /&gt;
&lt;i&gt;Lines 22–34&lt;/i&gt;&lt;br /&gt;
With two simple joins, create a combined view of the graph and the forest where each record contains one edge of the graph and the parents in the forest for each of the end-nodes.&lt;br /&gt;
&lt;i&gt;Line 36&lt;/i&gt;&lt;br /&gt;
The only "custom" map reduce job of interest in the whole program, &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;update_parent&lt;/span&gt; makes the actual decisions on what node subsets are going to be merged. It returns the number of &lt;i&gt;live edges&lt;/i&gt;, edges that bridge two subsets. When this is 0 — actually, NULL in the implementation — we are done. I will explain this after the main program.&lt;br /&gt;
&lt;i&gt;Line 37&lt;/i&gt;&lt;br /&gt;
This check tells us we are done.&lt;br /&gt;
&lt;i&gt;Line 38&lt;/i&gt;&lt;br /&gt;
Trace the decline in the number of live edges, just for my education.&lt;br /&gt;
&lt;i&gt;Line 40–46&lt;/i&gt;&lt;br /&gt;
Merge some pairs of trees in the forest as already decided at line 36. This actually executes the merger and can be easily expressed as a join.&lt;br /&gt;
&lt;i&gt;Lines 48–54&lt;/i&gt;&lt;br /&gt;
This modifies the forest to go from a max depth of 2 back to 1 as we started with. You guessed, it's just another join, this time of the self- variety.&lt;br /&gt;
&lt;i&gt;Lines 55–56&lt;/i&gt;&lt;br /&gt;
Recurse or return the forest. It's tail recursion, the algorithm is iterative, but we don't do loops unless we have to. It's a matter of style.&lt;br /&gt;
&lt;br /&gt;
The next function is the implementation of the &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;update_parent&lt;/span&gt; function above.&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush: r"&gt;function(ifolder, ofolder, mrround)
  rhex(
       rhmr(
            ifolder= ifolder,
            ofolder = ofolder,
            map = eval(
              substitute(
                         expression({
                           library(digest)
                           is.high = function(x,r) {
                             as.integer(
                                        paste("0x", substr(digest(c(x,r)), 1,1), sep = ""))%%2 == 1}
                           lapply(map.values,
                                  function(v){
                                    if(v$parentfrom != v$parentto) {
                                      rhcounter("live_edges", "live_edges", 1)
                                      hfrom = is.high(v$parentfrom, mrround);
                                      hto = is.high(v$parentto, mrround);
                                      if (hfrom != hto) {
                                        if (hfrom) {
                                          rhcollect(v$parentfrom,list(from = v$parentfrom, to = v$parentto))}
                                        else {
                                          rhcollect(v$parentto, list(from = v$parentto, to = v$parentfrom))}}}})}),
                         list(mrround = mrround))),
            reduce = expression(
                post = {rhcollect(reduce.key, reduce.values[[1]])}),
            inout = c("sequence","sequence"),
            combiner = T))
&lt;/pre&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Line 1&lt;/i&gt;&lt;br /&gt;
This function takes in input the combined graph-forest view and outputs a list of edge updates for the forest into its ofolder argument. It also needs a round number that is used to generate correlated pseudorandomness across tasks.&lt;br /&gt;
&lt;i&gt;Lines 2–5&lt;/i&gt;&lt;br /&gt;
The function consists of one map reduce job with said input and output files.&lt;br /&gt;
&lt;i&gt;Lines 6–24&lt;/i&gt;&lt;br /&gt;
The mapper randomly assigns a pseudo-random binary label to each node that depends only on the node itself and the round number. Then it checks that the two nodes defining an edge belong to different trees in the forest (&lt;i&gt;liveness&lt;/i&gt;) and that they have different pseudo-random labels. If so it emits a key, value pair that represents an edge from the root of one tree to the other, keyed on the origin node, which is like a green light to merging the two trees.&lt;br /&gt;
&lt;i&gt;Lines 25–28&lt;/i&gt;&lt;br /&gt;
The reducer just arbitrarily picks one of these update edges per tree. One tree can only merge into another one during one round; on the other hand the "receiving" tree can accept many trees — it's an asymmetrical merge. This is equivalent to the arbitrary CRCW PRAM model in the original PRAM algorithm from which this is derived and allows to use a combiner, which is important in case of high degree nodes. An animation depicting the algorithm's working is below.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://s1134.photobucket.com/albums/m613/piccolbo/?action=view&amp;amp;current=randmate.gif" style="margin-left: auto; margin-right: auto;" target="_blank"&gt;&lt;img alt="Photobucket" border="0" src="http://i1134.photobucket.com/albums/m613/piccolbo/randmate.gif" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;An animation of the algorithm at work on a small graph. The border/fill color combinations represent membership to a specific tree in the forest and eventually coincide with the connected components of the graph. It starts with a different combination for each node, and then some expand until only one per connected component is left. Animation generated with custom R code, Rgraphviz and the Gimp.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;
&lt;br /&gt;
One detail: if you are going to try this at home, my setup is Ubuntu maverick, R 2.11.1, Rhipe 0.65.4 and CDH3. With this setup, to avoid what I suppose is a Rhipe bug, one needs to specify certain map reduce options to rhmr, specifically &lt;br /&gt;
&lt;pre&gt;mapred = list(mapreduce.fileoutputcommitter.marksuccessfuljobs = "false").&lt;/pre&gt;Annoying to figure out but not hard to work around. Most of this post was prepared with older versions of everything with which this problem did not manifest itself, and hopefully it will be fixed soon.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Epilogue&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
In all of 85 lines — and I didn't skimp on line breaks to get there — we have implemented a non-trivial, scalable algorithm in the map reduce framework. You could factor in the 77 lines that went into implementing joins, but that would grossly underestimate the reusability of that abstraction, which is provided by several languages and libraries for map reduce independent of this algorithm, or even graph algorithms in general. Readability is in the eye of the beholder, but if you are used to moving around functions as first class objects and are basically familiar with map reduce, I hope this code is reasonably clear. &lt;br /&gt;
Besides an implementation for the algorithm presented in &lt;a href="http://blog.piccolboni.info/2010/07/map-reduce-algorithm-for-connected.html"&gt;an older post&lt;/a&gt;, the present post also provides a beefier Rhipe exercise as I had promised in a &lt;a href="http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html"&gt;review post&lt;/a&gt; of map reduce libraries and languages. Therein I had also anticipated the following friendly challenge, where to take part is to win: can anyone implement this algorithm in any of the alternative map reduce environments? The list includes, but doesn't have to be limited to: Java/Hadoop, Java/Cascading, C++/Pipes, Python/Dumbo, Pig, Hive and Cascalog. Is it possible? How hard is it? Can we learn anything comparing two implementations? This is not &lt;i&gt;wordcount&lt;/i&gt; and I suspect we stand to learn a lot more from the comparison of implementations of a non-trivial algorithm.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-4726269619950367059?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=eVW27zL1Zic:izL4h2K_G94:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=eVW27zL1Zic:izL4h2K_G94:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=eVW27zL1Zic:izL4h2K_G94:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/eVW27zL1Zic" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/4726269619950367059/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2011/04/map-reduce-algorithm-for-connected.html#comment-form" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/4726269619950367059?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/4726269619950367059?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/eVW27zL1Zic/map-reduce-algorithm-for-connected.html" title="A map reduce algorithm for connected components: implementation" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>3</thr:total><feedburner:origLink>http://blog.piccolboni.info/2011/04/map-reduce-algorithm-for-connected.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DUUBSX47cSp7ImA9WhZRGU0.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-7979440164268043910</id><published>2011-04-15T15:40:00.000-07:00</published><updated>2011-04-15T15:40:58.009-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-04-15T15:40:58.009-07:00</app:edited><title>Bringing relational joins to Rhipe</title><content type="html">Relational operations are a very common way to express map-reduce computations at a higher level, but Rhipe, an R package for mapreduce, doesn't have any. Let's start to fix this with a basic join function.&lt;br /&gt;
&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
This is going to be a little dry and technical, in preparation of better things to come.&lt;br /&gt;
As I was working on implementing a non-trivial map-reduce algorithm in Rhipe, I realized I needed joins, in the relational sense — Rhipe has an &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;rhjoin&lt;/span&gt; call but it is unrelated. The not-so-deep reason I needed them is that that's the way I described the algorithm in my &lt;a href="http://blog.piccolboni.info/2010/07/map-reduce-algorithm-for-connected.html"&gt;pseudo-code&lt;/a&gt;, but is that the only or the best way to do it? I don't know the answer, but it looks like several Hadoop derivatives that try to offer higher level alternatives have joins and a number of other SQL-like features: Hive of course but also Cascading, Cascalog, and Pig. Dumbo has sort of a lower level support for joins in the form of "join keys". While these are hardly independent validations of the use of relational operations for mapreduce, they give me some confort that other people see them as important. In fact, I don't think we have seen enough examples of mapreduce algorithms yet to know which techniques are going to be the most important.&lt;br /&gt;
Before we get onto the actual code, please bear in mind that this is a very basic join and doesn't use any of the techniques that have been devised to make joins faster and less memory intensive, from secondary keys to map-side joins. I thought it would be premature optimization to try and implement them right away and also Rhipe poses some restrictions on the partitioner, I suspect inherited from Hadoop Streaming, that would have made using them in this context hard. So, without further ado, the code (syntax highlighting is approximate at best, my apologies):&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush: r"&gt;function(
         ileftfolder = NULL,
         irightfolder = NULL,
         ifolder = NULL,
         ofolder,
         leftouter = F,
         rightouter = F,
         fullouter = F,
         map.left.key,
         map.right.key,
         map.right.value = function(k, v) v,
         map.left.value = function(k, v) v,
         reduce.key = function(k, vl, vr) k,
         reduce.value = function (k, vl, vr) list(left=vl, right=vr)){if (is.null(ileftfolder)) {
    ileftfolder = ifolder}
  mapcollect = function(k,v, kfun, vfun, left) rhcollect(kfun(k,v), list(value = vfun(k,v), left = left))
  rhex(
       rhmr(
            map = eval(
              substitute(
                         if (is.null(ifolder)) {
                           expression({
                             lapply(seq_along(map.keys), function(i) {
                               if(length(grep(paste("^file:", ileftfolder, sep = ""), 
                                              Sys.getenv("mapred.input.file")))) {
                                   mapcollect(map.keys[[i]], map.values[[i]], map.left.key, map.left.value, T)}
                                 else {
                                   mapcollect(map.keys[[i]], map.values[[i]], map.right.key, map.right.value, F)}
                             })})}
                         else {
                           expression({
                             lapply(seq_along(map.keys), function(i) {
                               mapcollect(map.keys[[i]], map.values[[i]], map.left.key, map.left.value, T)
                               mapcollect(map.keys[[i]], map.values[[i]], map.right.key, map.right.value, F)
                             })})},
                         list(map.left.key = map.left.key,
                              map.right.key = map.right.key,
                              map.right.value = map.right.value,
                              map.left.value = map.left.value,
                              ileftfolder = ileftfolder,
                              mapcollect = mapcollect))),
            reduce = eval(
              substitute(
                         expression(
                             pre = {
                               reduce.split = 
                                 function(values, left) lapply(values[which(lapply(values, 
                                                                                   function(x) x$left) == left)], 
                                                               function(x) x$value)
                               all.values = c()},
                             reduce = {
                               all.values = c(all.values, reduce.values)},
                             post = {
                               reduce.values.left = reduce.split(all.values, T)
                               if(length(reduce.values.left) == 0 &amp;amp;&amp;amp; (rightouter || fullouter)) {
                                 reduce.values.left = c(NA)}
                               reduce.values.right = reduce.split(all.values, F)
                               if(length(reduce.values.right) == 0 &amp;amp;&amp;amp; (leftouter || fullouter)) {
                                 reduce.values.right = c(NA)}
                               lapply(reduce.values.left,
                                      function(x) lapply(reduce.values.right,
                                                         function(y) rhcollect(reduce.key.fun(reduce.key, x, y),
                                                                               reduce.value(reduce.key, x, y))))
                             }),
                         list(reduce.key.fun = reduce.key,
                              reduce.value = reduce.value,
                              leftouter = leftouter,
                              rightouter = rightouter,
                              fullouter = fullouter
                              ))),
            combiner = F,
            inout = c('sequence','sequence'),
            ifolder=c(ileftfolder,irightfolder),
            ofolder= ofolder
            )
       )
}&lt;/pre&gt;&lt;br /&gt;
&lt;i&gt;Lines 1–15&lt;br /&gt;
&lt;/i&gt; This is the signature. There are left and right input files for regular joins or a single input for self joins. There's an output and then three flags to determine if it's going to be a left, right or full outer join. Then we have two functions, one for the left argument and one for the right argument to the join, to specify the join key (we need both also for a self-join). These functions are called for each key, value pair in the mapper and their output becomes the key for the shuffle and reduce steps. We can also specify which values will be sent into the reduce step from the left input and the right input, but these are optional and default to letting the values go through unmodified. Then we have two more functions that determine the key, value pairs emitted by the reduce phase. They take as arguments the value for the join key, the value from the left input and the value from the right input, of course record by record. The key defaults to simply the join key and the value is a named pair with the left side and the right side simply containing the value from each side resp., sort of an equivalent to SELECT *.&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Lines 17–18&lt;br /&gt;
&lt;/i&gt; &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;rhex&lt;/span&gt; executes a job and &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;rhmr&lt;/span&gt; creates the specs for a job. The latter takes several arguments, that we are going to describe in detail&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Lines 18–41&lt;br /&gt;
&lt;/i&gt; This is the map phase of the job, provided as an R expression. "Expressions" in R are unevaluated expressions. You get a value out of them by using eval and can modify them using substitute, which we use &lt;i&gt;in lieu&lt;/i&gt; of passing arguments. For added confusion, substitute returns an expression generator, so you need eval to get again an unevaluated expression. Like, you replace the spark plugs in a car and you are left with a car generator — sounds too good to be true. Passing an expression instead of a function is a choice that the author of Rhipe himself &lt;a href="https://groups.google.com/d/topic/rhipe/KU0fibsMLkk/discussion"&gt;explains&lt;/a&gt;. More precisely, there are two expressions, one for the join of two inputs and one for the self join. For the first case, it is enough, for each record, to decide which input the current record came from, evaluate the right key and value functions and emit a key, value pair. In the self-join case one has to emit both a left and right key value pair for each record.&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Lines 42–70&lt;br /&gt;
&lt;/i&gt;This is the reduce phase of the job. The drill about passing an expression and substituting arguments is exactly the same as for map. The expression this time is composed of three sub-expressions. For each key, &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;pre&lt;/span&gt; is evaluated before the first record, &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;reduce&lt;/span&gt; several times, one per block of records, and &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;post&lt;/span&gt; after the last one. Here &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;pre&lt;/span&gt; simply defines a function and a variable for later use, &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;reduce&lt;/span&gt; accumulates all the records and &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;post&lt;/span&gt; splits left and right side of the join, adds NA entries when necessary for outer joins and emits all the output records with two nested loops, applying the functions in the arguments to create them. &lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Lines 71–74&lt;br /&gt;
&lt;/i&gt;Some additional &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;rhmr&lt;/span&gt; arguments that specify whether to use a combiner, and format and names of inputs and outputs. The returned value is that returned by &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;rhex&lt;/span&gt;.&lt;p&gt;&lt;br /&gt;
And that's it, unless you are developing in Rhipe this is not so interesting. But in all of 77 lines we have built a missing component to Rhipe and we'll put it to good use in the next installment.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-7979440164268043910?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=CIHWOovaCLI:9NLOL6DRt8I:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=CIHWOovaCLI:9NLOL6DRt8I:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=CIHWOovaCLI:9NLOL6DRt8I:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/CIHWOovaCLI" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/7979440164268043910/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2011/04/bringing-relational-joins-to-rhipe.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/7979440164268043910?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/7979440164268043910?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/CIHWOovaCLI/bringing-relational-joins-to-rhipe.html" title="Bringing relational joins to Rhipe" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://blog.piccolboni.info/2011/04/bringing-relational-joins-to-rhipe.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CU8HRHo8fCp7ImA9WhZRFUk.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-3561471346328338615</id><published>2011-04-11T10:43:00.000-07:00</published><updated>2011-04-11T10:43:55.474-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-04-11T10:43:55.474-07:00</app:edited><title>Let a million Twitters bloom</title><content type="html">Why are some people uncomfortable with cloud computing? What are the limitations and is there a way forward?&lt;br /&gt;
&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
The recent sudden change in Twitter terms of service for developers — the consensus is, despite attempts to backtrack, that they are against third party clients — has unleashed a debate about the current model for web development. Big server farms hold the data and do the number crunching while users access services through thin clients, mostly browsers. At the same time, web services are run by a single company in a monolithic fashion: Google and Microsoft for search, Facebook and Linkedin for social networking, Twitter for microblogging. But it doesn't have to be that way. We all use email and there is ample choice of email providers; you can even run your own server. On top of that, all email providers participate in the same network: you can email anyone independent of what their email address is. This is possible because all email providers use protocols like POP, IMAP and SMTP, that have been donated to the commons by their inventors — notably, the late Jon Postel. The internal workings of the servers can be completely different, as long as they use SMTP to talk to each other and POP or IMAP to talk to the clients. It didn't have to go the way it did. At some point there were several &amp;nbsp;competing incompatible email services: AOL, Compuserve etc. We could all be using AOL email addresses today if AOL had won, but an open standard won instead. Email is not the only example. The DNS, the system that translates domain names into numeric addresses is based on an international collaborative effort, is distributed in architecture and ownership and based on open, free protocols. Skype, while proprietary, has a decentralized architecture whereby central servers provide directory and search services, but the heavy lifting of voice and video communication is taken on directly by the software running on the user machines. In this case it works well for Skype which retains total control of the network while keeping its server and network costs in check. Then there is the extreme example, the BitTorrent network, where almost every feature, with the partial exception of search, works in a completely decentralized fashion: there are no servers and the clients talk to each other for all their needs. The protocol is open and while several companies offer BitTorrent related software and services, none of them owns or controls significant portions of the network. Among the advantages of decentralized solutions:&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Competition: different providers can compete on price and service — I see ad supported models as a degradation of service.&lt;/li&gt;
&lt;li&gt;Lack of lock in: users can switch to alternative providers of equivalent services.&lt;/li&gt;
&lt;li&gt;Privacy: with appropriate design choices, data can be kept close to the owner and protected with cryptography when on route elsewhere.&lt;/li&gt;
&lt;li&gt;No need for giant server farms.&lt;/li&gt;
&lt;li&gt;More local resources, less sensitive to network latency.&lt;/li&gt;
&lt;li&gt;Disconnected operation.&lt;/li&gt;
&lt;li&gt;Reliability: harder for an individual, organization or accident to take down a distributed service — not impossible though.&lt;/li&gt;
&lt;li&gt;Can't go out of business: if there is no central owner, he can't file for bankruptcy. But even distributed services can fall to spam or&amp;nbsp;obsolescence, think of USENET or gopher for instance.&lt;/li&gt;
&lt;/ul&gt;&lt;div&gt;Disadvantages include limitations in what features can be implemented, for instance distributed web search is still an open problem, despite claims to the contrary; difficulty in policing the system — see the email spam problem, or the malware problem on BitTorrent for instance; and lack of economic incentives deriving from the control of a centralized network.&lt;/div&gt;&lt;div&gt;In particular people have been discussing the role and future of open source software in the age of centralized web services. The freedom promise becomes empty for the user if the free software runs on a central server, with the exception of the browser or other thin client.&lt;/div&gt;&lt;div&gt;Several projects have focused on alternatives for specific services, like Diaspora for Facebook and identi.ca or rstatus for Twitter. These are positive developments and I hope they get the traction they deserve (I can be reached as piccolbo@identi.ca). But they fall short in two respects, at least as far as I understand them. The first is that they are still bound to a multi-server model. So if you are an open source developer, you still have to run some server to make your software available. If your software takes off, you need to find money to host the system, possibly a lot of money. As a user your data is still in somebody else's hands: he could sell it, destroy it or what not. The other problem with these projects is that they solve one specific problem but they don't provide a programming environment for open source to go distributed. While still based on a multi-server approach, the now open source Wave protocol represents a step in the right direction. It is a distributed framework to create a variety of networked applications on top of it. It doesn't help though that Google shut down Wave the application that was built on top of it. Another general answer is proposed by unhosted. Their system is still multi-server, but the server provides only passive, encrypted storage with discovery and authentication services on top of it and all the processing happens in the client. It's not clear what limitations are entailed by having passive, non programmable servers — imagine for instance receiving a message in such an architecture — but advantages like increased privacy and reduced lock-in are clear. Another interesting solution is to build couchapps, javascript apps backed by a couchdb, a document store with replication and version merging features, extensible with server-side javascript. Again it is not clear to me how general couchapp is as a basis for web development, but it is interesting. These are all very encouraging developments, but it seems to me we haven't seen one or a few solutions attract the open source community in significant numbers, let alone gain a user base. Let me list some elements that should be part of this solution.&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;A directory service. It should be possible for a user to be addressable with his email address, user@domain, for a variety of services besides email, without the actual domain being that of a server that runs anything. For instance, my address is antonio@piccolboni.info. You couldn't tell from this address that I use gmail, and I could switch to another provider without you having to know. Since this little magic is performed by the DNS, it seems like the first candidate to look into.&lt;/li&gt;
&lt;li&gt;Authentication. It should be easy for people to establish a communication with a party knowing that they are who they say they are. On the other hand it should be possible to interact anonymously, when both parties are OK with that.&lt;/li&gt;
&lt;li&gt;Communication. Applications can send messages to each other. XMPP and extensions, such as wave, or RSS and pubsubhubbub.&lt;/li&gt;
&lt;li&gt;Storage. Data should be moved near where the users are for performance and ownership, but also copied elsewhere for reliability and availability, such as in a distributed social network all of your friends could hold a copy of your data, so that at least one copy will be available at all times with high probability.&lt;/li&gt;
&lt;li&gt;Encryption. Since your data is going to be all over the place, to keep it secure we need encryption and a system for key distribution.&lt;/li&gt;
&lt;li&gt;Availability. Since in the most fully distributed scenario nodes can be laptops that are on and off all the time, availability can be achieved not only by replication, but also with delegation. That is, when your laptop is off, one of your friends' machines would take over and, for instance, send you reminders. This means that the discovery service need to find an alternate server when the main one is off, communication can either fail (synchronous) or be delayed (asynchronous) etc. An alternative could be to use smartphones or wireless routers as supplementary servers, since they are always on. They might not be able to store a lot of data or serve it, but they could queue messages or send alerts when the main server is off-line.&lt;/li&gt;
&lt;/ol&gt;&lt;div&gt;It seems to me most or all of the pieces of the puzzle exist. We have TaoheFS for&amp;nbsp;distributed&amp;nbsp;storage, fully P2P networks such as BitTorrent, XMPP or RSS for messaging and public key criptography. Key distribution is still a problem and the availability of unreliable, personal servers is also a challenge, but hopefully somebody can package all these things into an open source P2P platform so that people with product ideas, but who can not solve all these problems on their own, can &amp;nbsp;start creating and unleash another wave of innovation and freedom.&amp;nbsp;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-3561471346328338615?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=yWMv7ghRgCc:xUXS5imI83I:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=yWMv7ghRgCc:xUXS5imI83I:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=yWMv7ghRgCc:xUXS5imI83I:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/yWMv7ghRgCc" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/3561471346328338615/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2011/04/let-million-twitters-bloom.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/3561471346328338615?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/3561471346328338615?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/yWMv7ghRgCc/let-million-twitters-bloom.html" title="Let a million Twitters bloom" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://blog.piccolboni.info/2011/04/let-million-twitters-bloom.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DkIFR3w8fip7ImA9WhRQEUw.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-672812696491855954</id><published>2011-04-07T22:14:00.000-07:00</published><updated>2011-12-05T11:48:36.276-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-12-05T11:48:36.276-08:00</app:edited><title>Looking for a map reduce language</title><content type="html">On a quest for an elegant and effective map reduce language, I went through a number of options and put together some considerations. And the winner is …&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
Update: since writing this post, I was approached by Revolution Analytics to write yet another map reduce library, this time for R, and it is, of course, my new favorite. &lt;a href="https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr"&gt;Check it out&lt;/a&gt;!&lt;br /&gt;
&lt;br /&gt;
In a couple of blog entries from my personal blog I described some map-reduce algorithms for &lt;a href="http://blog.piccolboni.info/2010/07/algorithm-for-sample-quantiles-in-map.html"&gt;statistical&lt;/a&gt; and &lt;a href="http://blog.piccolboni.info/2010/07/map-reduce-algorithm-for-connected.html"&gt;graph&lt;/a&gt; problems and sketched their implementation using pseudo-code. Pseudo-code has two problems: not everybody agrees on what a statement means and it doesn't run, so you can't test it or use it. Real programming languages on the other hand tend to obscure the logic of a program with unnecessary detail and have other issues that hinder readability, the reason why people resort to pseudo-code. But there is more to it than just aesthetics. Conciseness of code is related to programming abstractions, constructs that achieve higher generality and remove unnecessary detail; to reuse, whereby the same code is used in different contexts, reducing total program size; and even &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.567"&gt;testing&lt;/a&gt;, that is concise programs can be tested more easily. In short, short is better. The elegance of less is hardly my own or a software engineering discovery. As Antoine de Saint-Exupery, French writer and aviator, so eloquently put it :&lt;br /&gt;
&lt;blockquote&gt;
&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.&lt;/div&gt;
&lt;/blockquote&gt;
Unfortunately, in some circles, dull, predictable, repetitive code is considered simpler than short and to the point code, or at least tolerable. From java.util.Arrays:&lt;br /&gt;
&lt;blockquote&gt;
478     /*&lt;br /&gt;
479      * The code for each of the seven primitive types is largely identical.&lt;br /&gt;
480      * C'est la vie.&lt;br /&gt;
481      */&lt;/blockquote&gt;
In this case, repetition gets a free pass in exchange for efficiency. Very expressive languages tend to exact a higher toll on resources, and the different map-reduce environments we will look into are no exception.&lt;br /&gt;
I will present for each language or library the implementation of a word count program, lifted from its documentation, since this has become sort of the "Hello World" for map reduce. I don't think such a simple program is the ultimate test of the quality of a language, so this is just to give a taste of the language. What I am most interested in is:&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;Can I write reasonably concise, abstract programs in this language or library?&lt;/li&gt;
&lt;li&gt;Can I write the "inside" of map reduce, that is the code for the mapper and the reducer, as well as the "outside", the logic that decides which map reduce jobs to run?&lt;/li&gt;
&lt;li&gt;Is it general? Can I write any map-reduce program, including programs that require multiple map-reduce jobs, including the case of a data dependent number and type of jobs?&lt;/li&gt;
&lt;/ul&gt;
&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;Java Hadoop &lt;/span&gt;&lt;br /&gt;
This is the original, the real thing, the current performance champion and what "real men" write in. It is also the most mature of the different options. But take a look:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush: java"&gt;public static class MapClass extends MapReduceBase
 public class WordCount {

     public static class Map extends MapReduceBase implements Mapper&amp;lt;LongWritable, Text, Text, IntWritable&amp;gt; {
       private final static IntWritable one = new IntWritable(1);
              private Text word = new Text();
 
       public void map(LongWritable key, Text value, OutputCollector&amp;lt;Text, IntWritable&amp;gt; output, Reporter reporter) throws IOException {
         String line = value.toString();
         StringTokenizer tokenizer = new StringTokenizer(line);
         while (tokenizer.hasMoreTokens()) {
           word.set(tokenizer.nextToken());
           output.collect(word, one);
         }
       }
     }
 
     public static class Reduce extends MapReduceBase implements Reducer&amp;lt;Text, IntWritable, Text, IntWritable&amp;gt; {
       public void reduce(Text key, Iterator&amp;lt;IntWritable&amp;gt; values, OutputCollector&amp;lt;Text, IntWritable&amp;gt; output, Reporter reporter) throws IOException {
         int sum = 0;
         while (values.hasNext()) {
           sum += values.next().get();
         }
         output.collect(key, new IntWritable(sum));
       }
     }
 
     public static void main(String[] args) throws Exception {
       JobConf conf = new JobConf(WordCount.class);
       conf.setJobName("wordcount");
 
       conf.setOutputKeyClass(Text.class);
       conf.setOutputValueClass(IntWritable.class);
 
       conf.setMapperClass(Map.class);
       conf.setCombinerClass(Reduce.class);
       conf.setReducerClass(Reduce.class);
 
       conf.setInputFormat(TextInputFormat.class);
       conf.setOutputFormat(TextOutputFormat.class);
 
       conf.setInputPath(new Path(args[0]));
       conf.setOutputPath(new Path(args[1]));
 
       JobClient.runJob(conf);
     }
 }
 
&lt;/pre&gt;
48 lines to write a word count program (and I stripped the import statements at the top out of mercy)! My favorite line is number 5, a line devoted to redefining the number one. This makes sense in a world where programmer productivity is measured by number of lines of code written or for a production job that runs on a 1,000 node cluster for 5 hours every night, in which case efficiency may trump other considerations. But for a blog, for discussing and enjoying code, anything remotely more interesting than a word count program would not fit the size of an entry but would have to be an attachment, as John Mount did with his painstaking &lt;a href="http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/"&gt;Java/Hadoop implementation of logistic regression&lt;/a&gt;. I wonder how many people opened that tar file and read it through and through. &lt;br /&gt;
Hadoop was started by Doug Cutting and is developed by a large community with a significant group employed at Yahoo.&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;Cascading&lt;/span&gt;&lt;br /&gt;
Cascading is a Java library written on top of Hadoop. It enables programming in a dataflow style, with some primitives inspired by SQL (like GroupBy). But according to a person closely related to the project, "it's still Java, it's still boilerplate code". My favorite line is number 18. Remarkably, it trims down the line count for the word count program to half as many as plain Hadoop. I don't have first hand experience with Cascading, but since there is no or little performance penalty compared to the real thing — depending on programmer skill, it could actually be better — it's worth a try for production work. &lt;br /&gt;
Cascading is developed by Chris Wensel at Concurrent.&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush: java"&gt;Scheme sourceScheme = new TextLine( new Fields( "line" ) );
Tap source = new Hfs( sourceScheme, inputPath );

Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

Pipe assembly = new Pipe( "wordcount" );

String regex = "(?&amp;gt;!\\pL)(?=\\pL)[^ ]*(?&amp;lt;=\\pL)(?!\\pL)";
Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );

assembly = new GroupBy( assembly, new Fields( "word" ) );

Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every( assembly, count );

Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );

FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( "word-count", source, sink, assembly );

flow.complete();
&lt;/pre&gt;
&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;Pipes — C++&lt;/span&gt;&lt;br /&gt;
C++ fits into the Java environment not without some effort, which is encapsulated in a library called Pipes. The word count program looks more compact than in Java/Hadoop. I've read opposite comments on the efficiency of Pipes/C++ vs Hadoop/Java and I suspect it may depend on the specific problem being tackled. Even if I used to be quite proficient in C++, I do not remember fondly the 8000 characters template-induced error messages and I don't think it is the type of language I would want to use to discuss algorithms or for prototyping. &lt;br /&gt;
Pipes is developed as part of the Hadoop project.&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush: cpp"&gt;class WordCountMap: public HadoopPipes::Mapper {
public:
  WordCountMap(HadoopPipes::TaskContext&amp;amp; context){}
  void map(HadoopPipes::MapContext&amp;amp; context) {
    std::vector&amp;lt;std::string&amp;gt; words =
      HadoopUtils::splitString(context.getInputValue(), " ");
    for(unsigned int i=0; i &amp;lt; words.size(); ++i) {
      context.emit(words[i], "1");
    }
  }
};

class WordCountReduce: public HadoopPipes::Reducer {
public:
  WordCountReduce(HadoopPipes::TaskContext&amp;amp; context){}
  void reduce(HadoopPipes::ReduceContext&amp;amp; context) {
    int sum = 0;
    while (context.nextValue()) {
      sum += HadoopUtils::toInt(context.getInputValue());
    }
    context.emit(context.getInputKey(), HadoopUtils::toString(sum));
  }
};

int main(int argc, char *argv[]) {
  return HadoopPipes::runTask(HadoopPipes::TemplateFactory&amp;lt;WordCountMap,
                              WordCountReduce&amp;gt;());
}
&lt;/pre&gt;
&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;Hive&lt;/span&gt;&lt;br /&gt;
Hive is a SQL-like language that is interpreted on top of Hadoop. It can also be combined with small programs written in a variety of languages, to make up for the fact that the language itself is not general purpose. For what it does, it is very concise and expressive, but outside that you need to supplement it with other languages. Case in point, the word count example where two additional scripts are left as an exercise for the reader. &lt;br /&gt;
Hive started as part of the Hadoop project.&lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;FROM
(MAP docs.contents USING 'tokenizer_script' AS word, cnt
FROM docs
CLUSTER BY word) map_output

REDUCE map_output.word, map_output.cnt USING 'count_script' AS word, cnt;
&lt;/pre&gt;
&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;Pig&lt;/span&gt;&lt;br /&gt;
Pig adds to the limitations of Hive the hubris of creating a brand new language, as if creating a new programming language were easy. As you can see, it is inspired by SQL to a degree. It is not a general purpose language as clearly explained &lt;a href="http://wiki.apache.org/pig/TuringCompletePig"&gt;here&lt;/a&gt;. It interfaces with any JVM based language for custom extensions.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;A = load '/tmp/bible+shakes.nopunc';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = group C by word;
E = foreach D generate COUNT(C) as count, group as word;
F = order E by count desc;
store F into '/tmp/wc';
&lt;/pre&gt;
Pig development was started at Yahoo.&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;Rhipe&lt;/span&gt;&lt;br /&gt;
Rhipe is an R package to describe and execute map-reduce jobs. It is reasonably high level and satisfies all the criteria I listed above. It's not a speed daemon, because of R itself, there are some quirks in the API and it's still at an initial stage of development, but interesting. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;rhinit()
m &amp;lt;- expression({
  y &amp;lt;- strsplit(unlist(map.values)," ")
  lapply(y,function(r) rhcollect(r,T))
})
r &amp;lt;- expression(
    pre={
      count=0
    },
    reduce={
      count &amp;lt;- sum(as.numeric(unlist(reduce.values)),count)
    },post={
      rhcollect(reduce.key,count)
    })
z=rhmr(map=m,reduce=r,comb=T,inout=c("text","sequence"),ifolder="/tmp/50mil",ofolder='/tmp/tof')
rhex(z)
&lt;/pre&gt;
Rhipe is developed by Saptarshi Guha at Mozilla.&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;Dumbo&lt;/span&gt;&lt;br /&gt;
Dumbo is a Hadoop library for python, but also imposes a set of tools to run dumbo program. If you look at the word count program in Dumbo, below, it almost looks like pseudo-code! Finally! But there is a serious catch. There can only be a run statement per dumbo-powered program — I &lt;a href="https://groups.google.com/d/msg/dumbo-user/9FFGeFAZqQc/N3uuo-S-61YJ"&gt;asked the author himself&lt;/a&gt; after seeing some outlandish looking errors. To coordinate two runs, for instance one that starts based on the output of the first, one has to run separate python programs and go through the unix shell. This is different from static composition of jobs, which is well supported, but not general enough for my purposes. Other options for python include MR Job and pydoop, but I haven't had time to look into these yet.&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush: python"&gt;def mapper(key,value):
   for word in value.split(): yield word,1
def reducer(key,values):
   yield key,sum(values)
if __name__ == "__main__":
   import dumbo
   dumbo.run(mapper,reducer)
&lt;/pre&gt;
Dumbo is developed by Klaas Bosteels at last.fm.&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;Cascalog&lt;/span&gt;&lt;br /&gt;
Built on top of the already powerful cascading as a domain specific language within Clojure, Cascalog wins the word count conciseness contest with a one-liner. Indeed, word counting is simple enough that a line is all that it should take. But look at what a line:&lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;(?&amp;lt;- (stdout) [?word ?count] (sentence ?s) (split ?s :&amp;gt; ?word) (c/ count ?count))&lt;/pre&gt;
It probably looks familiar to anybody who's familiar with it. Conciseness can become terseness, but once some domain specific concepts have been grasped a terse program such as this might become perfectly clear. It was to me at some point. My misgivings here are more about the JVM-powered revival of LISP in the form of Clojure. LISP has been around some 50-odd years without taking off despite several attempts at its revival (Common LISP, Scheme, Arc and now Clojure). I suspect something is wrong with it, even if popularity is not an accurate gauge of language quality, as BASIC has long proved. Personally, I dislike LISP odd syntax, the widespread use of side effects in a functional language and the poor abstraction that lists represent over RAM, from a performance point of view — indeed LISP variants often add additional data structures, somehow negating the "LIS" part of the language. In the specific case of Clojure, the fact that a compiled language is compiled into an interpreted one, JVM bytecode, combining a slow dev cycle with suboptimal performance, makes me think Clojure users must be glutton for punishment.  &lt;br /&gt;
Cascalog is  developed by Nathan Marz at Backtype.&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;Final thoughts&lt;/span&gt;&lt;br /&gt;
At the end of this by necessity incomplete and unscientific language and library comparison, there is a winner and there isn't. There isn't because language comparison is always multidimensional and subjective but also because the intended applications are very different. On the other hand, looking for a general purpose, moderately elegant, not necessarily most efficient, not necessarily mature language for exploration purposes, Rhipe seems to fit the bill pretty nicely. First, it is just a library, which means that one can continue to use the tools he's familiar with. I found it particularly useful to run map-reduce jobs in the interpreter, inspecting the inputs and outputs of each, an invaluable debugging help — but no, you can not step into a mapper or reducer, I use counters instead to trace what's going on in there. I also like that one can read and write sequence files with one call, to examine the output of previous jobs and decide what to do next. Additionally since R is a statistical language and Hadoop is the tool of choice for big data analytics, this seems like a natural fit. Personally, I am familiar with both, which helps, and I have used R, in combination with Hive or Hadoop, to do analytics in the past, but not at this level of integration. Since there is nothing like trying a more substantial example than word count to figure out a language pros and cons, stay tuned for a fairly complex &lt;a href="http://blog.piccolboni.info/2011/04/map-reduce-algorithm-for-connected.html"&gt;example&lt;/a&gt;. After that is published, I plan to pose a friendly challenge to experts in the languages and libraries above or other Hadoop related languages and see what an implementation of the same algorithm would look like in their language of choice and learn something from the comparison. Maybe among my "25 readers" there is someone who will take it up.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-672812696491855954?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=gDDGTPrghDU:UVudc9CJY9M:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=gDDGTPrghDU:UVudc9CJY9M:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=gDDGTPrghDU:UVudc9CJY9M:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/gDDGTPrghDU" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/672812696491855954/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html#comment-form" title="11 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/672812696491855954?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/672812696491855954?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/gDDGTPrghDU/looking-for-map-reduce-language.html" title="Looking for a map reduce language" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>11</thr:total><feedburner:origLink>http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html</feedburner:origLink></entry><entry gd:etag="W/&quot;A08AQ389eip7ImA9WhZQEk4.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-16217976917203424</id><published>2010-11-29T23:59:00.000-08:00</published><updated>2011-04-19T12:04:02.162-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-04-19T12:04:02.162-07:00</app:edited><title>Find the odd bag</title><content type="html">From a job interview challenge, an interesting probability exercise in two parts. One of the themes here is pretty standard fare. You are given a clearly defined random procedure whose outcome is a mixture of two distributions. The problem is, given a certain set of outcomes, find which of the two distributions it is coming from. For instance, imagine you have to assign one of two classes to an item based on repeated noisy measurements and you know the relative size of the two classes (a priori probability of belonging to one of the two). The second part of the challenge is a bit more interesting but also eccentric. It is asking for a best case outcome that would make it easiest (smallest sample) to detect the class of the item with a certain error probability. I am not aware of any practical statistical question where such a best case problem arises, even if we consider the converse, the worst case outcome. But being used to worst case analysis from my CS training, I came up with an optimality proof based on induction and manipulation of binomial coefficients, which confirms the intuition that a very unlikely, extreme outcome is the best one. The main idea is that when lower bounding an expression including binomial coefficients, it is somehow easier to prove a tight lower bound because the binomial coefficients on the two sides of the inequality are very similar in that case and one can simplify a lot and then use simple algebra. It won't set the world of Mathematics abuzz, but it seemed interesting enough to share.&lt;br /&gt;
&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
&lt;h2&gt;Part 1&lt;/h2&gt;The following random experiment is described. There are 5 identical bags, 4 of which contain 4 read beads and 96 black ones, the 5th instead has 7 and 93 resp. Select one bag according to the uniform distribution and sample three beads, 1 red and 2 black. What is the probability that the selected bag was the 5th?&lt;br /&gt;
&lt;br /&gt;
&lt;h2&gt;Part 2&lt;/h2&gt;Let's go back to the initial condition, pick one bag and then pick one bead at a time from that bag and stop when the probability of having picked the 5th bag is greater than 1/2. In the best case, how many beads do you need to pick?&lt;br /&gt;
&lt;br /&gt;
&lt;h2&gt;Solution&lt;/h2&gt;&lt;br /&gt;
Let \(N\) be the number of beads in each bag, \(n\) the size of the sample, \(m\) and \(m'\) the number of red beads in each bag with \(m' &amp;gt; m\)  and \(k\) the number of read beads in the sample. \(N=100\), \(n = 3\), \(m' = 7\), \(m = 4\) and \(k = 1\) in the first part of the interview challenge, with \(n\) and \(k\) becoming variable in the second part. Let \(X\) be the random variable corresponding to the number of red beads present in a sample. Conditional to the knowledge of the bag from which the extraction occurred, this variable has an hypergeometric distribution. Let M be the random variable corresponding to the number of red beads in the chosen bag.&lt;br /&gt;
&lt;br /&gt;
\[&lt;br /&gt;
P(X = k|M=m) = \frac{\binom{m}{m}\binom{N-m}{n-k}}{\binom{N}{n}}&lt;br /&gt;
\]&lt;br /&gt;
&lt;br /&gt;
that is \(X\) follows the hypergeometric distribution with parameters&lt;br /&gt;
\(N,n,m\) conditional to having selected a type of bag and &lt;br /&gt;
&lt;br /&gt;
\[  &lt;br /&gt;
P(M=m) = 4/5\]&lt;br /&gt;
&lt;br /&gt;
\[&lt;br /&gt;
P(M=m') = 1/5  &lt;br /&gt;
\]&lt;br /&gt;
&lt;br /&gt;
assuming the uniform distribution in bag selection.&lt;br /&gt;
We are interested in:&lt;br /&gt;
&lt;br /&gt;
\[&lt;br /&gt;
P(M = m'|X=k)&lt;br /&gt;
\]&lt;br /&gt;
&lt;br /&gt;
that is distribution over bag types conditional to the outcome of a random draw.&lt;br /&gt;
Using the definition of conditional probability we have&lt;br /&gt;
&lt;br /&gt;
\[&lt;br /&gt;
P(M = m'|X = k) = \frac{P(M = m' \wedge X = k)}{P(X = k)}&lt;br /&gt;
\]&lt;br /&gt;
&lt;br /&gt;
and applying the same definition again we have &lt;br /&gt;
&lt;br /&gt;
\[&lt;br /&gt;
P(M = m'|X = k) = \frac{P(X = k | M = m')P(M = m')}{P(X = k)}&lt;br /&gt;
\]&lt;br /&gt;
&lt;br /&gt;
The numerator is the product of a  hypergeometric distribution with&lt;br /&gt;
parameters \(N,n,m'\) and a constant. At the numerator, we apply the&lt;br /&gt;
law of alternatives to get&lt;br /&gt;
&lt;br /&gt;
\[&lt;br /&gt;
P(X = k) = P(X = k| M= m') P(M = m')+P(X = k| M= m) P(M = m)  &lt;br /&gt;
\]&lt;br /&gt;
&lt;br /&gt;
Combining the last two we have&lt;br /&gt;
&lt;br /&gt;
\[P(M = m'|X = k) = \]&lt;br /&gt;
\[&lt;br /&gt;
\frac{P(X = k | M = m')P(M = m')}{P(X = k| M= m') P(M = m')+P(X = k| M= m) P(M = m)}  = \]&lt;br /&gt;
\[&lt;br /&gt;
\frac{1}{1+\frac{P(X = k| M= m) P(M = m)}{P(X = k| M= m') P(M = m')}} = \]&lt;br /&gt;
\[&lt;br /&gt;
\frac{1} {1+ \frac{\frac{\binom{m}{m}\binom{N-m}{n-k}}{\binom{N}{n}} \frac{4}{5}}  {\frac{\binom{m}{m}\binom{N-m'}{n-k}}{\binom{N}{n}} \frac{1}{5} }} &lt;br /&gt;
\]&lt;br /&gt;
&lt;br /&gt;
which can be simplified to &lt;br /&gt;
&lt;br /&gt;
\[\frac{1}{1+\frac{4\binom{m}{k}\binom{N-m}{n-k}}{\binom{m'}{k}\binom{N-m'}{n-k}}}\qquad(1)&lt;br /&gt;
\]&lt;br /&gt;
&lt;br /&gt;
We need only to substitute in the values to obtain the desired probability.&lt;br /&gt;
&lt;br /&gt;
Now to the second part of the challenge, whereby one needs to find the smallest \(n\) such that for some \(k\)  the above expression is greater than 1/2 (with the other parameters as before and within the allowed range for \(n\) and \(k\)). We want to show that the solution is \(n=k=3\) and we will do it in two parts. First we will show that Eq. (1) is maximized, for any given \(n\), when \(k=n\), which supports the intuition that the red bead rich bag will be more promptly identified when all the sampled beads are red. Then we will show that the smallest \(n\) such that Eq. (1) with \(k=n\) is greater than 1/2 is 3. To establish the first part we will show that &lt;br /&gt;
&lt;br /&gt;
\[\frac{1}{1+\frac{4\binom{m}{k'}\binom{N-m}{n-k'}}{\binom{m'}{k'}\binom{N-m'}{n-k'}}} \ge&lt;br /&gt;
\frac{1}{1+\frac{4\binom{m}{k}\binom{N-m}{n-k}}{\binom{m'}{k}\binom{N-m'}{n-k}}} \qquad (2)&lt;br /&gt;
\]&lt;br /&gt;
if and only if \(k' \ge k\). We will prove the special case \(k' = k + 1\) from which the general case follows by induction.&lt;br /&gt;
&lt;br /&gt;
By simple algebraic manipulation and substituting \(k'\) Eq. (2) is equivalent to:&lt;br /&gt;
&lt;br /&gt;
\[\frac{\binom{m}{k+1}\binom{N-m}{n-k-1}}{\binom{m'}{k+1}\binom{N-m'}{n-k-1}} &amp;lt; \frac{\binom{m}{k}\binom{N-m}{n-k}}{\binom{m'}{k}\binom{N-m'}{n-k}} \] &lt;br /&gt;
&lt;br /&gt;
Expanding the binomial coefficients we get:&lt;br /&gt;
&lt;br /&gt;
\[\frac{\frac{m!}{(k+1)!(m-k-1)!}\frac{(N-m)!}{(n-k-1)!(N-m-n+k+1)!}} {\frac{m'!}{(k+1)!(m'-k-1)!}\frac{(N-m')!}{(n-k-1)!(N-m'-n+k+1)!}} &amp;lt; \frac{\frac{m!}{k!(m-k)!}\frac{(N-m)!}{(n-k)!(N-m-n+k)!}} {\frac{m'!}{k!(m'-k)!}\frac{(N-m')!}{(n-k)!(N-m'-n+k)!}} \] &lt;br /&gt;
&lt;br /&gt;
By simple algebraic manipulations we have: &lt;br /&gt;
&lt;br /&gt;
\[\frac{N-m'-n+k+1}{N-m-n+k+1}  &amp;lt; \frac{m'-k}{m-k} \] &lt;br /&gt;
&lt;br /&gt;
Since \(m' &amp;gt; m\) we can upper bound the left side with 1 and lower bound&lt;br /&gt;
the left side with 1, which completes this part of the proof.&lt;br /&gt;
&lt;br /&gt;
Now we have established Eq. (2), we know that Eq. (1) is&lt;br /&gt;
maximized, for every \(n\), by setting \(k=n\). With this substitution&lt;br /&gt;
our goal becomes:&lt;br /&gt;
&lt;br /&gt;
\[\frac{1}{1+\frac{4\binom{m}{n}}{\binom{m'}{n}}} \ge \frac{1}{2}&lt;br /&gt;
\]&lt;br /&gt;
&lt;br /&gt;
which is equivalent to &lt;br /&gt;
&lt;br /&gt;
\[&lt;br /&gt;
4\binom{m}{n}\le \binom{m'}{n}&lt;br /&gt;
\]&lt;br /&gt;
&lt;br /&gt;
Substituting in the values of \(m\) and \(m'\) and trying &lt;br /&gt;
\(n \in \{1,2,3\}\) we find that \(n=3\) is the solution.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-16217976917203424?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=_aa9hkVFyzo:eFKPOB8fGYo:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=_aa9hkVFyzo:eFKPOB8fGYo:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=_aa9hkVFyzo:eFKPOB8fGYo:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/_aa9hkVFyzo" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/16217976917203424/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2010/11/find-odd-bag.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/16217976917203424?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/16217976917203424?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/_aa9hkVFyzo/find-odd-bag.html" title="Find the odd bag" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://blog.piccolboni.info/2010/11/find-odd-bag.html</feedburner:origLink></entry><entry gd:etag="W/&quot;D0ECRH0zeSp7ImA9Wx5aFUk.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-5181707010769881202</id><published>2010-11-11T23:07:00.000-08:00</published><updated>2010-11-11T23:07:45.381-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-11-11T23:07:45.381-08:00</app:edited><title>Social network integration the hard way</title><content type="html">&lt;div style="text-align: justify;"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div style="text-align: justify;"&gt;There is a jungle of &lt;a href="http://en.wikipedia.org/wiki/List_of_social_networking_websites"&gt;social&lt;/a&gt; and &lt;a href="http://hubpages.com/hub/40-Microblogging-Sites-list-for-Communication-Twitter-Alternatives"&gt;blogging&lt;/a&gt; services out there, and it's hard to pick one to use and impossible to use them all. Some people asked me how I deal with it and the short answer is I use several of them and tie them together using RSS feeds and APIs. Now for the long answer with pretty graphics.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Main goals were:&lt;/div&gt;&lt;ol&gt;&lt;li style="text-align: justify;"&gt;Keep using the preferred service for each task: &lt;a href="http://picasaweb.google.com/"&gt;Picasaweb&lt;/a&gt; for pictures, &lt;a href="http://vimeo.com/"&gt;Vimeo&lt;/a&gt; for videos, &lt;a href="http://blogger.com/"&gt;Blogger&lt;/a&gt; for blogging etc.&lt;/li&gt;
&lt;li style="text-align: justify;"&gt;Simplify subscription down to 0 clicks; support e-mail as alternative to feeds for technophobes.&lt;/li&gt;
&lt;li style="text-align: justify;"&gt;Merge data streams down to two, personal and work. Keep it simple, but don't mix the unmixable.&lt;/li&gt;
&lt;li style="text-align: justify;"&gt;Reach people where they are, don't invite them to use services they don't want to use. Have data, will travel. &lt;/li&gt;
&lt;li style="text-align: justify;"&gt;No manual intervention.&lt;/li&gt;
&lt;li style="text-align: justify;"&gt;To the extent possible, use own domain names: companies come and go, names are here to stay.&lt;/li&gt;
&lt;/ol&gt;&lt;div style="text-align: justify;"&gt;The solution's broad lines are:&lt;/div&gt;&lt;ol&gt;&lt;li style="text-align: justify;"&gt;Extract feeds from everywhere I create content, by whatever means necessary.&lt;/li&gt;
&lt;li style="text-align: justify;"&gt;Create two Posterous blogs and import everything there nightly using the API.&lt;/li&gt;
&lt;li style="text-align: justify;"&gt;From there export things where they will be read using posterous own extensive export features.&lt;/li&gt;
&lt;/ol&gt;&lt;div style="text-align: justify;"&gt;This is a graph of how things are plumbed, first the so called serious stuff:&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/_0zKhPf101t0/TNxBGLS_FSI/AAAAAAAAHn8/qL9bKKm7SZU/s1600/feedmapwork.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="190" src="http://3.bp.blogspot.com/_0zKhPf101t0/TNxBGLS_FSI/AAAAAAAAHn8/qL9bKKm7SZU/s400/feedmapwork.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div style="text-align: justify;"&gt;And then the recreational or personal stuff:&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_0zKhPf101t0/TNxB3Y4WUiI/AAAAAAAAHoA/dLN2EGblgnI/s1600/feedmaplife.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="243" src="http://1.bp.blogspot.com/_0zKhPf101t0/TNxB3Y4WUiI/AAAAAAAAHoA/dLN2EGblgnI/s400/feedmaplife.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;Now for the gory details. Why Posterous as the central hub? It has some killer features such as:&amp;nbsp;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;/div&gt;&lt;ol&gt;&lt;li&gt;Create as many blogs as you want, in this case two. No need to have grandma check out the latest on hadoop, or the uber-geek be briefed on baby's latest trick.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;0-click subscription: if grandma doesn't do feeds and doesn't do subscriptions but does email, you can force subscribe her via email, a few clicks for you but 0 for her. It's a little invasive but the only thing that worked with some technophobe friends. Use with care.&lt;/li&gt;
&lt;li&gt;Export to XYZ. Posterous supports a lot of different targets for content delivery — called autopost. This way you don't need to convert everybody to Posterous, which will never happen, you can just reach them where they are. I know, everyone is on Facebook. But most people I talked to have ended up befriending everyone, they are overwhelmed by the newsfeed and again I didn't want to mix geek content with baby pictures. One wrinkle: can't get the export to Flickr to work. I've been on and off with support, but no good news so far.&lt;/li&gt;
&lt;li&gt;In the opposite direction, they also have a much trumpeted &lt;a href="http://www.readwriteweb.com/start/2010/06/posterous-and-ambition-a-lesso.php#comment-264020"&gt;import feature&lt;/a&gt;, but it is one-time only, for migrating your pre-existing blog, whereas export is ongoing, from when you set it up until you stop it. Why this&amp;nbsp;asymmetry, I don't know — it looks like a lock-in play. Another "post everywhere" type service is ping.fm.&amp;nbsp;&lt;a href="http://friendfeed.com/"&gt;Friendfeed&lt;/a&gt;, &lt;a href="http://soup.io/"&gt;soup.io&lt;/a&gt; and also &lt;a href="http://tumblr.com/"&gt;Tumblr&lt;/a&gt;, to a degree, really push in the opposite direction: "give me all your content, we'll take good care of it". They have ongoing import and little or no export. If someone gets both right, they might be onto something. Also, Posterous doesn't support importing from a generic RSS feed, only specific services of their choosing&amp;nbsp;— so much for supporting standards. Anyway, Posterous has a read/write API and I am a computer person, so I wrote my own import from RSS and ATOM. You are welcome to&amp;nbsp;&lt;a href="http://imposterous.piccolboni.info/gateway/rss2posterous"&gt;try it out&lt;/a&gt;. I decided to import short form or abridged versions of most feeds to avoid copyright and rendering issues. &amp;nbsp;&lt;/li&gt;
&lt;li&gt;Custom domains. I can switch to a different service with less pain since I own the domain names. You never know how long a startup is going to last — hopefully Posterous will be around for a long time, but one never knows.&lt;/li&gt;
&lt;/ol&gt;Instead of straight sharing from Google reader I use two tags that are marked as public, which gives them a public web page with feeds. This is because I want to keep two separate streams, &lt;a href="http://lifestream.piccolboni.info/"&gt;one for grandma&lt;/a&gt; and &lt;a href="http://workstream.piccolboni.info/"&gt;one for the uber-geek&lt;/a&gt;, and that can be accomplished with a tag for each. I wish I used this more often, I will try.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;The feed from a Google bookmark list is the only unimplemented part of this plan: neither it is a feature provided by Google, nor several feed generators I tried have managed to generate a feed from that page. Feed generators are web services that check a page periodically for changes and extract what's new. I use page2rss with some success and Google Reader used to have a built in one, now retired. Neither those two nor others I tested worked for Google Bookmarks. No updates were ever generated. I thought I had cracked the problem, since that page is Ajax-heavy but has a static version for crawlers and other robots. Switching to that version apparently didn't help the generators. My testing was not extensive, I've got other business to attend, but if anyone has a solution please share. Yes I could switch to delicious but the integration of bookmarks and search is, well, too delicious to give up. Come on Google, every web page should have its own feed already.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;Finally you might ask about that intermediate step through a Friendfeed group I have set up. This is so for historical reasons, since Friendfeed groups were my hubs of choice before the switch to Posterous. I run into some issues pulling the Pandora and Netflix feeds myself, so I left it to Friendfeed to do that and pulled the latter instead.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;Is this network of data streams the nirvana of social interoperability? Of course not. First, it's way too hard to set up. Only a stubborn geek can pull this one off. Second, the feeds are incomplete and unidirectional. Comments may be lost in the process, and certainly any conversation that develops can &amp;nbsp;be fragmented. Like when a friend sees a new collection of pictures in my Posterous and follows the link to Picasaweb and enters a comment there — if he has an account. That comment will be only there, neither visible on Posterous nor any other of my social streams. If for some reason that picture is particularly good, my friends on Facebook will never know. It is for social web companies to decide that interoperability is in their best interest but if they wait a bit longer, it might just be &lt;a href="http://www.vincos.it/wp-content/uploads/2010/06/wmsn-01-10.png"&gt;too late&lt;/a&gt;.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-5181707010769881202?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=CCk9LtE9tQM:ZwbHNZPeI_c:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=CCk9LtE9tQM:ZwbHNZPeI_c:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=CCk9LtE9tQM:ZwbHNZPeI_c:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/CCk9LtE9tQM" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/5181707010769881202/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2010/11/social-network-integration-hard-way.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/5181707010769881202?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/5181707010769881202?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/CCk9LtE9tQM/social-network-integration-hard-way.html" title="Social network integration the hard way" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://3.bp.blogspot.com/_0zKhPf101t0/TNxBGLS_FSI/AAAAAAAAHn8/qL9bKKm7SZU/s72-c/feedmapwork.png" height="72" width="72" /><thr:total>0</thr:total><feedburner:origLink>http://blog.piccolboni.info/2010/11/social-network-integration-hard-way.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CEYFQn07fCp7ImA9Wx5XF0s.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-4118032758529268423</id><published>2010-09-17T15:21:00.000-07:00</published><updated>2010-09-17T15:21:53.304-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-09-17T15:21:53.304-07:00</app:edited><title>On lenses for small cameras: a data-driven counterargument</title><content type="html">Andy Westlake of dpreview.com &lt;a href="http://blog.dpreview.com/editorial/2010/01/on-lenses-for-small-cameras.html"&gt;takes apart&lt;/a&gt; the current lens offering for lightweight interchangeable lens cameras (LILC) like the micro four thirds and related mirrorless designs, but I was unconvinced. Let's see what the data says.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
Andy Westlake is a photographer and camera reviewer at dpreview.com and his opinions carry some weight in the digital photography community. His position, heavily summarized, is that the new type of LILCs such as the micro 4/3 format cameras need to maximize portability to differentiate themselves from standard DSLRs. To do so, they need to be coupled with lightweight, bright prime lenses. Many readers chimed in in the comments and I am not sure I am going to add to that. I just looked at the data. Since it's very hard to match one to one micro 4/3 lenses with equivalent say Canon EF lenses, because of limited available range and several differentiating parameters, I used stats. I started with a google square search for lenses in different mounts (Panasonic micro 4/3, Canon EF and EF-S and Nikon F) and their optical and weight data and I fitted a robust linear model to it (caveat: this is not a random sample and I had to clean the data manually: please let me know if you see something wrong with the data). This is what the model does in plain english: it tries to express weight of the lens as a weighted sum of zoom factor, f-number(s), focal length and mount type. Why weight? Because life is too short, but it would be lovely to try the same for length and diameter, and also add price into the equation. This model turned out to fit the data well enough, with the exception of three huge and bright tele lenses. You can find the details &lt;a href="https://spreadsheets.google.com/pub?key=0ApAU7JMALMM-dDR1T1Rva2dZRGlYdDUtQkpGNVlKbmc&amp;amp;hl=en&amp;amp;gid=2"&gt;here&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
The take home message is that being a zoom adds only about 27 grams per 1X factor increase (measured in degrees, not in mm to avoid any crop factor issues). The standard deviation here is 12 grams, so there is some uncertainty on this number. A typical 3X zoom is expected to add 81 grams to a system compared to a prime of similar aperture and focal length (mid-way in between the extremes of the zoom). If you look at mounts, micro 4/3 is some 260 grams lighter than EF. EF-S is only 90 grams lighter than  EF. Another big contribution after the mount is the aperture, a hefty 123 grams per f number. Keeping it constant across focal lengths costs another 118 grams (that is, you can save 118 grams by dropping the aperture at tele by 1, keeping the same at wide). Primes are included in the analysis as 1X zooms with constant max aperture.&lt;br /&gt;
&lt;br /&gt;
So how does this translates into comments to Andy's editorial? I think the data does not support the statement that the portability advantage is limited to the combination of micro 4/3 and prime lenses. There is a 260 grams advantage when keeping other factors equal. Going to a prime might take away another 90 grams, no discussion, but in the big scheme of things it doesn't seem like a lot. Adding a camera into the equation, let's say a Panasonic GF-1 at 285 grams body only vs a Canon 60D at 755 grams it's clear to me that the biggest factor is body weight, followed by mount type followed by f-number and finally zoom range. Is my linear model misleading? Is there a threshold at which even 1 gram makes a huge difference? There is, and it's called pocketability, but we are not there yet. I know because I use one of the smallest LILCs with one of the smallest lenses and there is no way I am going to carry them in my pocket. They might fit some jackets, but it's theoretical. Only if the lens were recessed into the body they would have a chance. Maybe we'll se that one day.&lt;br /&gt;
&lt;br /&gt;
Other assumptions in Andy's analysis are harder to check with hard data. He says serious photographers buying LILCs as second cameras are "half the story", but how about half the market? How many people are going to fork $800 for a GF-1 as a second camera, and buy expensive lenses that they can not share with their main system? And how about people wanting to take pictures of friends and family indoor and with existing light? I am one of those but we are called amateurs, and we have only one system, and many of us use the flash liberally and they are happy with the results. We have one system that has to represent a compromise, which sometimes includes using a zoom and a flash. Sometimes we don't have time to change lens and sometimes we don't own all the lenses to replace a zoom range. Sometimes we shoot stopped down to f-4.5 for a whole sunny day,&amp;nbsp;on the beach, with sand ready to fly onto that delicate CMOS sensor, so a bright prime lens would not make much of a difference, maybe a tad on sharpness if it is high quality. Disclosure: I don't own a zoom and I use the flash sparingly. I hope that Andy is right and camera makers will start catering to people just like me, I am not holding my breath just yet. And I like to throw data behind my opinions.&lt;br /&gt;
&lt;br /&gt;
And to finish on a colorful note, a chart summarizing the data used in this post.&lt;br /&gt;
&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_0zKhPf101t0/TJPhOJEHcxI/AAAAAAAAHVY/C_rdeqf1i0Y/s1600/lens+chart.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="449" src="http://1.bp.blogspot.com/_0zKhPf101t0/TJPhOJEHcxI/AAAAAAAAHVY/C_rdeqf1i0Y/s640/lens+chart.png" width="640" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Lens specs chart&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;&lt;span class="Apple-style-span" style="color: #333333; font-family: Georgia, serif; font-size: 16px; line-height: 25px;"&gt;I tried to compress a lot of information in this graphics, so bear with me. On the axes are the min and max field of view for each lens. That puts tele lenses in the lower left corner and wide angles in the upper right. Extreme zooms (wide to tele) are at the bottom right and the diagonal line that you can guess upper left of center has all the primes. The symbol is a square for micro four third lenses, and a circle otherwise. The color is weight, with blue being the lightest and orange the heaviest (I used topo map colors). The symbol size is inversely proportional to the f-number: larger means brighter. I printed a few lens names just as examples, I hope they are readable if you go to the full size version. Despite all my efforts, the results of the linear model are not immediately apparent from the graphic. Your best bet is to focus on the the micro 4/3 lenses and look for comparisons in the surroundings, but there's seldom a closely equivalent lens to compare.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-4118032758529268423?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=LFVRJfQJTVQ:7_fO0JOK5A8:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=LFVRJfQJTVQ:7_fO0JOK5A8:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=LFVRJfQJTVQ:7_fO0JOK5A8:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/LFVRJfQJTVQ" height="1" width="1"/&gt;</content><link rel="related" href="http://blog.dpreview.com/editorial/2010/01/on-lenses-for-small-cameras.html" title="On lenses for small cameras: a data-driven counterargument" /><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/4118032758529268423/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2010/09/on-lenses-for-small-cameras-data-driven.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/4118032758529268423?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/4118032758529268423?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/LFVRJfQJTVQ/on-lenses-for-small-cameras-data-driven.html" title="On lenses for small cameras: a data-driven counterargument" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/_0zKhPf101t0/TJPhOJEHcxI/AAAAAAAAHVY/C_rdeqf1i0Y/s72-c/lens+chart.png" height="72" width="72" /><thr:total>0</thr:total><feedburner:origLink>http://blog.piccolboni.info/2010/09/on-lenses-for-small-cameras-data-driven.html</feedburner:origLink></entry><entry gd:etag="W/&quot;D0AGRH09cCp7ImA9WhRSFUg.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-4700908475954907973</id><published>2010-09-16T15:47:00.000-07:00</published><updated>2011-11-17T10:48:45.368-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-11-17T10:48:45.368-08:00</app:edited><title>Thoughts on A/B testing</title><content type="html">A/B testing is part of a push towards software engineering as an experimental science, which I support, but there are plenty of open problems.&lt;br /&gt;
&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
I've been mulling over these points for a long while, but, after running into this &lt;a href="http://www.win-vector.com/blog/2010/08/statsmanship-failure-through-analytics-sabotage/"&gt;excellent and amusing post&lt;/a&gt; by John Moult, about the pains and perils of doing analytics, I was led by example to write up my thoughts about the more specific subject of A/B testing.&lt;br /&gt;
&lt;br /&gt;
There is undoubtedly a strong trend in the software industry toward the adoption of A/B testing as a powerful antidote to &lt;a href="http://exp-platform.com/hippo.aspx"&gt;arbitrary and opinion-based decision making&lt;/a&gt;. I think this is part of a larger push to inject powerful methods of the natural sciences into software engineering and make it more "scientific" and I wholeheartedly support that, but now there is an extreme programming like attitude to do too much of a good thing, a &lt;a href="http://www.startuplessonslearned.com/2008/09/one-line-split-test-or-how-to-ab-all.html"&gt;very visible advocate&lt;/a&gt; of which has been Eric Ries — note though his own caveats in the linked entry. I agree with him: there are many problems with A/B testing. Why shouldn't we A/B test everything, from the almost proverbial &lt;a href="http://www.nytimes.com/2009/03/01/.../01marissa.html"&gt;41 shades of blue&lt;/a&gt; to bug fixes?&lt;br /&gt;
&lt;h2&gt;
&lt;br /&gt;&lt;/h2&gt;
&lt;h2&gt;

Experiments do not replace theory&lt;/h2&gt;
When we perform a test, we need to pick a probability of rejecting the null when the null is true, or, in plain english, of promoting a change that is actually detrimental for some important metric. People set this probability fairly low, say 5% and think their job is done. They should read "&lt;a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1182327/"&gt;Why most published research findings are false&lt;/a&gt;". The jargon of that paper might be a little unfamiliar to a software developer, so let me try and explain some ideas using the following example. If we test a very carefully selected change, where multiple people have used their knowledge and experience to reach a consensus that that change is promising and should be given a run, the pre-test probability that it's going to be successful is, based on historical data, 90%. At the opposite end, we hire monkeys to throw random changes at the A/B testing system and their pre-test probability of success is 1%. Now that 5% chance of erroneously giving the green light to a harmful change has very different consequences in the two cases. In the first, we first need to hit the 10% when the expert committee makes a mistake, then the 5% where the A/B test doesn't catch it. These two events are most likely independent, so we end up with a 0.5% chance of a bad change pushed to the site. In the monkeys case, the two probabilities are 99% and and 5%, with a 4.95% chance of a bad change making its way to the users.&lt;br /&gt;
&lt;br /&gt;
Statistics and computational molecular biology have been dealing with monkeys for a long time: it's actually called &lt;i&gt;&lt;a href="http://en.wikipedia.org/wiki/Multiple_comparisons"&gt;multiple testing&lt;/a&gt;&lt;/i&gt;. Think of those genome-wide scans for genes of interests in some disease: for each gene, before seeing the data, the chance of being involved in the disease in negligible. I've been involved in some of them &lt;a href="http://scholar.google.com/scholar?cluster=18083597716113492455&amp;amp;hl=en&amp;amp;as_sdt=2000"&gt;myself&lt;/a&gt;. The solutions involve carefully raising the significance bar, which means that some potentially successful changes will be discarded, victim to the huge number of tests being performed. Another consequence is that monkeys raise the bar for everyone, including the team of experts, by submitting large numbers of random changes. There you have it: a statistical view of why teams are often bogged down by the weakest members. My point of view is that instead of succumbing to the &lt;a href="http://en.wikipedia.org/wiki/Lake_Wobegon#The_Lake_Wobegon_effect"&gt;Lake Wobegon effect&lt;/a&gt;, according to which even the most mediocre startup is hiring only the best and brightest, one should be able to model pre-test likelihood of success based on repeated observation of the same developer or team. There are always going to be performance differences in any team and it's better to try and make the best of them instead of trying to eliminate them.&lt;br /&gt;
&lt;br /&gt;
It seems to me the quarrel between designers, who seem to be &lt;a href="http://stopdesign.com/archive/2009/03/20/goodbye-google.html"&gt;opposed to A/B testing&lt;/a&gt;, and engineers, who seem more inclined to &lt;a href="https://www.google.com/accounts/ServiceLogin?service=websiteoptimizer&amp;amp;continue=http://www.google.com/analytics/siteopt/%3Fhl%3Den&amp;amp;hl=en"&gt;support it&lt;/a&gt;, derives from a misunderstanding of the role of A/B testing: it doesn't formulate interesting and important hypotheses to test, it doesn't direct our efforts where it matters. Experiments do not replace theory.&lt;br /&gt;
&lt;h2&gt;
&lt;br /&gt;&lt;/h2&gt;
&lt;h2&gt;

All observable metrics can be misleading&lt;/h2&gt;
In theory, we all agree that we are working towards the long term success of our employer or client. In practice, nobody can observe that in the present. What is the closest thing? Maximizing revenue? Adoption or user satisfaction? Content generation or consumption? A/B tests rigorously asses the impact of a change on one or more of these proxies for success. The test is rigorous, but the proxy relation is an educated guess at best. For example more page views are good, but using AJAX or adding video can reduce the number of page views while increasing user satisfaction. More impressions are good, right? Sure, but one can replace all content with ads based on this metric alone and short term measurements. In fact, such a thing happens: it's called a &lt;i&gt;homepage takeover&lt;/i&gt;. If you A/B tested the takeover, you would likely find that revenue has increased, but nobody would be so ill-advised as to deploy the takeover for good. Aware of the limitations and uncertainties of each of these metrics, management often uses the "let's do both (all)" heuristic and metrics inflation ensues. Let's measure everything: page views, page by page, content creation, by type of content, time on site, impressions, ad clicks, adoption, virality. Each metrics requires a separate test, compounding the multiple testing issue described above. Even if we can take advantage of very large samples, it is very often the case that some metrics are significantly up and some are down, just because people's time and patience is finite and using any feature competes with using every other. Add video and people have less time for pictures. Add chat and messages decline. If you have, say, 100 metrics, every decision becomes a judgement call because some will be up and some down, and this is the opposite of what we were trying to achieve with A/B testing in the first place. The quest for the practical measure that best correlates with long term goals is still ongoing. The issue of long term objectives is also covered in a recent and highly recommended &lt;a href="http://www.springerlink.com/content/r28m75k77u145115/fulltext.pdf"&gt;paper on A/B testing&lt;/a&gt; where it is somewhat downplayed. They suggest to incorporate them into the analysis, but how can you A/B test ROI over 5 years? Another recommendation they make is to use complex metrics that combine revenue with user satisfaction, such as a function of revenue a visit frequency. I think they are somewhat in denial of the key issue. At one level, we can use A/B testing to verify an assertion like: "feature X increases visit frequency". It's a statement about the present or a very near future we can wait for so that we can measure it and we'd better not be wrong about it. At another level, we want A/B testing to replace human decision making in all its fallibility, and then we are entering the prediction business, which is a lot more difficult.&lt;br /&gt;
&lt;br /&gt;
Speed is important, but is the enemy of accuracy -- unless your name is Sundance Kid. As Google CEO Eric Schmidt &lt;a href="http://news.cnet.com/8301-13860_3-20012724-56.html"&gt;recently observed&lt;/a&gt;, when launching a new product what counts is the speed of adoption after the first burst of interest has declined. I believe the same to be true for all but the least visible features, albeit on different time scales: some need just a look to get used to, like a different background, and some cause protests and user churn before taking off. The great chess master Capablanca, asked about how many moves he considered before making one, famously answered: "Only one, but it's always the right one". For everyone else it is necessary to look several moves ahead. Almost any reasonable heuristic becomes effective if the position space is searched thoroughly enough. The converse seems to be true with web related metrics: in the short term all of them can be misleading, but from what I hear in the industry, most A/B tests are run over fixed time periods, ignoring temporal effects. It is wise to experiment with the duration of A/B tests and model time dependent effects. The way to deploy more changes faster is not to perform shorter A/B tests, but to run many in parallel. Check out &lt;a href="http://radar.oreilly.com/2009/06/bing-and-google-agree-slow-pag.html"&gt;this report&lt;/a&gt; of two experimental studies from Microsoft and Google on the impact of web search speed on user behavior. From the graphs you can tell the experiments lasted at least 6 and 11 weeks respectively. The aforementioned &lt;a href="http://www.springerlink.com/content/r28m75k77u145115/fulltext.pdf"&gt;paper&lt;/a&gt; recommends a full week at least to steer clear of day of the week effects. I think those are very consistent and there is a chance to model them successfully. It's the rare events that worry me, like a sudden concentration of soccer matches on TV and the variable nature of users' reaction and learning curve. If you combine 11 week long tests with the &lt;a href="http://timothyfitz.wordpress.com/2009/02/10/continuous-deployment-at-imvu-doing-the-impossible-fifty-times-a-day/"&gt;50 releases per day&lt;/a&gt; that are possible with modern build-test-deploy systems that means 50 * 5 * 11 = 2750 ongoing experiments at any given time! That's an extreme I am not sure can be managed effectively without an equally extreme level of modularity, and it might not be necessary. Moreover lengthy tests affect time to market and delay useful feedback, no matter how many are performed at the same time. One way to work around that is to run a "B/A test" for some time after a release, that is keep a small sample of users behind to observe long term effects without slowing down the development of the product. Only when the supplemental, long term analysis disagrees with the initial, fast assessment a change will have to be reconsidered and the analyst has the time to really understand what is happening.&lt;br /&gt;
&lt;h2&gt;
&lt;br /&gt;&lt;/h2&gt;
&lt;h2&gt;

Testability bias&lt;/h2&gt;
Some things are easier to test than others. If we mandate that any product change undergo an A/B test, we might end up introducing a bias not only for changes that are successful but also for changes that are testable. This is a non exhaustive list of testability biases:&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;User attitude: users of many web sites have shown resistance to change, the most famous example of which might be Facebook's introduction of the newsfeed, now a mandatory feature of any social network. Visible changes also create user confusion and envy, because users might become aware of different features available to other users. Broadcast communication channels to announce new features, like company blogs and PR channels, can not be used because they don't reach selectively the B population. This means that users are surprised by the changes, exacerbating a negative attitude. This creates a bias toward small incremental changes that go unnoticed.&lt;/li&gt;
&lt;li&gt;Learning curve: related to the previous, any feature that requires learning will undergo a ramp up period during which it could be rejected by an A/B test. So the bias is toward intuitive features vs. powerful but difficult to use ones. Or it could be a novelty curve, a feature gets an initial spike of interest followed by oblivion, for the opposite bias.&lt;/li&gt;
&lt;li&gt;Networked features such as messaging or chat or multiplayer games or anything social: it's not clear how to sample communities in a statistically unbiased way. Facebook revealed they sample whole countries (can't locate the reference, could any reader help?), with all the limits of that approach. People at the "social boundaries" of a sampled community will find it harder to engage in the new feature. On the other hand some simple sampling techniques are biased toward more connected users. The bias here could be positive or negative.&lt;/li&gt;
&lt;li&gt;Non user vs users: users are more readily available for an A/B test then non users, but if user base growth is a priority one needs to take non users' needs into account: what is it that prevents a potential user from joining in? It is unlikely that deploying a new feature to a small subset of users will make a sizeable difference to non-user that would join if aware of that feature. Therefore user acquisition efforts focus on marketing and virality, which can be A/B tested, not on catering to non-user needs — that is creating a more broadly appealing product. One can A/B test casual visitors of a site and maximize their conversion into regular users, but this focuses mostly on the subscription process and the perception of the product more than its reality and potential users need to be visiting the site in the first place. The bias here is in favor of current users.&lt;/li&gt;
&lt;li&gt;Premium features: since adoption is generally lower than for free features, sample sizes are smaller, sometimes by orders of magnitude. Gains need to be bigger to be detectable, all else being equal. The bias is toward free features.&lt;/li&gt;
&lt;li&gt;Resource intensive features: depending on deployment and load balancing techniques, it's very possible that resource intensive features might be at an unfair advantage or disadvantage during testing. Let's say you have a fixed pool of machines to use for testing and that pool is not resized when testing a resource intensive feature: the test might indicate to drop that feature, but the only reason is that we needed to allocate more resources and the additional cost could have been justified. The opposite scenario is when the load is balanced across A and B pool users, using the same set of resources. Then any speed problem with a new feature is going to be hidden.&lt;/li&gt;
&lt;li&gt;Interaction: a change that affects several sections of a web site is going to interact with other changes we are trying to A/B test in parallel, making testing more difficult. Even when features appear to be unrelated, they are competing for user time and attention. The same influential &lt;a href="http://www.springerlink.com/content/r28m75k77u145115/fulltext.pdf"&gt;paper&lt;/a&gt; on A/B testing states that interactions are not very important and common, but my experience does not support that. It could be that different types of sites are more prone to interaction: for instance on a purely recreational site people are going to spend 10 minutes on any given day, and if offered activity X they won't engage in activity Y, which was the hot new thing just the week before. On an e-commerce web site or search engine things might work in a different way.&lt;/li&gt;
&lt;/ol&gt;
Contrasting hypothesis is at the heart of the scientific method, but science has the luxury of picking which problems are within its domain. When creating a product, one doesn't have that luxury. If these biases go unnoticed and are not addressed, we might end up with a product that is conservative in its development, dumbed down, catering to a niche of early adopters, free and so forth. The point of using statistics was to help us go where we want to go, not to introduce biases in decision making. Are there ways to mitigate these effects?&lt;br /&gt;
&lt;div&gt;
&lt;ol&gt;
&lt;li&gt;Educate users creating channels specific for test users when the new feature or improvement being tested is important enough. In doubt, split the B population into two random samples and educate B1, let B2 figure it out, study the differences.&lt;/li&gt;
&lt;li&gt;Look at temporal effects. Let the test run long enough until a steady state is reached or incorporate those effects into the model.&lt;/li&gt;
&lt;li&gt;This is an open research problem as far as I know. I would start from &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.8589&amp;amp;rep=rep1&amp;amp;type=pdf"&gt;this paper&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;This is a hard one. I will risk two suggestions. One is to advertise or announce non-existing or prototype level features and see what the response is. Some call it "&lt;a href="http://gigaom.com/2010/04/26/the-aardvark-theory-of-product-fake-it-till-you-make-it/"&gt;fake it till you make it&lt;/a&gt;". The other one is that newcomers are likely to be more similar to potential users than early adopters. You could even try to model the shifts in usage patterns and extrapolate. I don't have experience in either, so take them as brainstorming.&lt;/li&gt;
&lt;li&gt;Put more data gathering effort into measuring premium features. If you are sampling, be sure to use adjustable sampling rates so that you don't end up with 99% of your data being about page views of the home page. Run longer tests involving a larger user base when dealing with premium features. Use better statistics: when the sample sizes are small, accurate modeling is more important and errors and changes of directions are less costly and public.&lt;/li&gt;
&lt;li&gt;In general, testing should always be done in a real-life setting and therefore resource constraints should be part of it. Developers and managers that propose new features should also provide estimates of the resources necessary to make them successful. This way, a cost benefit analysis is possible. But for bolder, complicated features, it might be helpful to know first what the user reaction is, assuming there is no resource issue, then try to get the feature out within a realistic budget. It's a form of prototyping: leave optimizations for later. Of course one can't assume &lt;a href="http://en.wikipedia.org/wiki/P_versus_NP_problem"&gt;P = NP&lt;/a&gt; during prototyping and leave that as an implementation detail.&lt;/li&gt;
&lt;li&gt;Modeling interaction is best — you need to know if two independently useful changes will be a disaster when combined. One can also try and avoid them using multiple test pools, but subsequent combination testing is necessary.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
&lt;br /&gt;&lt;/h2&gt;
&lt;h2&gt;

Statistical testing might not be the right way&lt;/h2&gt;
I am not going to say that we should give up on statistics in software engineering. I just think hypothesis testing in particular has been overemphasized in its application to software development and we need to look at a richer set of statistical tools. Imagine this situation: after an A/B test, the data is not enough to reject the hypothesis that A and B perform equally well. Of course the numbers are not identical, but the test doesn't reach the predetermined significance level. What is a manager to do?&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Coin toss&lt;/li&gt;
&lt;li&gt;More experiments and analysis&lt;/li&gt;
&lt;li&gt;Go with the currently deployed version&lt;/li&gt;
&lt;li&gt;Go with the experimental version&lt;/li&gt;
&lt;li&gt;Go with the version that had slightly better numbers&lt;/li&gt;
&lt;li&gt;All of the above&lt;/li&gt;
&lt;/ol&gt;
Imagine the converse. The data shows that B is better than A. Unfortunately, B requires 10% more servers to meet response time requirements. The test doesn't tell you by how much B is ahead. In both cases and in my experience, an estimate of the performance gap and a confidence interval are much more useful or, as they say, &lt;i&gt;actionable&lt;/i&gt; than a test. In the first case, if the potential difference is big, albeit uncertain, let's say the 95% confidence interval is -3–11%, one might go with 2, more experiments, to try and narrow it down. If it is 0–1%, it could be better to move on — my pick is option 5 all else being equal. In the second case, if the lower end of the confidence interval is related to business goals that are deemed worth the cost of additional servers, the manager might go ahead, otherwise try and get more data. Of course there is a cost to gathering more data and delaying decisions as well.&lt;br /&gt;
&lt;br /&gt;
Another way in which statistical testing is not appropriate is illustrated again by the "41 shades of blue" example. In this case regression seems a more natural approach and multivariate regression would allow us to study border colors and background colors in a coordinated fashion, which makes a lot of sense to me. The expertise of both the domain expert, like a UI designer, and the data analyst here becomes important to formulate appropriate models. For instance, we have a strong expectation that similar shades of color will have very close performance, based on color perception theory: therefore the model should use a continuous or&amp;nbsp;Lipschitz&amp;nbsp;function to relate color and user behavior. In summary, there is a range of statistical tools beyond testing that might be more appropriate for different problems that arise in experimental software development: use them!&lt;br /&gt;
&lt;br /&gt;
&lt;h2&gt;

Check your assumptions&lt;/h2&gt;
In the aforementioned post, John Moult finds that fastidious attention to every single factor that can compromise the validity of a statistical study is a killer for the work of the analytics engineer or scientist, and I agree with that. But in the case of A/B testing there is something practical we can do to allay some concerns. Perform periodic A/A tests and look for differences, and also look at the distribution of p-values, which should be uniform in this case. An A/A test is one in which everything is done according to the protocol for an A/B test, but the two products being compared are actually identical. In my experience I had a few surprises from this kind of check, which triggered somewhat painful investigations, but eventually generated more confidence in the method and better understanding of how our system worked. On the other hand, it's a little harder to check for power, or false negatives. If a generative model of the data is available, it is possible to run the test on synthetic data and see how it goes. One can also subsample the data and see what size effects are detected at what sample size. Under normality assumptions, there is an analytic solution that connects effect size, variance and sample size — see yet again &lt;a href="http://www.springerlink.com/content/r28m75k77u145115/fulltext.pdf"&gt;this paper.&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;h2&gt;

Conclusions&lt;/h2&gt;
&lt;div&gt;
I believe that the experimental approach to software engineering is here to stay and will develop further, becoming easier to apply, more predictive of long term success and more generally accepted. On the contrary, overemphasizing statistical tests as the only and infallible tool of this trade or as a shrink-wrapped solution for all sort of development decisions is not helpful and not supported by evidence or theory and is generating a backlash among less quantitatively oriented people. The answer is more, better statistics but also an acceptance of its role and limits.&lt;/div&gt;
&lt;br /&gt;
&lt;h2&gt;

Credits&lt;/h2&gt;
John Moult suggested an important reference for this post&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-4700908475954907973?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=4f4nSYzm-YI:f7Kze2VPGCA:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=4f4nSYzm-YI:f7Kze2VPGCA:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=4f4nSYzm-YI:f7Kze2VPGCA:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/4f4nSYzm-YI" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/4700908475954907973/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2010/09/thoughts-on-ab-testing.html#comment-form" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/4700908475954907973?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/4700908475954907973?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/4f4nSYzm-YI/thoughts-on-ab-testing.html" title="Thoughts on A/B testing" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>1</thr:total><feedburner:origLink>http://blog.piccolboni.info/2010/09/thoughts-on-ab-testing.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CUECRHYzeyp7ImA9Wx5TEks.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-4939613169076845136</id><published>2010-07-27T13:47:00.000-07:00</published><updated>2010-07-27T13:47:45.883-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-07-27T13:47:45.883-07:00</app:edited><title>An algorithm for sample quantiles in map reduce</title><content type="html">A simple but often occurring problem is computing &lt;i&gt;sample quantiles&lt;/i&gt;, sometimes named &lt;i&gt;top $k$ elements&lt;/i&gt;, in a large data set. Here I show a solution for the MapReduce model of computation.&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
The standard in memory algorithm for this problem is similar to quicksort, with the main difference that only one branch of the recursion is followed. An important detail is that if we are looking for the top $k$ element, after the pivoting phase we might still be looking for it among the $m$ elements higher than the pivot, if $m \geq k$, but if that's not the case then we need to look for the top $k-m$ element among the elements smaller than the pivot. This is slightly more complicated to state in terms of quantiles, but the idea is the same. To port this algorithm to the map reduce model, we need to avoid pivoting, which requires random access, and adopt a style more similar to that of &lt;i&gt;streaming algorithms&lt;/i&gt;. Pivoting is replaced by a combination of counting and filtering. In each iteration, we maintain an upper and  lower bound for the quantile of interest. To refine it, we split that interval in half at a threshold $t$. We count how many data points are above the threshold. If that indicates that the target quantile is higher than the threshold, it becomes our new lower bound, otherwise we update the upper bound. Now we could simply go for a new pass, until upper and lower bound identify only one value in the sample (it could be repeated multiple times though), but if writing is not too costly and we have spare disk space, we can write out only the elements that are in between the upper and lower bound, bringing the complexity from $O(n \log(n))$ to $O(n)$. In the filtered sample we will be looking for a different quantile.&lt;br /&gt;
&lt;br /&gt;
To add some detail, this is how the counting happens (using a combiner is mandatory for this algorithm, or the only two reducers will have to process the whole dataset):&lt;br /&gt;
&lt;pre class="brush: python"&gt;def map(v):
  if (v &lt;= t):
    return [(T, 1)]
  else:
    return [(F, 1)]
&lt;/pre&gt;
&lt;pre class="brush: python"&gt;def reduce(list):
  sum = 0
  for (is_low, count) in list:
    sum += count
  return[(is_low, sum)]
&lt;/pre&gt;This is the filtering, just for completeness sake.
&lt;pre class="brush: python"&gt;def map(v):
  if (lower &lt;= v &lt; upper)
    return [v]
  else:
    return []
&lt;/pre&gt;
And this is the update rule for the bounds and quantile values:
&lt;pre class="brush: python"&gt;def bounds(u, l, t, ca, cb, q):
  # u, l: quantile upper and lower bounds
  # t: threshold
  # ca, cb: count of values above and below threshold
  # q: desired quantile
  n = ca + cb
  topk = (1 - q) * n
  if topk &lt;= ca:
    l = t
    q =  1 - topk/ca
  else:
    u = t
    q = 1 - (topk - ca)/cb
  t = floor(u + l / 2)
  return (u, l, t, q)
&lt;/pre&gt;
There is a little more work to glue everything together, but it should be clear at this point.
The algorithm terminates when the max and min of the current filtered sample are equal.
An optimization is to use more than one threshold during each iteration. This allows to trade iterations with memory usage. More precisely, we could maintain as many bins as memory allows and perform the counting bin by bin. Once this is done, we pick the bin that contains the quantile, split its range into equal sized bins and repeat. Filtering works exactly the same way as above.


&lt;p&gt;A couple of observations about the applications of this algorithm. Often we are trying to estimate a quantile from a distribution from which the data is sampled. In that case, calculating a sample quantile is only the start. As estimates of distribution quantiles, sample quantiles are biased towards the median and without confidence intervals or other measure of variability we don't know anything about their variability. To obtain confidence intervals for any quantile one can use the technique explained &lt;a href="http://turing.une.edu.au/~stat354/notes/node72.html"&gt;here&lt;/a&gt; and for extreme quantiles &lt;a href="http://cran.r-project.org/web/packages/evir/index.html"&gt;special techniques&lt;/a&gt; might give better results. In both cases we need to extract sample quantiles as an intermediate step, so the algorithm described here can be useful, but not necessarily from the whole dataset. We might opt for a single pass map reduce subsampling step followed by in memory analysis and get a tight enough confidence interval faster and using fewer resources. The best trade off between precision, speed and resources is application dependent.&lt;/p&gt;&lt;p&gt;This post was inspired by a job interview question, where we covered in memory algorithms and only hinted at the mapreduce solution, but I thought it would be interesting to write it down.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-4939613169076845136?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=3CNGF8JjpSM:gEe9Kiw4hdE:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=3CNGF8JjpSM:gEe9Kiw4hdE:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=3CNGF8JjpSM:gEe9Kiw4hdE:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/3CNGF8JjpSM" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/4939613169076845136/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2010/07/algorithm-for-sample-quantiles-in-map.html#comment-form" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/4939613169076845136?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/4939613169076845136?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/3CNGF8JjpSM/algorithm-for-sample-quantiles-in-map.html" title="An algorithm for sample quantiles in map reduce" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>1</thr:total><feedburner:origLink>http://blog.piccolboni.info/2010/07/algorithm-for-sample-quantiles-in-map.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CEIGSX0-cCp7ImA9Wx5TEE4.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-1133700587202708684</id><published>2010-07-19T16:37:00.000-07:00</published><updated>2010-07-24T21:35:28.358-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-07-24T21:35:28.358-07:00</app:edited><title>A map reduce algorithm for connected components</title><content type="html">In a recently published &lt;a href="http://www.umiacs.umd.edu/~jimmylin/book.html"&gt;book&lt;/a&gt; about algorithms for the map reduce model of computation, a simple connected components algorithm based on lablel propagation is proposed, but its complexity depends on the diameter of the graph, which can be very large. It turns out we can get rid of that dependency with a completely different algorithm, ported from the PRAM model.&lt;br /&gt;
&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The authors observe the many if not most practically occurring web-sized graphs have small diameters (the "small world" phenomenon) and therefore their algorithm is of practical importance. At the same time, new types of large graphs become available to researchers and practitioners who want to data mine them and do not know a priori what the diameter of these graphs is. And if you don't buy this motivation for an algorithm with a diameter-independent upper bound, it's just "because it's there".&lt;br /&gt;
&lt;br /&gt;
Researching the literature for parallel algorithms for connected components, I found one called &lt;i&gt;Random Mate&lt;/i&gt; (I am not sure if the correct attribution is to Hillel Gazit or John H. Reif, my access to older literature is limited, but I found a nice &lt;a href="http://www.cs.unc.edu/~prins/Classes/633/Handouts/pram.pdf"&gt;write up&lt;/a&gt; ) that seems amenable to a map reduce implementation. The basic idea is that, instead of extending components one or few nodes at a time, we should merge them and enjoy an exponential increase in component size for each iteration, on average. More specifically, components are represented with directed forests overlaid on top of the input graph, that is using the same nodes but a separate set of edges. The trees in the forest are kept very shallow, essentially stars or very close to stars, so that eventually each node will have as parent the root of a tree, providing a convenient representation for each component. The most technical bit is that at each step, nodes are randomly assigned to two sets, let's call them "high" and "low" (the above write up uses "male" and "female"), but what counts really is the type of the parent in the forest. Edges that hit two nodes with parents of the same type are out of the game for that specific iteration. Edges that hit nodes with one high and one low parent can only be used in one direction, that is to attach the forest of the low parent under the forest of the high parent. If we didn't do this, working in parallel we could end up creating loops, thus turning trees into general graphs (specifically, DAGs) and violating a key invariant of the algorithm. It is still possible that a low node could have multiple neighboring high nodes belonging to different trees, but only one wins based on the properties of the PRAM CRCW model of parallel computation which specifies a way for concurrent writes to be sorted out. &lt;br /&gt;
&lt;br /&gt;
The following is a sketch of a map-reduce version of this PRAM algorithm. Its correctness and performance bounds follow from the fact that it closely emulates the original, step by step and therefore the original proof is still valid. In my pseudo code, we will consider tables as on disk data structures and consider join operations as primitive operations. Tables are implemented as distributed file system files. See the aforementioned &lt;a href="http://www.umiacs.umd.edu/~jimmylin/book.html"&gt;book&lt;/a&gt; for the details on how to implement joins, or the Hive software. We will build a directed forest on top of the graph, so we will devote a table $F$ to the edge list of the forest and on to $E$, the input graph $G=(V,E)$ (undirected edges are represented as pairs of directed ones). $F$ is initialized as $(v,v) \;\forall v \in V$ that is all the self-loops (this is a slight departure from the definition of a forest, but let me still call it a forest, it makes for simpler code with no special handling of the root nodes). The first operation is a join described in SQL-like language:&lt;br /&gt;
&lt;pre class="brush: sql"&gt;select E.u as u , E.v as v, F1.v as p1, F2.v as p2 from F F1 join E on E.u = F1.u join F F2 on E.v = F2.u&lt;/pre&gt;That is, we "annotate" each edge with the parents in the forest of each vertex hit by the edge. Now we describe a map phase with input $(u, v, p_1, p_2)$. $r$ is the map reduce round number.&lt;br /&gt;
&lt;pre class="brush: python"&gt;def map(u, v, p1, p2):
  if p1 == p2: #same component
    return [] #do nothing
  else: 
    h1 = hash(p1, r)%2 #randomly assign "high" and "low"
    h2 = hash(p2, r)%2
    if h1 == h2:
      return [] #if same give up
    else:
      return [(p1,p2) if h1 else (p2,p1)] #otherwise merge one into the other&lt;/pre&gt;This phase, in the original formulation for the PRAM, requires a CRCW model, without preference for a specific variant, so we will use the arbitrary CRCW model and implement it in the reduce phase, the first node of the edge being designated as the key.&lt;br /&gt;
&lt;pre class="brush: python"&gt;def reduce(list):
  return list[0]&lt;/pre&gt;(this is amenable to be used as a combiner as well)&lt;br /&gt;
The output of this reducer becomes F', a table of updates to the forest that needs to be merged with the existing F.&lt;br /&gt;
&lt;pre class="brush: sql"&gt;select coalesce(F'.u, F.u), coalesce(F'.v, F.v) from F left outer join F' on F.u = F'.u
&lt;/pre&gt;which is a way of saying: keep all the edges in F unless there is one in F' with the same starting node, then replace it.&lt;br /&gt;
&lt;br /&gt;
Finally,  there is a path shortening phase that turns trees in the forest into stars. This is implemented with a self join on the forest.&lt;br /&gt;
&lt;pre class="brush: sql"&gt;select F1.u, F2.v from F F1 join F F2 on F1.v = F2.u&lt;/pre&gt;Again this replaces F. The termination criterion is that if there are no edges in G with different parents, we are done.&lt;br /&gt;
&lt;br /&gt;
As in the original algorithm, the number of iteration is $O(\log(N))$, each of which requires sorting of all the edges, the graph's and the forest's. There are highly optimized sorting implementations for the map reduce model, but I haven't found a discussion of their asymptotic complexity. From a quick look at Terasort, I think it requires $O(N\log(N))$ work and takes $O(\log(N))$ time with $\Omega(N)$ available processors. These are the dominating computational costs, as the mappers and reducers above all execute in constant time, for an overall $O(N(\log(N))^2)$ work and $O((\log(N))^2)$ time. There is one detail to take into account for the reducer that implements the effects of the arbitrary CRCW PRAM model: for a high degree node, one reducer might get a very large proportion of the input. This is not a problem, as the reducer can just return after reading the first line of input and we can use the reducer as a combiner for and additional optimization. &lt;br /&gt;
&lt;br /&gt;
Another potential optimization could be to use, instead of a random bipartition of the nodes, a random priority. This way more mergers would happen at each step. The central part of the algorithm becomes:&lt;br /&gt;
&lt;pre class="brush: python"&gt;def map(u, v, p1, p2):
  if p1 == p2: #same component
    return [] #do nothing
  else:
    h1 = hash(p1, r) #randomly assign priority
    h2 = hash(p2, r)
    if h1 == h2:
      return [] #if same give up
    else:
      return [(p1,p2) if h1 &amp;gt; h2 else (p2,p1)] #otherwise merge one into the other&lt;/pre&gt;That would require though a more powerful path shortening phase, since the correctness proof relies on the trees in the forest having depth one at the end of each iteration, and the current path shortening can only halve the depth. I expect that this modified algorithm would be faster for dense graphs for which the forest has many fewer edges.&lt;br /&gt;
&lt;br /&gt;
Even without this optimization, you might have noticed that the diameter of the graph was notably absent from the discussion, and in fact this algorithm works well even for paths, which was our original goal. It could require more iterations though than the label propagation algorithm when the diameter of the graph is $\Omega(\log(N))$ as far as I can tell from these bounds, but I don't have an example where it does. It might be possible to show that this algorithm works faster than $O(\log(N))$ when the diameter is small.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-1133700587202708684?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=hGbHzJskmPM:X1o0y1axMyA:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=hGbHzJskmPM:X1o0y1axMyA:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=hGbHzJskmPM:X1o0y1axMyA:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/hGbHzJskmPM" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/1133700587202708684/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2010/07/map-reduce-algorithm-for-connected.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/1133700587202708684?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/1133700587202708684?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/hGbHzJskmPM/map-reduce-algorithm-for-connected.html" title="A map reduce algorithm for connected components" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://blog.piccolboni.info/2010/07/map-reduce-algorithm-for-connected.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CEAGQXo-eip7ImA9Wx5TEE4.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-3731536890180279009</id><published>2010-07-16T10:34:00.000-07:00</published><updated>2010-07-24T21:38:40.452-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-07-24T21:38:40.452-07:00</app:edited><title>Rapleaf Array Absurdity or On streaming problems in disguise</title><content type="html">From the interview challenges of an up and coming web startup, three problems that range from the trivial to the impossible. The key to the the solution is to recognize that the setting is close to that of &lt;i&gt;streaming algorithms&lt;/i&gt;, which allows for very limited space resources compared to the size of the input. This challenge assumes the data is in an array, an additional benefit not assumed in the streaming literature, but it's close enough to use some of those ideas, since random access to the array doesn't seem to help for the specific questions.  &lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt; &lt;br /&gt;
Imagine we have an immutable array of size N which we know to be filled with integers ranging from $0$ to $N-2$, inclusive.&lt;br /&gt;
&lt;ol&gt;&lt;li&gt;Write a function that checks to see if the array contains a duplicated entry. This function should run as quickly as possible. That is, complete the follow ing function:&lt;/li&gt;
&lt;blockquote&gt;bool contains_duplicate(int* array, int N) { // ... }&lt;/blockquote&gt;Solution: Always true by the &lt;i&gt;pigeonhole principle&lt;/i&gt;.
&lt;li&gt;Suppose we know that the array contains exactly one duplicated entry and that duplicate appears exactly twice. Find, in constant space and time proportional to $N$, the duplicated entry. That is, complete the following function:&lt;/li&gt;
&lt;blockquote&gt;int find_unique_duplicate(int* array, int N) { // ... }&lt;/blockquote&gt;Solution: \[\sum_i \mathrm{array}[i] - \frac{(N-2)(N-1)}{2}\] This is a known trick in the field of streaming algorithms.
&lt;li&gt;Suppose that we drop the guarantee that the array contains exactly one duplicated entry. Write another find_duplicate function so that it returns one of the duplicated entries. It should still run in constant space and time proportional to $N$.&lt;/li&gt;
&lt;blockquote&gt;int find_duplicate(int* array, int N) { // ... }&lt;/blockquote&gt;Solution or lack thereof: It's impossible in the asymptotic complexity sense. Check J. Tarui, "Finding a duplicate and a missing item in a stream". In TAMC, pages 128–135, 2007 -- specifically Theorem 1 in Section 3. There is $O(N log N)$ solution, see P. Gopalan, J. Radhakrishnan "Finding duplicates in a data stream". Briefly, one can do a binary search for the value of a duplicated item by splitting the current search interval and counting in one array pass if there are more items above or below the threshold. Pick the interval with the larger count and repeat. If you make the size of the array and size of the range of elements independent, then this solution is O(N log M) and since C ints are a finite set, you could see this as $O(N)$, but if $M = N - 2$ then this is a finite problem and there is no asymptotic analysis one can apply. I guess this is more about clarifying the problem then actually solving it&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-3731536890180279009?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=H5a7y7oQ9W4:7bVHuqgVC7o:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=H5a7y7oQ9W4:7bVHuqgVC7o:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=H5a7y7oQ9W4:7bVHuqgVC7o:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/H5a7y7oQ9W4" height="1" width="1"/&gt;</content><link rel="related" href="http://www.webcitation.org/5kVOQTX9I" title="Rapleaf Array Absurdity or On streaming problems in disguise" /><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/3731536890180279009/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2010/07/rapleaf-array-absurdity-or-on-streaming.html#comment-form" title="4 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/3731536890180279009?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/3731536890180279009?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/H5a7y7oQ9W4/rapleaf-array-absurdity-or-on-streaming.html" title="Rapleaf Array Absurdity or On streaming problems in disguise" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>4</thr:total><feedburner:origLink>http://blog.piccolboni.info/2010/07/rapleaf-array-absurdity-or-on-streaming.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CEADRHk-cCp7ImA9Wx5TEE4.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-4775725705740920907</id><published>2008-10-05T20:10:00.000-07:00</published><updated>2010-07-24T21:39:35.758-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-07-24T21:39:35.758-07:00</app:edited><title>Six Products, Six Carbon Footprints - WSJ.com</title><content type="html">Interesting &lt;a href="http://online.wsj.com/article/SB122304950601802565.html"&gt;article&lt;/a&gt; in the Wall Street Journal about attempts to study the carbon footprint of different products&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
The surprising lesson, at least for me, is that transportation does not account for much of the carbon footprint of several products. Sometimes is one of the raw materials (leather, polyester), sometimes is the presentation (refrigerated beer), sometimes the natural production processes (milk -- cows produce lots of methane). It's also a very practical read that you can turn into action.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-4775725705740920907?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=Q596-eyUMGg:N7YkkIG5Hks:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=Q596-eyUMGg:N7YkkIG5Hks:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=Q596-eyUMGg:N7YkkIG5Hks:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/Q596-eyUMGg" height="1" width="1"/&gt;</content><link rel="related" href="http://online.wsj.com/article/SB122304950601802565.html" title="Six Products, Six Carbon Footprints - WSJ.com" /><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/4775725705740920907/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2008/10/six-products-six-carbon-footprints.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/4775725705740920907?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/4775725705740920907?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/Q596-eyUMGg/six-products-six-carbon-footprints.html" title="Six Products, Six Carbon Footprints - WSJ.com" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://blog.piccolboni.info/2008/10/six-products-six-carbon-footprints.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CUEEQ3s6fip7ImA9Wx5TEE4.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-2204670956275188003</id><published>2008-08-17T12:15:00.000-07:00</published><updated>2010-07-24T21:53:22.516-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-07-24T21:53:22.516-07:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="technology" /><category scheme="http://www.blogger.com/atom/ns#" term="computing" /><title>Google returns more AdSense rich results than Live: conflict of interest?</title><content type="html">An experiment in comparative search engine analysis seems to show that who is serving ads in the pages listed in the result page matters.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
Several search pundits have commented on conflicts of interest Google is potentially exposed to.  Google itself &lt;a href="http://www.techcrunch.com/2008/04/02/google-to-sell-performics/"&gt;admitted&lt;/a&gt; to not wanting to hold onto Performics, the search engine marketing division of their recent purchase Doubleclick, and in fact they quickly sold it. But that's only one of many conflicts of interest Google has to deal with. John Battelle &lt;a href="http://battellemedia.com/archives/004574.php"&gt;observes&lt;/a&gt; in his searchblog that, in a comparison between Google Ad Planner and Comscore data, sites that use AdSense, Google's advertising platform, get better traffic numbers. If you use the free Ad Planner to plan an ad campaign your money is more likely to flow to Google's coffers than if you were using Comscore. Google provided a statement denying any deliberate bias, but it's hard to find alternative explanations. A similar conflict of interest arises for Quantcast and their Media Planner and rating services w.r.t. being fair to publishers that don't use their tag (they are not &lt;i&gt;quantified&lt;/i&gt; in their jargon). They provide ratings and demographic analyses for both, but directly get data only from quantified publishers. Since I am a former employee I will abstain from a more detailed analysis in this case.&lt;br /&gt;
&lt;br /&gt;
Another conflict of interest is between Google the search company and Google the media company: will Google search give prominence to Google properties Picasa, Knol, Blogger and YouTube over the rest of the net? Timo Paloheimo is afraid that could be the case and so he created a &lt;a href="http://www.startupbin.com/2008/08/11/google-minus-google/"&gt;custom search engine&lt;/a&gt; that excludes such properties, a draconian solution in my opinion but a clear expression of the malaise generated by Google's dominant position in search and the multiple conflicts of interest with other Google products. &lt;br /&gt;
&lt;br /&gt;
But this is only the tip of the iceberg IMHO: the main conflict of interest is between search and advertisment. When companies like Answers.com report that their traffic is down 28% because of a change in their Google ranking, Om Malik rightfully &lt;a href="http://gigaom.com/2007/08/02/answerscom-raises-questions-about-googles-power/"&gt;comments&lt;/a&gt; that this very fact &lt;i&gt;"raises questions about Google’s power"&lt;/i&gt;.&lt;br /&gt;
Google has an interest in returning the most relevant and useful results in each search; and also has an interest in directing traffic to sites that use adSense because they provide a big part of Google's revenue. How do they balance the two? Google's answer is that the two businesses are run as independently as possible (I forgot the source here, but I am almost sure it was a Google spokesperson). But is that a sufficient answer? Or should we trust Microsoft not to tweak their live search? I think I have &lt;a href="http://spreadsheets.google.com/pub?key=pVGoEOC1mxMvU79Ok8b0-AA"&gt;&lt;b&gt;strong evidence&lt;/b&gt;&lt;/a&gt; that Google search results are more likely to carry adSense ads than Microsoft's live.com.&lt;br /&gt;
&lt;!-- more --&gt;&lt;br /&gt;
The methodology is as follows:&lt;br /&gt;
&lt;ol&gt;&lt;li&gt;Sample random search keywords&lt;/li&gt;
&lt;li&gt;Use each of them as a query on google.com and live.com&lt;/li&gt;
&lt;li&gt;Check if any of the results in the first page uses adSense&lt;/li&gt;
&lt;/ol&gt;I considered the ratio between the number of search matches using adSense and the total number of accessible search results (normally 10, but sometimes some could not be accessed within a reasonable time) for each keyword. Out of 632 searches, Google-provided matches were more adSense rich than live.com ones 316 times, the opposite was true 126 times and the ties were 190. Pretty remarkable, isn't it? If you are not convinced by your intuition or are statistically inclined, applying a Wilcoxon signed rank test rejects the hypothesis of the two search engines results being equally adSense rich with a p-value smaller than 2.2e-16. &lt;br /&gt;
&lt;br /&gt;
Correlation is not causation. It could be that either company is tweaking their engine to favor or penalize adSense; or it could be some other mechanism I haven't figured out yet. But consider this hypothetical scenario: Google doesn't feel it needs to compete so hard in search anymore, wants to boost revenue, decides to give a small but competitively significant rank boost to adSense-running sites, those in turn outcompete their non-adSense-running competitors. We really don't want this to happen. We just recovered from another stranglehold on the computing world.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-2204670956275188003?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=0eOMXynDWls:GAxkR04W9AQ:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=0eOMXynDWls:GAxkR04W9AQ:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=0eOMXynDWls:GAxkR04W9AQ:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/0eOMXynDWls" height="1" width="1"/&gt;</content><link rel="related" href="http://spreadsheets.google.com/pub?key=pVGoEOC1mxMvU79Ok8b0-AA" title="Google returns more AdSense rich results than Live: conflict of interest?" /><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/2204670956275188003/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2008/08/google-returns-more-adsense-rich.html#comment-form" title="2 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/2204670956275188003?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/2204670956275188003?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/0eOMXynDWls/google-returns-more-adsense-rich.html" title="Google returns more AdSense rich results than Live: conflict of interest?" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>2</thr:total><feedburner:origLink>http://blog.piccolboni.info/2008/08/google-returns-more-adsense-rich.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CUYBRnw4fyp7ImA9Wx5TEE4.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-1039392969032293063</id><published>2008-08-11T09:20:00.001-07:00</published><updated>2010-07-24T21:45:57.237-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-07-24T21:45:57.237-07:00</app:edited><title>Our energy future up close and personal</title><content type="html">Saul Griffith, CEO of a wind power startup, tries to map out a possible energy future in terms of global and personal choices.&lt;br /&gt;
 &lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
It's the first time I see somebody mapping sustainable energy choices down to how many trips one can take or how many energy drinks one can have. Extremely instructive, a "&lt;a href="http://blip.tv/file/1018152"&gt;must watch&lt;/a&gt;".&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-1039392969032293063?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=4tZyEtZO6yc:z9IdICHErNU:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=4tZyEtZO6yc:z9IdICHErNU:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=4tZyEtZO6yc:z9IdICHErNU:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/4tZyEtZO6yc" height="1" width="1"/&gt;</content><link rel="related" href="http://blip.tv/file/1018152" title="Our energy future up close and personal" /><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/1039392969032293063/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2008/08/our-energy-future-up-close-and-personal.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/1039392969032293063?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/1039392969032293063?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/4tZyEtZO6yc/our-energy-future-up-close-and-personal.html" title="Our energy future up close and personal" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://blog.piccolboni.info/2008/08/our-energy-future-up-close-and-personal.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CUYNR3w5eyp7ImA9Wx5TEE4.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-3276089771649253998</id><published>2008-07-30T15:34:00.000-07:00</published><updated>2010-07-24T21:46:36.223-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-07-24T21:46:36.223-07:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="technology" /><category scheme="http://www.blogger.com/atom/ns#" term="computing" /><title>reCAPTCHA, a bogus motivation</title><content type="html">The authors claim they use spare cycles for a good cause, but it isn't true.&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
CAPTCHAs are little reading tests that allow to distinguish humans from machines when accessing a computer interface. A word is displayed in a slightly noisy form and the user is required to type it in, easy for humans but not so for machines. The authors of reCAPTCHA wanted to associate a useful task to this test. They say&lt;br /&gt;
&lt;blockquote&gt;... in aggregate these little puzzles consume more than 150,000 hours of work each day. What if we could make positive use of this human effort?&lt;/blockquote&gt;So instead of picking random words for the test, they pick word-sized fragment from large public digitization efforts (currently the one led by the Internet Archive) and submit them to the users. That is, reCAPTCHA words are words that are not in digital format, appear in a paper document of interest and machines could not understand. So by solving a reCAPTCHA you are helping a useful project. So where is the catch? Well it took me a moment of thought but it's kind of obvious: if these are fragments of real digitization projects, by definition we don't know the solution. If we don't know it, we can't use them as CAPTCHAs! So I went back to the reCAPTHCA web site, and what they do is to propose two CAPTCHAs, one with a known but useless solution and the other with an unknown but useful one, in a random order. If you want to use a website, you have to complete both. Therefore there is no use of spare cycles whatsoever: the effort for the user has been doubled. If reCAPTCHA were the standard, there would be another 150,000 hours a day spent solving the second word. Fine by me in exchange for a free service, but not as efficient as advertised, not even close.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-3276089771649253998?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=10dlIX4gmqE:tV7YIPNXUeA:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=10dlIX4gmqE:tV7YIPNXUeA:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=10dlIX4gmqE:tV7YIPNXUeA:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/10dlIX4gmqE" height="1" width="1"/&gt;</content><link rel="related" href="http://recaptcha.net/learnmore.html" title="reCAPTCHA, a bogus motivation" /><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/3276089771649253998/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2008/07/recaptcha-bogus-motivation.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/3276089771649253998?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/3276089771649253998?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/10dlIX4gmqE/recaptcha-bogus-motivation.html" title="reCAPTCHA, a bogus motivation" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://blog.piccolboni.info/2008/07/recaptcha-bogus-motivation.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CUUHQ3s6fSp7ImA9Wx5TEE4.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-8493780365767406416</id><published>2008-07-23T00:01:00.000-07:00</published><updated>2010-07-24T21:47:12.515-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-07-24T21:47:12.515-07:00</app:edited><title>Live search cashback not so cheap</title><content type="html">It's been &lt;a href="http://www.techcrunch.com/2008/05/22/the-empire-strikes-back-our-analysis-of-microsoft-live-search-cashback/"&gt;said&lt;/a&gt; that the cashback program with which Microsoft is trying to increase it search market share is a bold if not desperate move, because MS is giving away to the user its search revenue. Not so fast.&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
I did an unscientific test simulating the purchase of a North Face Dyad 22 (a backpacking tent) and a pair of Wilson Trance all court (tennis shoes). I searched with &lt;a href="http://search.live.com/cashback"&gt;live&lt;/a&gt; and &lt;a href="http://www.google.com/products"&gt;google products&lt;/a&gt;. In both cases, I found a better deal with google products ($188 and $79 vs $201 and $82), live cashback included. One could speculate that google indexes more merchants or that there is a fee to be included in the cashback program, but I have no way to tell which is closer to the truth. Did I hit two outliers or is this something other people noticed?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-8493780365767406416?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=iqP6bqCh6ns:K-EIZls9yZ4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=iqP6bqCh6ns:K-EIZls9yZ4:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=iqP6bqCh6ns:K-EIZls9yZ4:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/iqP6bqCh6ns" height="1" width="1"/&gt;</content><link rel="related" href="http://search.live.com/cashback" title="Live search cashback not so cheap" /><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/8493780365767406416/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2008/07/live-search-cashback-not-charitable.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/8493780365767406416?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/8493780365767406416?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/iqP6bqCh6ns/live-search-cashback-not-charitable.html" title="Live search cashback not so cheap" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://blog.piccolboni.info/2008/07/live-search-cashback-not-charitable.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CUUAR38_cSp7ImA9Wx5TEE4.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-5641416747338185821</id><published>2008-06-08T23:47:00.000-07:00</published><updated>2010-07-24T21:47:26.149-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-07-24T21:47:26.149-07:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="technology" /><category scheme="http://www.blogger.com/atom/ns#" term="science" /><title>Map of Science</title><content type="html">Fascinating map of the relation between scientific disciplines. Not all of the methodology is clear, but it is fun to look at. Notice the lack of ties between CSE and Biology.&lt;br /&gt;
&lt;br /&gt;
&lt;a href="http://mapofscience.com/"&gt;Map of Science&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-5641416747338185821?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=t3zW8sZ2dyU:tTRIcMY04wA:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=t3zW8sZ2dyU:tTRIcMY04wA:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=t3zW8sZ2dyU:tTRIcMY04wA:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/t3zW8sZ2dyU" height="1" width="1"/&gt;</content><link rel="related" href="http://mapofscience.com/" title="Map of Science" /><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/5641416747338185821/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2008/06/map-of-science.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/5641416747338185821?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/5641416747338185821?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/t3zW8sZ2dyU/map-of-science.html" title="Map of Science" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://blog.piccolboni.info/2008/06/map-of-science.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CUUDRn47fip7ImA9Wx5TEE4.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-8382607382437603581</id><published>2007-11-04T20:57:00.000-08:00</published><updated>2010-07-24T21:47:57.006-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-07-24T21:47:57.006-07:00</app:edited><title>Facebook Illegal Wiretaps</title><content type="html">&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
The formulation of this problem is quite creative, but overall it is just describing a matrix where the rows are workers and the columns are tasks. Workers have numbers and tasks have names and the job completion time depend on whether the worker is odd or even, the number of vowels and consonants in the task name and even common factors between the task name length and the worker number. Contrived but trivial.  On top of that one is called to solve an assignment problem or  weighted bipartite matching in graph speak. One algorithm to do just that is the &lt;a href="http://search.cpan.org/src/ANAGHAKK/Algorithm-Munkres-0.06/lib/Algorithm/Munkres.pm"&gt;Hungarian Algorithm&lt;/a&gt; (I wish I could speak of the Italian Algorithm, but there are none I know of). To create the input matrix one can use the program below, but it doesn't really achieve the full generality called for by the Facebook people. I wrote it down just to convince myself there wasn't some obvious structure in the matrix. It is written in my latest loop-free style (what do you do if your company doesn't switch to a functional language? Simple, write functionally in any language to the extent that is possible).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;use strict;

sub nconsonants {
my $!$a = shift;
$!$a=~s/[aeiouAEIOU]+//g;
return length($!$a);
}

sub factors {
my $!$a = shift;
my @factors = ([],[],[2],[3],[2,2], [5], [2,3], [7], [2,2,2], [3,3], [2,5]);
return $!$factors[length($!$a)];
}


sub name2info {
my $!$a = shift;
return {ncon=&gt;nconsonants($!$a),
nvow=&gt;length($!$a) - nconsonants ($!$a),
fact=&gt;factors($!$a)
};
}

sub progtime {
my ($!$prog, $!$info) = @_;
my $!$sum = 0;
map({$!$sum+=(($!$prog % $!$_)?0:4)} @{$!$info-&gt;{fact}});
return ($!$prog % 2 ? 1.5 * $!$info-&gt;{nvow} : $!$info-&gt;{ncon}) + $!$sum + 1;
}

my @names = qw/ANDROMEDA BARBARA CAMERON DAGMAR EKATERINA FLANNERY GREGORY HAMILTON ISABELLA
JEBEDIAH KIMBERLEY LARISSA MEREDITH NORMAN OSWALD PENELOPE QUENTIN RANDALL SAVANNAH TABITHA
URSULA VIVIENNE WINONA XAVIER YVONNE ZENOBIA/;

my @res=
map ({my $!$info = $!$_;
[map({my $!$prog = $!$_;
progtime($!$prog, $!$info);
}
1..26)]
}
map({name2info $!$_
}
@names));

print join "\n", map({join " \t", @$!$_} @res);

&lt;pre&gt;&lt;/pre&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-8382607382437603581?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=L5XL_nvtCPk:atwepkiuTZc:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=L5XL_nvtCPk:atwepkiuTZc:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=L5XL_nvtCPk:atwepkiuTZc:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/L5XL_nvtCPk" height="1" width="1"/&gt;</content><link rel="related" href="http://www.facebook.com/jobs_puzzles/?puzzle_id=11" title="Facebook Illegal Wiretaps" /><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/8382607382437603581/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2007/11/facebook-illegal-wiretaps.html#comment-form" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/8382607382437603581?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/8382607382437603581?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/L5XL_nvtCPk/facebook-illegal-wiretaps.html" title="Facebook Illegal Wiretaps" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>1</thr:total><feedburner:origLink>http://blog.piccolboni.info/2007/11/facebook-illegal-wiretaps.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CUQAQng9cCp7ImA9Wx5TEE4.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-8137554217911239431</id><published>2007-11-02T15:24:00.000-07:00</published><updated>2010-07-24T21:49:03.668-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-07-24T21:49:03.668-07:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="computing" /><title>Facebook Prime Bits</title><content type="html">&lt;div style="text-align: justify;"&gt;This is one of Facebook job candidate puzzles. Given a range [a,b] of positive integer numbers, test for the primality of the number of 1 bits in the binary representation of each number, and do so in  O(n)  where n is b -  a.&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
Unfortunately the puzzle goes on to assume that n is a 64 bit integer, and that makes asymptotic analysis quite meaningless, so I will discard this assumption for now and comment on it later. The naive solution loops through the given range converting each number to its binary representation, counting 1 bits and testing primality. The complexity of this algorithm is O((b-a) log(b)), using the polynomial time deterministic primality test of Agrawal,  Saxena and Kayal. One might think that the primality test is the hard part of the algorithm, but the numbers to be tested are  pretty small. The representation of b takes log(b) bits. The number of 1 bits in it is at most, of course, log(b) and that can be represented with log(log(b)) bits, and the ASK algorithm is polynomial w.r.t to the size of its input, or poly(log(log(b))). So is it possible to get rid of the log(b) term depending on counting the bits in the binary representation of a number? The answer is in the following functions and is positive. The main idea is to avoid recomputing the same partial sum of bits over and over again. Imagine a trie or prefix tree of the binary representations of all the numbers in the range. Imagine a traversal in preorder that is performed while keeping a count of 1 bits seen so far between the root (most significant bit) and the current position. Every time a leaf (complete number) is reached, the results are extended with one item.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;use strict;

sub countbits {
return f(@_, 0, []);
}

sub f {
my ($a, $b, $count, $retv) = @_;
if ($a != $b) {
my $k = int(log2($b));
if (topbits($a, $k) == topbits($b, $k)) {
return  f(bottombits($a, $k),
bottombits($b, $k),
$count + topbits($a, $k),
$retv );
}
else {
f(bottombits($a, $k),  
2**$k - 1,
$count,
$retv);
return  f(0,
bottombits($b, $k),
$count + 1,  
$retv);
}
}
else {
push @$retv, $count + g($a);
return $retv;
}
}

sub g {
my $a = shift;
return ($a == 0 ? 0 :$a % 2 + g ($a &gt;&gt; 1));
}

sub topbits{
my ($a, $k) = @_;
return $a &gt;&gt; $k;
}

sub bottombits{
my ($a, $k) = @_;
return $a % 2**$k;
}

sub log2 {
my $n = shift;
return log($n)/log(2);
}

&lt;/pre&gt;&lt;br /&gt;
The shift and mod operations are not strictly constant time, but they are used just to select the first bit and the remaining bits, so that can be done in constant time but I thought the function was clearer this way. The worst case analysis assumes that the double recursive call is always performed and relies on the fact that the number of significant digits is decreasing by one with each level of recursion. G is of course O(log(a)). This establishes a O((b-a)+log(a)) bound. So there is a log(a) term in there, and we still need the primality test, either precomputed (O((b-a)+log(a)+log(b)poly(log(log(b)))) total time) or performed on each number in O((b-a) poly(log(log(b)))). Neither is superior depending on a. So one can not strictly achieve O(b-a) as claimed by the Facebook gurus. This is not a limitation of my approach. If it were possible, it would imply for a=b that primality can be tested in constant time. Two friends of mine and programming superstars both  used the additional assumption  of 64 bit numbers to consider log(b) &lt;=64 and therefore constant as we were discussing this problem. Personally, I think mixing finite and asymptotic analysis doesn't really help making things more clear and ridding oneself of a log(b) factor when  it can be as large as 64 can have practical consequences. Actually, I think this is a great example that asymptotic analysis is insightful. Still it would be nice to make sure that my algorithm doesn't have horrible constants in its time complexity function. This is left as an exercise for the reader (suggestion: use generating functions). &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-8137554217911239431?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=Vu18lvgYB8Y:PFCL12SZHXs:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=Vu18lvgYB8Y:PFCL12SZHXs:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=Vu18lvgYB8Y:PFCL12SZHXs:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/Vu18lvgYB8Y" height="1" width="1"/&gt;</content><link rel="related" href="http://www.facebook.com/jobs_puzzles/" title="Facebook Prime Bits" /><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/8137554217911239431/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2007/11/facebook-prime-bits.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/8137554217911239431?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/8137554217911239431?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/Vu18lvgYB8Y/facebook-prime-bits.html" title="Facebook Prime Bits" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://blog.piccolboni.info/2007/11/facebook-prime-bits.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DUMBRH46fSp7ImA9WB9XEk0.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-1337825549252782027</id><published>2007-05-24T12:17:00.000-07:00</published><updated>2007-11-04T12:17:35.015-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2007-11-04T12:17:35.015-08:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="computing" /><title>ProjectDescription - Lucene-hadoop Wiki</title><content type="html">Implementation of simple parallel computing, based on Google's map-reduce, runs over Amazon's EC2. Supercomputing for the rest of us   &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://wiki.apache.org/lucene-hadoop/ProjectDescription"&gt;ProjectDescription - Lucene-hadoop Wiki&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-1337825549252782027?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=e2hZndNky4w:dbnoTm0zfc4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=e2hZndNky4w:dbnoTm0zfc4:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=e2hZndNky4w:dbnoTm0zfc4:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/e2hZndNky4w" height="1" width="1"/&gt;</content><link rel="related" href="http://wiki.apache.org/lucene-hadoop/ProjectDescription" title="ProjectDescription - Lucene-hadoop Wiki" /><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/1337825549252782027/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2007/05/projectdescription-lucene-hadoop-wiki.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/1337825549252782027?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/1337825549252782027?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/e2hZndNky4w/projectdescription-lucene-hadoop-wiki.html" title="ProjectDescription - Lucene-hadoop Wiki" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://blog.piccolboni.info/2007/05/projectdescription-lucene-hadoop-wiki.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CUMERHc9eip7ImA9Wx5TEE4.&quot;"><id>tag:blogger.com,1999:blog-12987422.post-2067834701045677398</id><published>2007-03-22T12:01:00.000-07:00</published><updated>2010-07-24T21:50:05.962-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-07-24T21:50:05.962-07:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="bioinformatics" /><category scheme="http://www.blogger.com/atom/ns#" term="science" /><title>PLoS Computational Biology - “Antedisciplinary” Science</title><content type="html">Sean Eddy, a computational biologist, critiques the big science team mentality promoted by the NIH. New disciplines are always founded amid resistance from the existing powers. When the rise of a new discipline is unavoidable, existing powers try to split the pie by forming teams -- and requiring everyone else to do so if they want funding. Well, the original spin is not so political, read the paper yourself. It voices the opinions of many computational biologists I know, who are struggling to fit into some department and get funding.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/12987422-2067834701045677398?l=blog.piccolboni.info' alt='' /&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=TUVtPWgVY68:fNS66Ap1D4I:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=TUVtPWgVY68:fNS66Ap1D4I:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/piccolboni/qdqo?a=TUVtPWgVY68:fNS66Ap1D4I:bcOpcFrp8Mo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/piccolboni/qdqo?d=bcOpcFrp8Mo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/piccolboni/qdqo/~4/TUVtPWgVY68" height="1" width="1"/&gt;</content><link rel="related" href="http://compbiol.plosjournals.org/perlserv/?request=get-document&amp;doi=10.1371/journal.pcbi.0010006#journal-pcbi-0010006-b02" title="PLoS Computational Biology - “Antedisciplinary” Science" /><link rel="replies" type="application/atom+xml" href="http://blog.piccolboni.info/feeds/2067834701045677398/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://blog.piccolboni.info/2007/03/plos-computational-biology.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/2067834701045677398?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/12987422/posts/default/2067834701045677398?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/piccolboni/qdqo/~3/TUVtPWgVY68/plos-computational-biology.html" title="PLoS Computational Biology - “Antedisciplinary” Science" /><author><name>Antonio Piccolboni</name><uri>http://www.blogger.com/profile/18181181557046696245</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://blog.piccolboni.info/2007/03/plos-computational-biology.html</feedburner:origLink></entry></feed>

