<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>BEST IN CLASS</title>
	
	<link>http://www.bestinclass.dk</link>
	<description>Software Innovator</description>
	<lastBuildDate>Sun, 28 Feb 2010 17:25:41 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/bestinclass-the-blog" /><feedburner:info uri="bestinclass-the-blog" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Getting benchmarking right</title>
		<link>http://feedproxy.google.com/~r/bestinclass-the-blog/~3/dMwlra3uYnM/</link>
		<comments>http://www.bestinclass.dk/index.php/2010/02/benchmarking-jvm-languages/#comments</comments>
		<pubDate>Sun, 28 Feb 2010 15:30:56 +0000</pubDate>
		<dc:creator>Lau</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[benchmark]]></category>
		<category><![CDATA[clojure]]></category>
		<category><![CDATA[consensus]]></category>
		<category><![CDATA[haskell]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[scala]]></category>

		<guid isPermaLink="false">http://www.bestinclass.dk/?p=1121</guid>
		<description><![CDATA[
			
				
			
		
Several times on this blog, I’ve dealt with issues where some kind of benchmarking was required. The method, implementation, environment all play a role and has subsequently been the object of much discussion. In this post let’s see if we can agree on some way of benchmarking.



Benchmarking is 4D
In my oppinion there are 4 dimensions [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F02%2Fbenchmarking-jvm-languages%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F02%2Fbenchmarking-jvm-languages%2F&amp;source=LauJensen&amp;style=normal&amp;service=bit.ly" height="61" width="50" /><br />
			</a>
		</div>
<p>Several times on this blog, I’ve dealt with issues where some kind of benchmarking was required. The method, implementation, environment all play a role and has subsequently been the object of much discussion. In this post let’s see if we can agree on some way of benchmarking.</p>
<p><br class="spacer_" /></p>
<p><span id="more-1121"></span></p>
<p><br class="spacer_" /></p>
<h1>Benchmarking is 4D</h1>
<p>In my oppinion there are 4 dimensions to benchmarking which we need to deal with separately. In industry benchmarking can be crual to architecting your solution correctly and/or dealing with performance issues. Industry benchmarking needs to come as close to the real scenario as possible. If you’re working on a massive scale, benchmarking on a laptop is a no go, if your setup is heavy on I/O you need to clone your production environment in order to produce reliable results, etc etc.</p>
<p>The second type of benchmarking is what is typical for bloggers and I’m no exception: Microbenchmarking. Microbenchmarking is where we take certain routines out and benchmark them individually, my recent <a href="http://www.bestinclass.dk/index.php/2010/02/haskell-ruby-clojure/" target="_blank">Fibonnaci</a> <a href="http://www.bestinclass.dk/index.php/2010/02/haskellrubyscalaclojure-tweaked/" target="_blank">posts</a> are examples of this. Like the commentators pointed out, there needs to be an emphasis on equality where possible. A Ruby solution won’t look like a Clojure solution, but differences like returning or printing the result should be eliminated and defining the specific area to benchmark is very important: Are we timing the function or the inner-loop?</p>
<p>This second form of benchmarking is in divided into 2 distinct categories as well (bringing us up to the 4D benchmarking), because we have many languages which put the JVM to good effect with all the added benefits, but there are also languages that dont. Benchmarking on the JVM requires a good knowledge of both the JVM and HotSpot, whereas benchmarking outside of the JVM requires a good knowledge of whichever compiler you are working with. In that regard I couldn’t do Haskell any favors, for not having a clue about how the compiled code runs, start up times etc?</p>
<p><br class="spacer_" /></p>
<h1>JVM Benchmarking</h1>
<p>Being able to benchmark on the JVM is important and requires some background. I recommend reading <a href="http://www.ibm.com/developerworks/java/library/j-jtp02225.html" target="_blank">this post</a> in particular.</p>
<p>The unix time function is ruled out for the obvious reasons that <strong>1)</strong> We dont want to be affected by the startup time of the JVM and <strong>2)</strong> We dont need it. Especially when we’re looking at single algorithm/loop etc, what we really want to know is how well that given body of code performs. Its interaction with the rest of our system can be disregarded for the sake of comparison and since we’re often dealing with very small numbers, subtle differences quickly become not so subtle. Before we can get into the actual benchmarking we need to boot the JVM.</p>
<h3>What we dont care about</h3>
<p>For the sake of blogging/microbenchmarking we’re typically not doing any major optimizations, because usually idioms are being compared. There are exceptions, but we can deal with them once we get there. The JVM is a fantastic eco-system which allows for rigorious introspection — all of which we can disgard for our simple exercises.</p>
<p>Things we can disregard:</p>
<blockquote><p><strong>JVM Parameters:</strong></p>
<p><strong>–XX:+PrintCompilation</strong> — This gives us a heads-up whenever a method is compiled.</p>
<p><strong>–verbose:GC</strong> — Lets us keep tabs on when the GC is running</p>
<p><strong>–XX:-PrintGC</strong> — Print messages at GC</p>
<p><strong>–XX:-PrintGCDetails</strong> — More verbose GC messages</p>
<p><strong>–XX:-TraceClassLoading</strong> — Trace the loading of classes</p>
<p><strong>–agentlib:hprof[<em>=options</em>] — </strong> Heap/CPU — read this: <a href="http://java.sun.com/developer/technicalArticles/Programming/HPROF.html" target="_blank">options</a></p>
</blockquote>
<p>All of these (and dim sum) can be very helpful in tracing performance bottlenecks, but for microbenchmarking they seem overkill. Depending on how much allocation we’re doing we might trigger some heavy GC, but in that case we should account be stripping high/low values from our timings.</p>
<p>The things we cannot disregard:</p>
<blockquote><p><strong>JVM Paramters:</strong></p>
<p><strong>–Xms128M</strong> — The minimum amount of memory the JVM allocation</p>
<p><strong>–Xmx512M</strong> — The maximum amount</p>
<p><strong>–server/-client</strong> — Depending on which we choose for a given task, you actually get <a href="http://stackoverflow.com/questions/198577/real-differences-between-java-server-and-java-client" target="_blank">different compilers</a></p>
</blockquote>
<p>We should strive to keep the values identical because depending on the task they can make all the difference. The minimum amount is allocation at the JVM boot, meaning no time will be spent growing the memory space while the program is running if you don’t exceed this. Thought I haven’t checked Jarkko Oranen suggested that the preformance degradation which comes by approaching the Xmx value is due to the GC working double time to free up memory, therefore its important to set the max to an appropriate value — though it varies from test to test we should strive to keep these values identical.</p>
<p><br class="spacer_" /></p>
<h1>Methodology</h1>
<p>I don’t think there’s a set standard for how we benchmark when doing blog comparisons, so allow me to introduce and outline for how we can handle comparisons done on this blog:</p>
<blockquote><p><strong>Do multiple passes</strong>: The code being benched should be executed repeatedly a given number of times, at default for smaller algorithms could be 20 passes.</p>
</blockquote>
<p>Because we eventually hit GCs or other burps of the system, multiple passes ensure that we get the overall picture.</p>
<blockquote><p><strong>Garbage collect: </strong>Dont heap it up between runs, but clean after every pass.</p>
</blockquote>
<p>This is the simplest way in which you can tame the GC and help get uniform results.</p>
<blockquote><p><strong>Filter highest and lowest values</strong>: Take the highest and lowest value and remove them from the timings.</p>
</blockquote>
<p>When timing on the JVM some of the bumps are very significant — You can have a series of runs going at <strong><em>5ms</em></strong> and then suddenly a pass that takes <strong><em>40ms</em></strong> — Since its more a reflection of the disturbance than the actual algorithm, it can safely be stripped.</p>
<blockquote><p><strong>Warm up the JVM:</strong> To trigger all the optimizations etc, the main loop should be repeated a number of times</p>
</blockquote>
<p>Since we’re blogging, I recommend just going for either 1 pass of the algorithm, or repeated passes of the algorithm until we’ve have been crunching for 1 minute. So if you algorithm takes 30 seconds for 1 pass, it will run twice, if it takes 2 minutes per pass, it will run once.</p>
<p><br class="spacer_" /></p>
<h1>Implementation</h1>
<p>To make this fair I’ll provide a benchmark macro in Clojure and if you are also the happy user of some other JVM language which you want included in future benchmarks on this site, please provide me with a similar function/macro, which I will then both post here and use in future comparisons.</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defmacro</span> <span style="color: #7fffd4; font-weight: bold;">microbench</span>
  <span style="color: #87cefa;">" Evaluates the expression n number of times, returning the average
    time spent in computation, removing highest and lowest values.

    If the body of expr returns nil, only the timing is returned otherwise
    the result is printed - does not affect timing.

    Before timings begin, a warmup is performed lasting either 1 minute or
    1 full computational cycle, depending on which comes first."</span>
  [n expr] {<span style="color: #7fffd4;">:pre</span> [(<span style="color: #7fffd4;">&gt;</span> n 2)]}
  `(<span style="color: #afeeee; font-weight: bold;">let</span> [warm-up#  (<span style="color: #afeeee; font-weight: bold;">let</span> [start# (System/currentTimeMillis)]
                     (<span style="color: #7fffd4;">println</span> <span style="color: #87cefa;">"Warming up!"</span>)
                     (<span style="color: #7fffd4;">while</span> (<span style="color: #7fffd4;">&lt;</span> (System/currentTimeMillis) (<span style="color: #7fffd4;">+</span> start# (<span style="color: #7fffd4;">*</span> 60 1000)))
                            (<span style="color: #7fffd4;">with-out-str</span> ~expr)
                            (System/gc))
                     (<span style="color: #7fffd4;">println</span> <span style="color: #87cefa;">"Benchmarking..."</span>))
         timings#  (<span style="color: #afeeee; font-weight: bold;">doall</span>
                    (<span style="color: #afeeee; font-weight: bold;">for</span> [pass# (<span style="color: #7fffd4;">range</span> ~n)]
                      (<span style="color: #afeeee; font-weight: bold;">let</span> [start#    (System/nanoTime)
                            retr#     ~expr
                            timing#   (<span style="color: #7fffd4;">/</span> (<span style="color: #7fffd4;">double</span> (<span style="color: #7fffd4;">-</span> (System/nanoTime) start#))
                                         1000000.0)]
                        (<span style="color: #afeeee; font-weight: bold;">when</span> retr# (<span style="color: #7fffd4;">println</span> retr#))
                        (System/gc)
                        timing#)))
         runtime#  (<span style="color: #7fffd4;">reduce</span> + timings#)
         highest#  (<span style="color: #7fffd4;">apply</span> max timings#)
         lowest#   (<span style="color: #7fffd4;">apply</span> min timings#)]
     (<span style="color: #7fffd4;">println</span> <span style="color: #87cefa;">"Total runtime: "</span> runtime#)
     (<span style="color: #7fffd4;">println</span> <span style="color: #87cefa;">"Highest time : "</span> highest#)
     (<span style="color: #7fffd4;">println</span> <span style="color: #87cefa;">"Lowest time  : "</span> lowest#)
     (<span style="color: #7fffd4;">println</span> <span style="color: #87cefa;">"Average      : "</span> (<span style="color: #7fffd4;">/</span> (<span style="color: #7fffd4;">-</span> runtime# (<span style="color: #7fffd4;">+</span> highest# lowest#))
                                   (<span style="color: #7fffd4;">-</span> (<span style="color: #7fffd4;">count</span> timings#) 2)))
     timings#))
</pre>
<p>The code is effectively divded into 3 sections:</p>
<p style="padding-left: 30px;"><strong>Warm-up:</strong> A while loop runs for at least 1 minute, trapping all output so we dont see any printing from whatever we’re benchmarking.</p>
<p style="padding-left: 30px;"><strong>Timings</strong>: The actually expression is repeatedly run and the milliseconds spent in each pass is returned</p>
<p style="padding-left: 30px;"><strong>Stats</strong>: Highest/Lowest values are filtered, results are printed</p>
<p>Using this to test our Fibonacci code from my previous post, then becomes:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;"><span style="color: #afeeee; font-weight: bold;">user&gt; </span>(microbench 20
              (<span style="color: #afeeee; font-weight: bold;">let</span> [limit (.pow (BigInteger/TEN) 999)]
                   (<span style="color: #afeeee; font-weight: bold;">loop</span> [a 0 b 1 i 1]
                      (<span style="color: #afeeee; font-weight: bold;">if</span> (<span style="color: #7fffd4;">&lt;</span> b limit)
                          (<span style="color: #afeeee; font-weight: bold;">recur</span> b (<span style="color: #7fffd4;">+</span> a b) (<span style="color: #7fffd4;">inc</span> i))
                          nil))))
<span style="color: #87cefa;">Warming up!
Benchmarking...
Total runtime:  110.08787899999999
Highest time :  7.98691
Lowest time  :  5.135988
Average      :  5.386943388888889</span>
</pre>
<p>Notice the final ‘nil’ in the code. If I had returned ‘i’ instead it would have printed the result of each pass before printing the final stats — It would not affect the timing. Or if we used the more idiomatic (?) version:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;"><span style="color: #afeeee; font-weight: bold;">user&gt; </span><span style="font-weight: bold;">(microbench 20
</span>                  (<span style="color: #afeeee; font-weight: bold;">let</span> [limit (.pow (BigInteger/TEN) 999)]
                       (<span style="color: #7fffd4;">count</span> (<span style="color: #7fffd4;">take-while</span> #(<span style="color: #7fffd4;">&lt;</span> % limit) fib-seq))))
<span style="color: #87cefa;">Warming up!
Benchmarking...
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
Total runtime:  74.73764200000001
Highest time :  5.031017
Lowest time  :  3.168419
Average      :  3.696567</span></pre>
<p>Of course in that case its cheating, because the fib-seq is calculated once and then kept in memory.</p>
<p><br class="spacer_" /></p>
<h1>Conclusion</h1>
<p>Benchmarks can be fun, but I think its important to <strong>1)</strong> Agree on some methodology and <strong>2)</strong> Not go nuts over a few milliseconds here and there. Certain algorithms perform better than others, and certain algorithms can be more or less idiomatically expressed in various languages. Ultimately when looking at the results we have to apply some common sense as we’re not always doing 1:1 comparisons. Using these benchmarks to definitely positively declare one language superior to the other should neither be our goal nor is it possible.</p>
<p>If you’ve taken a look at Alex Osbornes <a href="http://meshy.org/2009/12/13/widefinder-2-with-clojure.html" target="_blank">attempt</a> at the WideFinder 2 challenge, you saw how he plowed through <strong>45 Gigabytes</strong> of text in a blazing <strong>8m 4s</strong> blowing both Scala and Java out of the water. Does that then conclusively state that Clojure is faster than both those languages? Of course not, Clojure compiles to bytecode, just like Scala and Java so the exact same speed could be obtained by both of them. The point is, that the concise and elegant Clojure code can be as powerful than other languages without the same level of incidental complexity being put on the user, so it makes sense to focus on the quality of the code while keeping half an eye on the performance. If all that mattered was speed, we’d all still be writing ASM.</p>
<p>This is my proposal for a uniform way to benchmark in the future — let me know what you think, I’m only happy to accept changes, improvements, implementations in other languages etc. And JVM outsiders shouldn’t feel left out.</p>

<p><a href="http://feedads.g.doubleclick.net/~a/VvMS5XJ4lWtQZnPE4myjTGUAXjk/0/da"><img src="http://feedads.g.doubleclick.net/~a/VvMS5XJ4lWtQZnPE4myjTGUAXjk/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/VvMS5XJ4lWtQZnPE4myjTGUAXjk/1/da"><img src="http://feedads.g.doubleclick.net/~a/VvMS5XJ4lWtQZnPE4myjTGUAXjk/1/di" border="0" ismap="true"></img></a></p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=dMwlra3uYnM:hrFNd64M5gg:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=dMwlra3uYnM:hrFNd64M5gg:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=dMwlra3uYnM:hrFNd64M5gg:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?i=dMwlra3uYnM:hrFNd64M5gg:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bestinclass-the-blog/~4/dMwlra3uYnM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.bestinclass.dk/index.php/2010/02/benchmarking-jvm-languages/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		<feedburner:origLink>http://www.bestinclass.dk/index.php/2010/02/benchmarking-jvm-languages/</feedburner:origLink></item>
		<item>
		<title>Haskell,Ruby,Scala,Clojure — Tweaked!</title>
		<link>http://feedproxy.google.com/~r/bestinclass-the-blog/~3/UjPBxJi5wVM/</link>
		<comments>http://www.bestinclass.dk/index.php/2010/02/haskellrubyscalaclojure-tweaked/#comments</comments>
		<pubDate>Wed, 24 Feb 2010 20:17:11 +0000</pubDate>
		<dc:creator>Lau</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[clojure]]></category>
		<category><![CDATA[euler]]></category>
		<category><![CDATA[fibonnacci]]></category>
		<category><![CDATA[haskell]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[scala]]></category>

		<guid isPermaLink="false">http://www.bestinclass.dk/?p=1112</guid>
		<description><![CDATA[
			
				
			
		
Last night I did very small superficial comparison, of 3 ways of getting the first fibonnaci number consisting of 1000 digits. This attracted a lot of attention from Rubists, Haskaloonies and Clojurians alike, here I’ll share their contributions.



Preface
Microbenchmarking sucks, especially on the JVM. HotSpot is doing such a great job of optimizing code, that sometimes you end [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F02%2Fhaskellrubyscalaclojure-tweaked%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F02%2Fhaskellrubyscalaclojure-tweaked%2F&amp;source=LauJensen&amp;style=normal&amp;service=bit.ly" height="61" width="50" /><br />
			</a>
		</div>
<p>Last night I did very <a href="http://www.bestinclass.dk/index.php/2010/02/haskell-ruby-clojure/" target="_blank">small superficial comparison</a>, of 3 ways of getting the first fibonnaci number consisting of 1000 digits. This attracted a lot of attention from Rubists, Haskaloonies and Clojurians alike, here I’ll share their contributions.</p>
<p><br class="spacer_" /></p>
<p><span id="more-1112"></span></p>
<p><br class="spacer_" /></p>
<h1>Preface</h1>
<p>Microbenchmarking sucks, especially on the JVM. HotSpot is doing such a great job of optimizing code, that sometimes you end up getting an incorrect idea of how your code performs. Nevertheless I got, many tips for improving performance, so it seems the Internet Performance Rule of Thumb is: <strong>Always benchmark the Fibs</strong>!</p>
<p><br class="spacer_" /></p>
<h2>Haskell</h2>
<p>Team Haskell weren’t pleased completing the job in 7 msecs, so they suggested that I recompile the program using –O2. I used GHC v. 6.10:</p>
<pre class="sh_haskell" name="code">limit = 10^999
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
main = print . length . takeWhile (&lt; limit) $ fibs
</pre>
<p style="text-align: right;"><em> (thanks to: “laulau”) </em></p>
<p>Real time, average of 20: <strong>7 msecs.</strong><br />
 Nothing gained, but 7 msecs is blazing.</p>
<p><br class="spacer_" /></p>
<h2>Ruby</h2>
<p>Ruby had big room for improvement as the very clever Matrix approach didn’t yield very satisfying performance. Here’s one commentators approach</p>
<pre class="sh_ruby" name="code">limit = 10**999
def fib2(a,b)
 return b,b+a
end
a,b = fib2(0,1)
while b &lt; limit
 a,b = fib2(a,b)
end
puts b
</pre>
<p style="text-align: right;"><em> (thanks to: Aaron) </em></p>
<p>Average on 20 runs: <strong>35 msecs.</strong></p>
<p>Huge improvement and honor is restored to the Ruby camp!</p>
<p><br class="spacer_" /></p>
<h2>Scala</h2>
<p>Camp Scala was good enough to contribute a solution as well:</p>
<pre class="sh_scala" name="code">object Main {
  def main(args: Array[String]) {
    def f(a: BigInt, b: BigInt) = (b, b + a)
    val max = BigInt(10).pow(999)
    var (a, b) = f(0, 1)
    while(b &lt; max) {
      val temp = f(a, b)
      a = temp._1
      b = temp._2
    }
    println("\nb = " + b)
  }
}
</pre>
<p style="text-align: right;"><em> (thanks to: Rahul G ‚<strong>*updated*</strong>) </em></p>
<p>Average: <strong>18 ms</strong>.</p>
<p><br class="spacer_" /></p>
<h2>Clojure</h2>
<p>Clojures loops are recursive and side-effect free, but even still it looks a little clunky:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #7fffd4;">time</span> (<span style="color: #afeeee; font-weight: bold;">let</span> [limit (.pow (BigInteger/TEN) 999)]
        (<span style="color: #afeeee; font-weight: bold;">loop</span> [a 0 b 1 i 1]
          (<span style="color: #afeeee; font-weight: bold;">if</span> (<span style="color: #7fffd4;">&lt;</span> b limit)
            (<span style="color: #afeeee; font-weight: bold;">recur</span> b (<span style="color: #7fffd4;">+</span> a b) (<span style="color: #7fffd4;">inc</span> i))
            i))))

(thanks to: Duc)</pre>
<p>Nevertheless, it averages at: <strong>5 msecs</strong>. Hooray! It’s a new record!</p>
<p><br class="spacer_" /></p>
<h1>Conclusion</h1>
<p>If nothing else I think its fun to see these challenges approached from different languages, which is also what I like about Project Euler. And I’m certain that this will not be the last time I bring Haskell along for the ride. Last nights post were meant as small reflections on 3 very different languages, so I was a little surprised of the attention that it got — But its positive, I think its great when we all can learn a little from one another.</p>
<p>I’m open for ideas on another arena for comparisons. We could do something crazy like solve Global Warming.. oh wait, <a href="http://www.bestinclass.dk/index.php/2010/01/global-warming/">I already did that</a>. How about <a href="http://projecteuler.net/index.php?section=problems&amp;id=200">Euler 200</a> then?</p>
<script type="text/javascript" src="/wp-content/plugins/shjs-syntax-hiliter/shjs/lang/sh_haskell.js"></script><script type="text/javascript" src="/wp-content/plugins/shjs-syntax-hiliter/shjs/lang/sh_ruby.js"></script><script type="text/javascript" src="/wp-content/plugins/shjs-syntax-hiliter/shjs/lang/sh_scala.js"></script>
<p><a href="http://feedads.g.doubleclick.net/~a/FUw-BVWESt0viYcaa2scFYmF_ME/0/da"><img src="http://feedads.g.doubleclick.net/~a/FUw-BVWESt0viYcaa2scFYmF_ME/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/FUw-BVWESt0viYcaa2scFYmF_ME/1/da"><img src="http://feedads.g.doubleclick.net/~a/FUw-BVWESt0viYcaa2scFYmF_ME/1/di" border="0" ismap="true"></img></a></p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=UjPBxJi5wVM:P5Vft869V50:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=UjPBxJi5wVM:P5Vft869V50:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=UjPBxJi5wVM:P5Vft869V50:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?i=UjPBxJi5wVM:P5Vft869V50:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bestinclass-the-blog/~4/UjPBxJi5wVM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.bestinclass.dk/index.php/2010/02/haskellrubyscalaclojure-tweaked/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		<feedburner:origLink>http://www.bestinclass.dk/index.php/2010/02/haskellrubyscalaclojure-tweaked/</feedburner:origLink></item>
		<item>
		<title>Clojure, Haskell &amp; Ruby Vs Euler 25</title>
		<link>http://feedproxy.google.com/~r/bestinclass-the-blog/~3/4xTpA5QEmk8/</link>
		<comments>http://www.bestinclass.dk/index.php/2010/02/haskell-ruby-clojure/#comments</comments>
		<pubDate>Tue, 23 Feb 2010 18:53:47 +0000</pubDate>
		<dc:creator>Lau</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[clojure]]></category>
		<category><![CDATA[euler]]></category>
		<category><![CDATA[haskell]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[vs]]></category>

		<guid isPermaLink="false">http://www.bestinclass.dk/?p=1092</guid>
		<description><![CDATA[
			
				
			
		
Haskell is a mature, statically typed, functional language which was recently compared to Ruby in an attempt to solve Euler #25. In this post I’ll share the code, the benchmark and add a Clojure version for those interested.



Preface
Normally it’s bad to start with the disclaimer, but for the sake of all you angry young men [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F02%2Fhaskell-ruby-clojure%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F02%2Fhaskell-ruby-clojure%2F&amp;source=LauJensen&amp;style=normal&amp;service=bit.ly" height="61" width="50" /><br />
			</a>
		</div>
<p>Haskell is a mature, statically typed, functional language which was <a href="http://blog.mostof.it/ruby-vs.-haskell-project-euler-25-deathmatch" target="_blank">recently compared</a> to Ruby in an attempt to solve <a href="http://projecteuler.net/index.php?section=problems&amp;id=25" target="_blank">Euler #25</a>. In this post I’ll share the code, the benchmark and add a Clojure version for those interested.</p>
<p><br class="spacer_" /></p>
<p><span id="more-1092"></span></p>
<p><br class="spacer_" /></p>
<h1>Preface</h1>
<p>Normally it’s bad to start with the disclaimer, but for the sake of all you angry young men with ill tempers I’ll say it anyway: Similar to the <a href="http://www.bestinclass.dk/index.php/2009/12/clojure-vs-ruby-scala-transient-newsgroups/" target="_blank">Clojure Vs Ruby &amp; Scala</a> post this is in no way an exhaustive comparison between the languages, its intended at a superficial glance at a simple challenge, namely <a href="http://projecteuler.net/index.php?section=problems&amp;id=25" target="_blank">Euler #25</a>, all in the name of good clean fun. Haskell is by far too vast, too impressive and too powerful to be fully dealt with in a simple exercise like this, and the Ruby code is still running :)</p>
<h2>Euler #25</h2>
<p>Create a Fibonnaci sequence and walk through it, until you find the first number which is 1000 digits. The Fibonnacci sequence is exceedingly simple. Its seeded with 0 1 and all consecutive values n, are generated by calculating n-1 + n-2.</p>
<blockquote><p>0,1,1,2,3,5,8,13,21,34,55,89,144,233,377,610,987,1597,2584,4181.….</p>
<p><br class="spacer_" /></p>
</blockquote>
<h1>Solutions</h1>
<p>All 3 versions are compact and powerful — I’m assuming they’re idiomatic, but if I’m wrong let me know. Like I said — If you can produce a better version, please submit it instead of taking this as an invitation to a flamewar :)</p>
<p><br class="spacer_" /></p>
<h2>Ruby</h2>
<p>Ruby goes like this:</p>
<pre class="sh_ruby" name="code">require 'matrix'

limit = 10**999
FIBONACCI_MATRIX = Matrix[[1,1],[1,0]]
def fibonacci(n)
  (FIBONACCI_MATRIX**(n-1)) [0,0]
end
i = 1
i+=1 while fibonacci(i) &lt; limit
puts i
</pre>
<p>The real trick here is building the Fibonnacci seq (fib-seq), which is done by applying the Matrix form, which is a clever way to leverage some differential equations in order to produce the sequence. According to the author this runs faster than a regular memoized variant.</p>
<p><br class="spacer_" /></p>
<h2>Haskell</h2>
<p>Haskell code looks very different from both Ruby and Clojure, but for a simple task like this is reads almost like plain English:</p>
<pre name="code" class="sh_haskell">limit = 10^999
fibonacci_numbers = 0:1:(zipWith (+)
                        fibonacci_numbers (tail fibonacci_numbers))

index = length w where w = takeWhile (&lt; limit) fibonacci_numbers

main = do
  putStrLn(show(index))
</pre>
<p>First the limit is defined as the first number having 1000 digits, and then a sequence seeded with 0 and 1. When you have a sequence which requires a constant look-behind of n-values, it can be represented as a recursive set of vectors. I call that a rule of thumb, <a href="http://clj-me.cgrand.net" target="_blank">Christophe</a> says its a <em><strong>theorem</strong></em>. One thing which is really great about the Haskell code, is the automatic currying which enables the author to do the elegant w = takeWhile (&lt; limit). Although this example is not flaunting it, Haskell is very different from Clojure in that its statically typed and there’s been many discussions back and forth on which is to prefer, dynamic or static typing. The truth is, no one answer fits all but there’s a simple test:</p>
<p><br class="spacer_" /></p>
<p style="text-align: center;"><img title="Static Typing" src="http://www.bestinclass.dk/wp-content/uploads/2009/ducktyping.jpeg" alt="Static Typing" width="477" height="396" /></p>
<p><br class="spacer_" /></p>
<p>If the arrow and the explanation actually helped you, then static typing is for you.</p>
<p><br class="spacer_" /></p>
<h2>Clojure</h2>
<p>To produce the fib-seq, we can mimic Haskell almost point for point, but where Haskell has direct support for sequences in the syntax, I need to call a few functions:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">def</span> <span style="color: #7fffd4; font-weight: bold;">fib-seq</span> (<span style="color: #7fffd4;">lazy-cat</span> [0 1] (<span style="color: #7fffd4;">map</span> + fib-seq (<span style="color: #7fffd4;">rest</span> fib-seq))))

(<span style="color: #afeeee; font-weight: bold;">let</span> [limit (.pow (BigInteger/TEN) 999)]
           (<span style="color: #7fffd4;">count</span> (<span style="color: #7fffd4;">take-while</span> #(<span style="color: #7fffd4;">&lt;</span> % limit) fib-seq)))</pre>
<p>The noticeable exterior differences is the call to lazy-cat and the anonymous function, which all readers of this blog should be able to read without any explanation. Under the hood, Haskell is fully lazy where Clojure is now Chunky-lazy. Chunked-seqs mean that I work with the sequences as if they were fully lazy, but under the hood Clojure is realizing the sequence, one chunk at a time. If I take 50 values, 82 may in fact be calculated — this is for improved performance without wrecking the use of infinite sequences. According to one of the authors of <a href="http://joyofclojure.com/buy" target="_blank">this book</a> (which looks extremely promising btw, so go look), Rich Hickey has provided an example which eliminates the need for chunked seqs: <a href="http://gist.github.com/312649" target="_blank">here</a>. That is idiomatic Clojure running faster than our old seq implementation written primarily in Java.</p>
<p><br class="spacer_" /></p>
<h1>Results</h1>
<p>So the usual indicators for expressiveness and performance are LOC and runtime speed, so here we go:</p>
<div style="margin: auto auto;">
<table style="border: 1px solid black; margin: auto auto;">
<tbody>
<tr>
<td><strong>Language</strong></td>
<td><strong>LOC</strong></td>
<td><strong>Runtime</strong></td>
</tr>
<tr>
<td style="width: 150px;">Clojure</td>
<td style="text-align: center;"><span style="color: #ff0000;"><strong>3</strong></span></td>
<td style="text-align: center;">0.021s</td>
</tr>
<tr>
<td>Haskell</td>
<td style="text-align: center;">6</td>
<td style="text-align: center;"><span style="color: #ff0000;"><strong><span style="color: #ff0000;">0.007s</span></strong></span></td>
</tr>
<tr>
<td>Ruby</td>
<td style="text-align: center;">9</td>
<td style="text-align: center;">7.1s</td>
</tr>
</tbody>
</table>
</div>
<p><br class="spacer_" /></p>
<p>Last time I benchmarked Ruby it also came in… a little late, and some commentators suggested trying out other Ruby compilers — The truth is however, that small benchmarks like this really show <em>very little</em> and if you are working in a performance critical zone, you’ll want to test something which mimics the actual project and the actual environment. Secondarily, neither Haskell nor Clojure are optimized for performance which accounts for some of the extra Ruby code.</p>
<p>The lesson we can take away, is that Haskell and Clojure are blazing, and Haskell is blazing to the point where I’m not finding words good enough to describe it.</p>
<script type="text/javascript" src="/wp-content/plugins/shjs-syntax-hiliter/shjs/lang/sh_ruby.js"></script><script type="text/javascript" src="/wp-content/plugins/shjs-syntax-hiliter/shjs/lang/sh_haskell.js"></script>
<p><a href="http://feedads.g.doubleclick.net/~a/UJL4wO8FFQx8NDYa5jQgC-L6Iwo/0/da"><img src="http://feedads.g.doubleclick.net/~a/UJL4wO8FFQx8NDYa5jQgC-L6Iwo/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/UJL4wO8FFQx8NDYa5jQgC-L6Iwo/1/da"><img src="http://feedads.g.doubleclick.net/~a/UJL4wO8FFQx8NDYa5jQgC-L6Iwo/1/di" border="0" ismap="true"></img></a></p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=4xTpA5QEmk8:vp0W7usMMkE:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=4xTpA5QEmk8:vp0W7usMMkE:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=4xTpA5QEmk8:vp0W7usMMkE:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?i=4xTpA5QEmk8:vp0W7usMMkE:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bestinclass-the-blog/~4/4xTpA5QEmk8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.bestinclass.dk/index.php/2010/02/haskell-ruby-clojure/feed/</wfw:commentRss>
		<slash:comments>37</slash:comments>
		<feedburner:origLink>http://www.bestinclass.dk/index.php/2010/02/haskell-ruby-clojure/</feedburner:origLink></item>
		<item>
		<title>My tribute to Steve Ballmer</title>
		<link>http://feedproxy.google.com/~r/bestinclass-the-blog/~3/FN_hG7Xd9c4/</link>
		<comments>http://www.bestinclass.dk/index.php/2010/02/my-tribute-to-steve-ballmer/#comments</comments>
		<pubDate>Fri, 12 Feb 2010 21:54:00 +0000</pubDate>
		<dc:creator>Lau</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[ascii]]></category>
		<category><![CDATA[ballmer]]></category>
		<category><![CDATA[clojure]]></category>
		<category><![CDATA[macros]]></category>
		<category><![CDATA[tribute]]></category>

		<guid isPermaLink="false">http://www.bestinclass.dk/?p=1075</guid>
		<description><![CDATA[
			
				
			
		
These days Microsoft is often being hammered in both the news and in Open Source communities across the globe, so on behalf of the Clojure community I would like to submit a small tribute to the man at the wheel, Steve Ballmer.




Preface
Microsoft makes good money, but they are going through tough times. So to make [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F02%2Fmy-tribute-to-steve-ballmer%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F02%2Fmy-tribute-to-steve-ballmer%2F&amp;source=LauJensen&amp;style=normal&amp;service=bit.ly" height="61" width="50" /><br />
			</a>
		</div>
<p>These days Microsoft is often being hammered in both the news and in Open Source communities across the globe, so on behalf of the Clojure community I would like to submit a small tribute to the man at the wheel, Steve Ballmer.</p>
<p><br class="spacer_" /></p>
<p><span id="more-1075"></span></p>
<p><br class="spacer_" /></p>
<p><br class="spacer_" /></p>
<h1>Preface</h1>
<p>Microsoft makes good money, but they are going through tough times. So to make life a little happier at Camp Microsoft, I’ve decided to write a little Image to Ascii Art converter. Demonstrating, among other things, the power of macros. I don’t think I’ll get any arguments, that when you conjoin clojure and ascii-art, you get claskii art — So I’ve named this project appropriately</p>
<p><br class="spacer_" /></p>
<h1>Processing an image</h1>
<p>To convert an image to ascii we need to look at it as a bunch of colored pixels, converting them to characters one at a time. Disregarding the details of which image-holder we will use, we always find ourselves doing some tedious java-interop when working with Java Classes. Here’s a quick example of how to get the brighest color from a pixel:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">get-pixel</span> [image x y]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [color (.getRGB image x y)
        red   (.getRed color)
        green (.getGreen color)
        blue  (.getBlue color)]
    (<span style="color: #7fffd4;">apply</span> max [red green blue])))
</pre>
<p>Very simple right? Yes, and very boring. What I would like to be able to do, is just get at the fields more directly, like:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;"><span style="color: #afeeee; font-weight: bold;">claskii&gt; </span><span style="font-weight: bold;">(def image (ImageIO/read (File. "steve-ballmer.jpg")))</span>
#'image
<span style="color: #afeeee; font-weight: bold;">claskii&gt; </span><span style="font-weight: bold;">(get-properties (Color. (.getRGB image 10 10)) .getRed .getBlue)</span>
[8 60]</pre>
<p>Which of course isn’t possible, because as soon as .getRed is evaluated I’ll get a <em><strong>Symbol not defined error</strong></em> — Enter Macros! A recommended first step when writing macros is:</p>
<ul>
<li><strong>Dont</strong>. If you can manage to solve the job without macros, you’re better off that way. Since macros generate code at read time, they’re more complex than regular code and thus more error-prone.</li>
</ul>
<p>The second advice is</p>
<ul>
<li>Write out the code that you want the macro to emit, before writing the macro</li>
</ul>
<p>In our case we want get-properties to expand into something like</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #ccffcc;">get-properties</span> object .getRed .getBlue)
&gt;&gt;&gt; [(.getRed object) (.getBlue object)]</pre>
<p>So to make this happen we need to walk through our method-list (ie. .getRed .getBlue) and return a sequence which results from applying the methods to the first argument, the object:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defmacro</span> <span style="color: #7fffd4; font-weight: bold;">get-properties</span> [obj &amp; properties]
  (<span style="color: #afeeee; font-weight: bold;">for</span> [property properties]
    (property obj)))
</pre>
<p>Because Lisp is homoiconic we treat code exactly like data in that (.getRed obj) is nothing more than a list whos first item is .getRed and the second is obj. To see what our macro expands to, call</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;"><span style="color: #afeeee; font-weight: bold;">claskii&gt; </span><span style="font-weight: bold;">(macroexpand-1 `(get-properties (Color. (.getRGB image 10 10)) .getRed .getBlue))</span>
(nil nil)</pre>
<p><em>Not</em> what we wanted! The reason is, that we haven’t taken control of evaluation, using the <a href="http://clojure.org/reader" target="_blank">macro-characters</a>. I recommend you checkout them all before proceeding, as these puppies give you a lot of power, but also make macro-definitions a little hard on the eyes. The backquote stops evaluation, while allowing us to prefix items with a tilde ~ for evaluation, ~@ for splicing ~@(list 1 2 3) =&gt; 1 2 3.</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defmacro</span> <span style="color: #7fffd4; font-weight: bold;">get-properties</span> [obj &amp; properties]
  `(<span style="color: #7fffd4;">vector</span>
    ~@(<span style="color: #afeeee; font-weight: bold;">for</span> [property properties]
        (property obj))))
</pre>
<p>Unfortunately when you check that using macroexpand-1, you’ll see the exact same result as above. The reason is, that when we’re passing .getRed as a symbol and symbols are similar to :keywords in that they are a function of their arguments. When you’re calling the symbol on the object you get nil in return. But why is the symbol being evaluated?</p>
<p>The splicing ~@ forces evaluation within its body, so we need to add an extra backquote:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defmacro</span> <span style="color: #7fffd4; font-weight: bold;">get-properties</span> [obj &amp; properties]
  `(<span style="color: #7fffd4;">vector</span>
    ~@(<span style="color: #afeeee; font-weight: bold;">for</span> [property properties]
        `(property obj))))
</pre>
<p>Then check it</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;"><span style="color: #afeeee; font-weight: bold;">claskii&gt; </span><span style="font-weight: bold;">(macroexpand-1 `(get-properties (Color. (.getRGB image 10 10)) .getRed .getBlue))</span>
(clojure.core/vector (claskii/property claskii/obj) (claskii/property claskii/obj))
</pre>
<p>Now thats more like it! But calling it will throw an error of course, since property is not defined anywhere — Instead of taking property literally we want it evaluated to the ‘property’ var in the for-loop, and although direct evaluation would work in most-cases you always have to be on your toes when using macros, as one day a user will submit an argument, which clashes with one of yours. The solution is (gensym), which generates a unique name for your variables. Gensym can either be called directly or simple using the syntactic sugar #:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defmacro</span> <span style="color: #7fffd4; font-weight: bold;">get-properties</span> [obj &amp; properties]
  `(<span style="color: #7fffd4;">vector</span>
    ~@(<span style="color: #afeeee; font-weight: bold;">for</span> [property# properties]
        `(~property# ~obj))))
</pre>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;"><span style="color: #afeeee; font-weight: bold;">claskii&gt; </span><span style="font-weight: bold;">(macroexpand-1 `(get-properties (Color. (.getRGB image 10 10)) .getRed .getBlue))</span>
(clojure.core/vector
    (.getRed (java.awt.Color. (.getRGB claskii/image 10 10)))
    (.getBlue (java.awt.Color. (.getRGB claskii/image 10 10))))
<span style="color: #afeeee; font-weight: bold;">claskii&gt; </span><span style="font-weight: bold;">(get-properties (Color. (.getRGB image 10 10)) .getRed .getBlue)</span>
[8 60]
</pre>
<p>Nice! It expands like we want and it gets us the result we want! There are a couple of optimizations which are begging to be done. Firstly the object is being created once for each method argument. This is partly because the macro allows it, and partly because I’m not actually passing an object but rather the code which constructs an object. Secondly, by adding at object with the syntactic ‘(let [obj# ~obj]) gensym, will actually change the gensym with each run of the for-loop. So the ugly, stable and fully functioning version, using manual gensym comes out looking like so:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defmacro</span> <span style="color: #7fffd4; font-weight: bold;">get-properties</span> [obj &amp; properties]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [target (<span style="color: #7fffd4;">gensym</span>)]
    `(<span style="color: #afeeee; font-weight: bold;">let</span> [~target ~obj]
       (<span style="color: #7fffd4;">vector</span> ~@(<span style="color: #afeeee; font-weight: bold;">for</span> [property properties]
                   `(~property ~target))))))
</pre>
<p><br class="spacer_" /></p>
<h1>Scared yet?</h1>
<p>I realize and admit that macro definitions look awful because we need all of our syntactic weapony rolled out in order to get the expansion we want. On the other hand, if you can look past the superficial, Lisps homoiconicity  lets us treat code exactly as data, which is just an incredible tool to be sitting with, as you can freely extend the very language itself by relatively simple means!</p>
<p>Well, to calm the nerves again, look at how easy java-interop has become. We have gone from:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">old-school</span> [image]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [w   (.getWidth image)
        h   (.getHeight image)
        r   (.getRed image)
        g   (.getGreen image)
        b   (.getBlue image)]))
</pre>
<p>To:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">def</span> <span style="color: #7fffd4; font-weight: bold;">new-school</span> [image]
     (<span style="color: #afeeee; font-weight: bold;">let</span> [[w h]   (get-properties image .getWidth .getHeight)
           [r g b] (get-properties (.getRGB image 10 10)l .getRed .getGreen .getBlue)]))
</pre>
<p><br class="spacer_" /></p>
<h1>Ascii-Art</h1>
<p><br class="spacer_" /></p>
<p>So making the ascii-art itself should be very simple. This is my strategy:</p>
<ol>
<li>Define a list of ascii characters of descending density</li>
<li>Scale the image down to ascii-output-size</li>
<li>Look at every pixel</li>
<li>Examine the brighest color</li>
<li>Divide that color by the amount of characters available and pick the appropriate one</li>
<li>Output result</li>
</ol>
<p>So begin by defining a list which you think looks good:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">def</span> <span style="color: #7fffd4; font-weight: bold;">ascii-chars</span> [\# \A \@ \% \$ \+ \= \* \: \, \. \space])</pre>
<p>Then pick out the peak value and convert it to an index in the above data, by simple division</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">ascii</span> [img x y color?]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [[red green blue] (get-properties ( Color. (.getRGB img x y))
                                         .getRed .getGreen .getBlue)
        peak    (<span style="color: #7fffd4;">apply</span> max [red green blue])
        idx     (<span style="color: #afeeee; font-weight: bold;">if</span> (zero? peak)
                  (<span style="color: #7fffd4;">dec</span> (<span style="color: #7fffd4;">count</span> ascii-chars))
                  (<span style="color: #7fffd4;">dec</span> (<span style="color: #7fffd4;">int</span> (<span style="color: #7fffd4;">+</span> 1/2 (<span style="color: #7fffd4;">*</span> (<span style="color: #7fffd4;">count</span> ascii-chars) (<span style="color: #7fffd4;">/</span> peak 255))))))
        output  (<span style="color: #7fffd4;">nth</span> ascii-chars (<span style="color: #afeeee; font-weight: bold;">if</span> (<span style="color: #7fffd4;">pos?</span> idx) idx 0)) ]
</pre>
<p>…And depending on the output-type selected by the user, return either that character or an html-version:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">    (<span style="color: #afeeee; font-weight: bold;">if</span> color?
      (html [<span style="color: #7fffd4;">:span</span> {<span style="color: #7fffd4;">:style</span> (<span style="color: #7fffd4;">format</span> <span style="color: #87cefa;">"color: rgb(%s,%s,%s);"</span> red green blue)} output])
      output)))
</pre>
<p><br class="spacer_" /></p>
<h1>Putting it together</h1>
<p>Now that we have a way of processing each pixels, we just need to walk them all:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">convert-image</span> [uri w color?]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [raw-image   (scale-image uri w)
        ascii-image (<span style="color: #afeeee; font-weight: bold;">-&gt;&gt;</span> (<span style="color: #afeeee; font-weight: bold;">for</span> [y (<span style="color: #7fffd4;">range</span> (.getHeight raw-image))
                               x (<span style="color: #7fffd4;">range</span> (.getWidth  raw-image))]
                           (ascii raw-image x y color?))
                         (<span style="color: #7fffd4;">partition</span> w))
</pre>
<p>So that walks every X for every Y returning the ascii-representation of each pixel. When the for-loop completes you’re sitting with a 1D stream of characters representing the image. In order to distinguish lines you need to partition the sequence, chopping it up every width-number-of-characters. Now we we’re sitting with a sequence of lines, which we can properly format:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">        output      (<span style="color: #afeeee; font-weight: bold;">-&gt;&gt;</span> ascii-image
                         (<span style="color: #7fffd4;">interpose</span> (<span style="color: #afeeee; font-weight: bold;">if</span> color? <span style="color: #87cefa;">"&lt;BR/&gt;"</span> \newline))
                         flatten)]
</pre>
<p>The only difference between the html version and ascii at this point, is how to seperate the lines. Once that done we can flatten the sequence of safe printing.</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">    (<span style="color: #afeeee; font-weight: bold;">if</span> color?
      (html [<span style="color: #7fffd4;">:pre</span> {<span style="color: #7fffd4;">:style</span> <span style="color: #87cefa;">"font-size:5pt; letter-spacing:1px;
                           line-height:4pt; font-weight:bold;"</span>}
             output])
      (<span style="color: #7fffd4;">println</span> output))))</pre>
<p>Adjust the html-settings to your liking — I’m no Ascii artist so consider this a raw prototype which more creative people can improve upon should they want to. Anyway, now’s the time to test.</p>
<p><br class="spacer_" /></p>
<h1>Getting Steve Ballmer</h1>
<p>First I hit Google to get an image of the guy:</p>
<p><br class="spacer_" /></p>
<div class="wp-caption aligncenter" style="width: 310px"><img title="Steve Ballmer" src="http://www.bestinclass.dk/wp-content/uploads/claskii/steve.jpg" alt="Steve Ballmer" width="300" height="420" /><p class="wp-caption-text">Steve Ballmer</p></div>
<p><br class="spacer_" /></p>
<p>Second, lets try and run it directly from the REPL:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;"><span style="color: #afeeee; font-weight: bold;">claskii&gt; </span>(convert-image "/home/lau/Desktop/steve.jpg" 50 nil)</pre>
<p><img class="aligncenter" title="ASCII REPL" src="http://www.bestinclass.dk/wp-content/uploads/claskii/ascii-repl.png" alt="ASCII REPL" width="638" height="877" /></p>
<p>Since thats cooked down to only 50 characters it doesn’t really to the guy justice, so lets try the HTML rendere:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;"><span style="color: #afeeee; font-weight: bold;">claskii&gt; </span><span style="font-weight: bold;">(spit "h.html" (convert-image "steve.jpg" 120 true))</span></pre>
<p><img class="aligncenter" title="ASCII HTML" src="http://www.bestinclass.dk/wp-content/uploads/claskii/ascii-html.png" alt="ASCII HTML" width="658" height="969" /></p>
<p>Thats more like it — although I’ll admit that the A’s look bad.</p>
<p><br class="spacer_" /></p>
<h1>Conclusion</h1>
<p>You’ve seen how macros are functions that control evaluation and outputs code. Macros are both fun and tricky so use them carefully. The entire program weighs in at 55 lines and I’ve put it on Github: <a href="http://github.com/LauJensen/Claskii" target="_blank">here</a>.</p>

<p><a href="http://feedads.g.doubleclick.net/~a/ZTRozKiZHwWWtrJq1uzaXR6zfzw/0/da"><img src="http://feedads.g.doubleclick.net/~a/ZTRozKiZHwWWtrJq1uzaXR6zfzw/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/ZTRozKiZHwWWtrJq1uzaXR6zfzw/1/da"><img src="http://feedads.g.doubleclick.net/~a/ZTRozKiZHwWWtrJq1uzaXR6zfzw/1/di" border="0" ismap="true"></img></a></p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=FN_hG7Xd9c4:awqf_3kwatw:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=FN_hG7Xd9c4:awqf_3kwatw:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=FN_hG7Xd9c4:awqf_3kwatw:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?i=FN_hG7Xd9c4:awqf_3kwatw:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bestinclass-the-blog/~4/FN_hG7Xd9c4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.bestinclass.dk/index.php/2010/02/my-tribute-to-steve-ballmer/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		<feedburner:origLink>http://www.bestinclass.dk/index.php/2010/02/my-tribute-to-steve-ballmer/</feedburner:origLink></item>
		<item>
		<title>Reddit Clone — Now accepting registrations!</title>
		<link>http://feedproxy.google.com/~r/bestinclass-the-blog/~3/60BDGNUtIVg/</link>
		<comments>http://www.bestinclass.dk/index.php/2010/02/reddit-clone-with-user-registration/#comments</comments>
		<pubDate>Tue, 09 Feb 2010 19:30:22 +0000</pubDate>
		<dc:creator>Lau</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[clone]]></category>
		<category><![CDATA[compojure]]></category>
		<category><![CDATA[reddit]]></category>
		<category><![CDATA[sessions]]></category>

		<guid isPermaLink="false">http://www.bestinclass.dk/?p=1033</guid>
		<description><![CDATA[
			
				
			
		
About 1 week ago I wrote a small Reddit Clone in about 90 lines of Clojure. The amount interest and feedback was unexpectedly high so I’ve decided to extend the example to a whopping 160 lines as well as echo some of the chatter.



Preface
My original Reddit Clone was a reaction to a Common Lisp program [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F02%2Freddit-clone-with-user-registration%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F02%2Freddit-clone-with-user-registration%2F&amp;source=LauJensen&amp;style=normal&amp;service=bit.ly" height="61" width="50" /><br />
			</a>
		</div>
<p>About 1 week ago I wrote a small <a href="http://www.bestinclass.dk/index.php/2010/02/reddit-clone-in-10-minutes-and-91-lines-of-clojure/" target="_blank">Reddit Clone</a> in about 90 lines of Clojure. The amount interest and feedback was unexpectedly high so I’ve decided to extend the example to a whopping 160 lines as well as echo some of the chatter.</p>
<p><br class="spacer_" /></p>
<p><span id="more-1033"></span></p>
<p><br class="spacer_" /></p>
<h1>Preface</h1>
<p>My original <a href="http://www.bestinclass.dk/index.php/2010/02/reddit-clone-in-10-minutes-and-91-lines-of-clojure/" target="_blank">Reddit Clone</a> was a reaction to a Common Lisp program which did very much the same thing. The Clojure version added some extra deployment goodness in that in compiled to a cross-platform jar which would launch on any Java supportive OS. Following my blogpost many more or less interesting versions popped up across the web, these caught my eye:</p>
<p><br class="spacer_" /></p>
<h3>Reddit Clone in 61 minutes and 97 lines of PHP</h3>
<p><a href="http://blargh.tommymontgomery.com/2010/02/reddit-in-61-minutes-and-97-lines-of-php/" target="_blank">This guy</a>, actually sat down and wrote out a Reddit Clone in PHP, which I thought was amazing — If PHP could do the same as Clojure in only 6 extra lines I would be greatly impressed — However it lacked the almost every feature except rendering links.</p>
<h3>Reddit Clone in 30 minutes and 4 lines of Perl</h3>
<p>Wow — Half the time of PHP and only 4 lines! See the code here: <a href="http://gist.github.com/299579" target="_blank">Gist</a></p>
<p>Unfortunately it also lacks every single feature of the PHP version and when you unpack it, it weighs in at about 80 lines which you can read: <a href="http://gist.github.com/299580" target="_blank">Here</a></p>
<h3>Reddit Clone in QBasic!</h3>
<p>This is by far one of the funniest contributions to this Reddit Clone War — QBasic emitted via CGI-BIN. Despite weighing in at 250 lines, you will probably enjoy the read: <a href="http://www.ryanbroomfield.com/projects/old-stuff/qbasic/cgi-bin-reddit-clone" target="_blank">here</a></p>
<h3>Echoes</h3>
<p>The reason I these, is that I think its a lot of fun to see a project like this performed in various languages, but if we’re to learn something from each other we should aim to offer the same features — So while you guys are catching up, I’ll go ahead and implement user management, as in login/logout, registration etc. :) <strong><em>ps:</em><span style="font-weight: normal;"> Can we see some </span><span style="font-weight: normal;"><em>Scala</em></span><span style="font-weight: normal;"> and </span><span style="font-weight: normal;"><em>Haskell</em></span><span style="font-weight: normal;"> versions soon?</span></strong></p>
<p><br class="spacer_" /></p>
<h1>Back to work</h1>
<p>Since we’re quickly adding pages we should consider extracting as much code as possible into wrapper function. As the code-base itself increases would also make a lot of sense to separate logic from views into their own files and namespaces, but because this is purely for demonstrative purposes of Clojure/Compojure I’ll stick to a single file.</p>
<p><br class="spacer_" /></p>
<h1>Storing Data</h1>
<p>We can store our data in any way shape or form we want, but I’ll stick with the approach of my last post, ie. keeping all the data in memory. Because we run the server (unlike something like PHP) data, functions, threads etc live between the requests.</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">def</span> <span style="color: #7fffd4; font-weight: bold;">data</span>  (<span style="color: #7fffd4;">ref</span> {<span style="color: #87cefa;">"http://www.bestinclass.dk"</span> {<span style="color: #7fffd4;">:title</span> <span style="color: #87cefa;">"Best in Class"</span> <span style="color: #7fffd4;">:points</span> 1 <span style="color: #7fffd4;">:date</span> (DateTime.) <span style="color: #7fffd4;">:poster</span> <span style="color: #87cefa;">"LauJensen"</span>}}))
(<span style="color: #afeeee; font-weight: bold;">def</span> <span style="color: #7fffd4; font-weight: bold;">users</span> (<span style="color: #7fffd4;">ref</span> {<span style="color: #87cefa;">"lau.jensen@bestinclass.dk"</span> {<span style="color: #7fffd4;">:username</span> <span style="color: #87cefa;">"LauJensen"</span> <span style="color: #7fffd4;">:password</span> <span style="color: #87cefa;">"way2secret"</span>}}))
(<span style="color: #afeeee; font-weight: bold;">def</span> <span style="color: #7fffd4; font-weight: bold;">online-users</span> (<span style="color: #7fffd4;">ref</span> {}))
</pre>
<p>The data struct is similar to what I used in the first post, but I’ve added a new keyword :poster, to keep track of who is posting what. The users list would normally be put in a database as it simply contains a list of all registered users. The final list is a hashmap of all logged in users, instead of storing the login-information in cookies with the client, we’ll keep tabs on who is online via the Jetty server.</p>
<p>Why refs? Well, with the current complexity (or lack of), we might actually be able to pull through just using atoms but in a near future we might need coordinated change, hence the STM. Whichever we pick, Clojures language level concurrency support makes it a breeze to handle concurrent users.</p>
<p>For those of you who read the <a href="http://www.bestinclass.dk/index.php/2009/12/beating-the-arc-challenge-in-clojure/" target="_blank">Beating The Arc Challenge</a> post, you might be wondering why I’m tracking the session in my own datastructure instead of using the built-in session/assoc/dissoc functions of Compojure and the simple reason is that I might as well show off both approaches.</p>
<p><br class="spacer_" /></p>
<h1>Sessions</h1>
<p>In order to get to Jetty’s session information, we need to active Compojures Session Middleware, so our old main function becomes:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">-main</span> [&amp; args]
  (run-server {<span style="color: #7fffd4;">:port</span> 8080} <span style="color: #87cefa;">"/*"</span> (<span style="color: #afeeee; font-weight: bold;">-&gt;&gt;</span> reddit with-session servlet)))</pre>
<p>Our routes are passed to the middle-ware and the modified routes are then passed to servlet. Now all of the routes can make use of the ‘session’ variable which contains the Jetty ID — A unique ID which we can use to track users across the site.</p>
<p><br class="spacer_" /></p>
<h1>Logging In/Out</h1>
<p>To log in is simple: We see if the supplied email is live in the system, if it is then we check if the password is a match and if thats the case then we associate the user-details from <strong><em>users</em><span style="font-weight: normal;"> with the Jetty ID:</span></strong></p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">login-user</span> [session [email password]]
  (redirect-to
   (<span style="color: #afeeee; font-weight: bold;">if-let</span> [user (@users email)]
     (<span style="color: #afeeee; font-weight: bold;">if</span> (<span style="color: #7fffd4;">=</span> password (<span style="color: #7fffd4;">:password</span> user))
       (<span style="color: #afeeee; font-weight: bold;">dosync</span>
        (<span style="color: #7fffd4;">alter</span> online-users assoc (<span style="color: #7fffd4;">:id</span> session) user)
        <span style="color: #87cefa;">"/"</span>)
       <span style="color: #87cefa;">"/login/?msg=Bad username/password combo"</span>)
    <span style="color: #87cefa;">"/login/?msg=User does not exist"</span>)))
</pre>
<p>Again the hash-maps is making access very simple, so to log out we simple need to disassociate that Jetty ID from the online-users list. The argument might look a little weird to you, but thats a clever way to destructure the session variable — Any hashmap can be broken down into <strong>named keys</strong> by calling {:keys [k1 k2 k2]}, like so:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">logout-user</span> [{<span style="color: #7fffd4;">:keys</span> [id]}]
  (<span style="color: #afeeee; font-weight: bold;">dosync</span>
   (<span style="color: #7fffd4;">alter</span> online-users dissoc id))
  (redirect-to <span style="color: #87cefa;">"/"</span>))
</pre>
<p>In a very short space we’ve now written our own backend session handling, so all thats left is given users a way into the system. First I would like to avoid always writing out the same head/css/js includes, so we’ll make a wrapper. Basically all it has to do is write out head-tag with a user supplied title, and then add a login button if the user isn’t logged in. If the user is logged in, we want to see the username instead:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">with-head</span> [session title &amp; body]
  (html
   [<span style="color: #7fffd4;">:head</span>
    [<span style="color: #7fffd4;">:title</span> title]
    (include-css <span style="color: #87cefa;">"/styles/reddit.css"</span>)]
   [<span style="color: #7fffd4;">:body</span>
    (<span style="color: #afeeee; font-weight: bold;">if-let</span> [user (@online-users (<span style="color: #7fffd4;">:id</span> session))]
      [<span style="color: #7fffd4;">:div#user</span> (<span style="color: #7fffd4;">:username</span> user) (link-to <span style="color: #87cefa;">"/logout/"</span> <span style="color: #87cefa;">"(Log out)"</span>)]
      [<span style="color: #7fffd4;">:div#user</span> (link-to <span style="color: #87cefa;">"/login/"</span> <span style="color: #87cefa;">"(Log in)"</span>)])
    body]))
</pre>
<p>So with that out of the way, the login form is simply</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">login-form</span> [session msg]
  (with-head session <span style="color: #87cefa;">"Reddit.Clojure - Login screen"</span>
    [<span style="color: #7fffd4;">:h1</span> <span style="color: #87cefa;">"Login"</span>]
    (<span style="color: #afeeee; font-weight: bold;">when</span> msg [<span style="color: #7fffd4;">:h4</span> msg])
    (form-to [<span style="color: #7fffd4;">:post</span> <span style="color: #87cefa;">"/login/"</span>]
             [<span style="color: #7fffd4;">:table</span>
              [<span style="color: #7fffd4;">:tr</span> [<span style="color: #7fffd4;">:td</span> <span style="color: #87cefa;">"email"</span>]   [<span style="color: #7fffd4;">:td</span> (text-field <span style="color: #87cefa;">"email"</span>) ]]
              [<span style="color: #7fffd4;">:tr</span> [<span style="color: #7fffd4;">:td</span> <span style="color: #87cefa;">"password"</span>][<span style="color: #7fffd4;">:td</span> (password-field <span style="color: #87cefa;">"psw"</span>) ]]]
             (submit-button <span style="color: #87cefa;">"Login"</span>))))
</pre>
<p>Giving you:</p>
<p style="text-align: center;"><img class="aligncenter" style="border: 0;" title="Login Form" src="http://www.bestinclass.dk/wp-content/uploads/rc/login-form.png" alt="Login Form" width="465" height="320" /></p>
<p><br class="spacer_" /></p>
<p>When you hit the submit button, this will POST to the backend functions below, resulting in either an error message or the front-page:</p>
<p style="text-align: center;"><img class="aligncenter" style="border: 0;" title="Front page" src="http://www.bestinclass.dk/wp-content/uploads/rc/frontpage.png" alt="Front page" width="573" height="416" /></p>
<p><br class="spacer_" /></p>
<p><br class="spacer_" /></p>
<h1>Registration</h1>
<p>So now users who are in the system can authenticate and submit links which are joined to their usernames. The logical next step is to allow newcomers to register. I would really enjoy doing some kind of fusion between Captcha and an IQ test, but thats probably best left for a separate blogpost.</p>
<p>Starting with the backend, we need to make some kind of fall-through input validation like you saw in the last post and I’ll leave it up to you and your imagination to put stuff in there that makes sense, but if all the input is good then we need to check if the email is live in the system and if so throw an error. If its not live then a user should be created both in <strong><em>users</em><span style="font-weight: normal;"> and in </span><em>online-users</em><span style="font-weight: normal;"> joining the latter to the Jetty ID:</span></strong></p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">add-user</span> [session-id [email user password]]
  (redirect-to
   (<span style="color: #afeeee; font-weight: bold;">cond</span>
    (invalid-email? email) <span style="color: #87cefa;">"/register/?msg=Invalid email"</span>
    <span style="color: #7fffd4;">:else</span>
    (<span style="color: #afeeee; font-weight: bold;">dosync</span>
     (<span style="color: #afeeee; font-weight: bold;">if</span> (@users email)
       <span style="color: #87cefa;">"/register/?msg=Email already registered"</span>
       (<span style="color: #afeeee; font-weight: bold;">do</span>
         (<span style="color: #7fffd4;">alter</span> users assoc email {<span style="color: #7fffd4;">:username</span> user <span style="color: #7fffd4;">:password</span> password})
         (<span style="color: #7fffd4;">alter</span> online-users assoc session-id (@users email))
         <span style="color: #87cefa;">"/"</span>))))))
</pre>
<p>With the backend ready for customers, we just need a simple front-end registration form. This isn’t optimal but I just wanted to show off how make life a little simpler with a small for-loop:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">registration-form</span> [session msg]
  (with-head session <span style="color: #87cefa;">"Reddit.Clojure - Registration form"</span>
    [<span style="color: #7fffd4;">:h1</span> <span style="color: #87cefa;">"Registration"</span>]
    (<span style="color: #afeeee; font-weight: bold;">when</span> msg [<span style="color: #7fffd4;">:h4</span> msg])
    (form-to [<span style="color: #7fffd4;">:post</span> <span style="color: #87cefa;">"/register/"</span>]
             [<span style="color: #7fffd4;">:table</span>
              (<span style="color: #afeeee; font-weight: bold;">for</span> [field [<span style="color: #87cefa;">"Email"</span> <span style="color: #87cefa;">"Username"</span> <span style="color: #87cefa;">"Password"</span>]]
                [<span style="color: #7fffd4;">:tr</span>
                 [<span style="color: #7fffd4;">:td</span> field]
                 [<span style="color: #7fffd4;">:td</span> (text-field field)]])]
             (submit-button <span style="color: #87cefa;">"Sign up"</span>))))
</pre>
<p>Instead of adding a specific link to this functionality, I’ll bundle it directly with the route for submitting links, ie. if you’re not logged in you either should, or you should register:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">  (GET  <span style="color: #87cefa;">"/new/*"</span>      (<span style="color: #afeeee; font-weight: bold;">if</span> (@online-users (<span style="color: #7fffd4;">:id</span> session))
                        (reddit-new-link session (<span style="color: #7fffd4;">:msg</span> params))
                        (redirect-to <span style="color: #87cefa;">"/register/"</span>)))
</pre>
<p>So if you’re not logged in, submitting a link and subsequently trying to use my email will give you this:</p>
<p style="text-align: center;"><img class="aligncenter" title="Registration" src="http://www.bestinclass.dk/wp-content/uploads/rc/registration.png" alt="Registration" width="573" height="415" /></p>
<p><br class="spacer_" /></p>
<h1>Up ahead</h1>
<p>Like I mentioned earlier, the key to Clojure happiness is structure, so if you’re considering booting up a Compojure project split the codebase into logical units, this makes for much faster development later as the project grows. Depending on your network interface, consider where you launch the clone. This morning I checked my referrers list and found an IP calling from port :8080, I figured someone was playing with Compojure so I followed the link — only to find myself looking at my own Reddit Clone launched from somewhere in the US — Cooool :)</p>
<p><strong>Code still</strong>: <a href="http://github.com/LauJensen/cloneit" target="_blank">here</a></p>

<p><a href="http://feedads.g.doubleclick.net/~a/mdHElBcazlpboK5thwuHJ1JyIFQ/0/da"><img src="http://feedads.g.doubleclick.net/~a/mdHElBcazlpboK5thwuHJ1JyIFQ/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/mdHElBcazlpboK5thwuHJ1JyIFQ/1/da"><img src="http://feedads.g.doubleclick.net/~a/mdHElBcazlpboK5thwuHJ1JyIFQ/1/di" border="0" ismap="true"></img></a></p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=60BDGNUtIVg:osk_6ldaCEA:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=60BDGNUtIVg:osk_6ldaCEA:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=60BDGNUtIVg:osk_6ldaCEA:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?i=60BDGNUtIVg:osk_6ldaCEA:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bestinclass-the-blog/~4/60BDGNUtIVg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.bestinclass.dk/index.php/2010/02/reddit-clone-with-user-registration/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://www.bestinclass.dk/index.php/2010/02/reddit-clone-with-user-registration/</feedburner:origLink></item>
		<item>
		<title>Simplicity on Steroids</title>
		<link>http://feedproxy.google.com/~r/bestinclass-the-blog/~3/Qlz0QQCc-ow/</link>
		<comments>http://www.bestinclass.dk/index.php/2010/02/clojure-list-comprehension/#comments</comments>
		<pubDate>Thu, 04 Feb 2010 20:00:07 +0000</pubDate>
		<dc:creator>Lau</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[clojure]]></category>
		<category><![CDATA[complexity]]></category>
		<category><![CDATA[for]]></category>
		<category><![CDATA[idioms]]></category>
		<category><![CDATA[scala]]></category>

		<guid isPermaLink="false">http://www.bestinclass.dk/?p=999</guid>
		<description><![CDATA[
			
				
			
		
Today I had a lot of fun reading about a for-constructs on steroids. The author of that post explores the possibilities which come with the built-in pattern matching in Scalas for-construct, so I’ll do the same with Clojure.



Preface
Before reading this post I hope you’ll at least skim the code in the article I linked above, [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F02%2Fclojure-list-comprehension%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F02%2Fclojure-list-comprehension%2F&amp;source=LauJensen&amp;style=normal&amp;service=bit.ly" height="61" width="50" /><br />
			</a>
		</div>
<p>Today I had a lot of fun reading about a <a href="http://www.artima.com/weblogs/viewpost.jsp?thread=281160" target="_blank">for-constructs on steroids</a>. The author of that post explores the possibilities which come with the built-in pattern matching in Scalas for-construct, so I’ll do the same with Clojure.</p>
<p><span id="more-999"></span></p>
<p><br class="spacer_" /></p>
<p><br class="spacer_" /></p>
<h1>Preface</h1>
<p>Before reading this post I hope you’ll at least skim the code in the article I linked above, just to get a feel for the complexity involved. But complex or not, with Scala you’re able to do pattern matching directly in your for-loop, <em>yielding </em>results as you move along. As the author in this post demonstrates, here’s one way to go:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;"><span style="color: #afeeee; font-weight: bold;">case</span> <span style="color: #afeeee; font-weight: bold;">class</span> <span style="color: #87ceeb; font-weight: bold;">Person</span>(<span style="color: #40e0d0; font-weight: bold;">firstName</span>:<span style="color: #87ceeb; font-weight: bold;">String</span>, <span style="color: #40e0d0; font-weight: bold;">lastName</span>: <span style="color: #87ceeb; font-weight: bold;">String</span>);

<span style="color: #afeeee; font-weight: bold;">val</span> <span style="color: #40e0d0; font-weight: bold;">people</span> = List(
  Person(“Jane”, “Smith”),
  Person(“John”, “Doe”),
  Person(“Jane”, “Eyre”));

<span style="color: #afeeee; font-weight: bold;">for</span> (Person(“Jane”, last) &lt;- people) <span style="color: #afeeee; font-weight: bold;">yield</span> “Ms. ” + last;

» <span style="color: #ccffff;">List</span>(“Ms. Smith”, “Ms. Eyre”)</pre>
<p>So is that handy? Yes that is absolutely handy, but Clojure doesn’t do steroids, in fact Lisps are almost solely identified by their powerful idioms which you use again and again, basically forming lists/datastructures in new and creative ways. Clojure hosts 2 functions (macros) which act in the same domain (looping/comprehension), but are 2 different beasts entirely.</p>
<p><br class="spacer_" /></p>
<h3>First we have doseq:</h3>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">doseq</span> [i (<span style="color: #7fffd4;">range</span> 5)]
    (<span style="color: #7fffd4;">println</span> i))
 0
 1
 2
 3
 4
 nil    <span style="color: #ffcc00;"><span style="color: #ff6600;">;;</span> return</span>
</pre>
<p>Notice the final nil? Thats your return from a doseq, since it only acts on the list for the purpose of side-effects, in this case printing. Though it supports the same binding options as <strong><em>for</em></strong> (ie. :let, :while: when) its purpose is totally different. Most people coming from imperative languages will recognize this immediately as it is your typical loop-construct, similar to <strong><em>foreach</em></strong>. In the same arena you’ll find <strong><em>dotimes<span style="font-weight: normal;"> <span style="font-style: normal;">which is handy for running a body of code multiple times and for that purpose it’s faster than (doseq [i (range 10)]).</span></span></em></strong></p>
<p><br class="spacer_" /></p>
<h3>For on the other hand</h3>
<p>…is list-comprehension at its best:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">for</span> [i (<span style="color: #7fffd4;">range</span> 5)]
  i)
(0 1 2 3 4)</pre>
<p>This is implicitly doing what the yield statement does in Scala and is a nice way to functionally build up a datastructure, ie. totally different from doseq. Don’t mix <strong><em>for</em><span style="font-weight: normal;"> with side-effects — both because it’s bad practice but also because </span><em>for</em><span style="font-weight: normal;"> is lazy.</span></strong> But lets say that we want to do matching like the Scala and only see those items with the first name ‘Jane’:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">def</span> <span style="color: #7fffd4; font-weight: bold;">people</span> [[<span style="color: #87cefa;">"Jane"</span> <span style="color: #87cefa;">"Smith"</span>] [<span style="color: #87cefa;">"John"</span> <span style="color: #87cefa;">"Doe"</span>] [<span style="color: #87cefa;">"Jane"</span> <span style="color: #87cefa;">"Eyre"</span>]])

(<span style="color: #afeeee; font-weight: bold;">for</span> [person people <span style="color: #7fffd4;">:let</span> [[name surname] person] <span style="color: #7fffd4;">:when</span> (<span style="color: #7fffd4;">=</span> <span style="color: #87cefa;">"Jane"</span> name)]
  (<span style="color: #7fffd4;">str</span> <span style="color: #87cefa;">"Ms. "</span> surname))

&gt;&gt; (<span style="color: #87cefa;">"Ms. Smith"</span> <span style="color: #87cefa;">"Ms. Eyre"</span>)</pre>
<p>So the <strong><em>:when</em></strong> clause only evaluates the body when the predicate (= “Jane” name) is true, however it does not halt the process when the predicate is false, it simply moves on to the next item.</p>
<p>Since <strong><em>for</em></strong> is lazy nothing gets evaluated unless it needs to. Throughout some programs it makes a lot of sense to keep things lazy, because although each item gets some overhead you can usually architect the code so that you end up saving computations. Lets imagine that we’re chewing our way through an endless stream of computational results, but for our final analysis we need only the first 5000 samples:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">for</span> [sample samples <span style="color: #7fffd4;">:while</span> (<span style="color: #7fffd4;">&lt;</span> (<span style="color: #7fffd4;">:id</span> sample) 5000)]
  (<span style="color: #7fffd4;">:value</span> sample))</pre>
<p>That will run 5000 times and then abort. This could also be achieved like so</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #7fffd4;">take</span> 5000 (<span style="color: #afeeee; font-weight: bold;">for</span> [sample samples ...

(<span style="color: #7fffd4;">take-while</span> #(<span style="color: #7fffd4;">&lt;</span> (<span style="color: #7fffd4;">:id</span> %) 5000) (<span style="color: #afeeee; font-weight: bold;">for</span> [sample ...</pre>
<p>But getting comfortable with the full power for/doseq usually ends up giving you a nicer end result. Through all the examples, you’re seeing the power of the seq-abstraction. Lets say you need to work on a <a href="http://clj-me.cgrand.net/2010/01/19/clojure-refactoring-flattening-reduces/" target="_blank">nested strucuture</a>, only working on the innermost data — double bound <strong><em>for</em></strong> is your friend. Imagine you have a named set, where each name is bound to  a vector containing the scores achieved that day. Parsing them in a way which handles empty vectors, nil values, int, floats, doubles, whatever you throw at it, looks like this:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">let</span>  [scoreboard {<span style="color: #87cefa;">"Allison"</span> [<span style="color: #ffffff;">10 11 12</span> <span style="color: #ffff00;">1e3</span>]
                   <span style="color: #87cefa;">"Franky"</span>  [<span style="color: #ffffff;">5</span> <span style="color: #ccffcc;">2.34</span> <span style="color: #ffcc00;">1/4</span>]
                   <span style="color: #87cefa;">"John"</span>    <strong>nil</strong>}
       scores     (<span style="color: #afeeee; font-weight: bold;">for</span> [[player scores] scoreboard, score scores]
                    score)]
  (<span style="color: #7fffd4;">/</span> (<span style="color: #7fffd4;">reduce</span> + scores) (<span style="color: #7fffd4;">count</span> scores)))

&gt;&gt; 148.65571428571428</pre>
<p>Notice how the core idioms of Clojure all work together neatly — We’re using <strong><em>for</em><span style="font-weight: normal;"> in many ways, adding predicates, abort conditions and also destructuring in breaking down <em>scoreboard</em> into both player and scores, then iterating each <em>score</em> individually. Of course the values could also have been expressed using another idiom:</span></strong></p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #7fffd4;">mapcat</span> val scoreboard)</pre>
<p>But we’ll save mapcat for a rainy day.</p>
<p><strong><span style="font-weight: normal;"><br />
 </span></strong></p>
<h1>Idioms</h1>
<p>If you’ve read the first chapter of <a href="http://joyofclojure.com/buy" target="_blank">Joy of Clojure</a>, you have seen this quote</p>
<blockquote><p>The only difference between Shakespeare and you was the size of his idiom list - not the size of his vocabulary.</p>
<div id="_mcePaste" style="text-align: right;">– Alan Perlis</div>
</blockquote>
<p>Clojure packs many extremely powerful idioms for working with data and in Lisp there’s nothing but data. All of Clojure implicitly treats data as immutable, meaning no freely roaming <strong><em>state</em><span style="font-weight: normal;">, and contrary to many languages you have to be explicit if you want <em>mutability</em>. </span></strong>This greatly reduces the incidental complexity of our programs, leading to a more robust software. Since we are in no shape or form introducing mutability (like iterators), but instead building all our data functionally we don’t wind up with MatchError Exceptions or anything else which results from <strong><em>unexpected state</em></strong>.</p>
<p><br class="spacer_" /></p>
<h1>Conclusion</h1>
<p>This blogpost serves as a primer for using all that <strong><em>for</em></strong> has, but also as a reminder that mutable state should always be handled very very carefully as results quickly can become unpredictable. As you move into the higher layers of programming (read high-level languages), sometimes the fear sets in that you’ll end up missing certain low-level features which used to be so crucial for your application, but I’m very pleased to see that Clojure  is deep enough that you won’t find yourself missing anything, any time soon. On the other hand, while low-level can feel comfy and safe (because we’ve all done it for many years)  it does make for <strong>many </strong>areas where you need to pay extreme attention to ensure stable bug-free programs.</p>
<p><br class="spacer_" /></p>

<p><a href="http://feedads.g.doubleclick.net/~a/RFAA_H-v1h_SkN0U7Vr2h5fbwMc/0/da"><img src="http://feedads.g.doubleclick.net/~a/RFAA_H-v1h_SkN0U7Vr2h5fbwMc/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/RFAA_H-v1h_SkN0U7Vr2h5fbwMc/1/da"><img src="http://feedads.g.doubleclick.net/~a/RFAA_H-v1h_SkN0U7Vr2h5fbwMc/1/di" border="0" ismap="true"></img></a></p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=Qlz0QQCc-ow:4pdLj83EQ44:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=Qlz0QQCc-ow:4pdLj83EQ44:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=Qlz0QQCc-ow:4pdLj83EQ44:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?i=Qlz0QQCc-ow:4pdLj83EQ44:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bestinclass-the-blog/~4/Qlz0QQCc-ow" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.bestinclass.dk/index.php/2010/02/clojure-list-comprehension/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		<feedburner:origLink>http://www.bestinclass.dk/index.php/2010/02/clojure-list-comprehension/</feedburner:origLink></item>
		<item>
		<title>Reddit Clone in 10 minutes and 91 lines of Clojure</title>
		<link>http://feedproxy.google.com/~r/bestinclass-the-blog/~3/dLCs3DpXJYE/</link>
		<comments>http://www.bestinclass.dk/index.php/2010/02/reddit-clone-in-10-minutes-and-91-lines-of-clojure/#comments</comments>
		<pubDate>Tue, 02 Feb 2010 20:00:47 +0000</pubDate>
		<dc:creator>Lau</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[clone]]></category>
		<category><![CDATA[common lisp]]></category>
		<category><![CDATA[compojure]]></category>
		<category><![CDATA[reddit]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://www.bestinclass.dk/?p=980</guid>
		<description><![CDATA[
			
				
			
		
Recently I had the good pleasure of reading this blog post, which demonstrates how to build a Reddit Clone in 100 lines of Common Lisp. I thought it might be interesting to see a port to Clojure, contrasting a couple of idioms and core functions of both languages.




Preface
Why contrast 2 Lisps you ask? Because the [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F02%2Freddit-clone-in-10-minutes-and-91-lines-of-clojure%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F02%2Freddit-clone-in-10-minutes-and-91-lines-of-clojure%2F&amp;source=LauJensen&amp;style=normal&amp;service=bit.ly" height="61" width="50" /><br />
			</a>
		</div>
<p>Recently I had the good pleasure of reading this <a href="http://homepage.mac.com/svc/LispMovies/index.html" target="_blank">blog post</a>, which demonstrates how to build a Reddit Clone in 100 lines of Common Lisp. I thought it might be interesting to see a port to Clojure, contrasting a couple of idioms and core functions of both languages.</p>
<p><br class="spacer_" /></p>
<p><span id="more-980"></span></p>
<p><br class="spacer_" /></p>
<p><br class="spacer_" /></p>
<h1>Preface</h1>
<p>Why contrast 2 Lisps you ask? Because the subtle differences are interesting to me and both languages take up about the same amount of code-space. Following the link above you’ll actually be able to see how Sven writes out his entire Clone in a couple of screencasts, demonstrating Lisp Works. Whether you watch it or not, I recommend opening op <a href="http://homepage.mac.com/svc/LispMovies/reddit.lisp.html" target="_blank">his code</a> in a 2nd tab while reading this post.</p>
<p><br class="spacer_" /></p>
<h1>The not so subtle differences</h1>
<p>Rich Hickey once remarked, that cl-Loop and cl-Format were in themselves more complex than the entire Clojure language. In this case Common Lisp has a function which I would very much see moved into Clojure, namely format. Format can render your input in more ways than you can think of, automatically figuring out wether to suffix and extra “s” to “sec” and what not — Its an impressive function, which we sadly don’t have in Clojure yet (<strong><span style="color: #ff0000;">update</span></strong>: Please see Tom Faulhabers comment below the article — turns out Clojure does have cl-format). As you can see from Svens Reddit Clone he formats the links like so:</p>
<p><strong>﻿Title</strong> <em>posted 1 minute 5 seconds ago </em><strong>Up Down</strong></p>
<p>In the absense of format in Clojure I turn to <a href="http://joda-time.sourceforge.net/" target="_blank">Joda Time</a> to mimic that behavior. Joda lets me define a PeriodFormatterBuilder, which I can use to coerce ‘durations’, which the timespan is, into text formatted like above:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">def</span> <span style="color: #7fffd4; font-weight: bold;">formatter</span>
     (.toPrinter (<span style="color: #afeeee; font-weight: bold;">doto</span> (org.joda.time.format.PeriodFormatterBuilder.)
                   .appendDays    (.appendSuffix <span style="color: #87cefa;">" day "</span>    <span style="color: #87cefa;">" days "</span>)
                   .appendHours   (.appendSuffix <span style="color: #87cefa;">" hour "</span>   <span style="color: #87cefa;">" hours "</span>)
                   .appendMinutes (.appendSuffix <span style="color: #87cefa;">" minute "</span> <span style="color: #87cefa;">" minutes "</span>)
                   .appendSeconds (.appendSuffix <span style="color: #87cefa;">" second "</span> <span style="color: #87cefa;">" seconds "</span>))))
</pre>
<p>This Printer can then be used to coerce a timestamp into the text you see above, by manually making a Duration (datatype) between the timestamp and DateTime/now:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">pprint</span> [stamp]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [retr   (StringBuffer.)
        period (Period. (Duration. stamp (DateTime.)))]
    (.printTo formatter retr period (java.util.Locale. <span style="color: #87cefa;">"US"</span>))
    (<span style="color: #7fffd4;">str</span> retr)))
</pre>
<p>We can test it by making up a TimeStamp from January 31.th 12:00:00 and 00 milliseconds:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;"><span style="color: #afeeee; font-weight: bold;">cloneit> </span><span style="font-weight: bold;">(pprint (DateTime. 2010 1 31 12 00 00 00))</span>
"2 hours 52 minutes 4 seconds "</pre>
<p>Nice. This in no way emulates all of what cl-format can do, but enough for this exercise.</p>
<p><br class="spacer_" /></p>
<h1>Rendering Links</h1>
<p>To render links we first need to agree on a datastructure and for simplicity I’ll go with a hash-map where the keys are the URLs and the values are hash-maps containing the properties of that URL. This makes for easy access later. To set up some test data:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">def</span> <span style="color: #7fffd4; font-weight: bold;">data</span>  (<span style="color: #7fffd4;">ref</span> {<span style="color: #87cefa;">"http://www.bestinclass.dk"</span> {<span style="color: #7fffd4;">:title</span> <span style="color: #87cefa;">"Best in Class"</span> <span style="color: #7fffd4;">:points</span> 1 <span style="color: #7fffd4;">:date</span> (DateTime.)}}))</pre>
<p>We know that our users will want to sort the data on various columns, so it makes sense to write out a render-links function, to which we can pass our criteria for sorting. Clojure’s sort-by is special in the sense that you can both pass it a function (keyfn) which extracts the data we cant to sort-by, and also a comparator to apply. Render-links thus becomes:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">render-links</span> [keyfn cmp]
  (<span style="color: #afeeee; font-weight: bold;">for</span> [link (<span style="color: #7fffd4;">sort-by</span> keyfn cmp @data)]
    (<span style="color: #afeeee; font-weight: bold;">let</span> [[url {<span style="color: #7fffd4;">:keys</span> [title points date]}] link]
      [<span style="color: #7fffd4;">:li</span>
       (link-to url title)
       [<span style="color: #7fffd4;">:span</span> (<span style="color: #7fffd4;">format</span> <span style="color: #87cefa;">" Posted %s ago. %d %s "</span> (pprint date) points <span style="color: #87cefa;">"points"</span>)]
       (link-to (<span style="color: #7fffd4;">str</span> <span style="color: #87cefa;">"/up/"</span> url)   <span style="color: #87cefa;">"Up"</span>)
       (link-to (<span style="color: #7fffd4;">str</span> <span style="color: #87cefa;">"/down/"</span> url) <span style="color: #87cefa;">"Down"</span>)])))
</pre>
<p>What that does it walk through every link in the sequence which results from sorting. For every link it extracts the key, which is the URL as well as the keys in the hash-map attached to that key. The return is a sequence of vectors starting with [:li …] compojure know s how to convert this to HTML.</p>
<p>I think the specific compojure helpers like (link-to) are pretty self explanatory, but its worth noting, that if you don’t know them all you could still make a link like so:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">[<span style="color: #7fffd4;">:a</span> {<span style="color: #7fffd4;">:href</span> <span style="color: #87cefa;">"http://www.bestinclass.dk"</span>} <span style="color: #87cefa;">"My favorite blog"</span>]</pre>
<p>So the entrance fee is pretty low, as you can explore away. Lets say you want to sort all links by the number of points they have, call it like so:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(render-links #(<span style="color: #7fffd4;">:points</span> (<span style="color: #7fffd4;">val</span> %))  >)</pre>
<p>So that hopefully makes sense right away. You get the key by calling :points on the value of each item, and you sort those using Greater Than as the comparator. Sorting my date might be a little more tricky:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(render-links #(.getMillis (Duration. (<span style="color: #7fffd4;">:date</span> %) (DateTime.))) >)</pre>
<p>As you can see I pull out the age of the each item in milliseconds and also compare them using GT.</p>
<p><br class="spacer_" /></p>
<h1>Rendering Home</h1>
<p>So to render a main-page almost exactly like the one Sven has, we do the following:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">reddit-home</span> []
  (html
   [<span style="color: #7fffd4;">:head</span>
    [<span style="color: #7fffd4;">:title</span> <span style="color: #87cefa;">"Reddit.Clojure"</span>]]
   [<span style="color: #7fffd4;">:body</span>
    [<span style="color: #7fffd4;">:h1</span> <span style="color: #87cefa;">"Reddit.Clojure"</span>]
    [<span style="color: #7fffd4;">:h3</span> (<span style="color: #7fffd4;">format</span> <span style="color: #87cefa;">"In exactly %d lines of gorgeous Clojure"</span>
                 (<span style="color: #afeeee; font-weight: bold;">->></span> (this-file) reader line-seq count))]
    [<span style="color: #7fffd4;">:a</span> {<span style="color: #7fffd4;">:href</span> <span style="color: #87cefa;">"/"</span>} <span style="color: #87cefa;">"Refresh"</span>] [<span style="color: #7fffd4;">:a</span> {<span style="color: #7fffd4;">:href</span> <span style="color: #87cefa;">"/new/"</span>} <span style="color: #87cefa;">"Add link"</span>]
    [<span style="color: #7fffd4;">:h1</span> <span style="color: #87cefa;">"Highest ranking list"</span>]
    [<span style="color: #7fffd4;">:ol</span> (render-links #(<span style="color: #7fffd4;">:points</span> (<span style="color: #7fffd4;">val</span> %))  >)]
    [<span style="color: #7fffd4;">:h1</span> <span style="color: #87cefa;">"Latest link"</span>]
    [<span style="color: #7fffd4;">:ol</span> (render-links #(.getMillis (Duration. (<span style="color: #7fffd4;">:date</span> (<span style="color: #ccffcc;">val</span> %)) (DateTime.))) >)]]))
</pre>
<p>The reason I said ‘almost exactly’, is because Svens version outputs “In about 100 lines of Lisp”, where mine will dynamically output the exact number of lines. But looking past that small detail, I think its a very clean representation of that webpage.</p>
<p>To get the actual line count is a little tricky. Clojure stores the filename relative to the classpath when loading then file — that means that the only way to the actual filename is to store it once Clojure is loading my file. As soon as Clojure has moved on to the next file, *file* changes:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defmacro</span> <span style="color: #7fffd4; font-weight: bold;">this-file</span> [] (<span style="color: #7fffd4;">str</span> <span style="color: #87cefa;">"src/"</span> *file*))</pre>
<p>Hackery you say? A little, but nevertheless it does dynamically output the number of lines in the source file.</p>
<p><strong><span style="color: #ff0000;">important</span></strong><strong>: </strong>If you’re running this program from REPL (ie. not from a .jar file), this-file won’t work because there is no file. Replace it with a dummy value.</p>
<p><br class="spacer_" /></p>
<h1>Adding Links</h1>
<p>So now we need to enable our users to add new links to the website and I’ll implement the same validation as Sven, ie. valid non empty url? non empty title etc. To begin, I’ll make a predicate to verify the URL:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">invalid-url?</span> [url]
  (<span style="color: #afeeee; font-weight: bold;">or</span> (<span style="color: #7fffd4;">empty?</span> url)
      (<span style="color: #7fffd4;">not</span> (<span style="color: #afeeee; font-weight: bold;">try</span> (java.net.URL. url) (<span style="color: #afeeee; font-weight: bold;">catch</span> Exception e nil)))))
</pre>
<p>That makes our lives a little easier when writing the main logic. Secondly we need to set up a page which contains the input fields:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">reddit-new-link</span> [msg]
  (html
   [<span style="color: #7fffd4;">:head</span>
    [<span style="color: #7fffd4;">:title</span> <span style="color: #87cefa;">"Reddit.Clojure - Submit to our authority"</span>]]
   [<span style="color: #7fffd4;">:body</span>
    [<span style="color: #7fffd4;">:h1</span> <span style="color: #87cefa;">"Reddit.Clojure - Submit a new link"</span>]
    [<span style="color: #7fffd4;">:h3</span> <span style="color: #87cefa;">"Submit a new link"</span>]
    (<span style="color: #afeeee; font-weight: bold;">when</span> msg [<span style="color: #7fffd4;">:p</span> {<span style="color: #7fffd4;">:style</span> <span style="color: #87cefa;">"color: red;"</span>} msg])
    (form-to [<span style="color: #7fffd4;">:post</span> <span style="color: #87cefa;">"/new/"</span>]
     [<span style="color: #7fffd4;">:input</span> {<span style="color: #7fffd4;">:type</span> <span style="color: #87cefa;">"Text"</span> <span style="color: #7fffd4;">:name</span> <span style="color: #87cefa;">"url"</span> <span style="color: #7fffd4;">:value</span> <span style="color: #87cefa;">"http://"</span> <span style="color: #7fffd4;">:size</span> 48 <span style="color: #7fffd4;">:title</span> <span style="color: #87cefa;">"URL"</span>}]
     [<span style="color: #7fffd4;">:input</span> {<span style="color: #7fffd4;">:type</span> <span style="color: #87cefa;">"Text"</span> <span style="color: #7fffd4;">:name</span> <span style="color: #87cefa;">"title"</span> <span style="color: #7fffd4;">:value</span> <span style="color: #87cefa;">""</span> <span style="color: #7fffd4;">:size</span> 48 <span style="color: #7fffd4;">:title</span> <span style="color: #87cefa;">"Title"</span>}]
     (submit-button <span style="color: #87cefa;">"Add link"</span>))
    (link-to <span style="color: #87cefa;">"/"</span> <span style="color: #87cefa;">"Home"</span>)]))
</pre>
<p>That function takes an argument for the sole reason, that I want to be able to call it with an error message while instructing the user on how to provide good input. You see that directly in the middle:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">when</span> msg [<span style="color: #7fffd4;">:p</span> {<span style="color: #7fffd4;">:style</span> <span style="color: #87cefa;">"color: red;"</span>} msg])</pre>
<p>That only kicks in if msg is non-nil, in which case it will output a p-tag with a red font containing msg. Now that we have all the rendering out of the way, we can implement the logic:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">add-link</span> [[title url]]
  (redirect-to
   (<span style="color: #afeeee; font-weight: bold;">cond</span>
    (invalid-url? url) <span style="color: #87cefa;">"/new/?msg=Invalid URL"</span>
    (<span style="color: #7fffd4;">empty?</span> title)     <span style="color: #87cefa;">"/new/?msg=Invalid Title"</span>
    (@data url)        <span style="color: #87cefa;">"/new/?msg=Link already submitted"</span>
    <span style="color: #7fffd4;">:else</span>
    (<span style="color: #afeeee; font-weight: bold;">dosync</span>
     (<span style="color: #7fffd4;">alter</span> data assoc url {<span style="color: #7fffd4;">:title</span> title <span style="color: #7fffd4;">:date</span> (DateTime.) <span style="color: #7fffd4;">:points</span> 1})
     <span style="color: #87cefa;">"/"</span>))))
</pre>
<p>Call that function with both a title and an url (ie. the user input) and it will run a fall-through validation of that data, meaning if none of the predicates are true, then we start an STM transaction in which we associate the url with the title, a Timestamp and an initial point. All the strings you see, as well as the final “/” are the return of the conditional, which then becomes the argument to “redirect-to”.</p>
<p><br class="spacer_" /></p>
<h1>Rating Posts</h1>
<p>Now there’s only 2 things missing, rating and the server-setup. With our data rolled in a native Clojure structure it becomes extremely easy to rate an item:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">rate</span> [url mfn]
  (<span style="color: #afeeee; font-weight: bold;">dosync</span>
   (<span style="color: #afeeee; font-weight: bold;">when</span> (@data url)
     (<span style="color: #7fffd4;">alter</span> data update-in [url <span style="color: #7fffd4;">:points</span>] mfn)))
  (redirect-to <span style="color: #87cefa;">"/"</span>))
</pre>
<p>That function takes the URL in question, as well as a function (modify-fn). The function can be (inc) (dec) #(+ 5 %) or whatever you’d like, its just a closure. Calling (when (@data url)) extracts the item specified by the url, if this is nil (ie. somebody tried to work around the system), then nothing happens. But if there is an URL by that name in the set, then we alter the data by update [url :points] directly within an STM transaction. That guarantees total concurrency safety even with many users.</p>
<p><br class="spacer_" /></p>
<h1>Finalizing</h1>
<p>So with all of the logic and rendering in place, we just need to bundle it in a set of routes which Compojure then serves our visitors:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(defroutes reddit
  (GET  <span style="color: #87cefa;">"/"</span>         (reddit-home))
  (GET  <span style="color: #87cefa;">"/new/*"</span>    (reddit-new-link (<span style="color: #7fffd4;">:msg</span> params)))
  (POST <span style="color: #87cefa;">"/new/"</span>     (add-link (<span style="color: #7fffd4;">map</span> #(params %) [<span style="color: #7fffd4;">:title</span> <span style="color: #7fffd4;">:url</span>])))
  (GET  <span style="color: #87cefa;">"/up/*"</span>     (rate (<span style="color: #7fffd4;">:*</span> params) inc))
  (GET  <span style="color: #87cefa;">"/down/*"</span>   (rate (<span style="color: #7fffd4;">:*</span> params) dec))
  (GET  <span style="color: #87cefa;">"/styles/*"</span> (serve-file <span style="color: #87cefa;">"res"</span> (params <span style="color: #7fffd4;">:*</span>)))
  (ANY <span style="color: #87cefa;">"*"</span>  404))
</pre>
<p>Firstly we serve the main page to visitors hitting the root. If you request the “/new/” adress you get our input form, but if you post to it, the logic from (add-link) runs. The result as you recall is a redirect, either to the same page with an error or the front page.</p>
<p>The 4th item serves “/up/” and then feeds the remaining of the url into the key “*”. That allows me to feed that directly to (rate) and a the final parameter (inc) which causes the :points property to be incremented by one. The opposite is true for the 5th item.</p>
<p>The call to serve-file allow me to serve statics like CSS, JS files etc.</p>
<p>Finally I have my failsafe (ANY “*” 404), which of course means that all other requests that those I’ve defined above will receive a 404 reponse — its not necessary, just nice to have. Launch these routes on a network interface calling my main function:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">-main</span> [&amp; args]
  (run-server {<span style="color: #7fffd4;">:port</span> 8080} <span style="color: #87cefa;">"/*"</span> (servlet reddit)))</pre>
<p>Throw in a call to (include-css “res/reddit.css”) and you get this:</p>
<p style="text-align: center;"><img class="aligncenter" style="border: 0;" title="CloneIT" src="http://www.bestinclass.dk/wp-content/uploads/cloneit.png" alt="CloneIT" width="484" height="456" /></p>
<p><br class="spacer_" /></p>
<h1>Deployment</h1>
<p>The reason I felt like following Svens lead in producing a Reddit Clone, was because I think Clojure gives you a lot of mileage in this domain, which hopefully a few of you agree with after reading this. I’ve added the code to a <a href="http://github.com/LauJensen/cloneit" target="_blank">Git Repo</a> which I hope you newcomers will really enjoy:</p>
<pre>$ git clone git://github.com/LauJensen/cloneit.git</pre>
<p>That gives you the code. Now download Leiningen:</p>
<pre>$ wget http://github.com/technomancy/leiningen/raw/stable/bin/lein</pre>
<p>Put that on your path and make it executable</p>
<pre>$ export PATH=$PATH:/path/to/lein
$ chmod +x lein</pre>
<p>And then install it simply by calling</p>
<pre>$ lein self-install</pre>
<p>Now you’re sitting with my code and one of the build tools which Clojurians use. Why is this great? Its great because now you don’t have to scour the net to find Clojure, Contrib, Joda etc etc, just run</p>
<pre>$ lein deps</pre>
<p>And you’ll have <strong>all of the dependencies</strong> on your own system. Want to run my program to experiment with the webservice? No problem:</p>
<pre>$ lein compile
$ lein uberjar
$ java -jar cloneit-standalone.jar
2010-01-31 15:22:09.694::INFO:  Logging to STDERR via org.mortbay.log.StdErrLog
cloneit.proxy$javax.servlet.http.HttpServlet$0
2010-01-31 15:22:09.725::INFO:  jetty-6.1.x
2010-01-31 15:22:09.767::INFO:  Started SocketConnector@0.0.0.0:8080
 </pre>
<p>Yes — It really is that easy to deploy! Now you have a portable Reddit Clone which will run on Linux, BSD, Mac OSX and even Windows — All with very little effort and less than 100 lines of code!</p>
<h1>Conclusion</h1>
<p>Now you know how to write a webservice, implement Reddit like functions, build it, handle dependencies and deploy cross platform — The language level support for concurrency is becoming invaluable at every turn these days and with Clojure infrastructure pieces quickly being put in place, Clojure is giving us a lot of mileage. Hope you all have some fun with it.</p>
<p>PS: Big thanks to Sven for getting the ball rolling.</p>
<p><strong>Code here:</strong> <a href="http://github.com/LauJensen/cloneit" target="_blank">Github</a></p>
<p><br class="spacer_" /></p>

<p><a href="http://feedads.g.doubleclick.net/~a/xhdjRq6JLrJTZj383KC2bIvcvJ8/0/da"><img src="http://feedads.g.doubleclick.net/~a/xhdjRq6JLrJTZj383KC2bIvcvJ8/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/xhdjRq6JLrJTZj383KC2bIvcvJ8/1/da"><img src="http://feedads.g.doubleclick.net/~a/xhdjRq6JLrJTZj383KC2bIvcvJ8/1/di" border="0" ismap="true"></img></a></p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=dLCs3DpXJYE:jGv_PR35Vs0:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=dLCs3DpXJYE:jGv_PR35Vs0:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=dLCs3DpXJYE:jGv_PR35Vs0:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?i=dLCs3DpXJYE:jGv_PR35Vs0:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bestinclass-the-blog/~4/dLCs3DpXJYE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.bestinclass.dk/index.php/2010/02/reddit-clone-in-10-minutes-and-91-lines-of-clojure/feed/</wfw:commentRss>
		<slash:comments>22</slash:comments>
		<feedburner:origLink>http://www.bestinclass.dk/index.php/2010/02/reddit-clone-in-10-minutes-and-91-lines-of-clojure/</feedburner:origLink></item>
		<item>
		<title>Global Warming Vs Clojure!</title>
		<link>http://feedproxy.google.com/~r/bestinclass-the-blog/~3/btuPkzLE54c/</link>
		<comments>http://www.bestinclass.dk/index.php/2010/01/global-warming/#comments</comments>
		<pubDate>Wed, 27 Jan 2010 19:00:39 +0000</pubDate>
		<dc:creator>Lau</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[clojure]]></category>
		<category><![CDATA[co2]]></category>
		<category><![CDATA[global]]></category>
		<category><![CDATA[gzip]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[tar]]></category>
		<category><![CDATA[warming]]></category>

		<guid isPermaLink="false">http://www.bestinclass.dk/?p=932</guid>
		<description><![CDATA[
			
				
			
		
Nobody who’s connected to the rest of the world, either via TV or the Internet is unaware of Global Warming — This phenomenon which threatens to destroy us all if we don’t collectively assume responsibility for the globe. Here’s my contribution to a solution in 98 lines of heavy computational Clojure!


Preface
As a Danish Citizen I [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F01%2Fglobal-warming%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F01%2Fglobal-warming%2F&amp;source=LauJensen&amp;style=normal&amp;service=bit.ly" height="61" width="50" /><br />
			</a>
		</div>
<p>Nobody who’s connected to the rest of the world, either via TV or the Internet is unaware of Global Warming — This phenomenon which threatens to destroy us all if we don’t collectively assume responsibility for the globe. Here’s my contribution to a solution in 98 lines of heavy computational Clojure!<span id="more-932"></span></p>
<p><br class="spacer_" /></p>
<p><br class="spacer_" /></p>
<h1>Preface</h1>
<p>As a Danish Citizen I feel its mandatory to engage in this debate. Denmark recently hosted a major conference to facilitate solutions to the climate threats of this the 21.th century. The result as you all probably know what unfortunately a huge failure, so tons of CO2 was needlessly emitted by flying in all the foreign officials, security motorcades etc etc.</p>
<p>I plan on getting a better result. I’ve learned that the National Oceanic and Atmospheric Administration (NOAA) have published about 3 Gigabytes of tarballed weather data, going back as far as 1929. My mission is now to organize and parse that data, to see exactly what the effect of our recent boom in CO2 emission is doing to the environment. From the total effect I’ll be able to approximate Clojures contribution to the Global Warming.</p>
<p><br class="spacer_" /></p>
<h1>Why should I read this post?</h1>
<p>Well if you should read it, it’s because one of more of the following applies:</p>
<ul>
<li>I care about Global Warming</li>
<li>I love heavy computational parallelized Clojure</li>
<li>I want to know how to stream Tarballs and GZips</li>
</ul>
<p><br class="spacer_" /></p>
<h1>Data</h1>
<p>First we have to get the data from NOAA. For the sake of all of you who want to repeat these calculations to show the neighborhood kids that they shouldn’t spend their spare time lighting dumpers on fire and generally wasting energy, I’ll go through every step:</p>
<p><strong>#1</strong>: Preparing URLS</p>
<p>Every dataset is found on their ftp in a subdirectory named according to the year the data is from and in that directory there’s a tar-ball called gsod_<em>year</em>.tar. I’m not a wget expert so we will tag-team with Clojure:</p>
<pre class="sh_clojure" name="code">(spit "urls"
   (apply str
        (map #(format "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/%d/gsod_%d.tar\n" % %)
              (range 1929 2010))))

--- OR ---

(->> (range 1929 2010)
	   (map #(format "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/%d/gsod_%d.tar\n" % %))
	   (apply str)
	   (spit "urls"))
</pre>
<p style="text-align: right;"><em><strong>(spit is from clojure.contrib.duck-streams)</strong></em></p>
<p><strong> </strong></p>
<p style="text-align: left;">I show both variants because you’ll want to get comfortable with both <strong>-&gt;</strong> (thread as the 2.nd item) and <strong>-»</strong> (thread as the last item), they’re here to stay and are quite handy. Either of those snippets produce a file called ‘urls’ which contains links to all the tar-balls available (except 2010 which ofc isn’t complete yet), totalling about 3 Gigabytes.</p>
<p>To download all the data issue this command from the terminal</p>
<pre> $ mkdir dataset &amp;&amp; cd dataset
 $ wget -i ../urls
--2010-01-20 17:18:53--  ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1929/gsod_1929.tar
           => `gsod_1929.tar'
Resolving ftp.ncdc.noaa.gov... 205.167.25.101
Connecting to ftp.ncdc.noaa.gov|205.167.25.101|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD /pub/data/gsod/1929 ... done.
==> SIZE gsod_1929.tar ... 71680
==> PASV ... done.    ==> RETR gsod_1929.tar ... done.
Length: 71680 (70K)

100%[=====================================>] 71.680      94,5K/s   in 0,7s
</pre>
<p>The first file is just 70K as you can see, but the last is 90M — they grow as the number of weather stations increase. Now you’ve got all of your tars sitting in the same directory, so<strong> <span style="color: #ff0000;">if</span> </strong>you wanted to extract the data, do this:</p>
<pre> $ for i in *.tar; do echo $i &amp;&amp; tar -xf $i &amp;&amp; rm $i; done
(..wait about 20 minutes..)
</pre>
<p>That would give you about 450.000 number of gzips weighing in at a heavy <strong>10 Gigabytes</strong>. If you unroll these files they’ll expand into about <strong>50 Gigabytes</strong> of raw text! Expanding that is too ambitious on my little laptop so lets think this through. When we’ve processed all the weather data we need it sorted by year (and perhaps month). To do a running sort of the data we have to carry the head which is close to impossible in this case — even ‘ls’ struggles with the 450k files so Javas Heap will too. The best possible way to attack this problem would be to write tar-ball reader, which lets us peek at the GZips and then a GZip reader which lets us work with the text without a<strong>ny pre-unpacking</strong>. Javas IO is centered around Input/Output streams, so if we can expose the data via streams it will be presorted and I don’t have to overload my harddrive.</p>
<h1>Processing</h1>
<p>We’re looking down the barrel of several Gigabytes of raw text data, so we need to set up some kind of headless processing. When I speak about holding/losing the head, I’m referring to how we handle memory. If you keep the head of the sequence while processing, that entire sequence is accumulated in memory. Not holding the head means only keeps an item/chunk in memory at every given moment.</p>
<p>First we have to narrow the field as much as possible so lets look at how the professionals do:</p>
<div class="mceIEcenter">
<dl class="aligncenter">
<dt><img title="Hockey Stick" src="http://www.bestinclass.dk/wp-content/uploads/hstick.jpg" alt="Hockey Stick" width="600" height="412" /></dt>
<dd>Hockey Stick — © Al Gore</dd>
</dl>
</div>
<p>As we can clearly see the temperatures in the Northen Hemisphere explode near start -&gt; middle of the 1900’s. So we’ll follow Mr. Gores lead and filter out the stations which record temperature in the Northen Hemisphere (NH) — hopefully by the end of this blogpost we can reproduce the Hockey Stick using Clojure.</p>
<p>To do that, download the <a href="ftp://ftp.ncdc.noaa.gov/pub/data/gsod/ish-history.txt" target="_blank">history file</a> from NOAA’s website, it contains the IDs of all weather stations as well as their position by longitude and latitude. The NH is defined by the longtitude coordinates that are positive. The file starts with 20 lines of information and then many lines like these:</p>
<pre>010014 99999 SOERSTOKKEN                   NO NO    ENSO  +59783 +005350 +00490
010015 99999 BRINGELAND                    NO NO    ENBL  +61383 +005867 +03270
010016 99999 RORVIK/RYUM                   NO NO          +64850 +011233 +00140
010017 99999 FRIGG                         NO NO    ENFR  +59933 +002417 +00480
010020 99999 VERLEGENHUKEN                 NO SV          +80050 +016250 +00080
010030 99999 HORNSUND                      NO SV          +77000 +015500 +00120
010040 99999 NY-ALESUND II                 NO SV    ENAS  +78917 +011933 +00080
010050 99999 ISFJORD RADIO                 NO NO    ENIS  +78067 +013633 +00050
</pre>
<p>The predefined structure of the file is based on indices, meaning I know that the WBAN ID always starts at IDX 8 and ends on 12 (7–11 when zero-based). We cannot split this on words (\w+) because some elements are occasionally left out but the index structure remains. Thats means that we can filter out those stations that are in the Northern hemisphere like so:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">northern-stations</span> [filename]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [data   (<span style="color: #afeeee; font-weight: bold;">->></span> (<span style="color: #7fffd4;">line-seq</span> (<span style="color: #7fffd4;">reader</span> filename)) (<span style="color: #7fffd4;">drop</span> 20))
        north  (<span style="color: #afeeee; font-weight: bold;">for</span> [station data <span style="color: #7fffd4;">:when</span> (<span style="color: #7fffd4;">=</span> \+ (<span style="color: #7fffd4;">nth</span> station 58))]
                 (<span style="color: #7fffd4;">vec</span> (<span style="color: #7fffd4;">take</span> 2 (.split station <span style="color: #87cefa;">" "</span>))))]
    (<span style="color: #7fffd4;">reduce</span> #(<span style="color: #7fffd4;">assoc</span> %1  <span style="color: #7fffd4;">:stn</span>  (<span style="color: #7fffd4;">conj</span> (<span style="color: #7fffd4;">:stn</span> %1) (%2 0))
                        <span style="color: #7fffd4;">:wban</span> (<span style="color: #7fffd4;">conj</span> (<span style="color: #7fffd4;">:wban</span> %1) (%2 1)))
            {} north)))
</pre>
<p>If you’re a regular here that should be very straight forward, but in case you’re not:</p>
<ol>
<li>Start a line-reader, skip 20 lines</li>
<li>Walk the lines and :when the 58.th index is “+” take out the first 2 elements STN and WBAN and bundle them in a vector</li>
<li>Reduce that series of vectors into a hashmap ala {:stn [id1 id2 id3] :wban [id1 id2 id3]}</li>
</ol>
<p>The reason this doesn’t break is that neither STN or WBAN are left out anywhere in the data, <strong>NOAA</strong> consistently uses 99999 to indicate that a field is blank. With all the valid stations extracted we can now set up a data-parser which only extracts data from these stations.</p>
<h1>Gunzip</h1>
<p>We can start by unpacking the 1929 tarball manually to inspect the data. Java.util.zip has a few classes for working with GZip archives, so lets try them out and see if we can’t integrate it nicely.</p>
<p>My personal preference would be to write some kind of wrapper, which would enable me to work on the data like so</p>
<pre class="sh_clojure" name="code">(with-zipstream [stream "/path/to/gzip.gz"]
   (...work on data...))
</pre>
<p>Well as you know its easy to conform a Lisp to your liking, so here’s my personal Gunship:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defmacro</span> <span style="color: #7fffd4; font-weight: bold;">with-zipstream</span> [bindings &amp; body]
  `(<span style="color: #afeeee; font-weight: bold;">with-open</span> [~(bindings 0) (<span style="color: #afeeee; font-weight: bold;">->></span> (FileInputStream. ~(bindings 1))
                                  GZIPInputStream. InputStreamReader.
                                  BufferedReader.)]
     (<span style="color: #afeeee; font-weight: bold;">do</span> ~@body)))
</pre>
<p>But although its tempting to abstract away using macros, this constitutes macro-abuse, since this is equally helpful:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">dump-stream</span> [stream sz]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [buffer    (<span style="color: #7fffd4;">make-array</span> Byte/TYPE sz)]
    (.read stream buffer 0 sz)
    (ByteArrayInputStream. buffer)))

(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">line-stream</span>
  [tarstream tarentry]
  (<span style="color: #afeeee; font-weight: bold;">with-open</span> [zipfile (<span style="color: #afeeee; font-weight: bold;">->></span> (dump-stream tarstream (.getSize tarentry))
                           GZIPInputStream. InputStreamReader. BufferedReader.)]
    (<span style="color: #afeeee; font-weight: bold;">doall</span> (<span style="color: #afeeee; font-weight: bold;">for</span> [line (<span style="color: #7fffd4;">repeatedly</span> #(.readLine zipfile)) <span style="color: #7fffd4;">:while</span> line] line))))
</pre>
<p>This does very much the same as cores line-seq, in that it returns a non-lazy (doall) sequence of all the lines in the GZip.</p>
<p>Now you have access to the weather data which looks like so:</p>
<pre>STN--- WBAN   YEARMODA    TEMP       DEWP      SLP        STP       VISIB
030050 99999  19291001    45.3  4    40.0  4  1001.6  4  9999.9  0   17.1
030050 99999  19291002    49.5  4    45.2  4   977.6  4  9999.9  0    9.3
030050 99999  19291003    49.0  4    41.7  4   975.7  4  9999.9  0   10.9
030050 99999  19291004    45.7  4    38.5  4   992.0  4  9999.9  0    6.2
030050 99999  19291005    46.5  4    41.5  4   997.8  4  9999.9  0    7.8
030050 99999  19291006    49.5  4    46.5  4   990.1  4  9999.9  0    7.8
030050 99999  19291007    48.2  4    44.8  4   979.1  4  9999.9  0    9.3
030050 99999  19291008    46.5  4    39.2  4   994.3  4  9999.9  0   12.4
030050 99999  19291009    44.7  4    40.0  4  1005.4  4  9999.9  0   10.9
030050 99999  19291010    48.7  4    47.0  4  1000.6  4  9999.9  0    8.4
030050 99999  19291011    48.7  4    39.2  4   995.5  4  9999.9  0   12.4
</pre>
<p>With direct access to the text compressed away in the GZips its easier to implement a tarball reader, because we can see the result of our experiments immediately — Just printing the binary values from each entry (file) in the tarball wouldn’t tell me much about my success of failure in uncompressing the data.</p>
<h1>Tar</h1>
<p>To work with tarballs I’ve downloaded this <a href="http://www.gjt.org/download/time/java/tar/javatar-2.5.tar.gz" target="_blank">jar file</a>. You can also visit the Javadocs: <a href="http://www.gjt.org/javadoc/com/ice/tar/package-summary.html" target="_blank">here</a>. First things first: Lets see if we can peek at the data, starting with the smallest set <strong>1929</strong>.</p>
<p>As always when we’re in Java-land we have to think carefully about how we want to abstract methods and workflows. The class TarInputStream lets me view a Tarball as series of TarEntries (compressed files).</p>
<p>With the ability to walk a zipstream as pure text already in place, we can now wrap the Tarball in a process-tarball function. Its handy to go through the data one tarball at a time, both for sorting as I mentioned before but also for getting both cores busy with the data. To give you an idea of what I’m thinking, here’s the top-level processing:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">process-tarball</span>
  [filename stn-ids wban-ids headers]
  (<span style="color: #7fffd4;">println</span> <span style="color: #87cefa;">"Parsing: "</span> (.getName filename))
  (<span style="color: #7fffd4;">flush</span>)
  (<span style="color: #afeeee; font-weight: bold;">let</span> [tarstream    (<span style="color: #afeeee; font-weight: bold;">->></span> filename FileInputStream. TarInputStream.)
        readings     (extract-readings tarstream stn-ids wban-ids)]
    {<span style="color: #7fffd4;">:year</span> (<span style="color: #7fffd4;">re-find</span> #<span style="color: #87cefa;">"\d{4}"</span> (.getName filename))
     <span style="color: #7fffd4;">:mean</span> (<span style="color: #afeeee; font-weight: bold;">if-let</span> [cnt (<span style="color: #7fffd4;">count</span> readings)]
             (<span style="color: #afeeee; font-weight: bold;">when-not</span> (zero? cnt)
               (<span style="color: #7fffd4;">/</span> (<span style="color: #7fffd4;">reduce</span> + readings) cnt))))}))
</pre>
<p>If you imagine that extract-readings just does something like walk through all readings and return a map of the temperatures it picks up, then this will collect all readings from a jar and return a hash-map containing the year parsed and the mean temp. for that year– Quite neat for about 20 lines of Clojure.</p>
<p>With the ability to peek through the tar into the GZip and through the compression into the text, we can start writing extract-readings. The easiest thing to do, would be to have a for-loop run through every line of the GZips and merge those line into 1 huge sequence, giving is a line-seq of all the GZips. The problem however, is that Java spends 2 bytes per char in a String. If we apply the math to the set from 2002. It goes like this</p>
<p style="text-align: center;"><strong>77</strong> Mb Tarball =&gt; 10.000 Gzips weighing <strong>360</strong> Mb =&gt; Strings worth at least <strong>3.6 Gb</strong></p>
<p style="text-align: left;">That means for a weak system as mine, I’m down from the count once we reach 1950. So that deprives me of the luxury of picking columns out at the higest level of the parser, meaning my extract-readings function needs to be more specific than just pulling out 1 line at a time:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">extract-readings</span>
  [tarstream stn-ids wban-ids]
  (<span style="color: #afeeee; font-weight: bold;">->></span> (Double. (<span style="color: #7fffd4;">nth</span> (cols data) 3))
       (<span style="color: #afeeee; font-weight: bold;">for</span> [data (<span style="color: #7fffd4;">rest</span> (line-stream tarstream file))])
       (<span style="color: #afeeee; font-weight: bold;">for</span> [file (<span style="color: #7fffd4;">repeatedly</span> #(.getNextEntry tarstream))
             <span style="color: #7fffd4;">:while</span> file
             <span style="color: #7fffd4;">:when</span> (<span style="color: #afeeee; font-weight: bold;">let</span> [[_ stn wban] (<span style="color: #7fffd4;">re-find</span> #<span style="color: #87cefa;">"(\d+)-(\d+)"</span> (.getName file))]
                     (<span style="color: #afeeee; font-weight: bold;">and</span> (<span style="color: #7fffd4;">not</span> (.isDirectory file))
                          (<span style="color: #afeeee; font-weight: bold;">or</span> (stn-ids stn) (wban-ids wban))))])
       flatten))
</pre>
<p>If you’re new to -» thats probably not easy to read. First I know that I want to work on every line of each file, so I pull out the 4.th column from calling ‘cols’ on that line. Cols just splits on spaces. And then I cast that to a Double. That expression is then fed to the first for loop which runs through all of the lines in the file where file is the result of the final for loop. The final for loops runs through all the entries in the tarball, picking out those entries which are not directories and are in the valid stations list, ie. stn-ids &amp; wban-ids respectively.</p>
<p>Currently we’re running about <strong>30 lines of Clojure</strong> and we’ve already got our main data extraction function set up. On the JVM we have several options when wanting to process data concurrently: <em>Agents</em> that carry state in asynchronized processes, <em>futures</em> that are just threads, <em>promises</em> and more. For this job there were two ways which seem appealing: Either LinkedBlockingQueue or a Parallelized map. I opted for #2. Pmap walks the data using a ‘sliding window’ approach, meaning if a thread is falling too far behind pmap will wait — This is good because of the heavy memory load inherent to this challenge. With LinkedBlockingQueue you have to decide on a number of workers but with pmap you just have to think in chunks — ie. how big do I want them? For this job its a no-brainer: A chunk = A tarball.</p>
<p><br class="spacer_" /></p>
<h1>Ready to Launch</h1>
<p>Now with the above functions implemented you have very free hands to decide how you want to attack the data, show/save the output etc etc, here’s one way to go:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">process-weather-data</span>
  [dataset history-file]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [stations   (northern-stations history-file)
        stn-ids    (<span style="color: #7fffd4;">disj</span> (<span style="color: #7fffd4;">set</span> (<span style="color: #7fffd4;">:stn</span> stations))  <span style="color: #87cefa;">"999999"</span>)
        wban-ids   (<span style="color: #7fffd4;">disj</span> (<span style="color: #7fffd4;">set</span> (<span style="color: #7fffd4;">:wban</span> stations)) <span style="color: #87cefa;">"999999"</span>)
        dataset    (<span style="color: #afeeee; font-weight: bold;">->></span> (File. dataset) file-seq (<span style="color: #7fffd4;">filter</span> #(.isFile %)) sort)
        headers    [<span style="color: #7fffd4;">:stn</span> <span style="color: #7fffd4;">:wban</span> <span style="color: #7fffd4;">:yearmoda</span> <span style="color: #7fffd4;">:temp</span>]
        result     (<span style="color: #afeeee; font-weight: bold;">->></span> dataset
                        (<span style="color: #7fffd4;">pmap</span> #(process-tarball % stn-ids wban-ids headers))
                        doall)]
    (spit <span style="color: #87cefa;">"result"</span> (<span style="color: #7fffd4;">sort-by</span> <span style="color: #7fffd4;">:year</span> result))))
</pre>
<p style="text-align: center;"><strong><em>(my runtime: 1 hour 20 minutes)</em></strong></p>
<p>First we pull out the stations on the Northern Hemisphere and break that down into 2 sets (for fast comparisons). Then from both of those I disjoin “999999” which per convention is an empty field (see the <a href="ftp://ftp.ncdc.noaa.gov/pub/data/gsod/readme.txt" target="_blank">README</a> on <strong>NOAA</strong>). Then I take the path ‘dataset’ and mangle it into a sequence of the files, sorted by ascending order. The sort is nice because 1) It allows me to track progress, 2) The 2 largest data-sets will only have 2 threads running instead of 4 for the majority of the time. Then I manually define some headers, it wouldn’t be hard to rip them from a tarball, but there’s no need. Finally I start off the process calling pmap which launches 4 threads.</p>
<p>To avoid my system buckling (and spending excessive time) and boot a lean Arch installation, which claims less than 100Mb of RAM for itself — After 5 minutes this was the scenario:</p>
<div class="wp-caption aligncenter" style="width: 560px"><a href="http://www.bestinclass.dk/wp-content/uploads/5min.jpg"><img class=" " title="5 minutes in" src="http://www.bestinclass.dk/wp-content/uploads/5min.jpg" alt="5 minutes in" width="550" height="382" /></a><p class="wp-caption-text">5 minutes in</p></div>
<p>Couldn’t be better, both cores are boiling and the memory isn’t headed toward a heap explosion. After 45 minutes, this was the situation:</p>
<div class="wp-caption aligncenter" style="width: 560px"><a href="http://www.bestinclass.dk/wp-content/uploads/5min.jpg"><img class=" " title="45 minutes in" src="http://www.bestinclass.dk/wp-content/uploads/45min.jpg" alt="45 minutes in" width="550" height="382" /></a><p class="wp-caption-text">45 minutes in</p></div>
<p><br class="spacer_" /></p>
<p>The memory consumption has stabilized at 79.2% and both cores are still going <strong>full speed</strong> — excellent!</p>
<p><br class="spacer_" /></p>
<h1>Results — Round 1</h1>
<p>With the data ‘spit’ directly into a result file, its easy read it back and mangle it any way we want. For instance:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">doseq</span> [{<span style="color: #7fffd4;">:keys</span> [year mean]} (<span style="color: #7fffd4;">read-string</span> (<span style="color: #7fffd4;">slurp</span> <span style="color: #87cefa;">"result"</span>))]
             (<span style="color: #7fffd4;">println</span> (<span style="color: #7fffd4;">format</span> <span style="color: #87cefa;">"%s\t%s"</span> year (<span style="color: #afeeee; font-weight: bold;">if</span> mean
                                                (<span style="color: #7fffd4;">str</span> (<span style="color: #7fffd4;">/</span> (<span style="color: #7fffd4;">*</span> (<span style="color: #7fffd4;">-</span> mean 32) 5) 9) )
                                                <span style="color: #87cefa;">"null"</span>))))
</pre>
<p>That will give you 2 columns of the readings converted to celcius, which you can copy/paste into your favorite Spreadsheet editor and observe the following:</p>
<div class="wp-caption aligncenter" style="width: 611px"><a href="http://www.bestinclass.dk/wp-content/uploads/all-stations.png"><img title="Temperature Graph #1" src="http://www.bestinclass.dk/wp-content/uploads/all-stations.png" alt="Temperature Graph #1" width="601" height="289" /></a><p class="wp-caption-text">Temperature Graph #1</p></div>
<p><br class="spacer_" /></p>
<p>And now perhaps you’re wondering, where’s the Hockey Stick ? Well one explanation could be, that we’re actually not doing a very good job at picking our input data, as we have simply parsed all available data. Because we see an explosion in the number of weather stations in the years 1995 — 2009 it’s a fair assumption that if these are unevenly distributed then because of their great number they distort the graph quite a bit.</p>
<p><br class="spacer_" /></p>
<h1>A closer look</h1>
<p>So we need to be more picky about the stations we use to avoid distorted data. Lets try to extract all the stations used in 1929 (our first recorded year) and follow their readings throughout the following years — That will give us a clear indication of the variations in global temperatures without any weight distortion. To accomplish this, we can help ourselves by making a function which extracts all station IDs from a given Tarball:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">get-stations</span> [filename]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [tarstream    (<span style="color: #afeeee; font-weight: bold;">->></span> filename FileInputStream. TarInputStream.)
        all-stations (<span style="color: #afeeee; font-weight: bold;">for</span> [file (<span style="color: #7fffd4;">repeatedly</span> #(.getNextEntry tarstream))
                           <span style="color: #7fffd4;">:while</span> file
                           <span style="color: #7fffd4;">:when</span> (<span style="color: #7fffd4;">not</span> (.isDirectory file))]
                       (<span style="color: #afeeee; font-weight: bold;">let</span> [[_ stn wban] (<span style="color: #7fffd4;">re-find</span> #<span style="color: #87cefa;">"(\d+)-(\d+)-"</span> (.getName file))]
                         {<span style="color: #7fffd4;">:stn</span>  stn <span style="color: #7fffd4;">:wban</span> wban}))]
    {<span style="color: #7fffd4;">:stn</span>  (<span style="color: #7fffd4;">disj</span> (<span style="color: #7fffd4;">set</span> (<span style="color: #7fffd4;">map</span> <span style="color: #7fffd4;">:stn</span> all-stations)) <span style="color: #87cefa;">"99999"</span>)
     <span style="color: #7fffd4;">:wban</span> (<span style="color: #7fffd4;">disj</span> (<span style="color: #7fffd4;">set</span> (<span style="color: #7fffd4;">map</span> <span style="color: #7fffd4;">:wban</span> all-stations)) <span style="color: #87cefa;">"99999"</span>)}))

climate> (get-stations <span style="color: #87cefa;">"../dataset/gsod_1929.tar"</span>)
{<span style="color: #7fffd4;">:stn</span> #{<span style="color: #87cefa;">"037950"</span> <span style="color: #87cefa;">"033110"</span> <span style="color: #87cefa;">"038940"</span> <span style="color: #87cefa;">"034970"</span>
<span style="color: #87cefa;">"039800"</span> <span style="color: #87cefa;">"033960"</span> <span style="color: #87cefa;">"032620"</span> <span style="color: #87cefa;">"030750"</span> <span style="color: #87cefa;">"030910"</span> <span style="color: #87cefa;">"038040"</span> <span style="color: #87cefa;">"038560"</span>
<span style="color: #87cefa;">"038110"</span> <span style="color: #87cefa;">"990061"</span><span style="color: #87cefa;">"037770"</span> <span style="color: #87cefa;">"036010"</span><span style="color: #87cefa;">"039530"</span> <span style="color: #87cefa;">"038640"</span><span style="color: #87cefa;">"031590"</span> <span style="color: #87cefa;">"033790"</span>
<span style="color: #87cefa;">"030050"</span> <span style="color: #87cefa;">"039730"</span>},
<span style="color: #7fffd4;">:wban</span> #{}}
</pre>
<p>So all we need to do in order to follow these stations, is filter out those in the northern hemisphere and re-run the job. For the sake of faster experiments while we wash our data, I’ll accept the station-ids + outputfilename as arguments:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">process-weather-data</span>
  [dataset history-file stations output]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [dataset   (<span style="color: #afeeee; font-weight: bold;">->></span> (File. dataset) file-seq (<span style="color: #7fffd4;">filter</span> #(.isFile %)) sort)
        nstations (northern-stations history-file)
        stn-ids   (<span style="color: #7fffd4;">set</span> (<span style="color: #7fffd4;">filter</span> #((<span style="color: #7fffd4;">set</span> (<span style="color: #7fffd4;">:stn</span>  nstations)) %) (<span style="color: #7fffd4;">:stn</span>  stations)))
        wban-ids  (<span style="color: #7fffd4;">set</span> (<span style="color: #7fffd4;">filter</span> #((<span style="color: #7fffd4;">set</span> (<span style="color: #7fffd4;">:wban</span> nstations)) %) (<span style="color: #7fffd4;">:wban</span> stations)))
        headers   [<span style="color: #7fffd4;">:stn</span> <span style="color: #7fffd4;">:wban</span> <span style="color: #7fffd4;">:yearmoda</span> <span style="color: #7fffd4;">:temp</span>]
        result    (<span style="color: #afeeee; font-weight: bold;">doall</span>
                   (<span style="color: #7fffd4;">pmap</span> #(process-tarball % stn-ids wban-ids) dataset))]
    (spit (<span style="color: #7fffd4;">str</span> output <span style="color: #87cefa;">".raw"</span>) (<span style="color: #7fffd4;">with-out-str</span> (<span style="color: #7fffd4;">prn</span> result)))
    (<span style="color: #7fffd4;">println</span> <span style="color: #87cefa;">"Done"</span>)))

(<span style="color: #afeeee; font-weight: bold;">let</span> [tracked-stations  (get-stations <span style="color: #87cefa;">"res/dataset/gsod_1929.tar"</span>)]
  (process-weather-data <span style="color: #87cefa;">"res/dataset/"</span> <span style="color: #87cefa;">"res/history"</span>
                        tracked-stations <span style="color: #87cefa;">"stats-1929"</span>))
</pre>
<p style="text-align: center;"><strong><em>(my runtime: 11 minutes)</em></strong></p>
<p>Now we don’t have as much data as before (although still a lot), but we know that its not distorted by the addition a huge number of stations in various locations. That gives us the following graph:</p>
<div class="wp-caption aligncenter" style="width: 610px"><a href="http://www.bestinclass.dk/wp-content/uploads/1929-stations.png"><img title="Temperature Graph #2" src="http://www.bestinclass.dk/wp-content/uploads/1929-stations.png" alt="Temperature Graph #2" width="600" height="256" /></a><p class="wp-caption-text">Temperature Graph #2 — Tracking 14 stations</p></div>
<p><br class="spacer_" /></p>
<p>Hmm… I’m really starting to wonder how that Hockey Stick was produced because as we can clearly see from following about 14 stations is a gradual decrease in Global Temperature. But lets give Mr. Gore the benefit of the doubt and expand our scope to include all stations used from 1929 — 1940 and then follow those throughout the years. That will give us a ton of data and keep us somewhat safe from weight distortion. I’ll introduce a helper to compile all the stations from a given range:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">get-station-series</span> [base start end]
  (<span style="color: #7fffd4;">apply</span> merge-with into
         (<span style="color: #afeeee; font-weight: bold;">for</span> [i (<span style="color: #7fffd4;">range</span> start (<span style="color: #7fffd4;">inc</span> end))]
           (get-stations
            (<span style="color: #7fffd4;">str</span> base (<span style="color: #afeeee; font-weight: bold;">if</span> (<span style="color: #7fffd4;">not=</span> \/ (<span style="color: #7fffd4;">last</span> base)) <span style="color: #87cefa;">"/"</span>) <span style="color: #87cefa;">"gsod_"</span> i <span style="color: #87cefa;">".tar"</span>)))))
</pre>
<p>Call that with your directory containing the tars as a base and then 2 integers (1929 1940) and you’ll get in return a 2 sets containing all stations used in the period. Then all you need to do is start the job with those stations located in the NH filtered:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">let</span> [tracked-stations  (get-stations-series <span style="color: #87cefa;">"res/dataset"</span> 1929 1940)]
  (process-weather-data <span style="color: #87cefa;">"res/dataset/"</span> <span style="color: #87cefa;">"res/history"</span>
                        tracked-stations<span style="color: #87cefa;">"stats-1929-1940"</span>))
</pre>
<p style="text-align: center;"><strong><em>(my runtime: 10 minutes)</em></strong></p>
<p>Let that run for a while and you’ll get a lot data resulting in the following graph:</p>
<div class="wp-caption aligncenter" style="width: 613px"><a href="http://www.bestinclass.dk/wp-content/uploads/1940-stations.png"><img title="Temperature Graph #3" src="http://www.bestinclass.dk/wp-content/uploads/1940-stations.png" alt="Temperature Graph #3" width="603" height="266" /></a><p class="wp-caption-text">Temperature Graph #3 — Tracking 450 stations</p></div>
<p><br class="spacer_" /></p>
<p>It seems that when we leave out the great number of weather stations that were introduced in the last 50 years or so, that the tendency is absolutely not a rise in temperature.</p>
<p><br class="spacer_" /></p>
<h1>Final crack at the Hockey Stick</h1>
<p>Ok — If at first you don’t succeed, try harder. The reason the data is distorted is because the stations aren’t been weighted correctly. Some areas have a higher density of stations, some stations report more frequently than other — Long story short: We need to visually get an impression of each stations readings. By looking at each station indiviually instead of compressing them to an unevenly weighted average, we will be able to clearly deduce how the weather has changed through the years recorded.</p>
<p>To make this really easy, I’ll introduce a helper which spits out files ready for OpenOffice Spreadsheet:</p>
<pre style="color: #bebebe; background-color: #262626; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">emit-dataset</span> [data]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [uids (<span style="color: #7fffd4;">distinct</span> (flatten (<span style="color: #7fffd4;">map</span> #(<span style="color: #7fffd4;">map</span> <span style="color: #7fffd4;">:uid</span> %) (<span style="color: #7fffd4;">map</span> <span style="color: #7fffd4;">:reads</span> data))))]
    (<span style="color: #7fffd4;">with-out-str</span>
      (<span style="color: #afeeee; font-weight: bold;">doseq</span> [{<span style="color: #7fffd4;">:keys</span> [year reads]} data]
        (<span style="color: #7fffd4;">print</span> year)
        (<span style="color: #afeeee; font-weight: bold;">doseq</span> [uid uids]
          (<span style="color: #afeeee; font-weight: bold;">if-let</span> [reading (<span style="color: #7fffd4;">first</span> (<span style="color: #7fffd4;">filter</span> #(<span style="color: #7fffd4;">=</span> uid (<span style="color: #7fffd4;">:uid</span> %)) reads))]
            (<span style="color: #7fffd4;">print</span> <span style="color: #87cefa;">","</span> (<span style="color: #7fffd4;">:mean</span> reading))
            (<span style="color: #7fffd4;">print</span> <span style="color: #87cefa;">",null"</span>)))
        (<span style="color: #7fffd4;">println</span> <span style="color: #87cefa;">""</span>)))))

(<span style="color: #ccffcc;">spit</span> output (emit-dataset result))
</pre>
<p>As you can see, this little helper just outputs a CSV file, but it does so calling out the :uid on each reading. To get that bit of info I changed the reader, but I won’t go through it here, if you’re interested you can read the source on <a href="http://github.com/LauJensen/Climate/blob/master/src/climate.clj" target="_blank">Github</a>. The last will dump the CSV in a file, so you can place that at the bottom of your main func.</p>
<h3>The 1929 stations:</h3>
<div class="wp-caption aligncenter" style="width: 560px"><a href="http://www.bestinclass.dk/wp-content/uploads/1929-individually.png"><img class=" " title="14 stations from 1929" src="http://www.bestinclass.dk/wp-content/uploads/1929-individually.png" alt="14 stations from 1929" width="550" height="250" /></a><p class="wp-caption-text">14 stations from 1929 seen individually</p></div>
<p>Despite the fact that some years are w/o readings, its clear that the general tendency is in direct contradiction with what Mr. Gore has shown — His graph resulting in a massive warmt increase throughout the 1900’s, meaning that the entire graph above should be rising quite drastically — The inconvenient truth seems to be however, that there is no significant rise in temperature.</p>
<div class="wp-caption aligncenter" style="width: 560px"><a href="http://www.bestinclass.dk/wp-content/uploads/1933-individually.png"><img class=" " title="100+ stations from 1929-33" src="http://www.bestinclass.dk/wp-content/uploads/1933-individually.png" alt="100+ stations from 1929-33" width="550" height="250" /></a><p class="wp-caption-text">100+ stations from 1929–33 seen individually</p></div>
<p><br class="spacer_" /></p>
<p>Its hard to get a good grasp of the data when visualized like this, but reviewing over 100 stations we see that a few areas are seeing a rise in temperature while most aren’t. I did a small hack ‘n’ slash sparkline viewer to try and get a feel for the larger sets coming up on 500 stations and they seem to show the same tendency.</p>
<p><br class="spacer_" /></p>
<h1>Hockey Game is Over</h1>
<p><span style="font-weight: normal; font-size: 13px;">I’ve worked the official numbers from NOAA and now you’re all able to re-run the computations at home. Its clear from the official data that the globe isn’t wildly heating up, in fact in some places it is now cooling down. I’m always a bit skeptic when politicians try to solve the worlds problems using taxes and indeed the proposed Carbon Tax is a dangerous idea. Whenever a human being exhales they emit CO2 so the Carbon Tax is in effect <strong>a tax on life</strong>. Activists, Greepeace, Lobbyists etc are willing to die, but for what, humanity? No, for knowing neither Math nor Clojure.</span></p>
<p>Code here: <a href="http://github.com/LauJensen/Climate/blob/master/src/climate.clj" target="_blank">Github</a></p>
<script type="text/javascript" src="/wp-content/plugins/shjs-syntax-hiliter/shjs/lang/sh_clojure.js"></script><script type="text/javascript" src="/wp-content/plugins/shjs-syntax-hiliter/shjs/lang/sh_clojure.js"></script>
<p><a href="http://feedads.g.doubleclick.net/~a/7Aw1KsQwVse8muo_KCzsx-ev_RI/0/da"><img src="http://feedads.g.doubleclick.net/~a/7Aw1KsQwVse8muo_KCzsx-ev_RI/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/7Aw1KsQwVse8muo_KCzsx-ev_RI/1/da"><img src="http://feedads.g.doubleclick.net/~a/7Aw1KsQwVse8muo_KCzsx-ev_RI/1/di" border="0" ismap="true"></img></a></p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=btuPkzLE54c:RhppRG9b1aY:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=btuPkzLE54c:RhppRG9b1aY:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=btuPkzLE54c:RhppRG9b1aY:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?i=btuPkzLE54c:RhppRG9b1aY:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bestinclass-the-blog/~4/btuPkzLE54c" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.bestinclass.dk/index.php/2010/01/global-warming/feed/</wfw:commentRss>
		<slash:comments>56</slash:comments>
		<feedburner:origLink>http://www.bestinclass.dk/index.php/2010/01/global-warming/</feedburner:origLink></item>
		<item>
		<title>Hadoop — Feeding Reddit to Hadoop</title>
		<link>http://feedproxy.google.com/~r/bestinclass-the-blog/~3/VidqNxAPIgo/</link>
		<comments>http://www.bestinclass.dk/index.php/2010/01/hadoop-feeding-reddit-to-hadoop/#comments</comments>
		<pubDate>Sat, 09 Jan 2010 22:51:42 +0000</pubDate>
		<dc:creator>Lau</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[leiningen]]></category>
		<category><![CDATA[reddit]]></category>
		<category><![CDATA[scraper]]></category>

		<guid isPermaLink="false">http://www.bestinclass.dk/?p=894</guid>
		<description><![CDATA[
			
				
			
		
With Hadoop installed on our lean mean Arch machine, we’re ready to fire up the first computations. Hadoop opens a world of fun with the promise of some heavy lifting and in order to feed the beast I’ve written a Reddit-scraper in just 30 lines of Clojure.  



The Task
When you start to consider the [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F01%2Fhadoop-feeding-reddit-to-hadoop%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F01%2Fhadoop-feeding-reddit-to-hadoop%2F&amp;source=LauJensen&amp;style=normal&amp;service=bit.ly" height="61" width="50" /><br />
			</a>
		</div>
<p>With Hadoop installed on our lean mean <a href="http://www.bestinclass.dk/index.php/2010/01/hadoop-installation/" target="_blank">Arch machine</a>, we’re ready to fire up the first computations. Hadoop opens a world of fun with the promise of some heavy lifting and in order to feed the beast I’ve written a Reddit-scraper in just 30 lines of Clojure.  <span id="more-894"></span></p>
<p><br class="spacer_" /></p>
<p><br class="spacer_" /></p>
<p><br class="spacer_" /></p>
<h1>The Task</h1>
<p>When you start to consider the possibilities which come with Hadoop you can probably think of many interesting stats you want to compute. Amazon describes the top Cluster as “equivalent of a system with <strong>15 GB of memory</strong>, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform.”. That certainly opens up a few options which I don’t have on my little laptop.</p>
<p>For my first experiment I want to scrape some data from Reddits Programming channels, to give me a better overview of what makes the development community there tick. Then maybe for kicks I can try to do a second blogpost which gets a thousand upvotes, but more on social engineering later.</p>
<p><br class="spacer_" /></p>
<h1>The Scraper</h1>
<p>Scraping Reddit is pretty simple. They have been kind enough to let us download each page in JSON simply by appending “.json” to the url. For some obscure reason, they’ve also tagged each item with a hidden ID (it shows up in the JSON, but thats it) that you have to insert into the URL in as the value to ‘after’, ex</p>
<blockquote><p>http://reddit.com/r/programming/.json?count=25&amp;after=t3_hiddenid</p>
</blockquote>
<p>First lets get a page to see what it looks like:</p>
<pre style="color: #bebebe; background-color: #262626; font-weight: bold; font-size: 8pt;">{<span style="color: #87cefa;">"kind"</span>: <span style="color: #87cefa;">"Listing"</span>,
 <span style="color: #87cefa;">"data"</span>: {
         <span style="color: #87cefa;">"after"</span>: <span style="color: #87cefa;">"t3_an6md"</span>,
         <span style="color: #87cefa;">"children"</span>: [{
                     <span style="color: #87cefa;">"kind"</span>: <span style="color: #87cefa;">"t3"</span>,
                     <span style="color: #87cefa;">"data"</span>: {
                             <span style="color: #87cefa;">"domain"</span>: <span style="color: #87cefa;">"dadhacker.com"</span>,
                             <span style="color: #87cefa;">"media_embed"</span>: {},
                             <span style="color: #87cefa;">"subreddit"</span>: <span style="color: #87cefa;">"programming"</span>,
                             <span style="color: #87cefa;">"selftext_html"</span>: <span style="color: #98fb98;">null</span>,
                             <span style="color: #87cefa;">"selftext"</span>: <span style="color: #87cefa;">""</span>,
                             <span style="color: #87cefa;">"likes"</span>: <span style="color: #98fb98;">null</span>,
                             <span style="color: #87cefa;">"saved"</span>: <span style="color: #98fb98;">false</span>,
                             <span style="color: #87cefa;">"id"</span>: <span style="color: #87cefa;">"ang32"</span>,
                             <span style="color: #87cefa;">"clicked"</span>: <span style="color: #98fb98;">false</span>,
                             <span style="color: #87cefa;">"author"</span>: <span style="color: #87cefa;">"SicSemperTyrannosaur"</span>,
                             <span style="color: #87cefa;">"media"</span>: <span style="color: #98fb98;">null</span>,
                             <span style="color: #87cefa;">"score"</span>: 330,
                             <span style="color: #87cefa;">"over_18"</span>: <span style="color: #98fb98;">false</span>,
                             <span style="color: #87cefa;">"hidden"</span>: <span style="color: #98fb98;">false</span>,
                             <span style="color: #87cefa;">"thumbnail"</span>: <span style="color: #87cefa;">""</span>,
                             <span style="color: #87cefa;">"subreddit_id"</span>: <span style="color: #87cefa;">"t5_2fwo"</span>,
                             <span style="color: #87cefa;">"downs"</span>: 223,
                             <span style="color: #87cefa;">"name"</span>: <span style="color: #87cefa;">"t3_ang32"</span>,
                             <span style="color: #87cefa;">"created"</span>: 1263019161.0,
                             <span style="color: #87cefa;">"url"</span>: <span style="color: #87cefa;">"http://www.dadhacker.com/blog/?p=1193"</span>,
                             <span style="color: #87cefa;">"title"</span>: <span style="color: #87cefa;">"DadHacker: Things I am not allowed to do any more"</span>,
                             <span style="color: #87cefa;">"created_utc"</span>: 1263019161.0,
                             <span style="color: #87cefa;">"num_comments"</span>: 112,
                             <span style="color: #87cefa;">"ups"</span>: 553}},{
                 ....
</pre>
<p>You can get the json file yourself: <a href="http://reddit.com/r/programming/.json" target="_blank">here</a>, and you’ll notice that the raw data looks a little different from what I pasted just above but thats only because I let a little Emacs Keyboard Macro loose on it. The thing to notice is the structure which is “data” -&gt; “children” -&gt; a sequence of 25 “data“‘s.</p>
<pre>\data
---after
---children
------data
------data
------data
</pre>
<p>Each of those final data-items are the actual posts on Reddit with all their individual properties. The top ‘after’ tag contains also the hiddenid mentioned above but more important its “null” when there are no more subpages.</p>
<p>To extract the data in first data-item, you could do this</p>
<pre style="color: #bebebe; background-color: #262626; font-weight: bold; font-size: 8pt;"><span style="color: #afeeee; font-weight: bold;">user> </span><span style="font-weight: bold;">(<span style="color: #ccffcc;">first</span> ((json-as-string "data") "children"))</span>
{"<span style="color: #00ccff;">data</span>" {"<span style="color: #ffcc00;">domain</span>" "dadhacker.com", "<span style="color: #ffcc00;">media</span>" nil, "<span style="color: #ffcc00;">clicked</span>" false, "<span style="color: #ffcc00;">saved</span>" false,
"<span style="color: #ffcc00;">created</span>" 1.263019161E9, "<span style="color: #ffcc00;">hidden</span>" false, "<span style="color: #ffcc00;">author</span>" "SicSemperTyrannosaur",
"<span style="color: #ffcc00;">name</span>" "t3_ang32", "<span style="color: #ffcc00;">thumbnail</span>" "", "<span style="color: #ffcc00;">num_comments</span>" 112, "<span style="color: #ffcc00;">created_utc</span>" 1.263019161E9,
"<span style="color: #ffcc00;">url</span>" "http://www.dadhacker.com/blog/?p=1193", "<span style="color: #ffcc00;">downs</span>" 223, "<span style="color: #ffcc00;">selftext_html</span>" nil,
"<span style="color: #ffcc00;">over_18</span>" false, "<span style="color: #ffcc00;">score</span>" 330, "<span style="color: #ffcc00;">ups</span>" 553,
"<span style="color: #ffcc00;">title</span>" "DadHacker: Things I am not allowed to do any more", "<span style="color: #ffcc00;">selftext</span>" "",
"<span style="color: #ffcc00;">id</span>" "ang32", "<span style="color: #ffcc00;">subreddit_id</span>" "t5_2fwo", "<span style="color: #ffcc00;">likes</span>" nil, "<span style="color: #ffcc00;">subreddit</span>" "programming", "<span style="color: #ffcc00;">media_embed</span>" {}}, "<span style="color: #00ccff;">kind</span>" "t3"}
</pre>
<p>If you follow through on that thought, you could move through all the ‘data’ items using a for-loop (list comprehension), and on each stop extract</p>
<blockquote><p>{:<strong><span style="color: #ff0000;">id</span></strong> (<span style="color: #ccffcc;"><strong><span style="color: #003366;">item</span></strong></span> “<strong><span style="color: #ff0000;">id</span></strong>”) :<span style="color: #ff0000;"><strong>url</strong></span> (<span style="color: #003366;"><strong>item</strong></span> “<span style="color: #ff0000;"><strong>url</strong></span>”) :<span style="color: #ff0000;"><strong>author</strong></span> (<span style="color: #003366;"><strong>item</strong></span> “<span style="color: #ff0000;"><strong>author</strong></span>”)}</p>
<p style="text-align: right;"><em>(repetitive patterns colored red/blue)</em></p>
</blockquote>
<p>When you are accustomed to Lisp (any Lisp) and you see that kind of repetitive action, you can be sure it’s a chance to optimize. In our case we’re writing each key (“author” :author”) twice, and we’re repeatedly writing “item”. Using Clojure’s powerful zipmap we can eliminate all repetition:</p>
<pre style="color: #bebebe; background-color: #262626; font-weight: bold; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">parse-page</span> [url]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [page (read-json-string (download-url url))
        ks   [<span style="color: #7fffd4;">:id :title</span> <span style="color: #7fffd4;">:domain</span> <span style="color: #7fffd4;">:author</span> <span style="color: #7fffd4;">:ups</span> <span style="color: #7fffd4;">:downs</span> <span style="color: #7fffd4;">:subreddit</span> <span style="color: #7fffd4;">:num_comments</span>]
        vmap (<span style="color: #7fffd4;">vec</span> (<span style="color: #afeeee; font-weight: bold;">for</span> [child ((page <span style="color: #87cefa;">"data"</span>) <span style="color: #87cefa;">"children"</span>)]
                    (zipmap ks (<span style="color: #7fffd4;">map</span> #((child <span style="color: #87cefa;">"data"</span>) (<span style="color: #7fffd4;">name</span> %)) ks))))]
    (<span style="color: #afeeee; font-weight: bold;">if-not</span> (<span style="color: #7fffd4;">nil?</span> ((page <span style="color: #87cefa;">"data"</span>) <span style="color: #87cefa;">"after"</span>))
      vmap
      [vmap <span style="color: #7fffd4;">:done</span>])))
</pre>
<p>Pass that an URL and in return get the data you want attached to keys in a hashmap wrapped in a vector. The final expression helps us determine when we’ve run out of subpages and then abort the scrape in case we haven’t reached the ‘max’ number of subpages. <strong>**Caution**</strong>: Because the ID of the last post is carried over in the URL, you have to query for the :id, the rest are optional.</p>
<p>This was the page procesing, so the 2 details we need are</p>
<ol>
<li>Emitting the data in a way which Hadoop can read</li>
<li>Walking through all subpages, accumulating data</li>
</ol>
<p>Starting with (1) we have very free hands. Hadoop will run a Clojure-job, so we can read the data any way we want. For now I’ll just walk through the data and print it as a string. The data is structured as a vector of pages, every page being a vector of items and we need to call ‘prn’ of all items:</p>
<pre style="color: #bebebe; background-color: #262626; font-weight: bold; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">emit-results</span> [outfile data]
  ((<span style="color: #afeeee; font-weight: bold;">if</span> (.isFile (java.io.File. outfile)) append-spit spit)
   outfile (<span style="color: #7fffd4;">with-out-str</span> (<span style="color: #afeeee; font-weight: bold;">->></span> (<span style="color: #7fffd4;">butlast</span> data)
                              (<span style="color: #7fffd4;">map</span> #(<span style="color: #afeeee; font-weight: bold;">doall</span> (<span style="color: #7fffd4;">map</span> prn %)))
                              <span style="color: #ccffcc;">doall</span>))))</pre>
<p>The ‘if’ statement just checks if the file already exists to determine if we’re appending or writing a new file. The rest should make sense when you consider the datas structure. The reason I remove the last item, is that parse-pages appends a :done which we don’t want hanging on to every channels data.</p>
<p>Now for the accumulation and parsing of all subpages I’ll define a scraper which takes a maximum number of subpages, an output file for the dataset and finally the name of the channel we want to rip. For convenience I’ve pasted the <strong>entire</strong> program, so you can run it at home:</p>
<pre style="color: #bebebe; background-color: #262626; font-weight: bold; font-size: 8pt;">(<span style="color: #7fffd4;">use</span> 'clojure.contrib.json.read
     'clojure.contrib.duck-streams)

(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">download-url</span> [url]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [s (.openStream (java.net.URL. url))]
    (<span style="color: #7fffd4;">apply</span> str
           (<span style="color: #7fffd4;">map</span> #(<span style="color: #7fffd4;">char</span> %) (<span style="color: #7fffd4;">take-while</span> pos? (<span style="color: #7fffd4;">repeatedly</span> #(.read s)))))))

(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">parse-page</span> [url]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [page (read-json-string (download-url url))
        ks   [<span style="color: #7fffd4;">:id :title</span> <span style="color: #7fffd4;">:domain</span> <span style="color: #7fffd4;">:author</span> <span style="color: #7fffd4;">:ups</span> <span style="color: #7fffd4;">:downs</span> <span style="color: #7fffd4;">:subreddit</span> <span style="color: #7fffd4;">:num_comments</span>]
        vmap (<span style="color: #7fffd4;">vec</span> (<span style="color: #afeeee; font-weight: bold;">for</span> [child ((page <span style="color: #87cefa;">"data"</span>) <span style="color: #87cefa;">"children"</span>)]
                    (zipmap ks (<span style="color: #7fffd4;">map</span> #((child <span style="color: #87cefa;">"data"</span>) (<span style="color: #7fffd4;">name</span> %)) ks))))]
    (<span style="color: #afeeee; font-weight: bold;">if</span> (<span style="color: #7fffd4;">nil?</span> ((page <span style="color: #87cefa;">"data"</span>) <span style="color: #87cefa;">"after"</span>))
      [vmap <span style="color: #7fffd4;">:done</span>] vmap)))

(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">emit-results</span> [outfile data]
  ((<span style="color: #afeeee; font-weight: bold;">if</span> (.isFile (java.io.File. outfile)) append-spit spit)
   outfile (<span style="color: #7fffd4;">with-out-str</span> (<span style="color: #afeeee; font-weight: bold;">->></span> (<span style="color: #7fffd4;">butlast</span> data)
                              (<span style="color: #7fffd4;">map</span> #(<span style="color: #afeeee; font-weight: bold;">doall</span> (<span style="color: #7fffd4;">map</span> prn %)))
                              doall))))

(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">scrape-channel</span> [max target channel]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [base   (<span style="color: #7fffd4;">format</span> <span style="color: #87cefa;">"http://reddit.com/r/%s/"</span> channel)
        page   (<span style="color: #7fffd4;">fn</span> [data idx]
                 (<span style="color: #7fffd4;">str</span> base <span style="color: #87cefa;">".json?count="</span> idx <span style="color: #87cefa;">"&amp;after=t3_"</span>
                      (<span style="color: #afeeee; font-weight: bold;">-></span> data peek peek <span style="color: #7fffd4;">:id</span>)))
        scrape (<span style="color: #7fffd4;">reduce</span> (<span style="color: #7fffd4;">fn</span> [data idx]
                         (<span style="color: #afeeee; font-weight: bold;">if</span> (<span style="color: #7fffd4;">=</span> <span style="color: #7fffd4;">:done</span> (<span style="color: #afeeee; font-weight: bold;">-></span> data peek peek))
                           data
                           (<span style="color: #7fffd4;">conj</span> data (parse-page (page data idx)))))
                       [(parse-page (<span style="color: #7fffd4;">str</span> base <span style="color: #87cefa;">".json"</span>))]
                       (<span style="color: #7fffd4;">take</span> max (<span style="color: #7fffd4;">iterate</span> #(<span style="color: #7fffd4;">+</span> 25 %) 0)))]
    (emit-results target scrape)))</pre>
<p>Originally I wrote that as a loop/recur, but as my old buddy Meikel pointed out: Most loops can/should be implemented as Reduce. If you’re really digging in, consider adding a Thread/sleep to give poor Reddit some time for recovery and perhaps avoid an IP Blacklisting as well, I didn’t come across any query limits when reading the API. (<strong>**<span style="color: #ff0000;">update</span>**</strong>): <strong><em>An official from Reddit has asked that sleeps be inserted into the code, so please add (Thread/sleep 200) or something similarly appropriate as the first line of parse-page</em></strong>.</p>
<p>Letting this puppy loose on your favorite channels look like so:</p>
<p style="text-align: center;"><a href="http://www.bestinclass.dk/wp-content/uploads/scraper-doseq.png"><img class="aligncenter size-full wp-image-905" title="Reddit Scraper" src="http://www.bestinclass.dk/wp-content/uploads/scraper-doseq.png" alt="" width="512" height="400" />(click to enlarge)</a></p>
<p><br class="spacer_" /></p>
<h1>The Data</h1>
<p>Now that we’re sitting with almost unlimited insight into the posts which make Redditors tick, we can think of many stats that would be fun to compute. Since this is a tutorial I’ll go with the simplest version, ie. something like calculating total number of upvotes per domain/author, but for a future experiment it would be fun to pull out the top authors/posts and also scrape the URLs they link, categorizing them after content length, keywords, number of graphical elements etc, just to get the recipe for a succesful post.</p>
<p><br class="spacer_" /></p>
<h1>Hadoop-De-Doop</h1>
<p>To interface with Hadoop we need to compile a JAR file, which is then passed to Hadoop as a job. Hadoop handles the distribution and computation. If it wasn’t for two interesting contributions to the Clojure community I would have to take you through a gruesome exercise in Java Interop. Fortunately for us, the team behind <a href="http://www.flightcaster.com" target="_blank">FlightCaster</a> have released <a href="http://github.com/bradford/crane" target="_blank">Crane</a> which is their home-grown tool for Hadoop jobs (platform specific) and Stuart Sierra has also released <a href="http://github.com/stuartsierra/clojure-hadoop" target="_blank">Clojure-Hadoop</a> which simplifies job-creation substantially. For this post I’ll run with Stuarts lib first and hopefully find enough room for improvement, so that I can fork it, extend it and give it a proper name like Hadoop-de-Doop or Cladoop.</p>
<p><br class="spacer_" /></p>
<h1>Creating The Job</h1>
<p>To create a job, we need to set up a project which we can build. For this we have several options</p>
<ol>
<li><a href="http://www.socaldims.com/Giant%20Ant.jpg" target="_blank">Ant</a></li>
<li><a href="http://github.com/technomancy/leiningen" target="_blank">Leiningen</a></li>
<li><a href="http://kotka.de/blog/2009/12/Clojuresque_1.2.0_released.html" target="_blank">Clojuresque</a></li>
</ol>
<p>For Clojure projects both Leiningen and Clojuresque should work almost equally well, I think Clojuresque has a small advantage in its ant-task interop, but I dont know for sure. For kicks I’ll go with Leiningen this time ’round.</p>
<p>You need to set up a directory structure like so:</p>
<pre>haddit/project.clj
haddit/src/haddit.clj
haddit/lib/clojure-hadoop-1.0-SNAPSHOT.jar
haddit/lib/clojure-hadoop-1.0-SNAPSHOT-job.jar
</pre>
<p>I started out by wasting some time by depending on clojure-hadoop from <a href="http://www.clojars.org" target="_blank">Clojars</a>, but as I learned that .jar file is broken (which is often the case with Clojars?). Therefore you need to clone the <a href="http://github.com/stuartsierra/clojure-hadoop" target="_blank">Git Repo</a> and build it as Stuart instructs in the readme. The project.clj is what Leiningen uses to handle the compilation process, you can set that up first:</p>
<pre style="color: #bebebe; background-color: #262626; font-weight: bold; font-size: 8pt;">(defproject haddit <span style="color: #87cefa;">"0.0.1"</span>
  <span style="color: #7fffd4;">:description</span>      <span style="color: #87cefa;">"Unifying Clojure/Hadoop power!"</span>
  <span style="color: #7fffd4;">:url</span>              <span style="color: #87cefa;">"http://www.bestinclass.dk"</span>
  <span style="color: #7fffd4;">:library-path</span>     <span style="color: #87cefa;">"lib/"</span>
  <span style="color: #7fffd4;">:namespaces</span>       [haddit]
  <span style="color: #7fffd4;">:dependencies</span>     [[org.clojure/clojure <span style="color: #87cefa;">"1.1.0-alpha-SNAPSHOT"</span>]
                     [org.apache.hadoop/hadoop-core <span style="color: #87cefa;">"0.20.2-dev"</span>]])
</pre>
<p>As I mentioned above I couldn’t handle all my dependencies directly through Leiningen, so to get around that I manually link the “lib/” directory and put clojure-hadoop in there. Now we can get to the fun part, namely the map-reduce job.</p>
<p>Map-Reduce jobs differ from Clojures Map/Reduce functions, in that they work on key/value pairs and return them as well. Clojures reduce returns a single item. There’s also the added trickery of data-types as Hadoop comes with its own set, which doesn’t naturally work with Clojures many functions for data munging.</p>
<p>CH (Clojure-Hadoop) has defined map-reduce-readers in the wrap.clj which come with the promise of letting us work purely in Clojure so lets try that out first. Create a new file in haddit/src/haddit.clj and start by declaring the namespace and doing your necessary imports:</p>
<pre style="color: #bebebe; background-color: #262626; font-weight: bold; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">ns</span> haddit
  (<span style="color: #7fffd4;">:gen-class</span>)
  (<span style="color: #7fffd4;">:require</span> [clojure-hadoop.wrap <span style="color: #7fffd4;">:as</span> wrap]
            [clojure-hadoop.defjob <span style="color: #7fffd4;">:as</span> defjob])
  (<span style="color: #7fffd4;">:import</span>  [java.io BufferedReader InputStreamReader]))
</pre>
<p>It used to be the case, that if you wanted a stand-alone jar which you could run with java –jar myjar.jar then you had to start out with (:gen-class :main true), but I’ve learned that it’s no longer the case. If you want this jar to be executable, just add “:main haddit” to your project.clj and compile with ‘lein uberjar’.</p>
<p>With the formalities in order we can move on to defining both our mapper and reducer. Since we dumped all our data using the printed representation of a hashmap, the idea is that using CH we can have our mapper work on each line individually seeing it as a hashmap. So to get the ball rolling, let’s say that we want to use the domain name as key and its ‘up votes’ as the value, that way we can sum up all the up-votes each domain got. The mapper is then:</p>
<pre style="color: #bebebe; background-color: #262626; font-weight: bold; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">mapper</span> [key value]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [{<span style="color: #7fffd4;">:keys</span> [domain ups]} value]
    [[domain ups]]))
</pre>
<p>I get in a hash-map with all the data I made available in the scraper and I extract the 2 keys I want to look at, domain and ups. They are then fed back to the Clojure-writer function which looks in the outer vector for key/value pairs. It’ll find one where the domain is the key and ups is the value.</p>
<p>When the reducer is passed this data, it’ll get 1 domain name as the key and all of the ‘ups’ which have been found for that domain name as the value-fn. When you evaluate value-fn, you get the values. The trick is then only to get the sum of all the ‘ups’ ie. the values:</p>
<pre style="color: #bebebe; background-color: #262626; font-weight: bold; font-size: 8pt;">(<span style="color: #afeeee; font-weight: bold;">defn</span> <span style="color: #7fffd4; font-weight: bold;">reducer</span> [key values-fn]
  (<span style="color: #afeeee; font-weight: bold;">let</span> [values  (values-fn)]
    [[key (<span style="color: #7fffd4;">reduce</span> + values)]]))
</pre>
<p>The return type is similar to the mappers, and you see me first evaluating the values and then calling reduce + on it — treating it like any other Clojure structure. But this is theoretical still, lets prepare the test run. Hadoop needs some info about the job, which we can elegantly define using defjob:</p>
<pre style="color: #bebebe; background-color: #262626; font-weight: bold; font-size: 8pt;">(defjob/defjob   job
  <span style="color: #7fffd4;">:map</span>           mapper
  <span style="color: #7fffd4;">:reduce</span>        reducer
  <span style="color: #7fffd4;">:map-reader</span>    wrap/clojure-map-reader
  <span style="color: #7fffd4;">:inputformat</span> <span style="color: #7fffd4;">:text</span>)
</pre>
<p>Check out wrap.clj from CH to see the other readers which Stuart has made available for us. The clojure-map-reader does what the name leads you to think: Calls read-string on each line giving us the Clojure datatype. The inputformat is set to :text which is Hadoops default. That ensures that the input file is split by Hadoop on newlines and the byte offset thus becomes the key.</p>
<p>Now we need to compile the thing, and you set everything up like I’ve outlined here you compile it like so:</p>
<pre class="sh_sh" name="code">$ cd haddit/
$ lein uberjar
[INFO] snapshot org.clojure:clojure:1.1.0-alpha-SNAPSHOT: checking for updates from central
[INFO] snapshot org.clojure:clojure:1.1.0-alpha-SNAPSHOT: checking for updates from clojure-snapshots
[INFO] snapshot org.clojure:clojure:1.1.0-alpha-SNAPSHOT: checking for updates from clojars
Compiling haddit
Including haddit.jar
Including ant-launcher-1.7.0.jar
Including commons-cli-1.2.jar
Including commons-logging.jar
Including clojure-hadoop-1.0-SNAPSHOT.jar
Including ant-1.7.0.jar
....
</pre>
<p>After about 30 more lines inclusions, you’ll get haddit-standalone.jar out. If lein fails because of some file named src/#.something its because you haven’t saved your Emacs file and its holding a lock on it. This is a good failsafe.</p>
<p>To avoid spending uncessary time on the Hadoop server, we start by testing the job locally:</p>
<pre class="sh_sh" name="code">$ java -Xmx1024m -Xms512m -cp haddit-standalone.jar clojure_hadoop.job -job haddit/job -input data/dataset -output out
10/01/11 11:03:05 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
10/01/11 11:03:05 INFO mapred.FileInputFormat: Total input paths to process : 1
10/01/11 11:03:05 INFO mapred.JobClient: Running job: job_local_0001
10/01/11 11:03:05 INFO mapred.FileInputFormat: Total input paths to process : 1
10/01/11 11:03:05 INFO mapred.MapTask: numReduceTasks: 1
</pre>
<p>After about 20 more lines the job will hopefully stop with no errors and the JobClient will print out some stats. The result is now stored in the out/ directory in a Hadoop sequence file which you can read+pipe like so:</p>
<pre class="sh_sh" name="code">$ java -cp haddit-standalone.jar org.apache.hadoop.fs.FsShell -text out/part-00000 >> rawtext
10/01/11 11:05:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
10/01/11 11:05:18 INFO compress.CodecPool: Got brand-new decompressor
10/01/11 11:05:18 INFO compress.CodecPool: Got brand-new decompressor
10/01/11 11:05:18 INFO compress.CodecPool: Got brand-new decompressor
10/01/11 11:05:18 INFO compress.CodecPool: Got brand-new decompressor
</pre>
<p>And if you open the file ‘rawtext’, you’ll see that all the domains from our scrape now has an integer next to it, indicating how many total upvotes that domain has, sorted alphabetically. If everything looks good start up your Hadoop server and launch the job on it:</p>
<pre class="sh_sh" name="code">[hadoop@myhost]$ scp youruser@192.168.ur.ip:/home/you/scraper/dataset .
[hadoop@myhost]$ scp youruser@192.168.ur.ip:/home/you/haddit/haddit-standalone.jar .
[hadoop@myhost]$ hadoop fs -put dataset dataset
[hadoop@myhost]$ hadoop jar haddit-standalone.jar clojure_hadoop.job -job haddit/job -input dataset -output hadditresult
</pre>
<p>First I download the files from the host system into the virtual box using scp, then I put the dataset on the HDFS — Hadoops filesystem. Now the Hadoop server starts crunching like there’s no tomorrow:</p>
<p style="text-align: center;"><a href="http://www.bestinclass.dk/wp-content/uploads/hadoopcli.jpg"><img class="aligncenter size-full wp-image-910" title="hadoopcli" src="http://www.bestinclass.dk/wp-content/uploads/hadoopcli.jpg" alt="" width="410" height="109" />(click to enlarge)</a></p>
<p>In addition to showing progress on the CLI you can also keep track of your jobs using the JobTracker WebUI:</p>
<p><a href="http://www.bestinclass.dk/wp-content/uploads/hadoopwebui.png"><img class="aligncenter size-full wp-image-910" title="hadoopcli" src="http://www.bestinclass.dk/wp-content/uploads/hadoopwebui.png" alt="" width="546" height="420" /></a></p>
<p>Once both the mapper and reducer jobs are 100% complete you can retrieve the result from the HDFS like so</p>
<pre name="code" class="sh_sh">[hadoop@myhost]$ hadoop fs -get hadditresult
</pre>
<p>That will download that folder from the HDFS and it will also contain a Hadoop Sequence file, identical to the one produced on your local test.</p>
<p><br class="spacer_" /></p>
<h1>Sorting the Set</h1>
<p>When you examine the data you’ll have one grievance — Its alphabetically sorted and not sorted per number of upvotes. To change that its helpful to look at the structure with which Hadoop works:</p>
<p style="text-align: center;"><a href="http://www.bestinclass.dk/wp-content/uploads/hadoop.png"><img class="aligncenter size-full wp-image-905" title="Reddit Scraper" src="http://www.bestinclass.dk/wp-content/uploads/hadoop.png" alt="" width="520" height="277" />(click to enlarge)</a></p>
<p><br class="spacer_" /></p>
<p>Hadoop sends a chunk of data to the reader, which then set up the type classes as you’d like, passes that data to the mapper which works on it and then passes it to the reader, which returns the data to Hadoop. Hadoop now sitting with the mapped data sends it further down the stream with the keys all in correct sequence, hence the alphabetical sort you just saw.</p>
<p>If you want to sort this set by the number of upvotes, you really only need two things</p>
<ol>
<li>Set the :inputtype to :seq</li>
<li>Make the mapper return [value key] instead of [key value], promoting the value to key.</li>
</ol>
<p>Ah but then you’ll have a problem! Clojure-Hadoop doesn’t currently provide the necessary configuration options, to make the reader/writer functions handle keys of the type LongWritable. Thats perfect! Now I have a chance to do Hadoop-De-Doop! More on that later :)</p>
<p>(<strong>update</strong>: Shortly following this post Stuart Sierra compiled version 1.0 of CH and added the missing configuration options!)</p>
<h1>Conclusion</h1>
<p>I saw a benchmark showing how Hadoop sorted 9TB of data in 1.5 hours. As data-sizes continue to rise it’s important to be able to harness the power of distributed computing. Hadoop offers an extremely friendly and powerful way of doing that. With much of the JavaInterop being hid inside Clojure-Hadoop we as Clojurians can keep doing what we do best: Write beautiful functional highway tearing, forrest burning, bytecode blazing run-like-the-wind code.</p>
<p><br class="spacer_" /></p>
<script type="text/javascript" src="/wp-content/plugins/shjs-syntax-hiliter/shjs/lang/sh_sh.js"></script><script type="text/javascript" src="/wp-content/plugins/shjs-syntax-hiliter/shjs/lang/sh_sh.js"></script><script type="text/javascript" src="/wp-content/plugins/shjs-syntax-hiliter/shjs/lang/sh_sh.js"></script><script type="text/javascript" src="/wp-content/plugins/shjs-syntax-hiliter/shjs/lang/sh_sh.js"></script><script type="text/javascript" src="/wp-content/plugins/shjs-syntax-hiliter/shjs/lang/sh_sh.js"></script>
<p><a href="http://feedads.g.doubleclick.net/~a/iAHhBVVe9ZuG9tXiCKLpZorTbrQ/0/da"><img src="http://feedads.g.doubleclick.net/~a/iAHhBVVe9ZuG9tXiCKLpZorTbrQ/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/iAHhBVVe9ZuG9tXiCKLpZorTbrQ/1/da"><img src="http://feedads.g.doubleclick.net/~a/iAHhBVVe9ZuG9tXiCKLpZorTbrQ/1/di" border="0" ismap="true"></img></a></p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=VidqNxAPIgo:_auSGRbw6kg:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=VidqNxAPIgo:_auSGRbw6kg:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=VidqNxAPIgo:_auSGRbw6kg:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?i=VidqNxAPIgo:_auSGRbw6kg:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bestinclass-the-blog/~4/VidqNxAPIgo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.bestinclass.dk/index.php/2010/01/hadoop-feeding-reddit-to-hadoop/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		<feedburner:origLink>http://www.bestinclass.dk/index.php/2010/01/hadoop-feeding-reddit-to-hadoop/</feedburner:origLink></item>
		<item>
		<title>Hadoop — Installation</title>
		<link>http://feedproxy.google.com/~r/bestinclass-the-blog/~3/aKgkvwLducc/</link>
		<comments>http://www.bestinclass.dk/index.php/2010/01/hadoop-installation/#comments</comments>
		<pubDate>Wed, 06 Jan 2010 13:31:04 +0000</pubDate>
		<dc:creator>Lau</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[arch]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[installation]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.bestinclass.dk/?p=877</guid>
		<description><![CDATA[
			
				
			
		
Since we’ve had so much fun with multiple cores running at once, how about upping the game to play with multiple servers? Hadoop is a framework for distributed computing, which lets us process jobs on multiple servers at once giving more power *grunt*. In this first post I’ll run through how to set up your [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F01%2Fhadoop-installation%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.bestinclass.dk%2Findex.php%2F2010%2F01%2Fhadoop-installation%2F&amp;source=LauJensen&amp;style=normal&amp;service=bit.ly" height="61" width="50" /><br />
			</a>
		</div>
<p>Since we’ve had so much fun with multiple cores running at once, how about upping the game to play with multiple servers? Hadoop is a framework for distributed computing, which lets us process jobs on multiple servers at once giving <strong>more power</strong> *grunt*. In this first post I’ll run through how to set up your first Hadoop server running in a VirtualBox using Arch.</p>
<p><span id="more-877"></span></p>
<p><br class="spacer_" /></p>
<p><br class="spacer_" /></p>
<h1>Why Arch?</h1>
<p>I’m doing these experiments on my tiny Macbook Pro laptop, so I want my Linux installation in the VBox to be as lean and clean as possible. Arch strikes a perfect balance between functionality and bloat and for something as simple as running a Hadoop server it’s very easy to set up.</p>
<p>I think its a beautiful thing when a cleanly installed linux replies “No entries” to the “netstat –lnput” after installation. Arch lets you build your system from the ground up and although that takes a little longer than Ubuntu, it might just make for a better end result.</p>
<p><br class="spacer_" /></p>
<h1>Why Hadoop?</h1>
<p>Clojure is an excellent language for writing data parsers et al, so what could be more fun than taking our regular code and process it on a multiserver network? In industry, many tasks are of such dimensions that its pointless to run it on a single server, so if you have something like Flightcaster in mind, you need to get comfortable with distributed computing. Secondly its Java based, meaning that to get my hands all the way from Clojure into the Engine Room is very doable.</p>
<p>Worth mentioning as well is the fact that there is already a couple of Clojure Interfaces out in the open. As most people know the crew behind Flightcaster released Crane and secondly Stuart Sierra released the creatively named clojure-hadoop library.</p>
<p><br class="spacer_" /></p>
<h1>The Installation</h1>
<p>Thanks to the kind donations I was able to purchase Vimeo Plus, so that you can now follow the screencasts in HD, hopefully giving you a clearer rendering of the text! If you know all there is to know about installing Arch and getting Hadoop up an running in Pseudo Distributed Mode, then feel free to skip this entire post. It’s a mandatory first stop for me, to ensure that everyone can follow future experiments using Hadoop.</p>
<p>Since this is HD 2x click for fullscreen or go to the Vimeo site.</p>
<p><br class="spacer_" /></p>
<h2 style="text-align: center;">The Video (16 min)</h2>
<p style="text-align: center;"><em>(<strong>double click</strong></em><em> for full-screen — if you’re not seeing it, try hitting F5 or using Firefox)</em></p>
<p style="text-align: center;"><object width="622" height="350"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=8615628&amp;server=vimeo.com&amp;show_title=0&amp;show_byline=0&amp;show_portrait=0&amp;color=00adef&amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=8615628&amp;server=vimeo.com&amp;show_title=0&amp;show_byline=0&amp;show_portrait=0&amp;color=00adef&amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="622" height="350"></embed></object></p>
<p><br class="spacer_" /></p>
<h1>Configuration</h1>
<p>For your own set up, these are the things you need to change:</p>
<h3>/etc/hosts.allow</h3>
<blockquote><p>sshd: ALL: ALLOW</p>
<p>java: ALL: ALLOW</p>
</blockquote>
<h3>/etc/rc.conf   (to autostart services)</h3>
<blockquote><p>daemons=(… sshd rsyncd …)</p>
</blockquote>
<h3>~/hadoop/conf/hadoop-env.sh</h3>
<blockquote><p>export JAVA_HOME=/usr/lib/jvm/java-6-openjdk</p>
</blockquote>
<h3>Hadoop XML configs</h3>
<blockquote>
<h4>~/hadoop/conf/core-site.xml</h4>
<p>Pseudo xml:  <strong>property</strong>: <strong>name</strong>: fs.default.name <strong>value</strong>: hdfs://localhost:9000</p>
<h4>~/hadoop/conf/hdfs-site.xml</h4>
<p>Pseudo xml: <strong>property</strong>: <strong>name</strong>: dfs.replication <strong>value</strong>: 1</p>
<h4>~/hadoop/conf/mapred-site.xml</h4>
<p>Pseudo xml: <strong>property</strong>: <strong>name</strong>: mapred.job.tracker <strong>value</strong>:  localhost:9001</p>
</blockquote>
<p>All of the XML configuration files are 6 lines long — I hope everybody is cool with that :)</p>
<p><br class="spacer_" /></p>
<h1>Next Up</h1>
<p>This was this obligatory step which we just have to get over with. The next step is making/using some kind of Clojure Interface with Hadoop in order to run jobs on it. Stay tuned for round #2.</p>
<p><br class="spacer_" /></p>

<p><a href="http://feedads.g.doubleclick.net/~a/D3RyA-2P_Naf0_kwzfu-VjJgv8I/0/da"><img src="http://feedads.g.doubleclick.net/~a/D3RyA-2P_Naf0_kwzfu-VjJgv8I/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/D3RyA-2P_Naf0_kwzfu-VjJgv8I/1/da"><img src="http://feedads.g.doubleclick.net/~a/D3RyA-2P_Naf0_kwzfu-VjJgv8I/1/di" border="0" ismap="true"></img></a></p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=aKgkvwLducc:pkeHmQFEBVM:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=aKgkvwLducc:pkeHmQFEBVM:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bestinclass-the-blog?a=aKgkvwLducc:pkeHmQFEBVM:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/bestinclass-the-blog?i=aKgkvwLducc:pkeHmQFEBVM:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bestinclass-the-blog/~4/aKgkvwLducc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.bestinclass.dk/index.php/2010/01/hadoop-installation/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://www.bestinclass.dk/index.php/2010/01/hadoop-installation/</feedburner:origLink></item>
	</channel>
</rss>
