<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Coderholic</title>
	
	<link>http://www.coderholic.com</link>
	<description>Addicited to Development</description>
	<lastBuildDate>Thu, 29 Oct 2009 19:45:24 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.5</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/coderholic" type="application/rss+xml" /><feedburner:emailServiceId>coderholic</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><item>
		<title>Munin Popularity Plugins</title>
		<link>http://feedproxy.google.com/~r/coderholic/~3/P_EXXO0f2Ec/</link>
		<comments>http://www.coderholic.com/munin-popularity-plugins/#comments</comments>
		<pubDate>Thu, 29 Oct 2009 19:45:24 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[scripts]]></category>

		<guid isPermaLink="false">http://www.coderholic.com/?p=405</guid>
		<description><![CDATA[
After writing about Munin last week and mentioning some of its uses outside of system performance tracking I decided to write a collection of plugins for tracking website popularity. The full code is available on GitHub.
The plugins include:

Alexa traffic rank
Number of pages in the Google index
Technorati authority
Feedburner RSS subscribers
Twitter followers

Munin will automatically handle the generation [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignnone size-full wp-image-408" title="google-index" src="http://www.coderholic.com/wp-content/uploads/google-index.png" alt="google-index" width="504" height="304" /></p>
<p>After writing about <a href="http://www.coderholic.com/server-monitoring-with-munin/">Munin</a> last week and mentioning some of its uses outside of system performance tracking I decided to write a collection of plugins for tracking website popularity. The full code is available on <a href="http://github.com/coderholic/munin-popularity-plugins">GitHub</a>.</p>
<p>The plugins include:</p>
<ul>
<li>Alexa traffic rank</li>
<li>Number of pages in the Google index</li>
<li>Technorati authority</li>
<li>Feedburner RSS subscribers</li>
<li>Twitter followers</li>
</ul>
<p>Munin will automatically handle the generation of daily, weekly, monthly and yearly graphs. All of the plugins support graphing of multiple sites/accounts, so you can show several sites on the same graph, as is the case with the Google index graph above, or you can decide to show a single site/account per graph. Some example graphs are shown below:</p>
<p><img class="alignnone size-full wp-image-410" title="twitter" src="http://www.coderholic.com/wp-content/uploads/twitter.png" alt="twitter" width="503" height="280" /></p>
<p><img class="alignnone size-full wp-image-409" title="technorati" src="http://www.coderholic.com/wp-content/uploads/technorati.png" alt="technorati" width="505" height="281" /></p>
<p><img class="alignnone size-full wp-image-407" title="feedburner" src="http://www.coderholic.com/wp-content/uploads/feedburner.png" alt="feedburner" width="503" height="280" /></p>
<p><img class="alignnone size-full wp-image-406" title="alexa" src="http://www.coderholic.com/wp-content/uploads/alexa.png" alt="alexa" width="504" height="304" /></p>
<img src="http://feeds.feedburner.com/~r/coderholic/~4/P_EXXO0f2Ec" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.coderholic.com/munin-popularity-plugins/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.coderholic.com/munin-popularity-plugins/</feedburner:origLink></item>
		<item>
		<title>Server Monitoring with Munin</title>
		<link>http://feedproxy.google.com/~r/coderholic/~3/6PWxu6NLToc/</link>
		<comments>http://www.coderholic.com/server-monitoring-with-munin/#comments</comments>
		<pubDate>Wed, 21 Oct 2009 21:53:56 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[scripts]]></category>
		<category><![CDATA[munin server opensource monitoring sysadmin]]></category>

		<guid isPermaLink="false">http://www.coderholic.com/?p=378</guid>
		<description><![CDATA[Munin is an excellent open source tool for monitoring and graphing server performance metrics. It can be configured to send out alert emails when something goes wrong with your server, and the graphs make it easy to view trends over time: You could see that your site gets much less traffic on a Sunday, for [...]]]></description>
			<content:encoded><![CDATA[<p>Munin is an excellent open source tool for monitoring and graphing server performance metrics. It can be configured to send out alert emails when something goes wrong with your server, and the graphs make it easy to view trends over time: You could see that your site gets much less traffic on a Sunday, for example, or that the number of database queries performed per day has doubled in the last 2 months.</p>
<p><img style="border: 1px solid gray;" src="/wp-content/uploads/munin-graphs1.png" alt="" /></p>
<p>On a Debian-based system installing Munin is as simple as running the following command, and then going to http://your-server/munin/ in your browser:</p>
<pre>sudo aptitude install munin</pre>
<p>Munin comes with lots of monitoring plugins by default, including those for MySQL, PostgresSQL, Apache, Tomcat, Squid, and for things such a CPU and memory usage, load average, network traffic, and many more. You can also find lots of user submitted plugins on sites like <a href="http://muninexchange.projects.linpro.no/">Munin Exchange</a>.</p>
<p>Munin doesn&#8217;t have to be used solely for monitoring server performance though. Being so easy to extend Munin is  also a great tool for tracking non-server performance related trends over time. In just a few lines of code you could write plugins to track the following stats about your website:</p>
<ul>
<li>Number of User signups</li>
<li>Google PageRank</li>
<li>Pages in Google&#8217;s index</li>
<li>Number of backlinks</li>
<li>Number of twitter mentions</li>
<li>Alexa traffic rank</li>
</ul>
<p>The <em>number of pages in Google&#8217;s index</em> is actually a plugin I&#8217;ve written. Simple put the following code in your /etc/munin/plugins directory to see it in action:</p>
<pre name="code" class="python">
#!/bin/sh
# Munin Plugin to display the number of pages in the
# google index for all of the given websites
# Ben Dowling - www.coderholic.com

# Change this to whatever sites you're interested in
websites="www.yahoo.com www.google.com www.twitter.com"

if [ "$1" = "autoconf" ]; then
        echo yes
        exit 0
fi

if [ "$1" = "config" ]; then

        echo 'graph_title Number of Pages in Google Index'
        echo 'graph_args --base 1000 -l 0 '
        echo 'graph_vlabel number of pages'
        echo 'graph_category google'
        echo 'graph_info This graph shows the number of pages in the Google index for a given website.'

        i=0
        for site in $websites
        do
                name="site_${i}"
                echo "${name}.label ${site}"
                echo "${name}.draw LINE2"
                echo "${name}.info The number of pages in the google index."
                i=$((i+1))
        done
        exit 0
fi

i=0
for site in $websites
do
        name="site_${i}"
		value=$(wget -q --user-agent=Firefox -O - "http://www.google.com/search?q=site:${site}" | grep -E "of about <b>[0-9,]+</b>" -o | grep -E "[0-9,]+" -o | sed "s/,//g")
        echo "${name}.value ${value}"

        i=$((i+1))
done
</pre>
<p>For more details about Munin see their <a href="http://munin.projects.linpro.no/">homepage</a>, which also includes detailed documentation on writing your own plugins. </p>
<p>Let me know if you can think of any more Munin plugins that could be interesting, or if you&#8217;ve used any yourself!</p>
<img src="http://feeds.feedburner.com/~r/coderholic/~4/6PWxu6NLToc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.coderholic.com/server-monitoring-with-munin/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://www.coderholic.com/server-monitoring-with-munin/</feedburner:origLink></item>
		<item>
		<title>10 More Puzzle Websites to Sharpen Your Programming Skills</title>
		<link>http://feedproxy.google.com/~r/coderholic/~3/_GqSN6wCnIc/</link>
		<comments>http://www.coderholic.com/10-more-puzzle-websites-to-sharpen-your-programming-skills/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 17:08:13 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.coderholic.com/?p=338</guid>
		<description><![CDATA[My recently published Six Revision guest post, 10 Puzzle Websites to Sharpen Your Programming Skills, got a great response, hitting the front page of Hacker News, Reddit, and doing fairly well on Digg too. 
Lots of comments were left pointing out some sites which weren&#8217;t included in my list, so I&#8217;m following up here with [...]]]></description>
			<content:encoded><![CDATA[<p>My recently published Six Revision guest post, <a href="http://sixrevisions.com/resources/10-puzzle-websites-to-sharpen-your-programming-skills/">10 Puzzle Websites to Sharpen Your Programming Skills</a>, got a great response, hitting the front page of <a href="http://news.ycombinator.com/item?id=885481">Hacker News</a>, <a href="http://www.reddit.com/r/programming/comments/9uotv/10_puzzle_websites_to_sharpen_your_programming/">Reddit</a>, and doing fairly well on <a href="http://digg.com/programming/10_Puzzle_Websites_to_Sharpen_Your_Programming_Skills">Digg</a> too. </p>
<p>Lots of comments were left pointing out some sites which weren&#8217;t included in my list, so I&#8217;m following up here with a list of 10 more top programming puzzle websites:</p>
<h3>1. <a href="http://www.codechef.com/">Code Chef</a></h3>
<p><img style="border:1px solid grey;" src="/wp-content/uploads/wwwcodechefcom-screen-capture-2009-10-17-15-14-50_thumb.png"/></p>
<p>Code Chef has lots of practice puzzles, and monthly competitions with cash prizes. The site officially supports over 35 programming languages!</p>
<h3>2. <a href="http://www.spoj.pl/">SPOJ</a></h3>
<p><img style="border:1px solid grey;" src="/wp-content/uploads/wwwspojpl-screen-capture-2009-10-17-15-14-36_thumb.png"/></p>
<p>The Sphere Online Judge contains 1871 different programming problems. More points are awarded for better performing solutions, which can be submitted in a range of languages.</p>
<h3>3. <a href="http://codegolf.com/">Code Golf</a></h3>
<p><img style="border:1px solid grey;" src="/wp-content/uploads/codegolfcom-screen-capture-2009-10-17-15-14-40_thumb.png"/></p>
<p>The aim with code golf is to submit a solution using the fewest characters possible.Solutions can be submitted in Perl, Python PHP or Ruby.</p>
<h3>4. <a href="http://uva.onlinejudge.org/">Uva Online Judge</a></h3>
<p><img style="border:1px solid grey;" src="/wp-content/uploads/uvaonlinejudgeorg-screen-capture-2009-10-17-15-14-41_thumb.png"/></p>
<p>Over 2600 great programming puzzles, and also regular contests. Submissions in C, C++, Java or Pascal are automatically checked for you.</p>
<h3>5. <a href="http://acm.timus.ru/">Timus Online Judge</a></h3>
<p><img style="border:1px solid grey;" src="/wp-content/uploads/acmtimusru-screen-capture-2009-10-17-15-14-43_thumb.png"/></p>
<p>An online competition site that automatically checks your submissions. Supports Java, C#, Pascal, C and C++.</p>
<h3>6. <a href="http://code.google.com/codejam/">Google Code Jam</a></h3>
<p><img style="border:1px solid grey;" src="/wp-content/uploads/codegooglecom-screen-capture-2009-10-17-15-14-46_thumb.png"/></p>
<p>The code jam is a programming contest from Google. The top 25 contestants get to travel to Google&#8217;s HQ in California. Entries are accepting in any programming language.</p>
<h3>7. <a href="http://train.usaco.org/usacogate">USA Computing Olympiad</a></h3>
<p><img style="border:1px solid grey;" src="/wp-content/uploads/trainusacoorg-screen-capture-2009-10-17-15-14-48_thumb.png"/></p>
<p>Programming puzzles designed to provide &#8220;pre-college students with opportunities to sharpen their computer programming skills&#8221;. The puzzles are still interesting and fun even if you&#8217;ve got a CS degree!</p>
<h3>8. <a href="http://www.olympiad.org.uk/">Informatics Olympiad</a></h3>
<p><img style="border:1px solid grey;" src="/wp-content/uploads/wwwolympiadorguk-screen-capture-2009-10-17-15-14-52_thumb.png"/></p>
<p>A British version of the computing olympiad. Again aimed at school and college students, but fun and interesting for everyone.</p>
<h3>9. <a href="http://cplus.about.com/od/programmingchallenges/Programming_Challenges.htm">Programming Challenges in C, C++ and C#</a></h3>
<p><img style="border:1px solid grey;" src="/wp-content/uploads/aboutcom-screen-capture-2009-10-17-15-14-54_thumb.png"/></p>
<p>About.com&#8217;s C/C++/C# section regularly posts interesting programming puzzles. Successful solutions get acknowledged on the site once the deadline has passed.</p>
<h3>10. <a href="http://www.javabat.com/">Java Bat</a></h3>
<p><img style="border:1px solid grey;" src="/wp-content/uploads/wwwjavabatcom-screen-capture-2009-10-17-15-14-57_thumb.png"/></p>
<p>A site dedicated to practical Java programming problems. You can type your code directly into the website, and it&#8217;ll tell you if you&#8217;ve solve the problem correctly or not.</p>
<img src="http://feeds.feedburner.com/~r/coderholic/~4/_GqSN6wCnIc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.coderholic.com/10-more-puzzle-websites-to-sharpen-your-programming-skills/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		<feedburner:origLink>http://www.coderholic.com/10-more-puzzle-websites-to-sharpen-your-programming-skills/</feedburner:origLink></item>
		<item>
		<title>MySQL table size reporting script</title>
		<link>http://feedproxy.google.com/~r/coderholic/~3/7_sQgAb11Yo/</link>
		<comments>http://www.coderholic.com/mysql-table-size-reporting-script/#comments</comments>
		<pubDate>Tue, 29 Sep 2009 07:40:04 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[scripts]]></category>

		<guid isPermaLink="false">http://www.coderholic.com/?p=330</guid>
		<description><![CDATA[I&#8217;ve written a small shell script to report how much disk space each table in a given MySQL database is using. For example, below is the output of the script when run against this site&#8217;s database:

bmd /~: ./dbSize.sh coderholic root ********
wp_comments Data: 6.60MB Indexes: .15MB Total: 6.75MB
wp_links Data: 0MB Indexes: 0MB Total: 0MB
wp_options Data: 1.57MB [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve written a small shell script to report how much disk space each table in a given MySQL database is using. For example, below is the output of the script when run against this site&#8217;s database:</p>
<pre>
bmd /~: ./dbSize.sh coderholic root ********
wp_comments Data: 6.60MB Indexes: .15MB Total: 6.75MB
wp_links Data: 0MB Indexes: 0MB Total: 0MB
wp_options Data: 1.57MB Indexes: .01MB Total: 1.58MB
wp_postmeta Data: .01MB Indexes: .01MB Total: .02MB
wp_posts Data: .57MB Indexes: .02MB Total: .60MB
wp_term_relationships Data: 0MB Indexes: .01MB Total: .02MB
wp_term_taxonomy Data: 0MB Indexes: 0MB Total: 0MB
wp_terms Data: 0MB Indexes: 0MB Total: 0MB
wp_tla_data Data: 0MB Indexes: 0MB Total: 0MB
wp_tla_rss_map Data: 0MB Indexes: 0MB Total: 0MB
wp_usermeta Data: 0MB Indexes: 0MB Total: 0MB
wp_users Data: 0MB Indexes: 0MB Total: 0MB
*** 12 Tables | Data: 8.78MB Indexes: .25MB Total: 9.03MB ***
</pre>
<p>With a few small modifications I&#8217;m sure it&#8217;d be possible to get the script working PostgreSQL or other RDBMSes. The full code is below:</p>
<pre name="code" class="python">
#!/bin/bash
# Calculate the storage space used up by all tables in a given MySQL database
# Ben Dowling - www.coderholic.com
database=$1
username=$2
password=$3

if [ ${#database} -eq 0 ]
then
	echo "Usage: $0 &lt;database&gt; [username [password]]"
	exit
fi

if [ "$password" ]
then
   password="-p$password"
fi

mysql="mysql -u $username $password $database"

$mysql -se "USE $database";

tables=$($mysql -se "SHOW TABLES")

totalData=0
totalIndex=0
totalTables=0

for table in $tables
do
   output=$($mysql -se "SHOW TABLE STATUS LIKE \"$table\"\G")
   data=$(echo "$output" | grep Data_length | awk -F': ' '{print $2}')
   dataMegs=$(echo "scale=2;$data/1048576" | bc)
   index=$(echo "$output" | grep Index_length | awk -F': ' '{print $2}')
   indexMegs=$(echo "scale=2;$index/1048576" | bc)
   total=$(($index+$data))
   totalMegs=$(echo "scale=2;$total/1048576" | bc)

   echo "$table Data: ${dataMegs}MB Indexes: ${indexMegs}MB Total: ${totalMegs}MB"

   totalData=$(($totalData+$data))
   totalIndex=$(($totalIndex+$index))
   totalTables=$(($totalTables+1))
done

dataMegs=$(echo "scale=2;$totalData/1048576" | bc)
indexMegs=$(echo "scale=2;$totalIndex/1048576" | bc)
total=$(($totalIndex+$totalData))
totalMegs=$(echo "scale=2;$total/1048576" | bc)

echo "*** $totalTables Tables | Data: ${dataMegs}MB Indexes: ${indexMegs}MB Total: ${totalMegs}MB ***"
</pre>
<img src="http://feeds.feedburner.com/~r/coderholic/~4/7_sQgAb11Yo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.coderholic.com/mysql-table-size-reporting-script/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.coderholic.com/mysql-table-size-reporting-script/</feedburner:origLink></item>
		<item>
		<title>Parsing CSV data in Python</title>
		<link>http://feedproxy.google.com/~r/coderholic/~3/ZSyGxg0KDzc/</link>
		<comments>http://www.coderholic.com/parsing-csv-data-in-python/#comments</comments>
		<pubDate>Thu, 03 Sep 2009 21:19:51 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.coderholic.com/?p=310</guid>
		<description><![CDATA[Python provides the csv module for parsing comma separated value files. It allows you to iterate over each line in a csv file and gives you a list of items on that row. For example, given the following csv data:

id, name, date
0, name, 2009-01-01
1, another name, 2009-02-01

You&#8217;d end up with something like:

["id", "name", "date"],
["0", "name", [...]]]></description>
			<content:encoded><![CDATA[<p>Python provides the <a href="http://docs.python.org/library/csv.html">csv module</a> for parsing comma separated value files. It allows you to iterate over each line in a csv file and gives you a list of items on that row. For example, given the following csv data:</p>
<pre>
id, name, date
0, name, 2009-01-01
1, another name, 2009-02-01
</pre>
<p>You&#8217;d end up with something like:</p>
<pre>
["id", "name", "date"],
["0", "name", "2009-01-01"],
["1", "another name", "2009-02-01"]
</pre>
<p>In some situations it is nice to have a dictionary of keys and values though, so that instead of a simple list of columns we end up with:</p>
<pre>
{"id": "0", "name": "name", "date": "2009-01-01"},
{"id": "1", "name": "another name", "date": "2009-02-01"}
</pre>
<p>This would allow us to refer to fields by name rather than position in the list. Do you really want to remember that date is in position 2? And what happens if the input data changes, and a new column is added between name and date? If we&#8217;re referring to columns by position then we&#8217;ll have to change our existing code, but by referring to it by name we won&#8217;t have to change anything.</p>
<p>It turns out this is pretty easy to achieve, in only a few lines of python:</p>
<pre name="code" class="python">
import csv
data = csv.reader(open('data.csv'))
# Read the column names from the first line of the file
fields = data.next()
for row in data:
        # Zip together the field names and values
	items = zip(fields, row)
	item = {}
        # Add the value to our dictionary
	for (name, value) in items:
		item[name] = value.strip()
</pre>
<p>The csv module allows you to specify a delimiter, so if your data separated you just need to make a single change:</p>
<pre name="code" class="python">
data = csv.reader(open('data.tsv'), delimiter='\t')
</pre>
<p><strong>Update</strong></p>
<p>Thanks to several people for mentioning <a href="http://docs.python.org/library/csv.html#csv.DictReader">csv.DictReader</a>, which does exactly what I&#8217;ve mentioned here. Having a look at the code it does something very similar, but also takes into account rows of different length, ignores empty columns, and uses the method Tim mentioned in the comments for creating the dictionary:</p>
<pre name="code" class="python">
    # From csv.py
    def next(self):
        if self.line_num == 0:
            # Used only for its side effect.
            self.fieldnames
        row = self.reader.next()
        self.line_num = self.reader.line_num

        # unlike the basic reader, we prefer not to return blanks,
        # because we will typically wind up with a dict full of None
        # values
        while row == []:
            row = self.reader.next()
        d = dict(zip(self.fieldnames, row))
        lf = len(self.fieldnames)
        lr = len(row)
        if lf < lr:
            d[self.restkey] = row[lf:]
        elif lf > lr:
            for key in self.fieldnames[lr:]:
                d[key] = self.restval
        return d
</pre>
<img src="http://feeds.feedburner.com/~r/coderholic/~4/ZSyGxg0KDzc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.coderholic.com/parsing-csv-data-in-python/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://www.coderholic.com/parsing-csv-data-in-python/</feedburner:origLink></item>
		<item>
		<title>Getting started with Hadoop</title>
		<link>http://feedproxy.google.com/~r/coderholic/~3/Bf96cVfN9vw/</link>
		<comments>http://www.coderholic.com/getting-started-with-hadoop/#comments</comments>
		<pubDate>Sat, 29 Aug 2009 10:13:01 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[mapreduce]]></category>

		<guid isPermaLink="false">http://www.coderholic.com/?p=300</guid>
		<description><![CDATA[Hadoop is an open source Java implementation of Google&#8217;s MapReduce, a distributed programming technique. I&#8217;ve been investigating Hadoop lately for some data processing tasks at work. It is a bit of a minefield at first, so I&#8217;m writing this post partly as a way for me to keep track of everything, and also to hopefully [...]]]></description>
			<content:encoded><![CDATA[<p>Hadoop is an open source Java implementation of Google&#8217;s <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce</a>, a distributed programming technique. I&#8217;ve been investigating Hadoop lately for some data processing tasks at work. It is a bit of a minefield at first, so I&#8217;m writing this post partly as a way for me to keep track of everything, and also to hopefully save somebody else some time, or maybe to spark an an interest in Hadoop.</p>
<p><strong>Writing Hadoop applications</strong></p>
<p>Hadoop takes care of distributing your application over several nodes, but the rest is down to you. You can write you application in Java, or as a &#8220;streaming&#8221; application in any language. Streaming apps simply reads data from stdin and write their output to stdout. Writing a Java application seems to give you much more control over various options, and presumably improved performance too, but I opted for streaming for simplicity.The following post gives a good example of a streaming program for Hadoop: <a href="http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python">Writing an Hadoop MapReduce program in Python</a>.</p>
<p>One of a best things I&#8217;ve found with streaming jobs is you can test them on a small amount of data on the command line, with no need for Hadoop:</p>
<p><code><br />
cat input-data | ./mapper | sort | ./reducer<br />
</code></p>
<p><strong>Running Hadoop</strong></p>
<p>Rather than installing a local copy of Hadoop I used virtual machine from Yahoo! which comes with Hadoop pre-installed. The virtual machine is available from their <a href="http://developer.yahoo.com/hadoop/tutorial/module3.html">Hadoop tutorial</a> page, which includes full details of how to run the VM and start runninng your hadoop application on it. The VM even includes an Eclipse plugin for Hadoop, which is mainly focused on writing Java application but it still useful if you&#8217;re writing streming apps becuase it provides you with job progress details and easy access to the Hadoop file system on the VM.</p>
<p><strong>Multiple Nodes</strong></p>
<p>The Yahoo VM is a great way to test your Hadoop apps locally, but doing distributed computing on a single node kind of misses the point. If you want to do any serious processing with hadoop you&#8217;ll need lots of machines. Amazon&#8217;s <a href="http://aws.amazon.com/ec2/">Elastic Complete Cloud</a> (EC2) is perfect for this. </p>
<p>Amazon actually provide an <a href="http://aws.amazon.com/elasticmapreduce/">Elastic Map Reduce</a> service which sits on top of EC2 and manages a cluster of EC2 instances for you automatically while running your Hadoop application. The article <a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2294&amp;categoryID=265">Finding Similar Items with Amazon Elastic MapReduce, Python, and Hadoop Streaming</a> contains details of how to use the service with a real-world example.</p>
<p>An alternative option to running Hadoop on EC2 is to use the Cloudera distribution. Their <a href="http://www.cloudera.com/hadoop-ec2">guide</a> gives full details on how to set this up, and also links to client side scripts which make it extremely simple to to bring up a cluster and start running your Hadoop job. Using the Cloudera distribution has the benefit of being cheaper than using the Elastic Cloud Compute service, because you only pay for your EC2 usage. Another advantage is that you have more control over your EC2 instances, so you can install any packages that you wish, so for example you could book up your instances with subversion or git installed and automatically run a script to checkout the latest version of your code and data and run your Hadoop program!</p>
<p><strong>Alternatives to Java and streaming apps</strong></p>
<p>There are several projects that are built on top of Hadoop that allow you to access or process your data in different ways. <a href="http://wiki.apache.org/hadoop/Hive">Hive</a> allows you to write SQL like queries on your data, which are automatically converted to MapReduce tasks. <a href="http://wiki.github.com/klbostee/dumbo">Dumbo</a> allows you to write Hadoop applications in Python at a lower level than you can with streaming apps.</p>
<p>For more information check out the <a href="http://wiki.apache.org/hadoop/">main Hadoop page</a>, which is packed full of documentation.</p>
<img src="http://feeds.feedburner.com/~r/coderholic/~4/Bf96cVfN9vw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.coderholic.com/getting-started-with-hadoop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.coderholic.com/getting-started-with-hadoop/</feedburner:origLink></item>
		<item>
		<title>Getting started with Google Android development</title>
		<link>http://feedproxy.google.com/~r/coderholic/~3/JlK54gpVgHE/</link>
		<comments>http://www.coderholic.com/getting-started-with-google-android-development/#comments</comments>
		<pubDate>Thu, 20 Aug 2009 23:33:34 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[mobile]]></category>

		<guid isPermaLink="false">http://www.coderholic.com/?p=296</guid>
		<description><![CDATA[When I upgraded my Blackberry for an Android G1 last December I was hoping to get stuck in with some Android development as soon as possible. 9 months later and I&#8217;m finally getting around to it! My initial impressions of the development platform are good. There is great documentation, and lots of sample applications.
One of [...]]]></description>
			<content:encoded><![CDATA[<p>When I upgraded my Blackberry for an Android G1 last December I was hoping to get stuck in with some Android development as soon as possible. 9 months later and I&#8217;m finally getting around to it! My initial impressions of the development platform are good. There is great <a href="http://developer.android.com/reference/packages.html">documentation</a>, and lots of sample applications.</p>
<p>One of the best sample applications I&#8217;ve found is <a href="http://code.google.com/p/wherearemyfriends/downloads/list?q=label:Featured">WhereAreMyFriends</a>, which uses the phone&#8217;s location awareness and Google Maps integration to display your location and that of everyone in your phone book.</p>
<p><img src="http://developer.motorola.com/docstools/library/wherearemyfriends/images/figure1.png"></p>
<p> Getting WhereAreMyFriends up and running the with 1.5 simulator is a little tricky, but this <a href="http://developer.motorola.com/docstools/library/wherearemyfriends/">guide</a> runs through the required steps.</p>
<p>Now that I&#8217;ve finally started writing some code for the Android I&#8217;ll be writing some posts on the subject. If anybody has any suggestions on any specifics they&#8217;d like me to write about let me know!</p>
<img src="http://feeds.feedburner.com/~r/coderholic/~4/JlK54gpVgHE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.coderholic.com/getting-started-with-google-android-development/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.coderholic.com/getting-started-with-google-android-development/</feedburner:origLink></item>
		<item>
		<title>linewatch – an alternative to linux’s watch</title>
		<link>http://feedproxy.google.com/~r/coderholic/~3/MAlDTJiBOUI/</link>
		<comments>http://www.coderholic.com/linewatch-an-alternative-to-linuxs-watch/#comments</comments>
		<pubDate>Sat, 18 Jul 2009 20:11:23 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[scripts]]></category>

		<guid isPermaLink="false">http://www.coderholic.com/?p=290</guid>
		<description><![CDATA[I often use the linux watch command to monitor the status of certain commands. When I&#8217;m copying lots of files say, I&#8217;d watch the files in the target directory to see what files have already been copied across with the following command:

watch ls -l

The watch program clears the screen and displays the output of &#8220;ls [...]]]></description>
			<content:encoded><![CDATA[<p>I often use the linux <a href="http://linux.die.net/man/1/watch">watch</a> command to monitor the status of certain commands. When I&#8217;m copying lots of files say, I&#8217;d watch the files in the target directory to see what files have already been copied across with the following command:</p>
<pre>
watch ls -l
</pre>
<p>The watch program clears the screen and displays the output of &#8220;ls -l&#8221; every 2 seconds. </p>
<p>Sometimes I&#8217;ll want to monitor a command that only outputs a single line. If I wanted to see the total number of files in a directory rather than the files themselves I could use the command &#8220;ls -l | wc -l&#8221;. The fact that watch clears the whole screen can be a little annoying here though, because the command is only outputting a single line. That is why I came up with the following small bash script, linewatch. </p>
<p>Linewatch repeatedly calls any arguments passed to it every 2 seconds (in the same way watch does), but only clears a single line rather than the whole screen. Here is the code:</p>
<pre name="code" class="python">
#!/bin/bash
clearline="\b\033[2K\r"
command=$@

while true
do
    eval "$command"
    sleep 2
    echo -n -e "$clearline"
done
</pre>
<p>And here is an example of how to call it:</p>
<pre>
$ ./linewatch "ls -l | wc -l"
24
</pre>
<p>The number of files in the current directory (24 in the example) will keep update every 2 seconds. Just hit Ctrl-C when you want to quit,</p>
<img src="http://feeds.feedburner.com/~r/coderholic/~4/MAlDTJiBOUI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.coderholic.com/linewatch-an-alternative-to-linuxs-watch/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.coderholic.com/linewatch-an-alternative-to-linuxs-watch/</feedburner:origLink></item>
		<item>
		<title>SQL Antipatterns</title>
		<link>http://feedproxy.google.com/~r/coderholic/~3/a84J9X_LBq8/</link>
		<comments>http://www.coderholic.com/sql-antipatterns/#comments</comments>
		<pubDate>Sat, 11 Jul 2009 17:43:56 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[software design]]></category>

		<guid isPermaLink="false">http://www.coderholic.com/?p=283</guid>
		<description><![CDATA[I was really pleased to come across the &#8220;SQL Antipatterns Stirke Back&#8221; presentation recently, which discusses common mistakes with SQL database design. It gives some really good advice on how best to design databases to avoid these issues. I&#8217;ve certainly made some of the mistakes mentioned, and I&#8217;m sure I&#8217;ll be referring back to this [...]]]></description>
			<content:encoded><![CDATA[<p>I was really pleased to come across the &#8220;SQL Antipatterns Stirke Back&#8221; presentation recently, which discusses common mistakes with SQL database design. It gives some really good advice on how best to design databases to avoid these issues. I&#8217;ve certainly made some of the mistakes mentioned, and I&#8217;m sure I&#8217;ll be referring back to this presentation again and again!</p>
<div style="width:425px;text-align:left" id="__ss_1319559"><a style="font:14px Helvetica,Arial,Sans-serif;display:block;margin:12px 0 3px 0;text-decoration:underline;" href="http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back" title="Sql Antipatterns Strike Back">Sql Antipatterns Strike Back</a><object style="margin:0px" width="425" height="355"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=sqlantipatternsstrikeback-090421005946-phpapp01&#038;stripped_title=sql-antipatterns-strike-back" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=sqlantipatternsstrikeback-090421005946-phpapp01&#038;stripped_title=sql-antipatterns-strike-back" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"></embed></object>
<div style="font-size:11px;font-family:tahoma,arial;height:26px;padding-top:2px;">View more <a style="text-decoration:underline;" href="http://www.slideshare.net/">documents</a> from <a style="text-decoration:underline;" href="http://www.slideshare.net/billkarwin">Bill Karwin</a>.</div>
</div>
<img src="http://feeds.feedburner.com/~r/coderholic/~4/a84J9X_LBq8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.coderholic.com/sql-antipatterns/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.coderholic.com/sql-antipatterns/</feedburner:origLink></item>
		<item>
		<title>SVN Change Monitoring Script</title>
		<link>http://feedproxy.google.com/~r/coderholic/~3/bYa0HlxqRjw/</link>
		<comments>http://www.coderholic.com/svn-change-monitoring-script/#comments</comments>
		<pubDate>Thu, 02 Jul 2009 19:56:00 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[scripts]]></category>

		<guid isPermaLink="false">http://www.coderholic.com/?p=280</guid>
		<description><![CDATA[I came up with the following shell script recently to monitor code changes in a subversion repository. On the first run it will emails out the 10 most recent changes. After that the script mails out all changes since the last time it was run. You can set it up to run as a daily [...]]]></description>
			<content:encoded><![CDATA[<p>I came up with the following shell script recently to monitor code changes in a subversion repository. On the first run it will emails out the 10 most recent changes. After that the script mails out all changes since the last time it was run. You can set it up to run as a daily cron job which mails you all changes made to you favourite open source project!</p>
<p>It wouldn&#8217;t take much to get it working with other version control systems such as Git or Bazaar, or to do some nice formatting of the output instead of outputting the raw svn log as-is. Let me know if you find it useful!</p>
<pre name="code" class="python">
#!/bin/bash
# Shell script to email the latest changes in an SVN
# repsitory to a specified email address.
# Ben Dowling - wwww.coderholic.com

svnUrl="http://anonsvn.wireshark.org/wireshark/trunk/"
lastRevisionFile="./.last-revision"
mailto="ben@coderholic.com"

function getCurrentRevision {
  # Get the current SVN revision, eg. "r4670"
  currentRevision=$(svn log "$svnUrl" -r HEAD 2&gt;/dev/null | head -n2 | grep -v -- "-------" | awk '{ print $1 }')
  # Strip off the 'r'
  currentRevision="${currentRevision:1}"
  echo "$currentRevision"
}

currentRevision=$(getCurrentRevision)

# If we've run this program before then we've stored the SVN revision at the time
if [ -f "$lastRevisionFile" ]
then
  lastRevision=$(cat "$lastRevisionFile")
  #  Check what the current revision is, and exit if there
  # haven't been any changes since we last checked
  if [ $currentRevision -lt $lastRevision ]
  then
      echo "No changes since last check"
      exit
  fi
else
  # We haven't run this program before, so set the last revision to the current revision - 10
  lastRevision=$(echo "$currentRevision - 10" | bc)
fi

# Mail the SVN changes
svn log "$svnUrl" -r "HEAD:${lastRevision}" | mail -s "SVN changes for $svnUrl" $mailto

# Store the current revision + 1 as the last revision
revision=$(echo "$currentRevision + 1" | bc)
echo "$revision" &gt; "$lastRevisionFile"
</pre>
<img src="http://feeds.feedburner.com/~r/coderholic/~4/bYa0HlxqRjw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.coderholic.com/svn-change-monitoring-script/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.coderholic.com/svn-change-monitoring-script/</feedburner:origLink></item>
	</channel>
</rss><!-- Dynamic page generated in 5.515 seconds. --><!-- Cached page generated by WP-Super-Cache on 2009-11-10 11:26:22 -->
