<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr="http://purl.org/syndication/thread/1.0" gd:etag="W/&quot;CEICRXY8cSp7ImA9WhVVFkU.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251</id><updated>2012-05-10T21:42:44.879+01:00</updated><category term="Mobile" /><category term="Query Languages" /><category term="Broadband" /><category term="Lucene" /><category term="Visualization" /><category term="MapReduce" /><category term="Cloud Computing" /><category term="Family" /><category term="Regular Expressions" /><category term="Web Services" /><category term="Music" /><category term="Hashing" /><category term="RPC" /><category term="Thrift" /><category term="Java" /><category term="Cloudera" /><category term="Open Source" /><category term="Testing" /><category term="Amazon Web Services" /><category term="Distributed Systems" /><category term="Amazon EC2" /><category term="Amazon S3" /><category term="Conferences" /><category term="Quantum Mechanics" /><category term="Data" /><category term="Whirr" /><category term="Hadoop" /><category term="HBase" /><category term="Hardware" /><category term="Apache" /><category term="Easter" /><category term="Book" /><category term="Serialization" /><title>Tom White</title><subtitle type="html" /><link rel="http://schemas.google.com/g/2005#feed" type="application/atom+xml" href="http://www.lexemetech.com/feeds/posts/default" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/" /><link rel="next" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default?start-index=26&amp;max-results=25&amp;redirect=false&amp;v=2" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><generator version="7.00" uri="http://www.blogger.com">Blogger</generator><openSearch:totalResults>50</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/TomWhite" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="tomwhite" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry gd:etag="W/&quot;CEICRXYyfSp7ImA9WhVVFkU.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-1017419998967425216</id><published>2012-05-10T21:42:00.000+01:00</published><updated>2012-05-10T21:42:44.895+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-05-10T21:42:44.895+01:00</app:edited><title>Volcanoes!</title><content type="html">I've just finished reading &lt;a href="http://www.amazon.com/Super-Volcano-Ticking-Yellowstone-National/dp/0760329257"&gt;"Super Volcano: The Ticking Time Bomb Beneath Yellowstone National Park"&lt;/a&gt; by Greg Breinin. Despite the hyperbolic title, it's a really good introduction to the subject. Actually, the title is entirely appropriate, since the previous Yellowstone eruption around 600,000 years ago was one &lt;i&gt;thousand&lt;/i&gt; times as powerful as the 1980 Mount St. Helens eruption. And it's likely to erupt again, but no one knows when.&lt;br /&gt;
&lt;br /&gt;
We've been on a bit of a volcano tour recently. First &lt;a href="http://faites-simple.blogspot.com/2011/10/lake-almanor-eagle-lake-and-lassen.html"&gt;we visited Lassen Volcanic National Park&lt;/a&gt; in October (climbing the Cinder Cone was a highlight), and &lt;a href="http://faites-simple.blogspot.com/2012/04/towards-seattle-and-other-thoughts.html"&gt;we stopped in on Mount St. Helens visitor center&lt;/a&gt; on our way to Seattle last month. Yesterday &lt;a href="http://faites-simple.blogspot.com/2012/05/yellowstone-mammoth-hot-springs-to-old.html"&gt;we ventured into the Yellowstone caldera&lt;/a&gt; (the bit that blew out in the last eruption).&lt;br /&gt;
&lt;br /&gt;
Before reading the book I hadn't appreciated how recent our understanding of Yellowstone's geology is. It was only in the 1960s that scientists combined new empirical data about the ages of different rock formations in the park with the then emerging theory of plate tectonics. One of the scientists was Robert Christiansen of the U.S. Geological Survey, who, with Richard Blank, collected samples from all over Yellowstone and pieced together the puzzle of how Yellowstone formed. (He also wrote &lt;a href="http://pubs.usgs.gov/pp/pp729g/"&gt;the definitive account of Yellowstone's geology&lt;/a&gt; in 2001.)&lt;br /&gt;
&lt;br /&gt;
They realized that the series of calderas between Oregon and Wyoming were all eruptions caused by what is now known as the &lt;a href="http://en.wikipedia.org/wiki/Yellowstone_Hotspot"&gt;Yellowstone hotspot&lt;/a&gt; over the last 16 million years. The continental plate is moving south west, which makes the newer volcanoes appear in the north east.&lt;br /&gt;
&lt;br /&gt;
This &lt;a href="http://en.wikipedia.org/wiki/File:HotspotsSRP.jpg"&gt;diagram from Wikipedia&lt;/a&gt; summarizes it nicely:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://upload.wikimedia.org/wikipedia/commons/thumb/4/46/HotspotsSRP.jpg/800px-HotspotsSRP.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="219" src="http://upload.wikimedia.org/wikipedia/commons/thumb/4/46/HotspotsSRP.jpg/800px-HotspotsSRP.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-1017419998967425216?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/sESWxYOpVAs" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/1017419998967425216/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=1017419998967425216" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1017419998967425216?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1017419998967425216?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2012/05/volcanoes.html" title="Volcanoes!" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CEUFR3Yzeip7ImA9WhZUE04.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-7077861757926689669</id><published>2011-06-04T23:04:00.017+01:00</published><updated>2011-06-06T04:50:16.882+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-06-06T04:50:16.882+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Whirr" /><category scheme="http://www.blogger.com/atom/ns#" term="Apache" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><category scheme="http://www.blogger.com/atom/ns#" term="Cloud Computing" /><title>What's new in Apache Whirr 0.5.0-incubating</title><content type="html">&lt;a href="http://incubator.apache.org/whirr/"&gt;Apache Whirr&lt;/a&gt; 0.5.0-incubating is &lt;a href="http://www.apache.org/dyn/closer.cgi/incubator/whirr/"&gt;now available&lt;/a&gt;. Whirr is a library and command line interface for running distributed services like Apache Hadoop in the cloud. Note that Whirr is currently undergoing       Incubation at the Apache Software Foundation, which means that, in particular, the project has yet to be&lt;br /&gt;fully endorsed by the ASF. Please read the full &lt;a href="http://svn.apache.org/repos/asf/incubator/whirr/trunk/DISCLAIMER.txt"&gt;disclaimer&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;In this release the &lt;a href="http://incubator.apache.org/whirr/team-list.html"&gt;Whirr development team&lt;/a&gt; have added many new features while still making the core more solid. This post covers some of the more important changes. The full list can be found in the &lt;a href="http://incubator.apache.org/whirr/release-notes.html"&gt;release notes&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Improving the new user experience&lt;/h3&gt;Orchestrating multiple services on cloud instances is a challenge to make simple, and Whirr has sometimes been a little fiddly to get running. SSH settings, in particular, have been a common sticking point with new users. The new &lt;a href="http://incubator.apache.org/whirr/whirr-in-5-minutes.html"&gt;Whirr in 5 Minutes&lt;/a&gt; guide walks through the minimum number of commands you need to type to get a simple 3-node ZooKeeper cluster running in a few minutes. From there you can move on to the &lt;a href="http://incubator.apache.org/whirr/quick-start-guide.html"&gt;Quick Start Guide&lt;/a&gt; and the &lt;a href="http://incubator.apache.org/whirr/configuration-guide.html"&gt;Configuration Guide&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The sample configurations in the &lt;i&gt;recipes&lt;/i&gt; directory in the distribution contain useful settings for running the services on a variety of cloud providers. Users are always encouraged to share their working configurations with the &lt;a href="http://incubator.apache.org/whirr/mail-lists.html"&gt;community&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;New services&lt;/h3&gt;Elastic Search and Voldemort have been added to the roster of services that come with Whirr. This brings the total to six; adding to Apache Cassandra, Apache Hadoop, Apache HBase, and Apache ZooKeeper.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;API improvements&lt;/h3&gt;Whirr is still a young project so it is not surprising that its API is rapidly evolving. In &lt;a href="https://issues.apache.org/jira/browse/WHIRR-245"&gt;WHIRR-245&lt;/a&gt;, the demarcation between the user API (for users who control Whirr clusters from Java) and the service API (for developers writing new Whirr services) was clarified. The user API can be found in the &lt;a href="http://incubator.apache.org/whirr/apidocs/org/apache/whirr/package-summary.html"&gt;org.apache.whirr&lt;/a&gt; package; whereas the service API is in &lt;a href="http://incubator.apache.org/whirr/apidocs/org/apache/whirr/service/package-summary.html"&gt;org.apache.whirr.service&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;You can find out more about writing Whirr services in &lt;a href="https://cwiki.apache.org/confluence/download/attachments/25199475/how-to-write-a-whirr-service.pdf?version=1&amp;amp;modificationDate=1301953266000"&gt;this presentation&lt;/a&gt; (PDF).&lt;br /&gt;&lt;br /&gt;The firewall API that service writers use to open ports for services was simplified and made more powerful in &lt;a href="https://issues.apache.org/jira/browse/WHIRR-275"&gt;WHIRR-275&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Overriding scripts&lt;/h3&gt;This feature was actually introduced in Whirr 0.4.0-incubating, but it's useful enough to mention here. In older versions of Whirr, if you wanted to make a modification to the scripts that run on cloud instances - to tweak some settings, for instance - you would have to upload your modifications (as well as all the other scripts) to a publicly available web server (Amazon S3 was a common choice), then point Whirr at the new location. Not particularly difficult, but a big enough barrier to discourage users from trying it.&lt;br /&gt;&lt;br /&gt;The new approach is to push scripts to nodes from the launching machine, so you can just edit them locally before launch. Full instructions are covered in the &lt;a href="http://incubator.apache.org/whirr/faq.html#How_can_I_modify_the_instance_installation_and_configuration_scripts"&gt;FAQ&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Running scripts on nodes&lt;br /&gt;&lt;/h3&gt;In 0.5.0 the scripts that run on cloud instances have been broken up to be more fine-grained, so many services have individual start and stop scripts (&lt;a href="https://issues.apache.org/jira/browse/WHIRR-266"&gt;WHIRR-266&lt;/a&gt;). Combined with the ability to run scripts on  sets of nodes in the cluster (by ID or role), users now have more control of the cluster once it has launched (&lt;a href="https://issues.apache.org/jira/browse/WHIRR-173"&gt;WHIRR-173&lt;/a&gt;). Try running &lt;code&gt;whirr run-script&lt;/code&gt; at the command line to use this feature. There's a contrib script to run the &lt;a href="https://github.com/brianfrankcooper/YCSB"&gt;Yahoo! Cloud Serving Benchmark&lt;/a&gt; (YCSB) against an HBase cluster, which takes advantage of the run-script command (&lt;a href="https://issues.apache.org/jira/browse/WHIRR-287"&gt;WHIRR-287&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;Also useful is &lt;a href="https://issues.apache.org/jira/browse/WHIRR-291"&gt;WHIRR-291&lt;/a&gt;, which allows you to launch "blank" nodes with no services running on them (in a "noop" role), and then, with &lt;code&gt;whirr run-script&lt;/code&gt;, run arbitrary scripts on them to bring them into the state you want.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Custom service builds&lt;/h3&gt;Developers who work on services supported in Whirr will find the ability to push a custom build to a cluster very useful for testing (&lt;a href="https://issues.apache.org/jira/browse/WHIRR-220"&gt;WHIRR-220&lt;/a&gt;). For example, if you are working on a ZooKeeper feature, you can build a ZooKeeper tarball with your new feature, then launch a cluster that uses this tarball by specifying &lt;code&gt;whirr.zookeeper.tarball.url&lt;/code&gt; as a local &lt;i&gt;file://&lt;/i&gt; URL pointing to your tarball. Whirr will push the tarball to a temporary blob store container, then each node will download from there.&lt;br /&gt;&lt;br /&gt;I used a variation of this feature to try out a &lt;a href="https://builds.apache.org/job/Hadoop-22-Build/"&gt;nightly Hadoop 0.22 build&lt;/a&gt; on a small Whirr cluster. In this case the tarball URL is not a local file, so Whirr doesn't copy the tarball to a blob store since it is already accessible from the cloud.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Service improvements&lt;/h3&gt;Whirr is only able to exist because of the powerful abstraction that &lt;a href="http://code.google.com/p/jclouds/"&gt;jclouds&lt;/a&gt; provides for interacting with cloud providers. A great example of this power is the API that jclouds provides for discovering the hardware capabilities of an instance running on any provider. &lt;a href="https://issues.apache.org/jira/browse/WHIRR-282"&gt;WHIRR-282&lt;/a&gt; took advantage of the jclouds API to find the number of cores on a node to dynamically configure the number of slots in a Hadoop cluster. Previously, you had to set this manually for each cluster to take full advantage of larger image sizes.&lt;br /&gt;&lt;br /&gt;This is just the beginning - there is more work to use memory capabilities to set configuration (&lt;a href="https://issues.apache.org/jira/browse/WHIRR-229"&gt;WHIRR-229&lt;/a&gt;), and to use hardware capabilities generally in services other than Hadoop.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Cluster state storage&lt;/h3&gt;In previous releases of Whirr, information about launched instances was stored in a file on the machine that launched the cluster (&lt;i&gt;~/.whirr/&amp;lt;cluster-name&amp;gt;/instances&lt;/i&gt;). With &lt;a href="https://issues.apache.org/jira/browse/WHIRR-288"&gt;WHIRR-288&lt;/a&gt;, it's now possible to store this information in a blob store instead (such as Amazon S3, although any jclouds-supported blob store can be used), which is useful if you want to control clusters from multiple machines.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Bring Your Own Nodes&lt;/h3&gt;Or just BYON, for short. Many users have requested the ability to deploy to privately owned hardware - and jclouds added this feature in 1.0-beta-9. Whirr now has preliminary support for BYON clusters. In a nutshell, you write a YAML file enumerating the nodes to deploy to - their addresses, access credentials, etc. - then Whirr will start services on them. The nodes just need to have a base OS like Centos or Ubuntu installed. You can find an example BYON configuration in the &lt;i&gt;recipes&lt;/i&gt; directory of the download.&lt;br /&gt;&lt;br /&gt;BYON is also useful for testing locally by using VMware or VirtualBox to host target nodes.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;A hummingbird&lt;/h3&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://incubator.apache.org/whirr/images/whirr-logo.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: right; cursor: pointer; width: 247px; height: 260px;" src="http://incubator.apache.org/whirr/images/whirr-logo.png" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Last, but not least, Whirr finally has a logo! Many thanks to Alison Wong, who designed it and donated it to the ASF.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Credits&lt;/h3&gt;I would like to thank everyone who helped with the 0.5.0-incubating release. We have a growing community, and we welcome feedback and help from new users and developers. If you'd like to get involved you can start by &lt;a href="http://www.apache.org/dyn/closer.cgi/incubator/whirr/"&gt;downloading the new release&lt;/a&gt; and joining us on the &lt;a href="http://incubator.apache.org/whirr/mail-lists.html"&gt;mailing lists&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;What's next?&lt;/h3&gt;It's difficult to make firm predictions about the contents of the next release since Whirr is an open source project with many &lt;a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;amp;jqlQuery=project+%3D+WHIRR+AND+resolution+%3D+Unresolved"&gt;open issues&lt;/a&gt;, but the general themes include:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Adding &lt;a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;amp;jqlQuery=project+%3D+WHIRR+AND+component+%3D+%22new+service%22+and+resolution+%3D+Unresolved"&gt;more services&lt;/a&gt;. In tandem, we want to make it easier to write new services by pushing common patterns into the core (e.g. &lt;a href="https://issues.apache.org/jira/browse/WHIRR-326"&gt;WHIRR-326&lt;/a&gt; is one example of this).&lt;/li&gt;&lt;li&gt;Improving existing services. By making them more flexible, better configured, easier to manage.&lt;/li&gt;&lt;li&gt;Adding &lt;a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;amp;jqlQuery=project+%3D+WHIRR+AND+component+%3D+%22new+provider%22+and+resolution+%3D+Unresolved"&gt;more cloud providers&lt;/a&gt;. The latest release of jclouds supports 30 providers, and we need help testing more of them with Whirr.&lt;/li&gt;&lt;li&gt;Implementing services using other configuration management tools, rather than bash scripting. Andrei Savu is working on using Puppet to write new services (&lt;a href="https://issues.apache.org/jira/browse/WHIRR-255"&gt;WHIRR-255&lt;/a&gt;).&lt;/li&gt;&lt;li&gt;Supporting elastic clusters, so new nodes can be added to running clusters (&lt;a href="https://issues.apache.org/jira/browse/WHIRR-214"&gt;WHIRR-214&lt;/a&gt;).&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-7077861757926689669?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/eAzX-CN-NJg" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/7077861757926689669/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=7077861757926689669" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7077861757926689669?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7077861757926689669?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2011/06/whats-new-in-apache-whirr-050.html" title="What's new in Apache Whirr 0.5.0-incubating" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;DUMGQn4zeyp7ImA9WhZXGUw.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-8764174624555956580</id><published>2011-05-09T04:23:00.008+01:00</published><updated>2011-05-09T06:03:43.083+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-05-09T06:03:43.083+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Data" /><title>Do Donors Choose Local Schools?</title><content type="html">&lt;a href="http://www.donorschoose.org/"&gt;DonorsChoose.org&lt;/a&gt; is a site where people donate money to school projects. For example, a teacher in Iowa might create a project request for some beanbags to create a reading area for her pupils. Then, via the website, donors can give as much or as little as they like to the project, and once the target is reached DonorsChoose purchase and deliver the beanbags to the school.&lt;br /&gt;&lt;br /&gt;DonorsChoose are running a contest. They have opened up their data, and are challenging developers to "&lt;a href="http://www.donorschoose.org/hacking-education"&gt;make discoveries and build apps that improve education in America&lt;/a&gt;".&lt;br /&gt;&lt;br /&gt;I thought I'd do a little hack to answer the question "Do donors tend to choose their local schools?"&lt;br /&gt;&lt;br /&gt;I wrote a short Python program to calculate the distance between each donor's address (where it was provided) and the address of the school for the project they were donating to. Then, using R, I plotted the following histogram:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-YG-bpy0WG_4/TcdnuzsWPxI/AAAAAAAAAIk/y6918RNNchU/s1600/distance-hist.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 400px;" src="http://3.bp.blogspot.com/-YG-bpy0WG_4/TcdnuzsWPxI/AAAAAAAAAIk/y6918RNNchU/s400/distance-hist.png" alt="" id="BLOGGER_PHOTO_ID_5604562315133730578" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;It's striking that many donors are local. In fact, in my analysis, one in four donors live within four miles of the school they are donating to, and the median distance is 128 miles. However, there is a long tail reaching to over 5000 miles!&lt;br /&gt;&lt;br /&gt;If we use a logarithmic scale for the y-axis (count), then a couple of features jump out. This plot is a scatter plot where counts are bucketed by integer distance.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-RARTtyCNHUo/TcdoApfJQZI/AAAAAAAAAIs/-QRnmc0tgDA/s1600/distance-log.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 400px;" src="http://1.bp.blogspot.com/-RARTtyCNHUo/TcdoApfJQZI/AAAAAAAAAIs/-QRnmc0tgDA/s400/distance-log.png" alt="" id="BLOGGER_PHOTO_ID_5604562621631644050" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;There is a small peak at around 2500 miles, which is puzzling until you realize that this is the approximate distance between the East Coast and West Coast of the USA, where the majority of the population is located. I'm guessing that this bump corresponds to people who donate to schools of friends and relatives on the other coast.&lt;br /&gt;&lt;br /&gt;The other noticeable feature is the significant drop off after 2500 miles. This small number of donations is where the donor or school is located in the non-contiguous states (Alaska and Hawaii), which have only a small fraction of the total population.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;How I produced the images&lt;/h3&gt;I wrote a Python program to parse the CSV &lt;a href="http://developer.donorschoose.org/the-data/"&gt;data&lt;/a&gt; from DonorsChoose. It reads two data files - the projects file and the donations file. The files are joined by the project ID field, which means we can access the school ZIP code (from the projects file), and the partial ZIP code of the donor (from the donations file). The donor's ZIP code is optional (and was actually only present in 46% of donations, so the results are restricted to this subset of donations). Also, for privacy reasons, only the first 3 digits of the donor's ZIP code are provided by DonorsChoose. This makes the distance measurements less accurate, particularly for local donors.&lt;br /&gt;&lt;br /&gt;In the case of the partial ZIP code matching the school ZIP code, I set the distance to zero, on the assumption that the donor lives close to the school. This assumption will tend to overcount the zero distance case, and undercount small distances.&lt;br /&gt;&lt;br /&gt;If the partial ZIP code did not match the school ZIP code, I chose a ZIP code with that prefix at random and calculate the distance between that ZIP code and the school's ZIP code. For this calculation I used Kevin T. Ryan's Python &lt;a href="http://code.activestate.com/recipes/393241-calculating-the-distance-between-zip-codes/"&gt;code at ActiveState&lt;/a&gt;, which I modified slightly to support partial ZIP codes.&lt;br /&gt;&lt;br /&gt;The program buckets integers distances and writes the counts to a file. I then used R to plot the distributions show above.&lt;br /&gt;&lt;br /&gt;I've put all my code into a &lt;a href="https://github.com/tomwhite/donors-choose-hack"&gt;GitHub repository&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;This hack just scratches the surface of the dataset, and I look forward to seeing some of the cool things that others do in this &lt;a href="http://www.donorschoose.org/hacking-education"&gt;contest&lt;/a&gt;. The closing date is June 30, 2011.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-8764174624555956580?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/cT-FnRhLYYQ" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/8764174624555956580/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=8764174624555956580" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/8764174624555956580?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/8764174624555956580?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2011/05/do-donors-choose-local-schools.html" title="Do Donors Choose Local Schools?" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://3.bp.blogspot.com/-YG-bpy0WG_4/TcdnuzsWPxI/AAAAAAAAAIk/y6918RNNchU/s72-c/distance-hist.png" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;A0YBRX4zcCp7ImA9WhZQEEQ.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-8434577759482729644</id><published>2011-04-16T23:23:00.005+01:00</published><updated>2011-04-18T04:59:14.088+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-04-18T04:59:14.088+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Whirr" /><title>Whirr in 5 Minutes</title><content type="html">A couple of days ago I wrote down a sequence of command lines to install &lt;a href="http://incubator.apache.org/whirr/"&gt;Apache Whirr&lt;/a&gt; (an incubator project for running distributed systems on various cloud providers) and run a service from scratch. You just need Java, SSH, and some &lt;a href="http://incubator.apache.org/whirr/faq.html#How_do_I_find_my_cloud_credentials"&gt;cloud credentials&lt;/a&gt; (Amazon EC2 in this case): I've reproduced the commands here:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;export AWS_ACCESS_KEY_ID=...&lt;br /&gt;export AWS_SECRET_ACCESS_KEY=...&lt;br /&gt;curl -O http://www.apache.org/dist/incubator/whirr/whirr-0.4.0-incubating/whirr-0.4.0-incubating.tar.gz&lt;br /&gt;tar zxf whirr-0.4.0-incubating.tar.gz; cd whirr-0.4.0-incubating&lt;br /&gt;ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr&lt;br /&gt;bin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;At this point you should have a 3 node ZooKeeper cluster running, which is easily checked with&lt;br /&gt;&lt;code&gt;&lt;br /&gt;echo "ruok" | nc $(awk '{print $3}' ~/.whirr/zookeeper/instances | head -1) 2181; echo&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;You can shutdown the cluster with the following command.&lt;br /&gt;&lt;code&gt;&lt;br /&gt;bin/whirr destroy-cluster --config recipes/zookeeper-ec2.properties&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;There are recipes for more services in the &lt;a href="http://www.apache.org/dyn/closer.cgi/incubator/whirr/"&gt;Whirr download&lt;/a&gt; package, and more detailed instructions in the &lt;a href="http://incubator.apache.org/whirr/quick-start-guide.html"&gt;Quick Start Guide&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-8434577759482729644?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/WRmV9uXMWnI" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/8434577759482729644/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=8434577759482729644" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/8434577759482729644?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/8434577759482729644?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2011/04/whirr-in-5-minutes.html" title="Whirr in 5 Minutes" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CU4ERX4zfyp7ImA9Wx9TGU8.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-1093096056004881234</id><published>2010-11-28T05:22:00.004Z</published><updated>2010-11-28T05:58:24.087Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-11-28T05:58:24.087Z</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Java" /><title>My favourite talk at Devoxx 2010</title><content type="html">I went to &lt;a href="http://www.devoxx.com/display/Devoxx2K10/Home"&gt;Devoxx&lt;/a&gt; in Antwerp for the first time this year, and really enjoyed it. I didn't go to that many talks, but the quality seemed very high. My favourite talk was "Performance Anxiety" by &lt;a href="http://en.wikipedia.org/wiki/Joshua_Bloch"&gt;Josh Bloch&lt;/a&gt;, because he's a great speaker and because he presented a single important idea so well.&lt;br /&gt;&lt;br /&gt;The idea was this: determining the performance of programs should be treated as an empirical science. We should give up any hope (if any existed) that predicting a program's performance will become easier in the future, since every layer in the deep stack of a modern computer is becoming more complex. Increased complexity is actually the price we must pay for increased performance. And increased complexity leads, almost inevitably, to reduced predictability.&lt;br /&gt;&lt;br /&gt;As an experimental demonstration, Josh ran a micro benchmark to sort an array of integers. (The demo actually failed to show what he wanted to show, but he assured us it had worked earlier... It's somehow reassuring when live demos don't work for Java demigods either.) Each invocation of the benchmark did a number of runs, and the timings of the runs converged on a stable value. However, between benchmark invocations, the stable values that they converged on varied by up to 20%.&lt;br /&gt;&lt;br /&gt;The reason is subtle: the HotSpot compiler produces different compile plans on different runs, and these have different performance profiles. (This is explained in Cliff Click's 2009 JavaOne presentation, &lt;a href="http://www.azulsystems.com/events/javaone_2009/session/2009_J1_Benchmark.pdf"&gt;"The Art of (Java) Benchmarking"&lt;/a&gt;.) They all converge on stable values, but &lt;span style="font-style: italic;"&gt;different&lt;/span&gt; stable values for &lt;span style="font-style: italic;"&gt;different&lt;/span&gt; runs. The fact that HotSpot is non-deterministic may not be particularly surprising, but Josh said that the same behaviour has been shown in C code and even assembler, since non-determinism exists at lower levels of the stack too.&lt;br /&gt;&lt;br /&gt;The practical upshot is that we need to change how we iteratively benchmark code. No longer is it permissible to run a benchmark, make a change, run the benchmark again, see that the execution time was faster (even across a number of runs in one VM) and legitimately conclude that it was due to the change we made. We have to reach for statistical tools that tell us the improved execution time was significant after we have run enough VMs.&lt;br /&gt;&lt;br /&gt;How many VMs? The short answer is "30", the longer answer is in &lt;a href="http://buytaert.net/files/oopsla07-georges.pdf"&gt;"Statistically Rigorous Java Performance Evaluation"&lt;/a&gt; by Andy Georges, Dries Buytaert, and Lieven Eeckhout.&lt;br /&gt;&lt;br /&gt;Thankfully there is a Java framework called &lt;a href="http://code.google.com/p/caliper"&gt;Caliper&lt;/a&gt; which can help you run microbenchmarks and which even plots the error bars for you. This stuff needs to see wider adoption in the industry.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-1093096056004881234?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/NA2fidEOTaQ" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/1093096056004881234/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=1093096056004881234" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1093096056004881234?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1093096056004881234?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2010/11/my-favourite-talk-at-devoxx-2010.html" title="My favourite talk at Devoxx 2010" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CEcHQnc4fyp7ImA9WxJSFkk.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-7308341528704841683</id><published>2009-05-01T16:52:00.008+01:00</published><updated>2009-05-06T21:33:53.937+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-05-06T21:33:53.937+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Book" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>"Hadoop: The Definitive Guide" Coming Soon</title><content type="html">&lt;a href="http://www.hadoopbook.com/"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 240px; height: 240px;" src="http://1.bp.blogspot.com/_IhqEHw4_Ick/SfseHCO7weI/AAAAAAAAAHc/N6vYnyvN6wQ/s320/htdg.jpg" alt="" id="BLOGGER_PHOTO_ID_5330887690130538978" border="0" /&gt;&lt;/a&gt;After a busy couple of months I've finished the writing for "&lt;a href="http://www.hadoopbook.com/"&gt;Hadoop: The Definitive Guide&lt;/a&gt;". It's now going through the production process at O'Reilly.&lt;br /&gt;&lt;br /&gt;You can pre-order it on &lt;a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979/ref=pd_bbs_sr_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1241193282&amp;amp;sr=8-1"&gt;Amazon&lt;/a&gt; and &lt;a href="http://oreilly.com/catalog/9780596521998/"&gt;O'Reilly&lt;/a&gt;. You can also get the Rough Cuts version from O'Reilly to read today, although it hasn't yet been refreshed with my latest draft (I hope that will happen in the next few days).&lt;br /&gt;&lt;br /&gt;Here's the final chapter listing. Readers of earlier drafts will notice that the number of chapters has grown: this is because the elephantine MapReduce chapter has been split into three (chapters 6, 7, and 8) to make things more digestible.&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Meet Hadoop&lt;/li&gt;&lt;li&gt;MapReduce&lt;/li&gt;&lt;li&gt;The Hadoop Distributed Filesystem&lt;/li&gt;&lt;li&gt;Hadoop I/O&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Developing a MapReduce Application&lt;/li&gt;&lt;li&gt;How MapReduce Works&lt;/li&gt;&lt;li&gt;MapReduce Types and Formats&lt;/li&gt;&lt;li&gt;MapReduce Features&lt;/li&gt;&lt;li&gt;Setting Up a Hadoop Cluster&lt;/li&gt;&lt;li&gt;Administering Hadoop&lt;/li&gt;&lt;li&gt;Pig&lt;/li&gt;&lt;li&gt;HBase&lt;/li&gt;&lt;li&gt;ZooKeeper&lt;/li&gt;&lt;li&gt;Case Studies&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;The writing's done but I still have to package up the example code. I'll be doing this soon, and it will appear on the &lt;a href="http://hadoopbook.com/"&gt;book's website&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-7308341528704841683?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/rlkX8xhbBEo" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/7308341528704841683/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=7308341528704841683" title="9 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7308341528704841683?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7308341528704841683?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2009/05/hadoop-definitive-guide-coming-soon.html" title="&quot;Hadoop: The Definitive Guide&quot; Coming Soon" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/_IhqEHw4_Ick/SfseHCO7weI/AAAAAAAAAHc/N6vYnyvN6wQ/s72-c/htdg.jpg" height="72" width="72" /><thr:total>9</thr:total></entry><entry gd:etag="W/&quot;C0MHSX4yeSp7ImA9WxVQEEg.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-1506167963919896927</id><published>2009-01-27T10:01:00.002Z</published><updated>2009-01-27T10:17:18.091Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-01-27T10:17:18.091Z</app:edited><title>Draft Pig Chapter</title><content type="html">A couple of quick updates on the &lt;a href="http://www.hadoopbook.com/"&gt;Hadoop book&lt;/a&gt; I'm writing. The Pig chapter is now available on &lt;a href="http://safari.oreilly.com/9780596521974?cid=orm-cat-readnow-9780596521974"&gt;Safari&lt;/a&gt;. It still has a few holes, but I'd love to hear feedback on it.&lt;br /&gt;&lt;br /&gt;Also included is a Hadoop case study from Last.fm. Thanks to Adrian Woodhead and Marc de Palol for writing it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-1506167963919896927?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/cMPdGogAYKs" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/1506167963919896927/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=1506167963919896927" title="4 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1506167963919896927?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1506167963919896927?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2009/01/draft-pig-chapter.html" title="Draft Pig Chapter" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>4</thr:total></entry><entry gd:etag="W/&quot;AkYAR30yeyp7ImA9WxRUEUo.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-5053020334356534059</id><published>2008-11-20T10:39:00.005Z</published><updated>2008-11-20T10:49:06.393Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-11-20T10:49:06.393Z</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Cloudera" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Hadoop Developer Zeitgeist</title><content type="html">The Cloudera team have just released a &lt;a href="http://community.cloudera.com/"&gt;website&lt;/a&gt; which has a few reports on various Hadoop development metrics. I like the &lt;a href="http://community.cloudera.com/reports/2/issues/"&gt;Most Watched Open Jira Issues&lt;/a&gt;, as it gives a good summary of what Hadoop Core developers are thinking about.&lt;br /&gt;&lt;br /&gt;Personally, I can't wait for the new MapReduce API (&lt;a href="http://issues.apache.org/jira/browse/HADOOP-1230"&gt;HADOOP-1230&lt;/a&gt;), which is currently the third most watched issue.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-5053020334356534059?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/TUNSDjjh6Ls" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/5053020334356534059/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=5053020334356534059" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5053020334356534059?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5053020334356534059?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/11/hadoop-developer-zeitgeist.html" title="Hadoop Developer Zeitgeist" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;DkcAQ3s-cSp7ImA9WxRXEUg.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-1310920520805089454</id><published>2008-10-16T11:36:00.003+01:00</published><updated>2008-10-16T11:47:22.559+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-10-16T11:47:22.559+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Cloudera" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Cloudera</title><content type="html">I'm pleased to announce that I've joined &lt;a href="http://www.cloudera.com/"&gt;Cloudera&lt;/a&gt;, a new startup providing support for Hadoop. Amr Awadallah (who's one of the founders) has got more details in his &lt;a href="http://www.awadallah.com/blog/2008/10/13/the-startup-is-cloudera-the-business-is-hadoop-mapreduce/"&gt;blog post&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-1310920520805089454?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/06ocxckUFU0" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/1310920520805089454/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=1310920520805089454" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1310920520805089454?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1310920520805089454?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/10/cloudera.html" title="Cloudera" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>3</thr:total></entry><entry gd:etag="W/&quot;DUcDQXg6eCp7ImA9WxRSFUU.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-1667539494188985504</id><published>2008-09-16T11:26:00.002+01:00</published><updated>2008-09-16T18:44:30.610+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-09-16T18:44:30.610+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Book" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Hadoop: The Definitive Guide</title><content type="html">&lt;span style="font-weight:bold;"&gt;Update:&lt;/span&gt; Fixed feedback link.&lt;br /&gt;&lt;br /&gt;The Rough Cut of &lt;a href="http://oreilly.com/catalog/9780596521998/index.html"&gt;Hadoop: The Definitive Guide&lt;/a&gt; is now up on O'Reilly's site. There are a few chapters available already, at various stages of completion. Remember, it's still pretty rough. I'd love to hear any suggestions for improvements that you may have though. You can submit feedback from &lt;a href="http://safari.oreilly.com/9780596521974"&gt;Safari&lt;/a&gt; where the book is hosted. As the &lt;a href="http://oreilly.com/roughcuts/faq.csp"&gt;Rough Cuts FAQ&lt;/a&gt; explains, I'd like feedback on missing topics, if something is not understandable, and technical mistakes.&lt;br /&gt;&lt;br /&gt;Now I just need to go and write the rest of it...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-1667539494188985504?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/q0PRvygljmU" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/1667539494188985504/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=1667539494188985504" title="2 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1667539494188985504?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1667539494188985504?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/09/hadoop-definitive-guide.html" title="Hadoop: The Definitive Guide" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>2</thr:total></entry><entry gd:etag="W/&quot;AkAMQHw_eSp7ImA9WxRSEkk.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-7867943272436546694</id><published>2008-09-04T22:00:00.004+01:00</published><updated>2008-09-12T20:46:21.241+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-09-12T20:46:21.241+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Amazon S3" /><category scheme="http://www.blogger.com/atom/ns#" term="Data" /><category scheme="http://www.blogger.com/atom/ns#" term="Amazon EC2" /><title>Hosting Large Public Datasets on Amazon S3</title><content type="html">&lt;span style="font-style:italic;"&gt;&lt;span style="font-weight:bold;"&gt;Update&lt;/span&gt;: I just thought of a quick and dirty way of doing this: just store your content on an extra large EC2 instance (holds up to 1690GB) and make the image public. Anyone can access it using their EC2 account, you just get charged for hosting the image.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There's a great deal of interest in large, publicly available datasets (see, for example, &lt;a href="http://groups.google.com/group/get-theinfo/browse_thread/thread/79e5b1159e533d52"&gt;this thread&lt;/a&gt; from &lt;a href="http://theinfo.org/"&gt;theinfo.org&lt;/a&gt;), but for very large datasets it is still expensive to provide the bandwidth to distribute them. Imagine if you could get your hands on the data from a large web crawl, the kind of thing that the &lt;a href="http://www.archive.org/"&gt;Internet Archive&lt;/a&gt; produces. I'm sure people would discover some interesting things from it.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://aws.amazon.com/s3"&gt;Amazon S3&lt;/a&gt; is an obvious choice for storing data for public consumption, but while the cost for storage may be reasonable, the cost for transfer can be crippling since the cost is not under the control of the data provider, being incurred &lt;span style="font-style: italic;"&gt;for each transfer&lt;/span&gt; (which is initiated by the user).&lt;br /&gt;&lt;br /&gt;For example, consider a 1TB dataset. With storage running at $0.15 per GB per month this works out at around $150 per month to host. With transfer costs costing $0.18 per GB, this dataset costs around $180 for each transfer out of Amazon! It's not surprising large datasets are not publicly hosted on S3.&lt;br /&gt;&lt;br /&gt;However, transferring data between S3 and EC2 is free, so could we limit transfers from S3 so they are only possible to EC2? You (or anyone else) could run an analysis on EC2 (using Hadoop, say) and only pay for the EC2 time. Or you could transfer it out of EC2 at your own expense. S3 doesn't support this option directly, but it is possible to emulate it with a bit of code.&lt;br /&gt;&lt;br /&gt;The idea (suggested by &lt;a href="http://blog.lucene.com/"&gt;Doug Cutting&lt;/a&gt;) is to make objects private on S3 to restrict access generally, then run a proxy on EC2 that is authorized to access the objects. The proxy only accepts connections from within EC2: any client that is outside Amazon's cloud is firewalled out. This combination ensures only EC2 instances can access the S3 objects, thus removing any bandwidth costs.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Implementation&lt;/h3&gt;I've written such a proxy. It's a Java servlet that uses the &lt;a href="https://jets3t.dev.java.net/"&gt;JetS3t&lt;/a&gt; library to add the correct &lt;a href="http://docs.amazonwebservices.com/AmazonS3/2006-03-01/index.html?RESTAuthentication.html"&gt;Amazon S3 &lt;code&gt;Authorization&lt;/code&gt; HTTP header&lt;/a&gt; to gain access to the owner's objects on S3. If the proxy is running on the EC2 instance with hostname &lt;span style="font-style: italic;"&gt;ec2-67-202-43-67.compute-1.amazonaws.com&lt;/span&gt;, for example, then a request for&lt;br /&gt;&lt;pre&gt;http://ec2-67-202-43-67.compute-1.amazonaws.com/bucket/object&lt;br /&gt;&lt;/pre&gt;is proxied to the protected object at&lt;br /&gt;&lt;pre&gt;http://s3.amazonaws.com/bucket/object&lt;br /&gt;&lt;/pre&gt;To ensure that only clients on EC2 can get access to the proxy I set up an EC2 security group (which limits access to port 80):&lt;br /&gt;&lt;pre&gt;ec2-add-group ec2-private-subnet -d "Group for all Amazon EC2 instances."&lt;br /&gt;ec2-authorize ec2-private-subnet -p 80 -s 10.0.0.0/8&lt;/pre&gt;Then by launching the proxy in this group, only machines on EC2 can connect. (Initially, I thought I had to add public IP addresses to the group -- which, incidentally, I found in &lt;a href="http://developer.amazonwebservices.com/connect/thread.jspa?messageID=51028"&gt;this forum posting&lt;/a&gt; -- but this is not necessary as the public DNS name of an EC2 instance resolves to the private IP address within EC2.) The AWS credentials to gain access to the S3 objects are passed in the user data, along with the hostname of S3:&lt;br /&gt;&lt;pre&gt;ec2-run-instances -k gsg-keypair -g ec2-private-subnet \&lt;br /&gt;-d "&amp;lt;aws_access_key&amp;gt; &amp;lt;aws_secret_key&amp;gt; s3.amazonaws.com" ami-fffd1996&lt;/pre&gt;This AMI (ID &lt;code&gt;ami-fffd1996&lt;/code&gt;) is publicly available, so anyone can use it by using the commands shown here. (The code is available &lt;a href="http://s3.amazonaws.com/s3proxy/s3proxy-0.1.tar.gz"&gt;here&lt;/a&gt;, under an Apache 2.0 license, but you don't need this if you only intend to run or use a proxy.)&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Demo&lt;/h3&gt;Here's a resource on S3 that is protected: &lt;span style="font-style:italic;"&gt;http://s3.amazonaws.com/tiling/private.txt&lt;/span&gt;. When you try to retrieve it you get an authorization error:&lt;br /&gt;&lt;pre&gt;% &lt;span style="font-weight: bold;"&gt;curl http://s3.amazonaws.com/tiling/private.txt&lt;/span&gt;&lt;br /&gt;&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;&lt;br /&gt;&amp;lt;Error&amp;gt;&lt;br /&gt;&amp;lt;Code&amp;gt;AccessDenied&amp;lt;/Code&amp;gt;&lt;br /&gt;&amp;lt;Message&amp;gt;Access Denied&amp;lt;/Message&amp;gt;&lt;br /&gt;&amp;lt;RequestId&amp;gt;57E370CDDD9FE044&amp;lt;/RequestId&amp;gt;&lt;br /&gt;&amp;lt;HostId&amp;gt;dA+9II1dYAjPE5aNsnRxhVoQ5qy3KCa6frkLg3SyTwzP3i2SQNCU534/v8NXXEnN&amp;lt;/HostId&amp;gt;&lt;br /&gt;&amp;lt;/Error&amp;gt;&lt;/pre&gt;With a proxy running, I still can't retrieve the resource via the proxy from outside EC2. It just times out due to the firewall rule:&lt;br /&gt;&lt;pre&gt;% &lt;span style="font-weight: bold;"&gt;curl http://ec2-67-202-56-11.compute-1.amazonaws.com/tiling/private.txt&lt;/span&gt;&lt;br /&gt;curl: (7) couldn't connect to host&lt;/pre&gt;But it does works from an EC2 machine (any EC2 machine):&lt;br /&gt;&lt;pre&gt;% &lt;span style="font-weight: bold;"&gt;curl http://ec2-67-202-56-11.compute-1.amazonaws.com/tiling/private.txt&lt;/span&gt;&lt;br /&gt;secret&lt;/pre&gt;&lt;h3&gt;Conclusion&lt;/h3&gt;By running a proxy on EC2, at 10 cents per hour (small instance) - or $72 a month, you can allow folks using EC2 to access your data on S3 for free. While running the proxy is not free, it is a fixed cost that might be acceptable to some organizations, particularly those that have an interest in making data publicly available (but can't stomach large bandwidth costs).&lt;br /&gt;&lt;br /&gt;A few questions:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Is this useful?&lt;/li&gt;&lt;li&gt;Is there a better way of doing it?&lt;/li&gt;&lt;li&gt;Can we have this built into S3 (please, Amazon)?&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-7867943272436546694?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/Cv9BYG1TpRo" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/7867943272436546694/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=7867943272436546694" title="4 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7867943272436546694?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7867943272436546694?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/09/hosting-large-public-datasets-on-amazon.html" title="Hosting Large Public Datasets on Amazon S3" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>4</thr:total></entry><entry gd:etag="W/&quot;DkcNRHs6fyp7ImA9WxdaFU8.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-7570203001448264185</id><published>2008-08-23T21:39:00.002+01:00</published><updated>2008-08-23T21:41:35.517+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-08-23T21:41:35.517+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Amazon Web Services" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Elastic Hadoop Clusters with Amazon's Elastic Block Store</title><content type="html">I gave a &lt;a href="http://skillsmatter.com/podcast/cloud-grid/hadoop-on-amazon-s3ec2"&gt;talk&lt;/a&gt; on Tuesday at the first &lt;a href="http://skillsmatter.com/event/java-jee/hadoop-user-group-meeting"&gt;Hadoop User Group UK&lt;/a&gt; about Hadoop and Amazon Web services - how and why you can run Hadoop with AWS. I mentioned how integrating Hadoop with Amazon's "Persistent local storage", which Werner Vogels had &lt;a href="http://www.allthingsdistributed.com/2008/04/persistent_storage_for_amazon.html"&gt;pre-announced&lt;/a&gt; in April, would be a great feature to have to enable truly elastic Hadoop clusters that you could stop and start on demand.&lt;br /&gt;&lt;br /&gt;Well, the very next day Amazon launched this service, called &lt;a href="http://www.amazon.com/b/ref=sc_fe_c_0_201590011_1?ie=UTF8&amp;amp;node=689343011&amp;amp;no=201590011"&gt;Elastic Block Store&lt;/a&gt; (EBS). So in this post I thought I'd sketch out how an elastic Hadoop might work.&lt;br /&gt;&lt;br /&gt;A bit of background. Currently there are three main ways to use Hadoop with AWS:&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;1. MapReduce with S3 source and sink&lt;/h3&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_IhqEHw4_Ick/SK7wd4umVGI/AAAAAAAAAE4/pnzP5XjfjtI/s1600-h/s3-mapred.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: right; cursor: pointer;" src="http://2.bp.blogspot.com/_IhqEHw4_Ick/SK7wd4umVGI/AAAAAAAAAE4/pnzP5XjfjtI/s320/s3-mapred.png" alt="" id="BLOGGER_PHOTO_ID_5237387812913173602" border="0"&gt;&lt;/a&gt;In this set up, the data resides on S3, and the MapReduce daemons run on a temporary EC2 cluster for the duration of the job run. This works, and is especially convenient if you've already store your data on S3, but you don't get any data locality. Data locality is what enables the magic of MapReduce to work efficiently - the computation is scheduled to run on the machine where the data is stored, so you get huge savings in not having to ship terabytes of data around the network. EC2 does not share nodes with S3 storage, in fact they are often in different data centres, so performance is nowhere near as good as a regular Hadoop cluster where the data in stored in HDFS (see 3. below).&lt;br /&gt;&lt;br /&gt;It's not all doom and gloom, as the bandwidth between EC2 and S3 is actually pretty good, as Rightscale &lt;a href="http://blog.rightscale.com/2007/10/28/network-performance-within-amazon-ec2-and-to-amazon-s3/"&gt;found when they did some measurements&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;2. MapReduce from S3 with HDFS staging&lt;/h3&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_IhqEHw4_Ick/SK7wv4AZxeI/AAAAAAAAAFA/66XE-v6mgFE/s1600-h/s3-hdfs-mapred.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: right; cursor: pointer;" src="http://1.bp.blogspot.com/_IhqEHw4_Ick/SK7wv4AZxeI/AAAAAAAAAFA/66XE-v6mgFE/s320/s3-hdfs-mapred.png" alt="" id="BLOGGER_PHOTO_ID_5237388121957058018" border="0"&gt;&lt;/a&gt;Data is stored on S3 but copied to a temporary HDFS cluster running on EC2. This is just a variation of the previous set-up, which is good if you want to run several jobs against the same input data. You save by only copying the data across the network once, but you pay a little more due to HDFS replication.&lt;br /&gt;&lt;br /&gt;The bottleneck is still copying the data out of S3. (Copying the results back into S3 isn't usually as bad as the output is often an order or two of magnitude smaller than the input.)&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;3. HDFS on Amazon EC2&lt;/h3&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_IhqEHw4_Ick/SK7w--YOZvI/AAAAAAAAAFI/iu8En_A_Arw/s1600-h/ec2.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: right; cursor: pointer;" src="http://2.bp.blogspot.com/_IhqEHw4_Ick/SK7w--YOZvI/AAAAAAAAAFI/iu8En_A_Arw/s320/ec2.png" alt="" id="BLOGGER_PHOTO_ID_5237388381365626610" border="0"&gt;&lt;/a&gt;Of course, you could just run a Hadoop cluster on EC2 and store your data there (and not in S3). In this scenario, you are committed to running your EC2 cluster long term, which can prove expensive, although the locality is excellent.&lt;br /&gt;&lt;br /&gt;These three scenarios demonstrate that you pay for locality. However, there is a gulf between S3 and local disks that EBS fills nicely. EBS does not have the bandwidth of local disks, but it's significantly better than S3. Rightscale &lt;a href="http://blog.rightscale.com/2008/08/20/amazon-ebs-explained/"&gt;again&lt;/a&gt;:&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;The bottom line though is that performance exceeds what we’ve seen for filesystems striped across the four local drives of x-large instances.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;h3&gt;Implementing Elastic Hadoop&lt;/h3&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_IhqEHw4_Ick/SK7xRq_YwjI/AAAAAAAAAFY/0ypyV3MrprQ/s1600-h/ec2-with-storage-volumes.png"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://3.bp.blogspot.com/_IhqEHw4_Ick/SK7xRq_YwjI/AAAAAAAAAFY/0ypyV3MrprQ/s320/ec2-with-storage-volumes.png" alt="" id="BLOGGER_PHOTO_ID_5237388702578688562" border="0"&gt;&lt;/a&gt;The main departure from the current Hadoop on EC2 approach is the need to maintain a map from storage volume to node type: i.e. we need to remember which volume is a master volume (storing the namenode's data) and which is a worker volume (storing the datanode's data). It would be nice if you could just start up EC2 instances for all the volumes, and have them figure out which is which, but this might not work as the master needs to be started first so its address can be given to the workers in their user data. (This choreography problem could be solved by introducing ZooKeeper, but that's another story.) So for a first cut, we could simply keep two files (locally, or on S3 or even SimpleDB) called &lt;font style="font-style: italic;"&gt;master-volumes&lt;/font&gt;, and &lt;font style="font-style: italic;"&gt;worker-volumes&lt;/font&gt;, which simply list the volume IDs for each node type, one per line.&lt;br /&gt;&lt;br /&gt;Assume there is one master running the namenode and jobtracker, and &lt;font style="font-style: italic;"&gt;n&lt;/font&gt; worker nodes each running a datanode and tasktracker.&lt;br /&gt;&lt;br /&gt;To create a new cluster&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Create &lt;font style="font-style: italic;"&gt;n&lt;/font&gt; + 1 volumes.&lt;/li&gt;&lt;li&gt;Create the &lt;font style="font-style: italic;"&gt;master-volumes&lt;/font&gt; file and write the first volume ID into it.&lt;/li&gt;&lt;li&gt;Create the &lt;font style="font-style: italic;"&gt;worker-volumes&lt;/font&gt; file and write the remaining volume IDs to it.&lt;/li&gt;&lt;li&gt;Follow the steps for starting a cluster.&lt;/li&gt;&lt;/ol&gt;To start a cluster&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Start the master instance passing it the &lt;font style="font-style: italic;"&gt;master-volumes&lt;/font&gt; as user data. On startup the instance attaches to the volume it was told to. It then formats the namenode if it isn't already formatted, then starts the namenode, secondary namenode and jobtracker.&lt;/li&gt;&lt;li&gt;Start &lt;span style="font-style: italic;"&gt;n&lt;/span&gt; worker instances passing it the &lt;font style="font-style: italic;"&gt;worker-volumes&lt;/font&gt; as user data. On startup each instance attaches to the volume on line &lt;font style="font-style: italic;"&gt;i&lt;/font&gt;, where &lt;font style="font-style: italic;"&gt;i&lt;/font&gt; is the &lt;font face="courier new"&gt;ami-launch-index&lt;/font&gt; of the instance. Each instance then starts a datanode and tasktacker.&lt;/li&gt;&lt;li&gt;If any worker instances failed to start then launch them again.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;To shutdown a cluster&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Shutdown the Hadoop cluster daemons.&lt;/li&gt;&lt;li&gt;Detach the EBS volumes.&lt;/li&gt;&lt;li&gt;Shutdown the EC2 instances.&lt;/li&gt;&lt;/ol&gt;To grow a cluster&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Create &lt;font style="font-style: italic;"&gt;m&lt;/font&gt; new volumes, where &lt;font style="font-style: italic;"&gt;m&lt;/font&gt; is the size to grow by.&lt;/li&gt;&lt;li&gt;Append the &lt;font style="font-style: italic;"&gt;m&lt;/font&gt; new volume IDs to the &lt;font style="font-style: italic;"&gt;worker-volumes&lt;/font&gt; file.&lt;/li&gt;&lt;li&gt;Start &lt;span style="font-style: italic;"&gt;m&lt;/span&gt; worker instances passing it the &lt;font style="font-style: italic;"&gt;worker-volumes&lt;/font&gt; as user data. On startup each instance attaches to the volume on line &lt;font style="font-style: italic;"&gt;n + i&lt;/font&gt;, where &lt;font style="font-style: italic;"&gt;i&lt;/font&gt; is the &lt;font face="courier new"&gt;ami-launch-index&lt;/font&gt; of the instance. Each instance then starts a datanode and tasktacker.&lt;/li&gt;&lt;/ol&gt;Future enhancements: attach multiple volumes for performance/storage growth on the namenode, or resilience on the namenode; integrate the secondary namenode backup facility with EBS snapshots to S3; provide tools for managing the &lt;font style="font-style: italic;"&gt;worker-volumes&lt;/font&gt; file (for example, integrating with datanode decommissioning).&lt;br /&gt;&lt;br /&gt;Building this would be a great project to work on - I hope someone does it!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-7570203001448264185?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/mf1EoYJwcEs" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/7570203001448264185/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=7570203001448264185" title="9 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7570203001448264185?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7570203001448264185?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.html" title="Elastic Hadoop Clusters with Amazon's Elastic Block Store" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_IhqEHw4_Ick/SK7wd4umVGI/AAAAAAAAAE4/pnzP5XjfjtI/s72-c/s3-mapred.png" height="72" width="72" /><thr:total>9</thr:total></entry><entry gd:etag="W/&quot;DU4FRXY9cSp7ImA9WxdVGEk.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-1050541487495156879</id><published>2008-07-23T20:26:00.002+01:00</published><updated>2008-07-23T22:18:34.869+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-07-23T22:18:34.869+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Pluggable Hadoop</title><content type="html">&lt;span style="font-weight:bold;"&gt;Update&lt;/span&gt;: This quote from Tim O'Reilly in his OSCON keynote today sums up the changes I describe below: "Do less and then create extensibility mechanisms." (via &lt;a href="http://raibledesigns.com/rd/entry/oscon_2008_the_keynote"&gt;Matt Raible&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;I'm noticing an increased desire to make Hadoop more modular. I'm not sure why this is happening now, but it's probably because as more people start using Hadoop it needs to be more malleable (people want to plug in their own implementations of things), and the way to do that in software is through modularity.&lt;br /&gt;&lt;br /&gt;Some examples:&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Job scheduling&lt;/h3&gt;The current scheduler is a simple FIFO scheduler which is adequate for small clusters with a few cooperating users. On larger clusters the best advice has been to use &lt;a href="http://hadoop.apache.org/core/docs/current/hod.html"&gt;HOD&lt;/a&gt; (Hadoop On Demand), but that has its own problems with inefficient cluster utilization. This situation led to a number of proposals to make the scheduler pluggable (&lt;a href="https://issues.apache.org/jira/browse/HADOOP-2510"&gt;HADOOP-2510&lt;/a&gt;, &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3412"&gt;HADOOP-3412&lt;/a&gt;, &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3444"&gt;HADOOP-3444&lt;/a&gt;). Already there is a &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3746"&gt;fair scheduler implementation&lt;/a&gt; (like the &lt;a href="http://en.wikipedia.org/wiki/Completely_Fair_Scheduler"&gt;Completely Fair Scheduler&lt;/a&gt; in Linux) from Facebook.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;HDFS block placement&lt;/h3&gt;Today the algorithm for placing a file's blocks across datanodes in the cluster is hardcoded into HDFS, and while it has &lt;a href="https://issues.apache.org/jira/browse/HADOOP-2559"&gt;evolved&lt;/a&gt;, it is clear that a one-size-fits-all approach is not necessarily the best approach. Hence the new proposal to support &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3799"&gt;pluggable block placement algorithms&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Instrumentation&lt;/h3&gt;Finding out what is happening in a distributed system is a hard problem. Today, Hadoop has a &lt;a href="http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/metrics/package-summary.html"&gt;metrics API&lt;/a&gt; (for gathering statistics from the main components of Hadoop), but there is interest in adding other logging systems, such as &lt;a href="http://radlab.cs.berkeley.edu/wiki/Projects/X-Trace_on_Hadoop"&gt;X-Trace&lt;/a&gt;, via a new &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3772"&gt;instrumentation API&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Serialization&lt;/h3&gt;The ability to use &lt;a href="https://issues.apache.org/jira/browse/HADOOP-1986"&gt;pluggable serialization frameworks&lt;/a&gt; in MapReduce appeared in Hadoop 0.17.0, but has received renewed interest due to the talk around &lt;a href="http://incubator.apache.org/thrift/"&gt;Apache Thrift&lt;/a&gt; and &lt;a href="http://code.google.com/p/protobuf/"&gt;Google Protocol Buffers&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Component lifecycle&lt;/h3&gt;There is work being done to &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3628"&gt;add a lifecyle interface to Hadoop components&lt;/a&gt;. One of the goals is to make it easier to subclass components, so they can be customized.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Remove dependency cycles&lt;/h3&gt;This is really just good engineering practice, but the existence of dependencies makes it harder to understand, modify and extend code. Bill de hÓra did a great &lt;a href="http://www.dehora.net/journal/2008/07/06/3-12-minutes-to-sort-a-terabyte-hadoops-code-structure/"&gt;analysis&lt;/a&gt; of Hadoop's code structure (and its deficiencies), which has lead to some work to enforce module dependencies and remove the cycles.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-1050541487495156879?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/7AwXt1E09Fg" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/1050541487495156879/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=1050541487495156879" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1050541487495156879?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1050541487495156879?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/07/pluggable-hadoop.html" title="Pluggable Hadoop" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CEUGQX4-eip7ImA9WxdWFU8.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-2466668333674859777</id><published>2008-07-08T10:43:00.012+01:00</published><updated>2008-07-08T14:03:40.052+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-07-08T14:03:40.052+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Thrift" /><category scheme="http://www.blogger.com/atom/ns#" term="Serialization" /><category scheme="http://www.blogger.com/atom/ns#" term="RPC" /><category scheme="http://www.blogger.com/atom/ns#" term="MapReduce" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>RPC and Serialization with Hadoop, Thrift, and Protocol Buffers</title><content type="html">Hadoop and related projects like Thrift provide a choice of protocols and formats for doing RPC and serialization. In this post I'll briefly run through them and explain where they came from, how they relate to each other and how Google's &lt;a href="http://google-code-updates.blogspot.com/2008/07/protocol-buffers-our-serialized.html"&gt;newly released Protocol Buffers&lt;/a&gt; might fit in.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;RPC and Writables&lt;/h3&gt;Hadoop has its own RPC mechanism that dates back to when Hadoop was a part of &lt;a href="http://lucene.apache.org/nutch/"&gt;Nutch&lt;/a&gt;. It's used throughout Hadoop as the mechanism by which daemons talk to each other. For example, a &lt;code&gt;DataNode&lt;/code&gt; communicates with the &lt;code&gt;NameNode&lt;/code&gt;  using the RPC interface &lt;code&gt;DatanodeProtocol&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;Protocols are defined using Java interfaces whose arguments and return types are primitives, Strings, Writables, or arrays. These types can all be serialized using Hadoop's specialized serialization format, based on &lt;a href="http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Writable.html"&gt;Writable&lt;/a&gt;. Combined with the magic of &lt;a href="http://java.sun.com/j2se/1.3/docs/guide/reflection/proxy.html"&gt;Java dynamic proxies&lt;/a&gt;, we get a simple RPC mechanism which for the caller appears to be a Java interface.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;MapReduce and Writables&lt;/h3&gt;Hadoop uses Writables for another, quite different, purpose: as a serialization format for MapReduce programs. If you've ever written a Hadoop MapReduce program you will have used Writables for the key and value types. For example:&lt;br /&gt;&lt;pre&gt;&lt;code&gt;&lt;br /&gt;&lt;br /&gt;public class MapClass&lt;br /&gt;implements Mapper&amp;lt;LongWritable, Text, Text, IntWritable&amp;gt; {&lt;br /&gt;&lt;br /&gt;// ...&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;(&lt;code&gt;Text&lt;/code&gt; is just a Writable version of Java &lt;code&gt;String&lt;/code&gt;.)&lt;br /&gt;&lt;br /&gt;The primary benefit of using Writables is in their efficiency. Compared to &lt;a href="http://java.sun.com/j2se/1.4.2/docs/guide/serialization/index.html"&gt;Java serialization&lt;/a&gt;, which would have been an obvious alternative choice, they have a more compact representation. Writables don't store their type in the serialized representation, since at the point of deserialization it is known which type is expected. For the MapReduce code above, the input key is a &lt;code&gt;LongWritable&lt;/code&gt;, so an empty &lt;code&gt;LongWritable&lt;/code&gt; instance is asked to populate itself from the input data stream.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;More flexible MapReduce&lt;/h3&gt;There are downsides of having to use Writables for MapReduce types, however. For a newcomer to Hadoop it's another hurdle: something else to learn ("why can't I just use a String?"). More seriously, perhaps, is that it's hard to use different binary storage formats for MapReduce input and output. For example, Apache Thrift (see below) is an increasingly popular way of storing binary data. It's possible, but cumbersome and inefficient, to read or write Thrift data from MapReduce.&lt;br /&gt;&lt;br /&gt;From Hadoop 0.17.0 onwards &lt;a href="https://issues.apache.org/jira/browse/HADOOP-1986"&gt;you no longer have to use Writables for key and value types in MapReduce programs&lt;/a&gt;. You can use any serialization framework. (Note that this is change is completely independent of Hadoop's RPC mechanism, which still uses Writables - and can only use Writables - as its on-wire format.) So it's easier to use Thrift types, say, throughout your MapReduce program. Or you can even use Java serialization (with &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3566"&gt;some limitations&lt;/a&gt; which will be fixed). What's more, you can add your own serialization framework if you like.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Record I/O, Thrift and Protocol Buffers&lt;br /&gt;&lt;/h3&gt;Another problem with Writables, at least for the MapReduce programmer, is that creating new types is a burden. You have to implement the Writable interface, which means designing the on-wire format, and writing two methods: one to write the data in that format and one to read it back.&lt;br /&gt;&lt;br /&gt;Hadoop's Record I/O was created to solve this problem. You write a definition of your types using a record definition language, then run a record compiler to generate Java source code representations of your types. All Record I/O types are Writable, so they plug into Hadoop very easily. As a bonus, you can generate bindings for other languages, so it's easy to read your data files from other programs.&lt;br /&gt;&lt;br /&gt;For whatever reason, Record I/O never really took off. It's used in &lt;a href="http://zookeeper.sourceforge.net/"&gt;ZooKeeper&lt;/a&gt;, but that's about it (and ZooKeeper will &lt;a href="http://publists.facebook.com/pipermail/thrift/2008-January/000330.html"&gt;move away&lt;/a&gt; from it someday). Momentum has switched to &lt;a href="http://incubator.apache.org/thrift/"&gt;Thrift&lt;/a&gt; (from Facebook, now in the Apache Incubator), which offers a very similar proposition, but in more languages. Thrift also makes it easy to build a (cross-language) RPC mechanism.&lt;br /&gt;&lt;br /&gt;Yesterday, Google open sourced &lt;a href="http://code.google.com/apis/protocolbuffers/"&gt;Protocol Buffers&lt;/a&gt;, its "language-neutral, platform-neutral, extensible mechanism for serializing structured data". Record I/O, Thrift and Protocol Buffers are really solving the same problem, so it will be interesting to see how this develops. Of course, since we're talking about persistent data formats, nothing's going to go away in the short or medium term while people have significant amounts of data locked up in these formats.&lt;br /&gt;&lt;br /&gt;That's why it makes sense to add support in Hadoop for MapReduce using Thrift and Protocol Buffers: so people can process data in the format they have it in. This will be a relatively simple addition.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;What Next?&lt;/h3&gt;For RPC, where a message is short-lived, changing the mechanism is more viable in the short term. Going back to Hadoop's RPC mechanism, now that both Thrift and Protocol Buffers offer an alternative, it may well be time to evaluate them to see if either can offer a performance boost. It would be a big job to retrofit RPC in Hadoop with another implementation, but if there are significant performance gains to be had, then it would be worth doing.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-2466668333674859777?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/CFZCumMzSLI" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/2466668333674859777/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=2466668333674859777" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/2466668333674859777?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/2466668333674859777?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/07/rpc-and-serialization-with-hadoop.html" title="RPC and Serialization with Hadoop, Thrift, and Protocol Buffers" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>3</thr:total></entry><entry gd:etag="W/&quot;CEcERno7fSp7ImA9WxdWEEQ.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-5017354658620266704</id><published>2008-07-03T14:09:00.005+01:00</published><updated>2008-07-03T14:33:27.405+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-07-03T14:33:27.405+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="MapReduce" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Hadoop beats terabyte sort record</title><content type="html">Hadoop has beaten the record for the terabyte sort benchmark, bringing it from 297 seconds to 209. Owen O'Malley wrote the MapReduce program (which by the way has a clever partitioner to ensure the reducer outputs are globally sorted and not just sorted per output partition, which is what the default sort does), and then ran it on 910 nodes on Yahoo!'s cluster. There are more details in Owen's &lt;a href="http://developer.yahoo.com/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.html"&gt;blog post&lt;/a&gt; (and there's a link to the benchmark page which has a PDF explaining his program). You can also look at the &lt;a href="http://svn.apache.org/viewvc/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/terasort/"&gt;code in trunk&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Well done Owen and well done Hadoop!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-5017354658620266704?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/tGveeimyv2Q" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/5017354658620266704/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=5017354658620266704" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5017354658620266704?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5017354658620266704?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/07/hadoop-beats-terabyte-sort-record.html" title="Hadoop beats terabyte sort record" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;DkYERXYyfyp7ImA9WxdQGUo.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-7956405302605780166</id><published>2008-06-20T16:01:00.001+01:00</published><updated>2008-06-20T16:01:44.897+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-06-20T16:01:44.897+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="MapReduce" /><category scheme="http://www.blogger.com/atom/ns#" term="Query Languages" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Hadoop Query Languages</title><content type="html">If you want a high-level query language for drilling into your huge Hadoop dataset, then you've got some choice:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://incubator.apache.org/pig/"&gt;Pig&lt;/a&gt;, from Yahoo! and now incubating at Apache, has an imperative language called Pig Latin for performing operations on large data files.&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.jaql.org/"&gt;Jaql&lt;/a&gt;, from IBM and soon to be open sourced, is a declarative query language for JSON data.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Hive, from Facebook and soon to &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3601"&gt;become a Hadoop contrib module&lt;/a&gt;, is a data warehouse system with a declarative query language that is a hybrid of SQL and Hadoop streaming.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;All three projects have different strengths, but there is plenty of scope for collaboration and cross-pollination, particularly in the query language. For example, at the &lt;a href="http://research.yahoo.com/node/2104"&gt;Hadoop Summit&lt;/a&gt; in March, Joydeep Sen Sarma of Facebook said that they would be receptive to users who wanted to use Pig Latin or Jaql in Hive. And Kevin Beyer of IBM Research said that Pig and Jaql are converging, and they've had discussions with the Pig team about how to bring them even closer together.&lt;br /&gt;&lt;br /&gt;Meanwhile, to learn more I recommend &lt;a href="http://www.cs.cmu.edu/%7Eolston/publications/sigmod08.pdf"&gt;Pig Latin: A Not-So-Foreign Language for Data Processing&lt;/a&gt; (by Chris Olston &lt;span style="font-style: italic;"&gt;et al&lt;/span&gt;&lt;span&gt;), and the &lt;a href="http://research.yahoo.com/node/2104"&gt;slides and videos from the Hadoop Summit&lt;/a&gt;.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(And I haven't even included &lt;a href="http://www.cascading.org/"&gt;Cascading&lt;/a&gt;, from &lt;a href="http://chris.wensel.net/"&gt;Chris K. Wensel&lt;/a&gt;, which, while not a query language &lt;span style="font-style: italic;"&gt;per se&lt;/span&gt;, is an abstraction built on MapReduce for building data processing flows in Java or Groovy using a plumbing metaphor with constructs such as taps, pipes, and flows. Well worth a look too.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-7956405302605780166?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/GG1lc17Rd-c" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/7956405302605780166/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=7956405302605780166" title="4 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7956405302605780166?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7956405302605780166?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/06/hadoop-query-languages.html" title="Hadoop Query Languages" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>4</thr:total></entry><entry gd:etag="W/&quot;A0MESX87fSp7ImA9WxdQE0g.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-3339094449895398610</id><published>2008-06-13T13:15:00.000+01:00</published><updated>2008-06-13T13:16:48.105+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-06-13T13:16:48.105+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><category scheme="http://www.blogger.com/atom/ns#" term="Cloud Computing" /><title>"The Next Big Thing"</title><content type="html">&lt;a href="http://mvdirona.com/jrh/perspectives/2007/10/29/WelcomeToPerspectives.aspx"&gt;James Hamilton&lt;/a&gt; on &lt;a href="http://perspectives.mvdirona.com/2008/01/15/TheNextBigThing.aspx"&gt;The Next Big Thing&lt;/a&gt;:&lt;blockquote&gt;Storing blobs in the sky is fine but pretty reproducible by any competitor.  Storing structured data as well as blobs is considerably more interesting but what has even more lasting business value is the storing data in the cloud AND providing a programming platform for multi-thousand node data analysis.  Almost every reasonable business on the planet has a complex set of dimensions that need to be optimized.&lt;/blockquote&gt;&lt;br /&gt;I think we're only beginning to see interesting data processing being done in the cloud - there's much more to come.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-3339094449895398610?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/3aYL0e3cJUg" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/3339094449895398610/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=3339094449895398610" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/3339094449895398610?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/3339094449895398610?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/05/next-big-thing.html" title="&quot;The Next Big Thing&quot;" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;AkQBSHw_cCp7ImA9WxdREUs.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-5153878883282354355</id><published>2008-05-30T17:15:00.004+01:00</published><updated>2008-05-30T18:25:59.248+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-05-30T18:25:59.248+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Family" /><category scheme="http://www.blogger.com/atom/ns#" term="Mobile" /><title>Bluetooth Castle</title><content type="html">&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_IhqEHw4_Ick/SEAyN-dm-EI/AAAAAAAAAEw/o8lOSEi9SRM/s1600-h/raglan_castle.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer;" src="http://4.bp.blogspot.com/_IhqEHw4_Ick/SEAyN-dm-EI/AAAAAAAAAEw/o8lOSEi9SRM/s200/raglan_castle.jpg" alt="" id="BLOGGER_PHOTO_ID_5206216384927168578" border="0" /&gt;&lt;/a&gt;Today I visited &lt;a href="http://en.wikipedia.org/wiki/Raglan_Castle"&gt;Raglan Castle&lt;/a&gt; in Monmouthshire with my family. &lt;a href="http://www.cadw.wales.gov.uk/"&gt;Cadw&lt;/a&gt;, the government body that manages the castle, were running a trial to deliver audio files to visitors' mobile phones using Bluetooth. As I walked through the entrance I simply made my phone discoverable, waited a few seconds for the MP3 to download, then started listening to a &lt;a href="http://www.cadw-feedback.org.uk/audio_en.html"&gt;guided tour&lt;/a&gt; of the castle. It's a great use of the technology: it just worked.&lt;br /&gt;&lt;br /&gt;The talk only lasted a few minutes, so we had plenty of time afterwards to run around the ruins.&lt;br /&gt;&lt;br /&gt;A couple of technical questions that sprang to mind:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;How would you set up a server to push files over Bluetooth? (There are loads of ways you could use this - maps of the local area at transport hubs, sharing the schedule at conferences, random photo of the day at home, etc.)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Can you make audio files navigable? That is, make it easy to go to the part of audio file that is about a given exhibit by typing in the exhibit's number? (This problem reminds me of Cliff Schmidt's &lt;a href="http://www.eu.apachecon.com/eu2008/program/talk/2625.html"&gt;talk&lt;/a&gt; about the Talking Book Device at ApacheCon EU 2008.)&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-5153878883282354355?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/fVWvxKw7UZM" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/5153878883282354355/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=5153878883282354355" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5153878883282354355?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5153878883282354355?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/05/bluetooth-castle.html" title="Bluetooth Castle" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/_IhqEHw4_Ick/SEAyN-dm-EI/AAAAAAAAAEw/o8lOSEi9SRM/s72-c/raglan_castle.jpg" height="72" width="72" /><thr:total>3</thr:total></entry><entry gd:etag="W/&quot;A0ICSHw8eyp7ImA9WxZaFUU.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-4309697238982889737</id><published>2008-04-22T21:05:00.012+01:00</published><updated>2008-04-30T22:06:09.273+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-04-30T22:06:09.273+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><category scheme="http://www.blogger.com/atom/ns#" term="Cloud Computing" /><title>Portable Cloud Computing</title><content type="html">Last July I asked "&lt;a href="http://www.lexemetech.com/2007/07/why-are-there-no-amazon-s3ec2.html"&gt;Why are there no Amazon S3/EC2 competitors?&lt;/a&gt;", lamenting the lack of competition in the utility or cloud computing market and the implications for disaster recovery. Closely tied to disaster recover is &lt;span style="font-style: italic;"&gt;portability&lt;/span&gt; -- the ability to switch between different utility computing providers as easily as I switch electricity suppliers. (OK, that can be a pain, at least in the UK, but it doesn't require rewiring my house.)&lt;br /&gt;&lt;br /&gt;It's useful to compare &lt;a href="http://aws.amazon.com/"&gt;Amazon Web Services&lt;/a&gt; with Google's recently launched &lt;a href="http://code.google.com/appengine/"&gt;App Engine&lt;/a&gt; in these terms. In some sense they compete, but they are strikingly different offerings. Amazon's services are &lt;span style="font-style: italic;"&gt;bottom up&lt;/span&gt;: "here's a CPU, now install your own software". Google's is &lt;span style="font-style: italic;"&gt;top down&lt;/span&gt;: "write some code to these APIs and we'll scale it for you". There's no way I can take my EC2 services and run them on App Engine. But I can do the reverse -- sort of -- thanks to &lt;a href="http://appdrop.com/"&gt;AppDrop.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;And that's the thing. What I would love is a utility service from different providers that I can switch between. That's one of the meanings of "utility", after all. (Moving lots of data between providers may make this difficult or expensive to do in practice -- "data inertia" -- but that's not a reason not to have the capability.)&lt;br /&gt;&lt;br /&gt;There are at least two ways this can happen. One's the AppDrop approach -- emulate Google's API and provide an alternative place to run applications, in this case it's EC2.&lt;br /&gt;&lt;br /&gt;However, there's another way: build "standard, non-proprietary cloud APIs with open-source implementations", as Doug Cutting says on his blog post &lt;a href="http://blog.lucene.com/2008/04/09/cloud-commodity-or-proprietary/"&gt;Cloud: commodity or proprietary?&lt;/a&gt; In this world, applications use a common API, and host with whichever provider they fancy. Bottom up offerings like Amazon's facilitate this approach: the underlying platforms may differ, but it's easy to run your application on the provided platform -- for example, by building an Amazon AMI. Google's top down approach is not so amenable, application developers are locked-in to the APIs Google provide. (Of course, Google may open this platform up more over time, but it remains to be seen if they will open it up to the extent of being able to run arbitrary executables.)&lt;br /&gt;&lt;br /&gt;As Doug notes, &lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt; is providing a lot of the building blocks for building cloud services: filesystem (&lt;a href="http://hadoop.apache.org/core/"&gt;HDFS&lt;/a&gt;), database (&lt;a href="http://hadoop.apache.org/hbase/"&gt;HBase&lt;/a&gt;), computation (&lt;a href="http://hadoop.apache.org/core/"&gt;MapReduce&lt;/a&gt;), coordination (&lt;a href="http://zookeeper.sourceforge.net/"&gt;ZooKeeper&lt;/a&gt;). And here, perhaps, is where the two approaches may meet -- AppDrop could be backed by HBase (or indeed &lt;a href="http://www.hypertable.org/"&gt;Hypertable&lt;/a&gt;), or HBase (and Hypertable) could have an open API which your application could use directly.&lt;br /&gt;&lt;br /&gt;Rails or Django on HBase, anyone?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-4309697238982889737?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/4MI2vRCdN24" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/4309697238982889737/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=4309697238982889737" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/4309697238982889737?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/4309697238982889737?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/04/portable-cloud-computing.html" title="Portable Cloud Computing" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;D0UNQ3w6eip7ImA9WxZbEk0.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-5714730535162695628</id><published>2008-04-14T20:43:00.005+01:00</published><updated>2008-04-14T21:34:52.212+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-04-14T21:34:52.212+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Conferences" /><category scheme="http://www.blogger.com/atom/ns#" term="Apache" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Hadoop at ApacheCon Europe</title><content type="html">On Friday in Amsterdam there was a lot of Hadoop on the menu at ApacheCon. I kicked it off at 9am with &lt;a href="http://eu.apachecon.com/eu2008/program/talk/2403"&gt;A Tour of Apache Hadoop&lt;/a&gt;, Owen O'Malley followed with &lt;a href="http://eu.apachecon.com/eu2008/program/talk/2524"&gt;Programming with Hadoop’s Map/Reduce&lt;/a&gt;, and &lt;span class="speaker_name"&gt;Allen Wittenauer&lt;/span&gt; finished off after lunch with &lt;a href="http://eu.apachecon.com/eu2008/program/talk/2535"&gt;Deploying Grid Services using Apache Hadoop&lt;/a&gt;. Find the slides on the &lt;a href="http://wiki.apache.org/hadoop/HadoopPresentations"&gt;Hadoop Presentations&lt;/a&gt; page of the wiki. I've also embedded mine below.&lt;br /&gt;&lt;br /&gt;I only saw half of Allen's talk as I had to catch my plane, but I was there long enough to see his interesting choice of HDFS users... :)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="width: 425px; text-align: left;" id="__ss_352863"&gt;&lt;object style="margin: 0px;" height="355" width="425"&gt;&lt;param name="movie" value="http://static.slideshare.net/swf/ssplayer2.swf?doc=apacheconeu2008hadooptourtomwhite-1208199764129275-9"&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;param name="allowScriptAccess" value="always"&gt;&lt;embed src="http://static.slideshare.net/swf/ssplayer2.swf?doc=apacheconeu2008hadooptourtomwhite-1208199764129275-9" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" height="355" width="425"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div style="font-size: 11px; font-family: tahoma,arial; height: 26px; padding-top: 2px;"&gt;&lt;a href="http://www.slideshare.net/?src=embed"&gt;&lt;img src="http://static.slideshare.net/swf/logo_embd.png" style="border: 0px none ; margin-bottom: -5px;" alt="SlideShare" /&gt;&lt;/a&gt; | &lt;a href="http://www.slideshare.net/tomwhite/apache-con-eu2008-hadoop-tour-tom-white?src=embed" title="View 'Apache Con Eu2008 Hadoop Tour Tom White' on SlideShare"&gt;View&lt;/a&gt; | &lt;a href="http://www.slideshare.net/upload?src=embed"&gt;Upload your own&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;Also at ApacheCon I enjoyed meeting the &lt;a href="http://lucene.apache.org/mahout/"&gt;Mahout&lt;/a&gt; people (Grant, Karl, Isabel and Erik), seeing &lt;a href="http://eu.apachecon.com/eu2008/program/talk/2625"&gt;Cliff Schmidt's keynote&lt;/a&gt;, and generally meeting interesting people.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-5714730535162695628?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/E0Bl1HQ-ESg" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/5714730535162695628/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=5714730535162695628" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5714730535162695628?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5714730535162695628?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/04/hadoop-at-apachecon-europe.html" title="Hadoop at ApacheCon Europe" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;C0YGRHYzfCp7ImA9WxZVGUw.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-8179153698514682004</id><published>2008-03-30T20:48:00.007+01:00</published><updated>2008-03-30T22:05:25.884+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-03-30T22:05:25.884+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Amazon EC2" /><title>Turn off the lights when you're not using them, please</title><content type="html">One of the things that struck me about this week's &lt;a href="http://developer.amazonwebservices.com/connect/ann.jspa?annID=295"&gt;new Amazon EC2 features&lt;/a&gt; was the pricing model for Elastic IP addresses:&lt;br /&gt;&lt;blockquote&gt;$0.01 per hour when &lt;span style="font-weight: bold;"&gt;not&lt;/span&gt; mapped to a running instance&lt;br /&gt;&lt;/blockquote&gt;The idea is to encourage people to stop hogging public IP addresses, which are a limited resource, when they don't need them.&lt;br /&gt;&lt;br /&gt;I think one way of viewing EC2 - and the other Amazon utility services - is as a way of putting very fine-grained costs on various computing operations.  So will such a pricing model drive us to minimise the computing resources we use to solve a particular problem? My hope is that making computing costs more transparent will at least make us think about what we're using more, in the way metered electricity makes (some of) us think twice about leaving the lights on. Perhaps we'll even start talking about &lt;a href="http://weblogs.java.net/blog/tomwhite/archive/2006/08/s3map.html"&gt;optimizing for monetary cost&lt;/a&gt; or energy usage rather than purely raw speed?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-8179153698514682004?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/J9_ShxlyQcw" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/8179153698514682004/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=8179153698514682004" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/8179153698514682004?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/8179153698514682004?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/03/turn-off-lights-when-youre-not-using.html" title="Turn off the lights when you're not using them, please" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;A0UFR3k7eip7ImA9WxZVE00.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-7735453237479750871</id><published>2008-03-23T18:49:00.005Z</published><updated>2008-03-23T21:53:36.702Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-03-23T21:53:36.702Z</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Visualization" /><category scheme="http://www.blogger.com/atom/ns#" term="Easter" /><title>Visualizing Easter</title><content type="html">I made this image a few years ago (as a postcard to give to friends), but it's appropriate to show again today as it's a neat visual demonstration that Easter this year is the earliest this century.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://flickr.com/photos/8797345@N02/2355962134/"&gt;&lt;img src="http://farm3.static.flickr.com/2288/2355962134_b29b261631_d.jpg" width="349" height="500" alt="101 Easters"/&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The scale at the bottom shows the maximum range of Easter: from 22 March to 25 April. You can read more about the image &lt;a href="http://tiling.org/longview/index.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The image is licensed under a &lt;a href="http://creativecommons.org/licenses/by-nc-sa/3.0/"&gt;Creative Commons Attribution-Noncommercial-Share Alike license&lt;/a&gt;, so you are free to share and remix it for non-commercial purposes.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-7735453237479750871?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/OT1uM2dUsVs" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/7735453237479750871/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=7735453237479750871" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7735453237479750871?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7735453237479750871?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/03/visualizing-easter.html" title="Visualizing Easter" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;DkYMQX06eSp7ImA9WxZaFUs.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-7837841027525509394</id><published>2008-03-22T09:44:00.004Z</published><updated>2008-04-30T15:03:00.311+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-04-30T15:03:00.311+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="MapReduce" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Learning MapReduce</title><content type="html">&lt;span style="font-style: italic;"&gt;&lt;span style="font-weight: bold;"&gt;Update&lt;/span&gt;: I've posted &lt;/span&gt;&lt;a style="font-style: italic;" href="http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/MapReduce-SPA2008-answers.pdf"&gt;my answers&lt;/a&gt;&lt;span style="font-style: italic;"&gt; to the exercises. Let me know if you find any mistakes. &lt;span style="font-weight: bold;"&gt;Also&lt;/span&gt;: Tamara Petroff has posted a &lt;a href="http://rfa.blog.idnet.com/?p=14"&gt;write up&lt;/a&gt; of the session.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;On Wednesday [19 March], I ran a session at &lt;a href="http://www.spaconference.org/spa2008/"&gt;SPA 2008&lt;/a&gt; entitled "&lt;a href="http://www.spaconference.org/spa2008/sessions/session153.html"&gt;Understanding MapReduce with Hadoop&lt;/a&gt;". SPA is a very hands-on conference, with many sessions having a methodological slant, so I wanted to get people who had never encountered MapReduce before actually writing MapReduce programs. I only had 75 minutes, so I decided against getting people coding on their laptops. (In hindsight this was a good decision, as I went to several other sessions where we struggled to get the software installed.) Instead, we wrote MapReduce programs on paper, using a simplified notation.&lt;br /&gt;&lt;br /&gt;It seemed to work. For the first half hour, I gave as minimal an introduction to MapReduce as I could, then the whole group spent the next half hour working in pairs to express the solutions to a number of exercises as MapReduce programs. We spent the last 15 minutes comparing notes and discussing some of the solutions to the problems.&lt;br /&gt;&lt;br /&gt;There were six exercises, presented in rough order of difficulty, and I'm pleased to say that every pair managed to solve at least one. Here's some of the feedback I got:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Some struggled to know what input data formats to use. Perhaps I glossed over this too much - I didn't want people to worry about precisely how the data was encoded - but I could have emphasised more that you can have the data presented to your map function in any way that's convenient.&lt;/li&gt;&lt;li&gt;While most people understood the notation I used for writing the map and reduce functions, it did cause some confusion. For example, someone wanted to see the example code again so they could understand what was going on. And another person said it took a while to realise that they could do arbitrary processing as a part of the map and reduce functions. It would be interesting to do the session again but using Java notation.&lt;/li&gt;&lt;li&gt;It was quite common for people to try to do complex things in their map and reduce functions - they felt bad if they just used an identity function, because it was somehow a waste. And on a related note, chaining map reduce jobs together wasn't obvious to many. But once pointed out, folks had an "aha!" moment and were quick to exploit it.&lt;/li&gt;&lt;li&gt;The fact that you typically get multiple reduce outputs prompted questions from some - "but how do you combine them into a single answer?". Talking about chained MapReduce helped here again.&lt;/li&gt;&lt;li&gt;Everyone agreed that it wasn't much like functional programming.&lt;/li&gt;&lt;/ul&gt;You can find the &lt;a href="http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/MapReduce-SPA2008.pdf"&gt;slides&lt;/a&gt; on the Hadoop wiki. They include the six exercises, which I've reproduced below, in rough order of difficulty. (I'll post my answers next week.)&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Find the [number of] hits by 5 minute timeslot for a website given its access logs.&lt;/li&gt;&lt;li&gt;Find the pages with over 1 million hits in day for a website given its access logs.&lt;/li&gt;&lt;li&gt;Find the pages that link to each page in a collection of webpages.&lt;/li&gt;&lt;li&gt;Calculate the proportion of lines that match a given regular expression for a collection of documents.&lt;/li&gt;&lt;li&gt;Sort tabular data by a primary and secondary column.&lt;/li&gt;&lt;li&gt;Find the most popular pages for a website given its access logs.&lt;/li&gt;&lt;/ol&gt;Is this a good list of exercises? Do you have any exercises that you've found useful for learning MapReduce?&lt;br /&gt;&lt;br /&gt;Finally, thanks to &lt;a href="http://chatley.com/"&gt;Robert Chatley&lt;/a&gt; for being a guinea pig for the exercises, and for helping out on the day with participants' questions during the session.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-7837841027525509394?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/EsMCLJ166UE" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/7837841027525509394/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=7837841027525509394" title="2 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7837841027525509394?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7837841027525509394?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/03/learning-mapreduce.html" title="Learning MapReduce" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>2</thr:total></entry><entry gd:etag="W/&quot;D0MMQ3s7eCp7ImA9WxZWGEk.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-2860680421013796857</id><published>2008-03-18T13:04:00.000Z</published><updated>2008-03-18T13:04:42.500Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-03-18T13:04:42.500Z</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Hardware" /><category scheme="http://www.blogger.com/atom/ns#" term="MapReduce" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>"Disks have become tapes"</title><content type="html">MapReduce is a programming model for processing vast amounts of data. One of the reasons that it works so well is because it exploits a sweet spot of modern disk drive technology trends. In essence MapReduce works by repeatedly sorting and merging data that is streamed to and from disk at the &lt;span style="font-style: italic;"&gt;transfer rate&lt;/span&gt; of the disk. Contrast this to accessing data from a relational database that operates at the &lt;span style="font-style: italic;"&gt;seek rate&lt;/span&gt; of the disk (seeking is the process of moving the disk's head to a particular place on the disk to read or write data).&lt;br /&gt;&lt;br /&gt;So why is this interesting? Well, look at the trends in seek time and transfer rate. Seek time has grown at about 5% a year, whereas transfer rate at about 20% &lt;a href="http://www.blogger.com/post-edit.g?blogID=8898949683610477251&amp;amp;postID=2860680421013796857#1"&gt;[1]&lt;/a&gt;. Seek time is growing more slowly than transfer rate - &lt;span style="font-style: italic;"&gt;so it pays to use a model that operates at the transfer rate&lt;/span&gt;. Which is what MapReduce does. I first saw this observation in Doug Cutting's talk, with Eric Baldeschwieler, at &lt;a href="http://conferences.oreillynet.com/os2007/"&gt;OSCON&lt;/a&gt; last year, where he worked through the numbers for updating a 1 terabyte database using the two paradigms B-Tree (seek-limited) and Sort/Merge (transfer-limited). (See the &lt;a href="http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/oscon-part-1.pdf"&gt;slides&lt;/a&gt; and &lt;a href="http://us.dl1.yimg.com/download.yahoo.com/dl/ydn/hadoop.m4v"&gt;video&lt;/a&gt; for more detail.)&lt;br /&gt;&lt;br /&gt;The general point was well summed up by Jim Gray in an &lt;a href="http://www.acmqueue.org/modules.php?name=Content&amp;amp;pa=showpage&amp;amp;pid=43"&gt;interview&lt;/a&gt; in ACM Queue from 2003:&lt;br /&gt;&lt;blockquote&gt;... programmers have to start thinking of the disk as a sequential device rather than a random access device.&lt;/blockquote&gt;Or the more pithy: "Disks have become tapes." (Quoted by &lt;a href="http://www.databasecolumn.com/2007/09/disk-trends.html"&gt;David DeWitt&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;But even the growth of transfer rate is dwarfed by another measure of disk drives - capacity, which is growing at about 50% a year. David DeWitt &lt;a href="http://www.databasecolumn.com/2007/09/disk-trends.html"&gt;argues&lt;/a&gt; that since the effective transfer rate of drives is falling we need database systems that work with this trend - such as column-store databases and wider use of compression (since this effectively increases the transfer rate of a disk). Of existing databases he says:&lt;br /&gt;&lt;blockquote&gt;Already we see transaction processing systems running on farms of mostly empty disk drives to obtain enough seeks/second to satisfy their transaction processing rates.&lt;/blockquote&gt;But this applies to transfer rate too (or if it doesn't yet, it will). Replace "seeks" with "transfers" and "transaction processing" with "MapReduce" and I think over time we'll start seeing Hadoop installations that choose to use large numbers of smaller capacity disks to maximize their processing rates.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;a name="1"&gt;[1]&lt;/a&gt; See &lt;a href="http://www.cs.utexas.edu/users/dahlin/techTrends/trends.disk.ps"&gt;Trends in Disk Technology&lt;/a&gt; by Michael D. Dahlin for changes between 1987-1994. For the period since then these figures still hold - as it's relatively easy to check using manufacturer's data sheets, although with seek time it's harder to tell since the definitions seem to change from year to year and from manufacturer to manufacturer. Still, 5% is generous.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-2860680421013796857?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/ZqqRYzvD_18" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/2860680421013796857/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=2860680421013796857" title="17 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/2860680421013796857?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/2860680421013796857?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/03/disks-have-become-tapes.html" title="&quot;Disks have become tapes&quot;" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>17</thr:total></entry><entry gd:etag="W/&quot;C0QBRHg7eyp7ImA9WxZXFE8.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-7201201271658646958</id><published>2008-03-02T01:03:00.002Z</published><updated>2008-03-02T01:29:15.603Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-03-02T01:29:15.603Z</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="MapReduce" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>MapReduce without the Reduce</title><content type="html">There's a class of MapReduce applications that use Hadoop just for its distributed processing capabilities. Telltale signs are:&lt;br /&gt;&lt;br /&gt;1. Little or no input data of note. (Certainly not large files stored in HDFS.)&lt;br /&gt;2. Map tasks are therefore not limited by their ability to consume input, but by their ability to run the task, which depending on the application may be CPU-bound or IO-bound.&lt;br /&gt;3. Little or map output.&lt;br /&gt;4. No reducers (set by &lt;code&gt;conf.setNumReduceTasks(0)&lt;/code&gt;).&lt;br /&gt;&lt;br /&gt;This seems to work well - indeed the &lt;a href="http://hadoop.apache.org/core/docs/current/api/index.html"&gt;CopyFiles&lt;/a&gt; program in Hadoop (aka &lt;code&gt;distcp&lt;/code&gt;) follows this pattern to efficiently copy files between distributed filesystems:&lt;br /&gt;&lt;br /&gt;1. The input to each map task is a source file and a destination.&lt;br /&gt;2. The map task is limited by its ability to copy the source to the destination (IO-bound).&lt;br /&gt;3. The map output is used as a convenience to record files that were skipped.&lt;br /&gt;4. There are no reducers.&lt;br /&gt;&lt;br /&gt;Combined with &lt;a href="http://hadoop.apache.org/core/docs/current/streaming.html"&gt;Streaming&lt;/a&gt; this is a neat way to distribute your processing in any language. You do need a Hadoop cluster, it is true, but CPU-intensive jobs would happily co-exist with more traditional MapReduce jobs, which are typically fairly light on CPU usage.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8898949683610477251-7201201271658646958?l=www.lexemetech.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/ZcoJFCU-3z0" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.lexemetech.com/feeds/7201201271658646958/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=7201201271658646958" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7201201271658646958?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7201201271658646958?v=2" /><link rel="alternate" type="text/html" href="http://www.lexemetech.com/2008/03/mapreduce-without-reduce.html" title="MapReduce without the Reduce" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry></feed>

