<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:blogger="http://schemas.google.com/blogger/2008" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr="http://purl.org/syndication/thread/1.0" gd:etag="W/&quot;CEYEQ3o4eyp7ImA9WhBaEUQ.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251</id><updated>2013-05-22T04:08:22.433+01:00</updated><category term="Mobile" /><category term="Query Languages" /><category term="Broadband" /><category term="Lucene" /><category term="Visualization" /><category term="MapReduce" /><category term="Cloud Computing" /><category term="Family" /><category term="Regular Expressions" /><category term="Web Services" /><category term="Music" /><category term="Hashing" /><category term="RPC" /><category term="Thrift" /><category term="Java" /><category term="Cloudera" /><category term="Open Source" /><category term="Testing" /><category term="Amazon Web Services" /><category term="Distributed Systems" /><category term="Amazon EC2" /><category term="Amazon S3" /><category term="Conferences" /><category term="Quantum Mechanics" /><category term="Data" /><category term="Whirr" /><category term="Hadoop" /><category term="HBase" /><category term="Hardware" /><category term="Apache" /><category term="Easter" /><category term="Book" /><category term="Serialization" /><title>Tom White</title><subtitle type="html" /><link rel="http://schemas.google.com/g/2005#feed" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/posts/default" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/" /><link rel="next" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default?start-index=26&amp;max-results=25&amp;redirect=false&amp;v=2" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><generator version="7.00" uri="http://www.blogger.com">Blogger</generator><openSearch:totalResults>56</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/TomWhite" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="tomwhite" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry gd:etag="W/&quot;DU4AR3YzfSp7ImA9WhBXFks.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-2077960426562913693</id><published>2013-03-30T18:25:00.000Z</published><updated>2013-03-30T18:25:46.885Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-03-30T18:25:46.885Z</app:edited><title>Making a Kitchen Table</title><content type="html">&lt;br /&gt;
&lt;div class="p1"&gt;
A couple of weeks ago I made a new kitchen table.&lt;/div&gt;
&lt;div class="p1"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="p1"&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-y9oOrPhfkQA/UVcrvF1MIwI/AAAAAAAAALc/E-8Uy1C0Ejo/s1600/20130316_163046.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-y9oOrPhfkQA/UVcrvF1MIwI/AAAAAAAAALc/E-8Uy1C0Ejo/s320/20130316_163046.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;span id="goog_1670473399"&gt;&lt;/span&gt;&lt;span id="goog_1670473400"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;
&lt;div class="p2"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="p1"&gt;
It was much easier than it looks as all I had to do was attach some hairpin legs to a worktop. If you haven't seen hairpin legs before, here's a closeup:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-SvfV44yLBIY/UVcsBASbMVI/AAAAAAAAALk/EGczx1Mi8WU/s1600/20130316_152120.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://2.bp.blogspot.com/-SvfV44yLBIY/UVcsBASbMVI/AAAAAAAAALk/EGczx1Mi8WU/s320/20130316_152120.jpg" width="240" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="p2"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="p1"&gt;
Eliane got the idea for the design after seeing something similar on the web, and she ordered the worktop from &lt;a href="http://www.worktop-express.co.uk/"&gt;Worktop Express&lt;/a&gt;, and the hairpin legs from the &lt;a href="http://www.theironmill.co.uk/"&gt;Iron Mill&lt;/a&gt;.&lt;/div&gt;
&lt;div class="p2"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="p1"&gt;
I worked out what size screws to use (#12) and the pilot drill size using &lt;a href="http://www.wlfuller.com/html/wood_screw_chart.html"&gt;this handy chart&lt;/a&gt;. I also found a tip somewhere that said putting a little wax on the screw makes it easier to drive in with hardwoods (our worktop is oak).&lt;/div&gt;
&lt;div class="p2"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="p1"&gt;
The table is pretty sturdy, and hasn't collapsed! It was quicker to put together than some Ikea furniture, and it's very satisfying having an everyday piece of furniture that we designed and built ourselves.&lt;/div&gt;
&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/3FeZKkzYyjY" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/2077960426562913693/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=2077960426562913693" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/2077960426562913693?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/2077960426562913693?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2013/03/making-kitchen-table.html" title="Making a Kitchen Table" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/-y9oOrPhfkQA/UVcrvF1MIwI/AAAAAAAAALc/E-8Uy1C0Ejo/s72-c/20130316_163046.jpg" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CU4CQXc6fSp7ImA9WhNaGUw.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-3144519400768005075</id><published>2013-02-03T17:52:00.001Z</published><updated>2013-02-03T17:52:40.915Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-02-03T17:52:40.915Z</app:edited><title>Have you put the chickens to bed?</title><content type="html">"Have you put the chickens to bed?" -- it's a question we ask each other frequently in our house, since we are the &lt;a href="http://welshchickengirl.blogspot.co.uk/"&gt;proud owners&lt;/a&gt; of seven beautiful hens. Normally Eliane has, but when Lottie, our younger daughter, asked long after it had got dark one evening last week it turned out that none of us had, despite having &lt;a href="http://www.tom-e-white.com/2012/12/ifttt.html"&gt;IFTTT alerts set up to remind us&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
The problem with the alert is that it is set to go off at sunset, which is all that IFTTT allows, and that's a bit too early as it's not dark enough for the chickens to be in their house. So we wait a bit, then we forget.&lt;br /&gt;
&lt;br /&gt;
So I decided to write an Android app to send an alert a fixed amount of time (say 45 minutes) after sunset, so that when we received it, it would be dark, the chickens would be in their house, and we could close the door there and then.&lt;br /&gt;
&lt;br /&gt;
This is the result:&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-iTi2rcr5yJI/UQ6anIfdqaI/AAAAAAAAALM/fDHbWAsLzL0/s1600/chickenalerts.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-iTi2rcr5yJI/UQ6anIfdqaI/AAAAAAAAALM/fDHbWAsLzL0/s1600/chickenalerts.png" height="142" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Eliane is currently beta testing it, so we'll see how well it works. (Obviously the long term goal is an automatic sensor to open and close the chicken house door, but we're not there yet.)&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
Writing Android Apps&lt;/h3&gt;
&lt;div&gt;
This is the first Android app I've written, and overall I found the process very straightforward. A couple of years ago I ran a "Hello World" Android tutorial, and I seem to remember most of the time taken to get the app running was installing the Eclipse plugin. This time the Android Developer Tools (ADT) include a customized version of Eclipse, making the getting started process much smoother.&amp;nbsp;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The Android API is huge and fairly intimidating. It is, however, incredibly well documented, and the user guides are invaluable. The hardest part of writing the app was figuring out which parts of the API to use - do I need a &lt;span style="font-family: Courier New, Courier, monospace;"&gt;BroadcastReceiver&lt;/span&gt; or a &lt;span style="font-family: Courier New, Courier, monospace;"&gt;Service&lt;/span&gt;?, how do &lt;span style="font-family: Courier New, Courier, monospace;"&gt;AlarmManager&lt;/span&gt; and &lt;span style="font-family: Courier New, Courier, monospace;"&gt;Notification&lt;/span&gt; interact? - that kind of thing. There's a lot of material online covering how to do various things in Android, and these offered general pointers, but not necessarily useful code, since the API evolves rapidly from release to release. And although the older code is generally supported, since compatibility is taken very seriously, there may be a better way of doing things in later versions.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The ADT tooling is good and encourages you to do the right thing - for example, extracting natural language strings from your app so it's easy to change them (or translate them) later. In this case, a class called &lt;span style="font-family: Courier New, Courier, monospace;"&gt;R&lt;/span&gt;&amp;nbsp;is generated which has references to all the assets that you need in you app: icons, sound files, strings, etc. For example, the audio file which plays when the notification is received is referred to with:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;







&lt;div class="p1"&gt;
&lt;span style="font-family: Courier New, Courier, monospace;"&gt;R.raw.&lt;span class="s1"&gt;cluck&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class="p1"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="p1"&gt;
To generate the icons I drew a chicken on a piece of paper with a sharpie, then took a photo of it and used an &lt;a href="http://www.iaza.com/index-ln.html"&gt;online image editor&lt;/a&gt;&amp;nbsp;to make the background transparent. The &lt;a href="http://android-ui-utils.googlecode.com/hg/asset-studio/dist/index.html"&gt;Android Asset Studio&lt;/a&gt; completed the job of converting the image to a set of icons. (I didn't use Inkscape in the end, but &lt;a href="http://envyandroid.com/archives/271/easiest-way-to-create-android-icons"&gt;this blog entry&lt;/a&gt; shows how to convert from an Inkscape drawing.)&amp;nbsp;&lt;/div&gt;
&lt;div class="p1"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;h3&gt;
What's Next?&lt;/h3&gt;
&lt;div&gt;
The biggest limitation in the app at the moment is that the &lt;a href="https://github.com/mikereedell/sunrisesunsetlib-java"&gt;calculation for sunset time&lt;/a&gt; is hardcoded for the UK. Using the &lt;a href="http://developer.android.com/guide/topics/location/strategies.html"&gt;Location API&lt;/a&gt; is the obvious next step there.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
There are also some complications to do with making sure that notifications will still be sent even the phone is rebooted. I want to make sure that works properly before putting the app on Google Play.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The UI is pretty rudimentary too and could do with some work.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
And before we get to the fully-automated solution, we could have a sensor that detects if the door is open or closed and only sends the reminder if the door hasn't been closed for the night.&lt;/div&gt;
&lt;br /&gt;
Source is on &lt;a href="https://github.com/tomwhite/chickenalerts"&gt;GitHub&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/AqHAqDV4SKw" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/3144519400768005075/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=3144519400768005075" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/3144519400768005075?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/3144519400768005075?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2013/02/have-you-put-chickens-to-bed.html" title="Have you put the chickens to bed?" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/-iTi2rcr5yJI/UQ6anIfdqaI/AAAAAAAAALM/fDHbWAsLzL0/s72-c/chickenalerts.png" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;A04CSX0zeip7ImA9WhNVGUs.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-5000894151668084463</id><published>2012-12-31T16:06:00.000Z</published><updated>2012-12-31T16:06:08.382Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-12-31T16:06:08.382Z</app:edited><title>How far away is the sea?</title><content type="html">I wanted an app to answer this question, so I wrote one:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-epoJi4Fc_t0/UOGfMOSFluI/AAAAAAAAAK4/1zRh_2FyrJ8/s1600/how-far-away-is-the-sea.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-epoJi4Fc_t0/UOGfMOSFluI/AAAAAAAAAK4/1zRh_2FyrJ8/s1600/how-far-away-is-the-sea.gif" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
You can try it out at&amp;nbsp;&lt;a href="http://how-far-away-is-the-sea.appspot.com/"&gt;http://how-far-away-is-the-sea.appspot.com/&lt;/a&gt;. It works well on phones too, so you can use it when you are out and about.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
How does it work?&lt;/h3&gt;
I used the &lt;a href="http://www.naturalearthdata.com/downloads/10m-physical-vectors/"&gt;dataset of land polygons&lt;/a&gt; from &lt;a href="http://www.naturalearthdata.com/"&gt;Natural Earth&lt;/a&gt;, which as the name suggests covers the whole world. The scale is 1:10 million, so inevitably there is some inaccuracy near the coast, particularly where it's wiggly.&lt;br /&gt;
&lt;br /&gt;
The app uses your current location (or a location you selected by clicking on the map) and computes the closest point in the set of land polygons. This calculation is performed using the &lt;a href="http://www.vividsolutions.com/jts/JTSHome.htm"&gt;JTS Topology Suite&lt;/a&gt;, a library for 2D spatial work, and it runs as a Java webapp hosted on Google App Engine.&lt;br /&gt;
&lt;br /&gt;
Originally I used &lt;a href="http://www.geotools.org/"&gt;Geotools&lt;/a&gt; to perform the geospatial calculations, but unfortunately it doesn't run on GAE, so I wrote an offline tool to convert the Natural Earth shapefiles to a &lt;a href="http://www.vividsolutions.com/jts/javadoc/com/vividsolutions/jts/io/WKBWriter.html"&gt;JTS binary format&lt;/a&gt;. JTS works fine on GAE, but it lacks a distance calculator. Luckily &lt;a href="https://github.com/spatial4j/spatial4j"&gt;spatial4j&lt;/a&gt; has the requisite distance functions, and it too works on GAE.&lt;br /&gt;
&lt;br /&gt;
The webapp exposes a simple query endpoint, so a request for the following URL, for example:&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;http://how-far-away-is-the-sea.appspot.com/query?lat=51.856479&amp;amp;lng=-3.13551&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
will return a JSON document with the closest point on the coast, whether the (origin) location is on land or at sea, and the distance in metres to the coast:&lt;br /&gt;
&lt;br /&gt;
&lt;pre style="white-space: pre-wrap; word-wrap: break-word;"&gt;{&lt;/pre&gt;
&lt;pre style="white-space: pre-wrap; word-wrap: break-word;"&gt;"latitude":51.856479,&lt;/pre&gt;
&lt;pre style="white-space: pre-wrap; word-wrap: break-word;"&gt;"longitude":-3.13551,&lt;/pre&gt;
&lt;pre style="white-space: pre-wrap; word-wrap: break-word;"&gt;"coastLatitude":51.55853913000007,&lt;/pre&gt;
&lt;pre style="white-space: pre-wrap; word-wrap: break-word;"&gt;"coastLongitude":-2.984038865999878,&lt;/pre&gt;
&lt;pre style="white-space: pre-wrap; word-wrap: break-word;"&gt;"onLand":true,&lt;/pre&gt;
&lt;pre style="white-space: pre-wrap; word-wrap: break-word;"&gt;"distanceToCoast":34734.59501052392&lt;/pre&gt;
&lt;pre style="white-space: pre-wrap; word-wrap: break-word;"&gt;}&lt;/pre&gt;
&lt;br /&gt;
The page that the user sees is a simple static HTML page that uses the Google Maps API (v3) to render the map and the markers, and jQuery to query the Java webapp.&lt;br /&gt;
&lt;br /&gt;
The complete source code is on Github at&amp;nbsp;&lt;a href="https://github.com/tomwhite/how-far-away-is-the-sea"&gt;https://github.com/tomwhite/how-far-away-is-the-sea&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
Further ideas&lt;/h3&gt;
Some of the polygons are a poor approximation to the coastline, so it would be nice to get a higher-resolution dataset. There are likely many potential sources, such as &lt;a href="http://cdu.mimas.ac.uk/2011/data/boundary-data/index.html"&gt;this one&lt;/a&gt; for the UK.&lt;br /&gt;
&lt;br /&gt;
It would be interesting to use the dataset to answer the question: "which is the furthest point from the sea [in the UK/in X/in the world]?". I'd like to find time to do that sometime. Adding in spatial indexes might be helpful too.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
If you liked this app then you might like...&lt;/h3&gt;
&lt;a href="https://github.com/tomwhite/isitdayornight"&gt;Is it day or night?&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/gB_Uds3NRLw" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/5000894151668084463/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=5000894151668084463" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5000894151668084463?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5000894151668084463?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2012/12/how-far-away-is-sea.html" title="How far away is the sea?" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/-epoJi4Fc_t0/UOGfMOSFluI/AAAAAAAAAK4/1zRh_2FyrJ8/s72-c/how-far-away-is-the-sea.gif" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;C0YFR3k_eCp7ImA9WhNWFks.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-1747695397940033386</id><published>2012-12-16T12:31:00.001Z</published><updated>2012-12-16T12:31:56.740Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-12-16T12:31:56.740Z</app:edited><title>IFTTT</title><content type="html">&lt;a href="https://ifttt.com/"&gt;IFTTT&lt;/a&gt;, pronounced "ift", and which stands for "if this then that", is a great service for wiring bits of the internet together.

The idea is that you create rules for performing actions, based on triggers.&lt;br /&gt;
&lt;b&gt;&lt;br /&gt;&lt;/b&gt;
&lt;b&gt;If this&lt;/b&gt;&amp;nbsp;[trigger] occurs&amp;nbsp;&lt;b&gt;then&lt;/b&gt; perform &lt;b&gt;that&lt;/b&gt;&amp;nbsp;[action].&lt;br /&gt;
&lt;br /&gt;
There are lots of triggers and actions, provided by &lt;a href="https://ifttt.com/channels"&gt;channels&lt;/a&gt;. For example, the &lt;a href="https://ifttt.com/weather"&gt;Weather Channel&lt;/a&gt; provides a trigger which fires at sunset. And the &lt;a href="https://ifttt.com/google_talk"&gt;Google Talk Channel&lt;/a&gt; provides an action to send a chat message. I combined the trigger and action into a &lt;a href="https://ifttt.com/recipes/70456"&gt;recipe&lt;/a&gt; called "Did you put the chickens to bed?" which will remind me (and Eliane) to close the chicken shed in the evening.&lt;br /&gt;
&lt;action&gt;&lt;br /&gt;&lt;/action&gt;
&lt;action&gt;I love the simplicity of the whole thing. I quickly added a recipe to send a weekly SMS to remind me to put the rubbish out. And one to send an email to Lottie when there is a full moon. Emilia created a recipe to send her an email when a friend of hers posts something on his blog. I fear the recipe that tells me when it has started raining will be deleted soon due to email overload.&lt;/action&gt;&lt;br /&gt;
&lt;action&gt;&lt;br /&gt;&lt;/action&gt;When you start thinking in this way, the more interesting uses invariably involve the the physical world in some way. I want to have a recipe that says "if we're running out of coffee beans then order some more", or "if I'm on Skype light up a lamp outside my office so the kids know not to come in" (this one is close with the &lt;a href="https://ifttt.com/blink1"&gt;blink(1)&lt;/a&gt; device), or even "it's actually dark now and you still haven't closed the chicken shed door".&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/ulGLvzn2aTo" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/1747695397940033386/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=1747695397940033386" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1747695397940033386?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1747695397940033386?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2012/12/ifttt.html" title="IFTTT" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CkEMRXc-fyp7ImA9WhNWEEQ.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-86665386429028061</id><published>2012-12-09T22:04:00.001Z</published><updated>2012-12-09T22:04:44.957Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-12-09T22:04:44.957Z</app:edited><title>Apportionment</title><content type="html">&lt;script type="text/x-mathjax-config"&gt;   MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}}); &lt;/script&gt; &lt;script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"&gt; &lt;/script&gt;
&lt;script type="text/x-mathjax-config"&gt;
MathJax.Hub.Config({
  TeX: { equationNumbers: { autoNumber: "AMS" } }
});
&lt;/script&gt;

&lt;p&gt;&lt;i&gt;[I wrote this in July, but never got round to posting it.]&lt;/i&gt;&lt;/p&gt;
&lt;p&gt;
Last weekend I visited the U.S. Capitol in Washington, D.C., with my family, and I learned that the House of Representatives has 435 seats which are appointed
so that each state has a number of seats that is proportional to its population.
It sounded simple when the tour guide said it, but I wondered how are fractions handled fairly?
Simply rounding off quotas doesn't work&amp;mdash;firstly because some states could get no seats, which would be unfair, and 
secondly, how do you make sure that the rounding is both fair and assigns all 435 seats?
&lt;/p&gt;

&lt;p&gt;
When I got home I read about the apportionment problem, as it is known, which has a long and interesting history.
Wikipedia &lt;a id="footnote-1-ref" href="#footnote-1"&gt;[1]&lt;/a&gt; is a good read, as usual; and &lt;a id="footnote-2-ref" href="#footnote-2"&gt;[2]&lt;/a&gt; goes into the history and mathematics of different apportionment algorithms in depth, at least one of which suffers causes a paradox.
Here I'm interested in looking at the algorithm that is used today to calculate apportionments for the House of Representatives,
and why it is considered to be the fairest.
&lt;/p&gt;

&lt;h3&gt;The Algorithm&lt;/h3&gt;

&lt;p&gt;The algorithm in use today for apportioning seats is due to Huntington and Hill and is known as the Huntington-Hill method, or the method of equal proportions.
It's best understood as a dynamic process, which works as follows:&lt;p&gt;

&lt;p&gt;
To start, each state is given one seat. (This ensures that states with relatively small populations, like Wyoming, get at least one seat.)
Then, each remaining seat is allocated in turn to the state is allocated to the state with the
highest &lt;i&gt;priority&lt;/i&gt;, where the priority of a state of population \(P\) and \(n\) previously-allocated seats is defined as
&lt;/p&gt;

&lt;p&gt;\begin{align}
\frac {P} {\sqrt{n(n+1)}}\label{pri}
\end{align}&lt;/p&gt;

&lt;p&gt;We'll see why the priority is defined as it is below, but for now notice that it is approximately \(P/n\), so the seat is given to the state
that has the least number of representatives per person, roughly speaking.&lt;/p&gt;

&lt;h3&gt;Results for the 2010 Census&lt;/h3&gt;

&lt;p&gt;Running the algorithm for the state populations from the 2010 Census (using a program I wrote &lt;a id="footnote-5-ref" href="#footnote-5"&gt;[5]&lt;/a&gt;)
gives the following apportionment, which agrees with the U.S. Census Bureau &lt;a id="footnote-3-ref" href="#footnote-3"&gt;[3]&lt;/a&gt;. (The quota column is the percentage of the population for each state.)&lt;/p&gt;

&lt;p&gt;
&lt;table&gt;
&lt;tr&gt;&lt;th&gt;State&lt;/th&gt; &lt;th&gt;Seats&lt;/th&gt; &lt;th&gt;Population&lt;/th&gt; &lt;th&gt;Quota&lt;/th&gt; &lt;th&gt;People per representative&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Alabama&lt;/td&gt; &lt;td&gt;7&lt;/td&gt; &lt;td&gt;4802982&lt;/td&gt; &lt;td&gt;6.76&lt;/td&gt; &lt;td&gt;686140&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Alaska&lt;/td&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;721523&lt;/td&gt; &lt;td&gt;1.02&lt;/td&gt; &lt;td&gt;721523&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Arizona&lt;/td&gt; &lt;td&gt;9&lt;/td&gt; &lt;td&gt;6412700&lt;/td&gt; &lt;td&gt;9.02&lt;/td&gt; &lt;td&gt;712522&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Arkansas&lt;/td&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt;2926229&lt;/td&gt; &lt;td&gt;4.12&lt;/td&gt; &lt;td&gt;731557&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;California&lt;/td&gt; &lt;td&gt;53&lt;/td&gt; &lt;td&gt;37341989&lt;/td&gt; &lt;td&gt;52.54&lt;/td&gt; &lt;td&gt;704565&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Colorado&lt;/td&gt; &lt;td&gt;7&lt;/td&gt; &lt;td&gt;5044930&lt;/td&gt; &lt;td&gt;7.10&lt;/td&gt; &lt;td&gt;720704&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Connecticut&lt;/td&gt; &lt;td&gt;5&lt;/td&gt; &lt;td&gt;3581628&lt;/td&gt; &lt;td&gt;5.04&lt;/td&gt; &lt;td&gt;716325&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Delaware&lt;/td&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;900877&lt;/td&gt; &lt;td&gt;1.27&lt;/td&gt; &lt;td&gt;900877&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Florida&lt;/td&gt; &lt;td&gt;27&lt;/td&gt; &lt;td&gt;18900773&lt;/td&gt; &lt;td&gt;26.59&lt;/td&gt; &lt;td&gt;700028&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Georgia&lt;/td&gt; &lt;td&gt;14&lt;/td&gt; &lt;td&gt;9727566&lt;/td&gt; &lt;td&gt;13.69&lt;/td&gt; &lt;td&gt;694826&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Hawaii&lt;/td&gt; &lt;td&gt;2&lt;/td&gt; &lt;td&gt;1366862&lt;/td&gt; &lt;td&gt;1.92&lt;/td&gt; &lt;td&gt;683431&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Idaho&lt;/td&gt; &lt;td&gt;2&lt;/td&gt; &lt;td&gt;1573499&lt;/td&gt; &lt;td&gt;2.21&lt;/td&gt; &lt;td&gt;786749&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Illinois&lt;/td&gt; &lt;td&gt;18&lt;/td&gt; &lt;td&gt;12864380&lt;/td&gt; &lt;td&gt;18.10&lt;/td&gt; &lt;td&gt;714687&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Indiana&lt;/td&gt; &lt;td&gt;9&lt;/td&gt; &lt;td&gt;6501582&lt;/td&gt; &lt;td&gt;9.15&lt;/td&gt; &lt;td&gt;722398&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Iowa&lt;/td&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt;3053787&lt;/td&gt; &lt;td&gt;4.30&lt;/td&gt; &lt;td&gt;763446&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Kansas&lt;/td&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt;2863813&lt;/td&gt; &lt;td&gt;4.03&lt;/td&gt; &lt;td&gt;715953&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Kentucky&lt;/td&gt; &lt;td&gt;6&lt;/td&gt; &lt;td&gt;4350606&lt;/td&gt; &lt;td&gt;6.12&lt;/td&gt; &lt;td&gt;725101&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Louisiana&lt;/td&gt; &lt;td&gt;6&lt;/td&gt; &lt;td&gt;4553962&lt;/td&gt; &lt;td&gt;6.41&lt;/td&gt; &lt;td&gt;758993&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Maine&lt;/td&gt; &lt;td&gt;2&lt;/td&gt; &lt;td&gt;1333074&lt;/td&gt; &lt;td&gt;1.88&lt;/td&gt; &lt;td&gt;666537&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Maryland&lt;/td&gt; &lt;td&gt;8&lt;/td&gt; &lt;td&gt;5789929&lt;/td&gt; &lt;td&gt;8.15&lt;/td&gt; &lt;td&gt;723741&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Massachusetts&lt;/td&gt; &lt;td&gt;9&lt;/td&gt; &lt;td&gt;6559644&lt;/td&gt; &lt;td&gt;9.23&lt;/td&gt; &lt;td&gt;728849&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Michigan&lt;/td&gt; &lt;td&gt;14&lt;/td&gt; &lt;td&gt;9911626&lt;/td&gt; &lt;td&gt;13.94&lt;/td&gt; &lt;td&gt;707973&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Minnesota&lt;/td&gt; &lt;td&gt;8&lt;/td&gt; &lt;td&gt;5314879&lt;/td&gt; &lt;td&gt;7.48&lt;/td&gt; &lt;td&gt;664359&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Mississippi&lt;/td&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt;2978240&lt;/td&gt; &lt;td&gt;4.19&lt;/td&gt; &lt;td&gt;744560&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Missouri&lt;/td&gt; &lt;td&gt;8&lt;/td&gt; &lt;td&gt;6011478&lt;/td&gt; &lt;td&gt;8.46&lt;/td&gt; &lt;td&gt;751434&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Montana&lt;/td&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;994416&lt;/td&gt; &lt;td&gt;1.40&lt;/td&gt; &lt;td&gt;994416&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Nebraska&lt;/td&gt; &lt;td&gt;3&lt;/td&gt; &lt;td&gt;1831825&lt;/td&gt; &lt;td&gt;2.58&lt;/td&gt; &lt;td&gt;610608&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Nevada&lt;/td&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt;2709432&lt;/td&gt; &lt;td&gt;3.81&lt;/td&gt; &lt;td&gt;677358&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;New Hampshire&lt;/td&gt; &lt;td&gt;2&lt;/td&gt; &lt;td&gt;1321445&lt;/td&gt; &lt;td&gt;1.86&lt;/td&gt; &lt;td&gt;660722&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;New Jersey&lt;/td&gt; &lt;td&gt;12&lt;/td&gt; &lt;td&gt;8807501&lt;/td&gt; &lt;td&gt;12.39&lt;/td&gt; &lt;td&gt;733958&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;New Mexico&lt;/td&gt; &lt;td&gt;3&lt;/td&gt; &lt;td&gt;2067273&lt;/td&gt; &lt;td&gt;2.91&lt;/td&gt; &lt;td&gt;689091&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;New York&lt;/td&gt; &lt;td&gt;27&lt;/td&gt; &lt;td&gt;19421055&lt;/td&gt; &lt;td&gt;27.32&lt;/td&gt; &lt;td&gt;719298&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;North Carolina&lt;/td&gt; &lt;td&gt;13&lt;/td&gt; &lt;td&gt;9565781&lt;/td&gt; &lt;td&gt;13.46&lt;/td&gt; &lt;td&gt;735829&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;North Dakota&lt;/td&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;675905&lt;/td&gt; &lt;td&gt;0.95&lt;/td&gt; &lt;td&gt;675905&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Ohio&lt;/td&gt; &lt;td&gt;16&lt;/td&gt; &lt;td&gt;11568495&lt;/td&gt; &lt;td&gt;16.28&lt;/td&gt; &lt;td&gt;723030&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Oklahoma&lt;/td&gt; &lt;td&gt;5&lt;/td&gt; &lt;td&gt;3764882&lt;/td&gt; &lt;td&gt;5.30&lt;/td&gt; &lt;td&gt;752976&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Oregon&lt;/td&gt; &lt;td&gt;5&lt;/td&gt; &lt;td&gt;3848606&lt;/td&gt; &lt;td&gt;5.41&lt;/td&gt; &lt;td&gt;769721&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Pennsylvania&lt;/td&gt; &lt;td&gt;18&lt;/td&gt; &lt;td&gt;12734905&lt;/td&gt; &lt;td&gt;17.92&lt;/td&gt; &lt;td&gt;707494&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Rhode Island&lt;/td&gt; &lt;td&gt;2&lt;/td&gt; &lt;td&gt;1055247&lt;/td&gt; &lt;td&gt;1.48&lt;/td&gt; &lt;td&gt;527623&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;South Carolina&lt;/td&gt; &lt;td&gt;7&lt;/td&gt; &lt;td&gt;4645975&lt;/td&gt; &lt;td&gt;6.54&lt;/td&gt; &lt;td&gt;663710&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;South Dakota&lt;/td&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;819761&lt;/td&gt; &lt;td&gt;1.15&lt;/td&gt; &lt;td&gt;819761&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Tennessee&lt;/td&gt; &lt;td&gt;9&lt;/td&gt; &lt;td&gt;6375431&lt;/td&gt; &lt;td&gt;8.97&lt;/td&gt; &lt;td&gt;708381&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Texas&lt;/td&gt; &lt;td&gt;36&lt;/td&gt; &lt;td&gt;25268418&lt;/td&gt; &lt;td&gt;35.55&lt;/td&gt; &lt;td&gt;701900&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Utah&lt;/td&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt;2770765&lt;/td&gt; &lt;td&gt;3.90&lt;/td&gt; &lt;td&gt;692691&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Vermont&lt;/td&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;630337&lt;/td&gt; &lt;td&gt;0.89&lt;/td&gt; &lt;td&gt;630337&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Virginia&lt;/td&gt; &lt;td&gt;11&lt;/td&gt; &lt;td&gt;8037736&lt;/td&gt; &lt;td&gt;11.31&lt;/td&gt; &lt;td&gt;730703&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Washington&lt;/td&gt; &lt;td&gt;10&lt;/td&gt; &lt;td&gt;6753369&lt;/td&gt; &lt;td&gt;9.50&lt;/td&gt; &lt;td&gt;675336&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;West Virginia&lt;/td&gt; &lt;td&gt;3&lt;/td&gt; &lt;td&gt;1859815&lt;/td&gt; &lt;td&gt;2.62&lt;/td&gt; &lt;td&gt;619938&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Wisconsin&lt;/td&gt; &lt;td&gt;8&lt;/td&gt; &lt;td&gt;5698230&lt;/td&gt; &lt;td&gt;8.02&lt;/td&gt; &lt;td&gt;712278&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Wyoming&lt;/td&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;568300&lt;/td&gt; &lt;td&gt;0.80&lt;/td&gt; &lt;td&gt;568300&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
&lt;/p&gt;

&lt;h3&gt;The Mathematics&lt;/h3&gt;

&lt;p&gt;
The algorithm finally settled on by Congress was chosen because it was thought to be the fairest. There are different ways of
defining what "fair" means, and so it cannot be settled mathematically.
In this context "fair" is taken to mean "minimizes the relative difference in representatives per person between states".
&lt;/p&gt;

&lt;p&gt;
To see how the algorithm meets this definition of fairness, let's see what happens when we examine any two states to see if
transferring one seat between them would improve the apportionment. This is the argument published by E. V. Huntington in &lt;a id="footnote-4-ref" href="#footnote-4"&gt;[4]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;
Suppose after the apportionment, state \(A\) has received \(x+1\) seats, and state \(B\) has received \(y\) seats.
Furthermore, also suppose that \(A\) is over-represented because the number of people per representative is less than for \(B\):
&lt;/p&gt;

&lt;p&gt;\begin{align}
\frac {A} {x+1} &amp;\lt \frac {B} {y}\label{Aover}
\end{align}&lt;/p&gt;

&lt;p&gt;We can check this in the case of California and New York:&lt;/p&gt;

&lt;p&gt;\begin{align}
\frac {37,341,989} {53} &amp;\lt \frac {19,421,055} {27}\nonumber
\end{align}&lt;/p&gt;

&lt;p&gt;Or&lt;/p&gt;

&lt;p&gt;\begin{align}
704565.83 &amp;\lt 719298.33 \nonumber
\end{align}&lt;/p&gt;

&lt;p&gt;Now let's see what happens if we try to transfer one seat from \(A\) to \(B\)&amp;mdash;does that make things fairer?&lt;/p&gt;

&lt;p&gt;In the round when \(A\) won its last seat (number \(x+1\)), we know that its priority (defined by (\ref{pri})) was higher than \(B\)'s. That is,&lt;/p&gt;

&lt;p&gt;\begin{align}
\frac {A^2} {x(x+1)} &amp;\gt \frac {B^2} {y(y+1)}\label{priority}
\end{align}&lt;/p&gt;

&lt;p&gt;(Note that even if \(B\) hadn't won its last seat (number \(y\)) at that point, the inequality still holds,
since the number of seats it had would be less than \(y\).)&lt;/p&gt;

&lt;p&gt;Again we can check this in the case of California and New York:&lt;/p&gt;

&lt;p&gt;\begin{align}
\frac {37,341,989^2} {52 \times 53} &amp;\gt \frac {19,421,055^2} {27 \times 28}\nonumber
\end{align}&lt;/p&gt;

&lt;p&gt;Which is true. (The numbers also tally with the U.S. Census Bureau &lt;a id="footnote-6-ref" href="#footnote-6"&gt;[6]&lt;/a&gt;, and my program to calculate apportionments &lt;a id="footnote-5-ref" href="#footnote-5"&gt;[5]&lt;/a&gt;, where the priority value for California's last seat is \(711,308\), which is \(37,341,989/\sqrt{52 \times 53}\).)&lt;/p&gt;

&lt;p&gt;Dividing (\ref{priority}) by (\ref{Aover}) we get &lt;/p&gt;

&lt;p&gt;\begin{align}
\frac {A} {x} &amp;\gt \frac {B} {y+1}\label{Bover}
\end{align}&lt;/p&gt;

&lt;p&gt;which we can interpret as saying that \(B\) would be over-represented if one seat were transferred to it from \(A\).
For our example of California and New York, this becomes&lt;/p&gt;

&lt;p&gt;\begin{align}
718115.17 &amp;\gt 693609.11 \nonumber
\end{align}&lt;/p&gt;

&lt;p&gt;The question now is, which over-representation is the smallest? That is, which is fairer, and therefore, to be preferred?&lt;/p&gt;

&lt;p&gt;Using (\ref{Aover}), we calculate the relative difference before the transfer as&lt;/p&gt;

&lt;p&gt;\begin{align}
\newcommand{\slfrac}[2]{\left.#1\middle/#2\right.}
\slfrac{ \left( \frac {B} {y} - \frac {A} {x+1} \right) } {\frac {A} {x+1}}  = \frac {B(x+1)} {Ay} - 1 \label{Adiff}
\end{align}&lt;/p&gt;

&lt;p&gt;And, using (\ref{Bover}), the relative difference after the transfer is&lt;/p&gt;

&lt;p&gt;\begin{align}
\newcommand{\slfrac}[2]{\left.#1\middle/#2\right.}
\slfrac{ \left( \frac {A} {x} - \frac {B} {y+1} \right) } {\frac {B} {y+1}}  = \frac {A(y+1)} {Bx} - 1 \label{Bdiff}
\end{align}&lt;/p&gt;

&lt;p&gt;To compare these relative differences, note that we can rewrite (\ref{priority}) as&lt;/p&gt;

&lt;p&gt;\begin{align}
\frac {A(y+1)} {Bx} &amp;\gt \frac {B(x+1)} {Ay}
\end{align}&lt;/p&gt;

&lt;p&gt;Thus&lt;/p&gt;

&lt;p&gt;\begin{align}
\frac {A(y+1)} {Bx} - 1 &amp;\gt \frac {B(x+1)} {Ay} - 1
\end{align}&lt;/p&gt;

&lt;p&gt;and the relative difference is smaller before the seat transfer (using (\ref{Adiff}) and (\ref{Bdiff})). So the original apportionment is optimal.
There was nothing special about the choice of \(A\) and \(B\), so we can conclude that the apportionment is optimal overall.&lt;/p&gt;

&lt;p&gt;Again, this checks out for our example. The relative difference for 53 seats for California and 27 for New York is \(0.021\), versus \(0.035\) for 52 for California and 28 for New York.&lt;/p&gt;

&lt;h3&gt;References&lt;/h3&gt;

&lt;p id="footnote-1"&gt;
[1] &lt;a href="http://en.wikipedia.org/wiki/United_States_congressional_apportionment"&gt;United States congressional apportionment&lt;/a&gt;, Wikipedia. &lt;a href="#footnote-1-ref"&gt;&amp;#8617&lt;/a&gt;
&lt;/p&gt;

&lt;p id="footnote-2"&gt;
[2] &lt;a href="http://www.ams.org/samplings/feature-column/fcarc-apportion1"&gt;Apportionment: Introduction&lt;/a&gt;, American Mathematical Society. &lt;a href="#footnote-2-ref"&gt;&amp;#8617&lt;/a&gt;
&lt;/p&gt;

&lt;p id="footnote-3"&gt;
[3] &lt;a href="http://2010.census.gov/news/pdf/apport2010_table1.pdf"&gt;"APPORTIONMENT POPULATION AND NUMBER OF REPRESENTATIVES, BY STATE: 2010 CENSUS"&lt;/a&gt;, U.S. Census Bureau.  &lt;a href="#footnote-3-ref"&gt;&amp;#8617&lt;/a&gt;
&lt;/p&gt;

&lt;p id="footnote-4"&gt;
[4] &lt;a href="http://links.jstor.org/sici?sici=0002-9947%28192801%2930%3A1%3C85%3ATAORIC%3E2.0.CO%3B2-H"&gt;The Apportionment of Representatives in Congress&lt;/a&gt;, E. V. Huntington, &lt;i&gt;Transactions of the American Mathematical Society&lt;/i&gt;, Vol. 30, No. 1. (Jan., 1928), pp. 85-110.  &lt;a href="#footnote-4-ref"&gt;&amp;#8617&lt;/a&gt;
&lt;/p&gt;

&lt;p id="footnote-5"&gt;
[5] &lt;a href="https://github.com/tomwhite/apportionment"&gt;A program to calculate apportionments&lt;/a&gt;, Tom White, July 2012.  &lt;a href="#footnote-5-ref"&gt;&amp;#8617&lt;/a&gt;
&lt;/p&gt;

&lt;p id="footnote-6"&gt;
[6] &lt;a href="http://www.census.gov/population/apportionment/files/Priority%20Values%202010.pdf"&gt;PRIORITY VALUES FOR 2010 CENSUS&lt;/a&gt;, U.S. Census Bureau.  &lt;a href="#footnote-6-ref"&gt;&amp;#8617&lt;/a&gt;
&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/tlfO4ok4d2E" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/86665386429028061/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=86665386429028061" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/86665386429028061?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/86665386429028061?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2012/12/apportionment.html" title="Apportionment" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CUECRn07fip7ImA9WhNXFUo.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-623392643395801041</id><published>2012-12-03T22:27:00.000Z</published><updated>2012-12-03T22:27:47.306Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-12-03T22:27:47.306Z</app:edited><title>d3troit</title><content type="html">







&lt;br /&gt;
&lt;div class="p1"&gt;
I wrote a &lt;a href="http://tomwhite.github.com/d3troit/index.html"&gt;visualization&lt;/a&gt; of the populations of the largest cities in the US over the years. I got the idea when I was in Detroit in the summer, and read about the huge decline in Detroit's city population since the 1950s when the automotive industry was at its peak. According to Wikipedia's &lt;a href="http://en.wikipedia.org/wiki/Shrinking_cities_in_the_USA"&gt;Shrinking cities in the USA&lt;/a&gt;&amp;nbsp;page, Detroit has declined by 61.4% from its peak population. This is a decline of over 1 million, which makes it the largest decline among US cities in absolute numbers, but not in percentage terms since St. Louis has a slightly higher percentage decline (62.7%, or 537,502 people).&lt;/div&gt;
&lt;div class="p1"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="p1"&gt;







&lt;/div&gt;
&lt;div class="p1"&gt;
These figures are all for city populations, not for the greater urban or metropolitan areas, which, in the case of Detroit, have both significantly increased in size since the 1950s. Detroit has therefore seen one of the largest population shifts to the suburbs since the middle of the 20th century. Much has been written on the dramatic decline of Detroit's city population, and the impact on life there. These are some pieces that I found interesting:&lt;/div&gt;
&lt;div class="p1"&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;For more information on Detroit's history, particularly its buildings, I highly recommend&amp;nbsp;Dan Austin's&amp;nbsp;&lt;span class="s1"&gt;&lt;a href="http://historicdetroit.org/"&gt;http://historicdetroit.org&lt;/a&gt;.&lt;/span&gt;&lt;div class="p1"&gt;
&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;The Guardian has an amazing &lt;a href="http://www.guardian.co.uk/artanddesign/gallery/2011/jan/02/photography-detroit"&gt;photo gallery&lt;/a&gt; of some of&amp;nbsp;Yves Marchand and Romain Meffre's pictures of Detroit in ruins.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.motherjones.com/politics/2012/06/detroit-economy-art-recovery?page=1"&gt;How to Bring Detroit Back From the Grave&lt;/a&gt;&amp;nbsp;by&amp;nbsp;Josh Harkinson, Mother Jones.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.wired.com/geekmom/2012/02/the-maker-culture-is-reinventing-detroit/"&gt;The Maker Culture is Reinventing Detroit&lt;/a&gt; by&amp;nbsp;Gina Clifford, Wired.&lt;/li&gt;
&lt;li&gt;We didn't get to see it, but a friend highly recommended &lt;a href="http://heidelberg.org/"&gt;the heidelberg project&lt;/a&gt;, a street art project.&lt;/li&gt;
&lt;li&gt;Eliane on our visit:&amp;nbsp;&lt;a href="http://faites-simple.blogspot.co.uk/2012/06/journey-into-past-detroit-in-words.html"&gt;A journey into the past: Detroit in words&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="p1"&gt;
&lt;span class="s1"&gt;







&lt;/span&gt;&lt;/div&gt;
&lt;div class="p1"&gt;
&lt;/div&gt;
&lt;div class="p1"&gt;
&lt;/div&gt;
&lt;br /&gt;
&lt;div class="p1"&gt;







&lt;/div&gt;
&lt;h3&gt;
How I wrote the visualization&lt;/h3&gt;
&lt;div&gt;







&lt;div class="p1"&gt;
I used the wonderful &lt;a href="http://d3js.org/"&gt;d3 library&lt;/a&gt; to write the visualization, combining elements of the &lt;a href="http://mbostock.github.com/d3/ex/population.html"&gt;Population Pyramid example&lt;/a&gt;&amp;nbsp;with an &lt;a href="http://mbostock.github.com/d3/tutorial/bar-2.html"&gt;updating bar chart&lt;/a&gt;. The documentation on &lt;a href="http://bost.ocks.org/mike/constancy/"&gt;Object Constancy&lt;/a&gt;&amp;nbsp;was critical to getting the visualization to work.&lt;/div&gt;
&lt;div class="p2"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="p1"&gt;
The population data is from the US Census Bureau, although I found the actual files linked from Wikipedia's&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Largest_cities_in_the_United_States_by_population_by_decade"&gt;Largest cities in the United States by population by decade&lt;/a&gt;. I had to write some simple scripts to turn the source data into CSV files that d3 could read.&lt;/div&gt;
&lt;/div&gt;
&lt;br /&gt;
&lt;div class="p1"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="p1"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/cEpK9ozUqxk" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/623392643395801041/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=623392643395801041" title="2 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/623392643395801041?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/623392643395801041?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2012/12/d3troit.html" title="d3troit" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>2</thr:total></entry><entry gd:etag="W/&quot;CEICRXYyfSp7ImA9WhVVFkU.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-1017419998967425216</id><published>2012-05-10T21:42:00.000+01:00</published><updated>2012-05-10T21:42:44.895+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-05-10T21:42:44.895+01:00</app:edited><title>Volcanoes!</title><content type="html">I've just finished reading &lt;a href="http://www.amazon.com/Super-Volcano-Ticking-Yellowstone-National/dp/0760329257"&gt;"Super Volcano: The Ticking Time Bomb Beneath Yellowstone National Park"&lt;/a&gt; by Greg Breinin. Despite the hyperbolic title, it's a really good introduction to the subject. Actually, the title is entirely appropriate, since the previous Yellowstone eruption around 600,000 years ago was one &lt;i&gt;thousand&lt;/i&gt; times as powerful as the 1980 Mount St. Helens eruption. And it's likely to erupt again, but no one knows when.&lt;br /&gt;
&lt;br /&gt;
We've been on a bit of a volcano tour recently. First &lt;a href="http://faites-simple.blogspot.com/2011/10/lake-almanor-eagle-lake-and-lassen.html"&gt;we visited Lassen Volcanic National Park&lt;/a&gt; in October (climbing the Cinder Cone was a highlight), and &lt;a href="http://faites-simple.blogspot.com/2012/04/towards-seattle-and-other-thoughts.html"&gt;we stopped in on Mount St. Helens visitor center&lt;/a&gt; on our way to Seattle last month. Yesterday &lt;a href="http://faites-simple.blogspot.com/2012/05/yellowstone-mammoth-hot-springs-to-old.html"&gt;we ventured into the Yellowstone caldera&lt;/a&gt; (the bit that blew out in the last eruption).&lt;br /&gt;
&lt;br /&gt;
Before reading the book I hadn't appreciated how recent our understanding of Yellowstone's geology is. It was only in the 1960s that scientists combined new empirical data about the ages of different rock formations in the park with the then emerging theory of plate tectonics. One of the scientists was Robert Christiansen of the U.S. Geological Survey, who, with Richard Blank, collected samples from all over Yellowstone and pieced together the puzzle of how Yellowstone formed. (He also wrote &lt;a href="http://pubs.usgs.gov/pp/pp729g/"&gt;the definitive account of Yellowstone's geology&lt;/a&gt; in 2001.)&lt;br /&gt;
&lt;br /&gt;
They realized that the series of calderas between Oregon and Wyoming were all eruptions caused by what is now known as the &lt;a href="http://en.wikipedia.org/wiki/Yellowstone_Hotspot"&gt;Yellowstone hotspot&lt;/a&gt; over the last 16 million years. The continental plate is moving south west, which makes the newer volcanoes appear in the north east.&lt;br /&gt;
&lt;br /&gt;
This &lt;a href="http://en.wikipedia.org/wiki/File:HotspotsSRP.jpg"&gt;diagram from Wikipedia&lt;/a&gt; summarizes it nicely:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://upload.wikimedia.org/wikipedia/commons/thumb/4/46/HotspotsSRP.jpg/800px-HotspotsSRP.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="219" src="http://upload.wikimedia.org/wikipedia/commons/thumb/4/46/HotspotsSRP.jpg/800px-HotspotsSRP.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/-WhhnGH_i9A" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/1017419998967425216/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=1017419998967425216" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1017419998967425216?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1017419998967425216?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2012/05/volcanoes.html" title="Volcanoes!" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CEUFR3Yzeip7ImA9WhZUE04.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-7077861757926689669</id><published>2011-06-04T23:04:00.017+01:00</published><updated>2011-06-06T04:50:16.882+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-06-06T04:50:16.882+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Whirr" /><category scheme="http://www.blogger.com/atom/ns#" term="Apache" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><category scheme="http://www.blogger.com/atom/ns#" term="Cloud Computing" /><title>What's new in Apache Whirr 0.5.0-incubating</title><content type="html">&lt;a href="http://incubator.apache.org/whirr/"&gt;Apache Whirr&lt;/a&gt; 0.5.0-incubating is &lt;a href="http://www.apache.org/dyn/closer.cgi/incubator/whirr/"&gt;now available&lt;/a&gt;. Whirr is a library and command line interface for running distributed services like Apache Hadoop in the cloud. Note that Whirr is currently undergoing       Incubation at the Apache Software Foundation, which means that, in particular, the project has yet to be&lt;br /&gt;fully endorsed by the ASF. Please read the full &lt;a href="http://svn.apache.org/repos/asf/incubator/whirr/trunk/DISCLAIMER.txt"&gt;disclaimer&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;In this release the &lt;a href="http://incubator.apache.org/whirr/team-list.html"&gt;Whirr development team&lt;/a&gt; have added many new features while still making the core more solid. This post covers some of the more important changes. The full list can be found in the &lt;a href="http://incubator.apache.org/whirr/release-notes.html"&gt;release notes&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Improving the new user experience&lt;/h3&gt;Orchestrating multiple services on cloud instances is a challenge to make simple, and Whirr has sometimes been a little fiddly to get running. SSH settings, in particular, have been a common sticking point with new users. The new &lt;a href="http://incubator.apache.org/whirr/whirr-in-5-minutes.html"&gt;Whirr in 5 Minutes&lt;/a&gt; guide walks through the minimum number of commands you need to type to get a simple 3-node ZooKeeper cluster running in a few minutes. From there you can move on to the &lt;a href="http://incubator.apache.org/whirr/quick-start-guide.html"&gt;Quick Start Guide&lt;/a&gt; and the &lt;a href="http://incubator.apache.org/whirr/configuration-guide.html"&gt;Configuration Guide&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The sample configurations in the &lt;i&gt;recipes&lt;/i&gt; directory in the distribution contain useful settings for running the services on a variety of cloud providers. Users are always encouraged to share their working configurations with the &lt;a href="http://incubator.apache.org/whirr/mail-lists.html"&gt;community&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;New services&lt;/h3&gt;Elastic Search and Voldemort have been added to the roster of services that come with Whirr. This brings the total to six; adding to Apache Cassandra, Apache Hadoop, Apache HBase, and Apache ZooKeeper.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;API improvements&lt;/h3&gt;Whirr is still a young project so it is not surprising that its API is rapidly evolving. In &lt;a href="https://issues.apache.org/jira/browse/WHIRR-245"&gt;WHIRR-245&lt;/a&gt;, the demarcation between the user API (for users who control Whirr clusters from Java) and the service API (for developers writing new Whirr services) was clarified. The user API can be found in the &lt;a href="http://incubator.apache.org/whirr/apidocs/org/apache/whirr/package-summary.html"&gt;org.apache.whirr&lt;/a&gt; package; whereas the service API is in &lt;a href="http://incubator.apache.org/whirr/apidocs/org/apache/whirr/service/package-summary.html"&gt;org.apache.whirr.service&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;You can find out more about writing Whirr services in &lt;a href="https://cwiki.apache.org/confluence/download/attachments/25199475/how-to-write-a-whirr-service.pdf?version=1&amp;amp;modificationDate=1301953266000"&gt;this presentation&lt;/a&gt; (PDF).&lt;br /&gt;&lt;br /&gt;The firewall API that service writers use to open ports for services was simplified and made more powerful in &lt;a href="https://issues.apache.org/jira/browse/WHIRR-275"&gt;WHIRR-275&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Overriding scripts&lt;/h3&gt;This feature was actually introduced in Whirr 0.4.0-incubating, but it's useful enough to mention here. In older versions of Whirr, if you wanted to make a modification to the scripts that run on cloud instances - to tweak some settings, for instance - you would have to upload your modifications (as well as all the other scripts) to a publicly available web server (Amazon S3 was a common choice), then point Whirr at the new location. Not particularly difficult, but a big enough barrier to discourage users from trying it.&lt;br /&gt;&lt;br /&gt;The new approach is to push scripts to nodes from the launching machine, so you can just edit them locally before launch. Full instructions are covered in the &lt;a href="http://incubator.apache.org/whirr/faq.html#How_can_I_modify_the_instance_installation_and_configuration_scripts"&gt;FAQ&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Running scripts on nodes&lt;br /&gt;&lt;/h3&gt;In 0.5.0 the scripts that run on cloud instances have been broken up to be more fine-grained, so many services have individual start and stop scripts (&lt;a href="https://issues.apache.org/jira/browse/WHIRR-266"&gt;WHIRR-266&lt;/a&gt;). Combined with the ability to run scripts on  sets of nodes in the cluster (by ID or role), users now have more control of the cluster once it has launched (&lt;a href="https://issues.apache.org/jira/browse/WHIRR-173"&gt;WHIRR-173&lt;/a&gt;). Try running &lt;code&gt;whirr run-script&lt;/code&gt; at the command line to use this feature. There's a contrib script to run the &lt;a href="https://github.com/brianfrankcooper/YCSB"&gt;Yahoo! Cloud Serving Benchmark&lt;/a&gt; (YCSB) against an HBase cluster, which takes advantage of the run-script command (&lt;a href="https://issues.apache.org/jira/browse/WHIRR-287"&gt;WHIRR-287&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;Also useful is &lt;a href="https://issues.apache.org/jira/browse/WHIRR-291"&gt;WHIRR-291&lt;/a&gt;, which allows you to launch "blank" nodes with no services running on them (in a "noop" role), and then, with &lt;code&gt;whirr run-script&lt;/code&gt;, run arbitrary scripts on them to bring them into the state you want.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Custom service builds&lt;/h3&gt;Developers who work on services supported in Whirr will find the ability to push a custom build to a cluster very useful for testing (&lt;a href="https://issues.apache.org/jira/browse/WHIRR-220"&gt;WHIRR-220&lt;/a&gt;). For example, if you are working on a ZooKeeper feature, you can build a ZooKeeper tarball with your new feature, then launch a cluster that uses this tarball by specifying &lt;code&gt;whirr.zookeeper.tarball.url&lt;/code&gt; as a local &lt;i&gt;file://&lt;/i&gt; URL pointing to your tarball. Whirr will push the tarball to a temporary blob store container, then each node will download from there.&lt;br /&gt;&lt;br /&gt;I used a variation of this feature to try out a &lt;a href="https://builds.apache.org/job/Hadoop-22-Build/"&gt;nightly Hadoop 0.22 build&lt;/a&gt; on a small Whirr cluster. In this case the tarball URL is not a local file, so Whirr doesn't copy the tarball to a blob store since it is already accessible from the cloud.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Service improvements&lt;/h3&gt;Whirr is only able to exist because of the powerful abstraction that &lt;a href="http://code.google.com/p/jclouds/"&gt;jclouds&lt;/a&gt; provides for interacting with cloud providers. A great example of this power is the API that jclouds provides for discovering the hardware capabilities of an instance running on any provider. &lt;a href="https://issues.apache.org/jira/browse/WHIRR-282"&gt;WHIRR-282&lt;/a&gt; took advantage of the jclouds API to find the number of cores on a node to dynamically configure the number of slots in a Hadoop cluster. Previously, you had to set this manually for each cluster to take full advantage of larger image sizes.&lt;br /&gt;&lt;br /&gt;This is just the beginning - there is more work to use memory capabilities to set configuration (&lt;a href="https://issues.apache.org/jira/browse/WHIRR-229"&gt;WHIRR-229&lt;/a&gt;), and to use hardware capabilities generally in services other than Hadoop.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Cluster state storage&lt;/h3&gt;In previous releases of Whirr, information about launched instances was stored in a file on the machine that launched the cluster (&lt;i&gt;~/.whirr/&amp;lt;cluster-name&amp;gt;/instances&lt;/i&gt;). With &lt;a href="https://issues.apache.org/jira/browse/WHIRR-288"&gt;WHIRR-288&lt;/a&gt;, it's now possible to store this information in a blob store instead (such as Amazon S3, although any jclouds-supported blob store can be used), which is useful if you want to control clusters from multiple machines.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Bring Your Own Nodes&lt;/h3&gt;Or just BYON, for short. Many users have requested the ability to deploy to privately owned hardware - and jclouds added this feature in 1.0-beta-9. Whirr now has preliminary support for BYON clusters. In a nutshell, you write a YAML file enumerating the nodes to deploy to - their addresses, access credentials, etc. - then Whirr will start services on them. The nodes just need to have a base OS like Centos or Ubuntu installed. You can find an example BYON configuration in the &lt;i&gt;recipes&lt;/i&gt; directory of the download.&lt;br /&gt;&lt;br /&gt;BYON is also useful for testing locally by using VMware or VirtualBox to host target nodes.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;A hummingbird&lt;/h3&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://incubator.apache.org/whirr/images/whirr-logo.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: right; cursor: pointer; width: 247px; height: 260px;" src="http://incubator.apache.org/whirr/images/whirr-logo.png" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Last, but not least, Whirr finally has a logo! Many thanks to Alison Wong, who designed it and donated it to the ASF.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Credits&lt;/h3&gt;I would like to thank everyone who helped with the 0.5.0-incubating release. We have a growing community, and we welcome feedback and help from new users and developers. If you'd like to get involved you can start by &lt;a href="http://www.apache.org/dyn/closer.cgi/incubator/whirr/"&gt;downloading the new release&lt;/a&gt; and joining us on the &lt;a href="http://incubator.apache.org/whirr/mail-lists.html"&gt;mailing lists&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;What's next?&lt;/h3&gt;It's difficult to make firm predictions about the contents of the next release since Whirr is an open source project with many &lt;a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;amp;jqlQuery=project+%3D+WHIRR+AND+resolution+%3D+Unresolved"&gt;open issues&lt;/a&gt;, but the general themes include:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Adding &lt;a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;amp;jqlQuery=project+%3D+WHIRR+AND+component+%3D+%22new+service%22+and+resolution+%3D+Unresolved"&gt;more services&lt;/a&gt;. In tandem, we want to make it easier to write new services by pushing common patterns into the core (e.g. &lt;a href="https://issues.apache.org/jira/browse/WHIRR-326"&gt;WHIRR-326&lt;/a&gt; is one example of this).&lt;/li&gt;&lt;li&gt;Improving existing services. By making them more flexible, better configured, easier to manage.&lt;/li&gt;&lt;li&gt;Adding &lt;a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;amp;jqlQuery=project+%3D+WHIRR+AND+component+%3D+%22new+provider%22+and+resolution+%3D+Unresolved"&gt;more cloud providers&lt;/a&gt;. The latest release of jclouds supports 30 providers, and we need help testing more of them with Whirr.&lt;/li&gt;&lt;li&gt;Implementing services using other configuration management tools, rather than bash scripting. Andrei Savu is working on using Puppet to write new services (&lt;a href="https://issues.apache.org/jira/browse/WHIRR-255"&gt;WHIRR-255&lt;/a&gt;).&lt;/li&gt;&lt;li&gt;Supporting elastic clusters, so new nodes can be added to running clusters (&lt;a href="https://issues.apache.org/jira/browse/WHIRR-214"&gt;WHIRR-214&lt;/a&gt;).&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/5nw7MkG9Zec" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/7077861757926689669/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=7077861757926689669" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7077861757926689669?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7077861757926689669?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2011/06/whats-new-in-apache-whirr-050.html" title="What's new in Apache Whirr 0.5.0-incubating" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;DUMGQn4zeyp7ImA9WhZXGUw.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-8764174624555956580</id><published>2011-05-09T04:23:00.008+01:00</published><updated>2011-05-09T06:03:43.083+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-05-09T06:03:43.083+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Data" /><title>Do Donors Choose Local Schools?</title><content type="html">&lt;a href="http://www.donorschoose.org/"&gt;DonorsChoose.org&lt;/a&gt; is a site where people donate money to school projects. For example, a teacher in Iowa might create a project request for some beanbags to create a reading area for her pupils. Then, via the website, donors can give as much or as little as they like to the project, and once the target is reached DonorsChoose purchase and deliver the beanbags to the school.&lt;br /&gt;&lt;br /&gt;DonorsChoose are running a contest. They have opened up their data, and are challenging developers to "&lt;a href="http://www.donorschoose.org/hacking-education"&gt;make discoveries and build apps that improve education in America&lt;/a&gt;".&lt;br /&gt;&lt;br /&gt;I thought I'd do a little hack to answer the question "Do donors tend to choose their local schools?"&lt;br /&gt;&lt;br /&gt;I wrote a short Python program to calculate the distance between each donor's address (where it was provided) and the address of the school for the project they were donating to. Then, using R, I plotted the following histogram:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-YG-bpy0WG_4/TcdnuzsWPxI/AAAAAAAAAIk/y6918RNNchU/s1600/distance-hist.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 400px;" src="http://3.bp.blogspot.com/-YG-bpy0WG_4/TcdnuzsWPxI/AAAAAAAAAIk/y6918RNNchU/s400/distance-hist.png" alt="" id="BLOGGER_PHOTO_ID_5604562315133730578" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;It's striking that many donors are local. In fact, in my analysis, one in four donors live within four miles of the school they are donating to, and the median distance is 128 miles. However, there is a long tail reaching to over 5000 miles!&lt;br /&gt;&lt;br /&gt;If we use a logarithmic scale for the y-axis (count), then a couple of features jump out. This plot is a scatter plot where counts are bucketed by integer distance.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-RARTtyCNHUo/TcdoApfJQZI/AAAAAAAAAIs/-QRnmc0tgDA/s1600/distance-log.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 400px;" src="http://1.bp.blogspot.com/-RARTtyCNHUo/TcdoApfJQZI/AAAAAAAAAIs/-QRnmc0tgDA/s400/distance-log.png" alt="" id="BLOGGER_PHOTO_ID_5604562621631644050" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;There is a small peak at around 2500 miles, which is puzzling until you realize that this is the approximate distance between the East Coast and West Coast of the USA, where the majority of the population is located. I'm guessing that this bump corresponds to people who donate to schools of friends and relatives on the other coast.&lt;br /&gt;&lt;br /&gt;The other noticeable feature is the significant drop off after 2500 miles. This small number of donations is where the donor or school is located in the non-contiguous states (Alaska and Hawaii), which have only a small fraction of the total population.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;How I produced the images&lt;/h3&gt;I wrote a Python program to parse the CSV &lt;a href="http://developer.donorschoose.org/the-data/"&gt;data&lt;/a&gt; from DonorsChoose. It reads two data files - the projects file and the donations file. The files are joined by the project ID field, which means we can access the school ZIP code (from the projects file), and the partial ZIP code of the donor (from the donations file). The donor's ZIP code is optional (and was actually only present in 46% of donations, so the results are restricted to this subset of donations). Also, for privacy reasons, only the first 3 digits of the donor's ZIP code are provided by DonorsChoose. This makes the distance measurements less accurate, particularly for local donors.&lt;br /&gt;&lt;br /&gt;In the case of the partial ZIP code matching the school ZIP code, I set the distance to zero, on the assumption that the donor lives close to the school. This assumption will tend to overcount the zero distance case, and undercount small distances.&lt;br /&gt;&lt;br /&gt;If the partial ZIP code did not match the school ZIP code, I chose a ZIP code with that prefix at random and calculate the distance between that ZIP code and the school's ZIP code. For this calculation I used Kevin T. Ryan's Python &lt;a href="http://code.activestate.com/recipes/393241-calculating-the-distance-between-zip-codes/"&gt;code at ActiveState&lt;/a&gt;, which I modified slightly to support partial ZIP codes.&lt;br /&gt;&lt;br /&gt;The program buckets integers distances and writes the counts to a file. I then used R to plot the distributions show above.&lt;br /&gt;&lt;br /&gt;I've put all my code into a &lt;a href="https://github.com/tomwhite/donors-choose-hack"&gt;GitHub repository&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;This hack just scratches the surface of the dataset, and I look forward to seeing some of the cool things that others do in this &lt;a href="http://www.donorschoose.org/hacking-education"&gt;contest&lt;/a&gt;. The closing date is June 30, 2011.&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/V2lH1BYdMbc" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/8764174624555956580/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=8764174624555956580" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/8764174624555956580?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/8764174624555956580?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2011/05/do-donors-choose-local-schools.html" title="Do Donors Choose Local Schools?" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://3.bp.blogspot.com/-YG-bpy0WG_4/TcdnuzsWPxI/AAAAAAAAAIk/y6918RNNchU/s72-c/distance-hist.png" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;A0YBRX4zcCp7ImA9WhZQEEQ.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-8434577759482729644</id><published>2011-04-16T23:23:00.005+01:00</published><updated>2011-04-18T04:59:14.088+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-04-18T04:59:14.088+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Whirr" /><title>Whirr in 5 Minutes</title><content type="html">A couple of days ago I wrote down a sequence of command lines to install &lt;a href="http://incubator.apache.org/whirr/"&gt;Apache Whirr&lt;/a&gt; (an incubator project for running distributed systems on various cloud providers) and run a service from scratch. You just need Java, SSH, and some &lt;a href="http://incubator.apache.org/whirr/faq.html#How_do_I_find_my_cloud_credentials"&gt;cloud credentials&lt;/a&gt; (Amazon EC2 in this case): I've reproduced the commands here:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;export AWS_ACCESS_KEY_ID=...&lt;br /&gt;export AWS_SECRET_ACCESS_KEY=...&lt;br /&gt;curl -O http://www.apache.org/dist/incubator/whirr/whirr-0.4.0-incubating/whirr-0.4.0-incubating.tar.gz&lt;br /&gt;tar zxf whirr-0.4.0-incubating.tar.gz; cd whirr-0.4.0-incubating&lt;br /&gt;ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr&lt;br /&gt;bin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;At this point you should have a 3 node ZooKeeper cluster running, which is easily checked with&lt;br /&gt;&lt;code&gt;&lt;br /&gt;echo "ruok" | nc $(awk '{print $3}' ~/.whirr/zookeeper/instances | head -1) 2181; echo&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;You can shutdown the cluster with the following command.&lt;br /&gt;&lt;code&gt;&lt;br /&gt;bin/whirr destroy-cluster --config recipes/zookeeper-ec2.properties&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;There are recipes for more services in the &lt;a href="http://www.apache.org/dyn/closer.cgi/incubator/whirr/"&gt;Whirr download&lt;/a&gt; package, and more detailed instructions in the &lt;a href="http://incubator.apache.org/whirr/quick-start-guide.html"&gt;Quick Start Guide&lt;/a&gt;.&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/SQhhVJn9ZWY" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/8434577759482729644/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=8434577759482729644" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/8434577759482729644?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/8434577759482729644?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2011/04/whirr-in-5-minutes.html" title="Whirr in 5 Minutes" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CU4ERX4zfyp7ImA9Wx9TGU8.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-1093096056004881234</id><published>2010-11-28T05:22:00.004Z</published><updated>2010-11-28T05:58:24.087Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-11-28T05:58:24.087Z</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Java" /><title>My favourite talk at Devoxx 2010</title><content type="html">I went to &lt;a href="http://www.devoxx.com/display/Devoxx2K10/Home"&gt;Devoxx&lt;/a&gt; in Antwerp for the first time this year, and really enjoyed it. I didn't go to that many talks, but the quality seemed very high. My favourite talk was "Performance Anxiety" by &lt;a href="http://en.wikipedia.org/wiki/Joshua_Bloch"&gt;Josh Bloch&lt;/a&gt;, because he's a great speaker and because he presented a single important idea so well.&lt;br /&gt;&lt;br /&gt;The idea was this: determining the performance of programs should be treated as an empirical science. We should give up any hope (if any existed) that predicting a program's performance will become easier in the future, since every layer in the deep stack of a modern computer is becoming more complex. Increased complexity is actually the price we must pay for increased performance. And increased complexity leads, almost inevitably, to reduced predictability.&lt;br /&gt;&lt;br /&gt;As an experimental demonstration, Josh ran a micro benchmark to sort an array of integers. (The demo actually failed to show what he wanted to show, but he assured us it had worked earlier... It's somehow reassuring when live demos don't work for Java demigods either.) Each invocation of the benchmark did a number of runs, and the timings of the runs converged on a stable value. However, between benchmark invocations, the stable values that they converged on varied by up to 20%.&lt;br /&gt;&lt;br /&gt;The reason is subtle: the HotSpot compiler produces different compile plans on different runs, and these have different performance profiles. (This is explained in Cliff Click's 2009 JavaOne presentation, &lt;a href="http://www.azulsystems.com/events/javaone_2009/session/2009_J1_Benchmark.pdf"&gt;"The Art of (Java) Benchmarking"&lt;/a&gt;.) They all converge on stable values, but &lt;span style="font-style: italic;"&gt;different&lt;/span&gt; stable values for &lt;span style="font-style: italic;"&gt;different&lt;/span&gt; runs. The fact that HotSpot is non-deterministic may not be particularly surprising, but Josh said that the same behaviour has been shown in C code and even assembler, since non-determinism exists at lower levels of the stack too.&lt;br /&gt;&lt;br /&gt;The practical upshot is that we need to change how we iteratively benchmark code. No longer is it permissible to run a benchmark, make a change, run the benchmark again, see that the execution time was faster (even across a number of runs in one VM) and legitimately conclude that it was due to the change we made. We have to reach for statistical tools that tell us the improved execution time was significant after we have run enough VMs.&lt;br /&gt;&lt;br /&gt;How many VMs? The short answer is "30", the longer answer is in &lt;a href="http://buytaert.net/files/oopsla07-georges.pdf"&gt;"Statistically Rigorous Java Performance Evaluation"&lt;/a&gt; by Andy Georges, Dries Buytaert, and Lieven Eeckhout.&lt;br /&gt;&lt;br /&gt;Thankfully there is a Java framework called &lt;a href="http://code.google.com/p/caliper"&gt;Caliper&lt;/a&gt; which can help you run microbenchmarks and which even plots the error bars for you. This stuff needs to see wider adoption in the industry.&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/QG4OwT732Rw" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/1093096056004881234/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=1093096056004881234" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1093096056004881234?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1093096056004881234?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2010/11/my-favourite-talk-at-devoxx-2010.html" title="My favourite talk at Devoxx 2010" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CEcHQnc4fyp7ImA9WxJSFkk.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-7308341528704841683</id><published>2009-05-01T16:52:00.008+01:00</published><updated>2009-05-06T21:33:53.937+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-05-06T21:33:53.937+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Book" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>"Hadoop: The Definitive Guide" Coming Soon</title><content type="html">&lt;a href="http://www.hadoopbook.com/"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 240px; height: 240px;" src="http://1.bp.blogspot.com/_IhqEHw4_Ick/SfseHCO7weI/AAAAAAAAAHc/N6vYnyvN6wQ/s320/htdg.jpg" alt="" id="BLOGGER_PHOTO_ID_5330887690130538978" border="0" /&gt;&lt;/a&gt;After a busy couple of months I've finished the writing for "&lt;a href="http://www.hadoopbook.com/"&gt;Hadoop: The Definitive Guide&lt;/a&gt;". It's now going through the production process at O'Reilly.&lt;br /&gt;&lt;br /&gt;You can pre-order it on &lt;a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979/ref=pd_bbs_sr_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1241193282&amp;amp;sr=8-1"&gt;Amazon&lt;/a&gt; and &lt;a href="http://oreilly.com/catalog/9780596521998/"&gt;O'Reilly&lt;/a&gt;. You can also get the Rough Cuts version from O'Reilly to read today, although it hasn't yet been refreshed with my latest draft (I hope that will happen in the next few days).&lt;br /&gt;&lt;br /&gt;Here's the final chapter listing. Readers of earlier drafts will notice that the number of chapters has grown: this is because the elephantine MapReduce chapter has been split into three (chapters 6, 7, and 8) to make things more digestible.&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Meet Hadoop&lt;/li&gt;&lt;li&gt;MapReduce&lt;/li&gt;&lt;li&gt;The Hadoop Distributed Filesystem&lt;/li&gt;&lt;li&gt;Hadoop I/O&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Developing a MapReduce Application&lt;/li&gt;&lt;li&gt;How MapReduce Works&lt;/li&gt;&lt;li&gt;MapReduce Types and Formats&lt;/li&gt;&lt;li&gt;MapReduce Features&lt;/li&gt;&lt;li&gt;Setting Up a Hadoop Cluster&lt;/li&gt;&lt;li&gt;Administering Hadoop&lt;/li&gt;&lt;li&gt;Pig&lt;/li&gt;&lt;li&gt;HBase&lt;/li&gt;&lt;li&gt;ZooKeeper&lt;/li&gt;&lt;li&gt;Case Studies&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;The writing's done but I still have to package up the example code. I'll be doing this soon, and it will appear on the &lt;a href="http://hadoopbook.com/"&gt;book's website&lt;/a&gt;.&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/NpTz-GvHuSU" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/7308341528704841683/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=7308341528704841683" title="9 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7308341528704841683?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7308341528704841683?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2009/05/hadoop-definitive-guide-coming-soon.html" title="&quot;Hadoop: The Definitive Guide&quot; Coming Soon" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/_IhqEHw4_Ick/SfseHCO7weI/AAAAAAAAAHc/N6vYnyvN6wQ/s72-c/htdg.jpg" height="72" width="72" /><thr:total>9</thr:total></entry><entry gd:etag="W/&quot;C0MHSX4yeSp7ImA9WxVQEEg.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-1506167963919896927</id><published>2009-01-27T10:01:00.002Z</published><updated>2009-01-27T10:17:18.091Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-01-27T10:17:18.091Z</app:edited><title>Draft Pig Chapter</title><content type="html">A couple of quick updates on the &lt;a href="http://www.hadoopbook.com/"&gt;Hadoop book&lt;/a&gt; I'm writing. The Pig chapter is now available on &lt;a href="http://safari.oreilly.com/9780596521974?cid=orm-cat-readnow-9780596521974"&gt;Safari&lt;/a&gt;. It still has a few holes, but I'd love to hear feedback on it.&lt;br /&gt;&lt;br /&gt;Also included is a Hadoop case study from Last.fm. Thanks to Adrian Woodhead and Marc de Palol for writing it.&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/h8dvkotwAJM" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/1506167963919896927/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=1506167963919896927" title="4 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1506167963919896927?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1506167963919896927?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2009/01/draft-pig-chapter.html" title="Draft Pig Chapter" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>4</thr:total></entry><entry gd:etag="W/&quot;AkYAR30yeyp7ImA9WxRUEUo.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-5053020334356534059</id><published>2008-11-20T10:39:00.005Z</published><updated>2008-11-20T10:49:06.393Z</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-11-20T10:49:06.393Z</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Cloudera" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Hadoop Developer Zeitgeist</title><content type="html">The Cloudera team have just released a &lt;a href="http://community.cloudera.com/"&gt;website&lt;/a&gt; which has a few reports on various Hadoop development metrics. I like the &lt;a href="http://community.cloudera.com/reports/2/issues/"&gt;Most Watched Open Jira Issues&lt;/a&gt;, as it gives a good summary of what Hadoop Core developers are thinking about.&lt;br /&gt;&lt;br /&gt;Personally, I can't wait for the new MapReduce API (&lt;a href="http://issues.apache.org/jira/browse/HADOOP-1230"&gt;HADOOP-1230&lt;/a&gt;), which is currently the third most watched issue.&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/c5hnPLuJZ1s" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/5053020334356534059/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=5053020334356534059" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5053020334356534059?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5053020334356534059?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2008/11/hadoop-developer-zeitgeist.html" title="Hadoop Developer Zeitgeist" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;DkcAQ3s-cSp7ImA9WxRXEUg.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-1310920520805089454</id><published>2008-10-16T11:36:00.003+01:00</published><updated>2008-10-16T11:47:22.559+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-10-16T11:47:22.559+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Cloudera" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Cloudera</title><content type="html">I'm pleased to announce that I've joined &lt;a href="http://www.cloudera.com/"&gt;Cloudera&lt;/a&gt;, a new startup providing support for Hadoop. Amr Awadallah (who's one of the founders) has got more details in his &lt;a href="http://www.awadallah.com/blog/2008/10/13/the-startup-is-cloudera-the-business-is-hadoop-mapreduce/"&gt;blog post&lt;/a&gt;.&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/Ib32m9JS60Q" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/1310920520805089454/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=1310920520805089454" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1310920520805089454?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1310920520805089454?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2008/10/cloudera.html" title="Cloudera" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>3</thr:total></entry><entry gd:etag="W/&quot;DUcDQXg6eCp7ImA9WxRSFUU.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-1667539494188985504</id><published>2008-09-16T11:26:00.002+01:00</published><updated>2008-09-16T18:44:30.610+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-09-16T18:44:30.610+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Book" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Hadoop: The Definitive Guide</title><content type="html">&lt;span style="font-weight:bold;"&gt;Update:&lt;/span&gt; Fixed feedback link.&lt;br /&gt;&lt;br /&gt;The Rough Cut of &lt;a href="http://oreilly.com/catalog/9780596521998/index.html"&gt;Hadoop: The Definitive Guide&lt;/a&gt; is now up on O'Reilly's site. There are a few chapters available already, at various stages of completion. Remember, it's still pretty rough. I'd love to hear any suggestions for improvements that you may have though. You can submit feedback from &lt;a href="http://safari.oreilly.com/9780596521974"&gt;Safari&lt;/a&gt; where the book is hosted. As the &lt;a href="http://oreilly.com/roughcuts/faq.csp"&gt;Rough Cuts FAQ&lt;/a&gt; explains, I'd like feedback on missing topics, if something is not understandable, and technical mistakes.&lt;br /&gt;&lt;br /&gt;Now I just need to go and write the rest of it...&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/y1_XSTXr3L4" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/1667539494188985504/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=1667539494188985504" title="2 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1667539494188985504?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1667539494188985504?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2008/09/hadoop-definitive-guide.html" title="Hadoop: The Definitive Guide" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>2</thr:total></entry><entry gd:etag="W/&quot;AkAMQHw_eSp7ImA9WxRSEkk.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-7867943272436546694</id><published>2008-09-04T22:00:00.004+01:00</published><updated>2008-09-12T20:46:21.241+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-09-12T20:46:21.241+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Amazon S3" /><category scheme="http://www.blogger.com/atom/ns#" term="Data" /><category scheme="http://www.blogger.com/atom/ns#" term="Amazon EC2" /><title>Hosting Large Public Datasets on Amazon S3</title><content type="html">&lt;span style="font-style:italic;"&gt;&lt;span style="font-weight:bold;"&gt;Update&lt;/span&gt;: I just thought of a quick and dirty way of doing this: just store your content on an extra large EC2 instance (holds up to 1690GB) and make the image public. Anyone can access it using their EC2 account, you just get charged for hosting the image.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There's a great deal of interest in large, publicly available datasets (see, for example, &lt;a href="http://groups.google.com/group/get-theinfo/browse_thread/thread/79e5b1159e533d52"&gt;this thread&lt;/a&gt; from &lt;a href="http://theinfo.org/"&gt;theinfo.org&lt;/a&gt;), but for very large datasets it is still expensive to provide the bandwidth to distribute them. Imagine if you could get your hands on the data from a large web crawl, the kind of thing that the &lt;a href="http://www.archive.org/"&gt;Internet Archive&lt;/a&gt; produces. I'm sure people would discover some interesting things from it.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://aws.amazon.com/s3"&gt;Amazon S3&lt;/a&gt; is an obvious choice for storing data for public consumption, but while the cost for storage may be reasonable, the cost for transfer can be crippling since the cost is not under the control of the data provider, being incurred &lt;span style="font-style: italic;"&gt;for each transfer&lt;/span&gt; (which is initiated by the user).&lt;br /&gt;&lt;br /&gt;For example, consider a 1TB dataset. With storage running at $0.15 per GB per month this works out at around $150 per month to host. With transfer costs costing $0.18 per GB, this dataset costs around $180 for each transfer out of Amazon! It's not surprising large datasets are not publicly hosted on S3.&lt;br /&gt;&lt;br /&gt;However, transferring data between S3 and EC2 is free, so could we limit transfers from S3 so they are only possible to EC2? You (or anyone else) could run an analysis on EC2 (using Hadoop, say) and only pay for the EC2 time. Or you could transfer it out of EC2 at your own expense. S3 doesn't support this option directly, but it is possible to emulate it with a bit of code.&lt;br /&gt;&lt;br /&gt;The idea (suggested by &lt;a href="http://blog.lucene.com/"&gt;Doug Cutting&lt;/a&gt;) is to make objects private on S3 to restrict access generally, then run a proxy on EC2 that is authorized to access the objects. The proxy only accepts connections from within EC2: any client that is outside Amazon's cloud is firewalled out. This combination ensures only EC2 instances can access the S3 objects, thus removing any bandwidth costs.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Implementation&lt;/h3&gt;I've written such a proxy. It's a Java servlet that uses the &lt;a href="https://jets3t.dev.java.net/"&gt;JetS3t&lt;/a&gt; library to add the correct &lt;a href="http://docs.amazonwebservices.com/AmazonS3/2006-03-01/index.html?RESTAuthentication.html"&gt;Amazon S3 &lt;code&gt;Authorization&lt;/code&gt; HTTP header&lt;/a&gt; to gain access to the owner's objects on S3. If the proxy is running on the EC2 instance with hostname &lt;span style="font-style: italic;"&gt;ec2-67-202-43-67.compute-1.amazonaws.com&lt;/span&gt;, for example, then a request for&lt;br /&gt;&lt;pre&gt;http://ec2-67-202-43-67.compute-1.amazonaws.com/bucket/object&lt;br /&gt;&lt;/pre&gt;is proxied to the protected object at&lt;br /&gt;&lt;pre&gt;http://s3.amazonaws.com/bucket/object&lt;br /&gt;&lt;/pre&gt;To ensure that only clients on EC2 can get access to the proxy I set up an EC2 security group (which limits access to port 80):&lt;br /&gt;&lt;pre&gt;ec2-add-group ec2-private-subnet -d "Group for all Amazon EC2 instances."&lt;br /&gt;ec2-authorize ec2-private-subnet -p 80 -s 10.0.0.0/8&lt;/pre&gt;Then by launching the proxy in this group, only machines on EC2 can connect. (Initially, I thought I had to add public IP addresses to the group -- which, incidentally, I found in &lt;a href="http://developer.amazonwebservices.com/connect/thread.jspa?messageID=51028"&gt;this forum posting&lt;/a&gt; -- but this is not necessary as the public DNS name of an EC2 instance resolves to the private IP address within EC2.) The AWS credentials to gain access to the S3 objects are passed in the user data, along with the hostname of S3:&lt;br /&gt;&lt;pre&gt;ec2-run-instances -k gsg-keypair -g ec2-private-subnet \&lt;br /&gt;-d "&amp;lt;aws_access_key&amp;gt; &amp;lt;aws_secret_key&amp;gt; s3.amazonaws.com" ami-fffd1996&lt;/pre&gt;This AMI (ID &lt;code&gt;ami-fffd1996&lt;/code&gt;) is publicly available, so anyone can use it by using the commands shown here. (The code is available &lt;a href="http://s3.amazonaws.com/s3proxy/s3proxy-0.1.tar.gz"&gt;here&lt;/a&gt;, under an Apache 2.0 license, but you don't need this if you only intend to run or use a proxy.)&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Demo&lt;/h3&gt;Here's a resource on S3 that is protected: &lt;span style="font-style:italic;"&gt;http://s3.amazonaws.com/tiling/private.txt&lt;/span&gt;. When you try to retrieve it you get an authorization error:&lt;br /&gt;&lt;pre&gt;% &lt;span style="font-weight: bold;"&gt;curl http://s3.amazonaws.com/tiling/private.txt&lt;/span&gt;&lt;br /&gt;&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;&lt;br /&gt;&amp;lt;Error&amp;gt;&lt;br /&gt;&amp;lt;Code&amp;gt;AccessDenied&amp;lt;/Code&amp;gt;&lt;br /&gt;&amp;lt;Message&amp;gt;Access Denied&amp;lt;/Message&amp;gt;&lt;br /&gt;&amp;lt;RequestId&amp;gt;57E370CDDD9FE044&amp;lt;/RequestId&amp;gt;&lt;br /&gt;&amp;lt;HostId&amp;gt;dA+9II1dYAjPE5aNsnRxhVoQ5qy3KCa6frkLg3SyTwzP3i2SQNCU534/v8NXXEnN&amp;lt;/HostId&amp;gt;&lt;br /&gt;&amp;lt;/Error&amp;gt;&lt;/pre&gt;With a proxy running, I still can't retrieve the resource via the proxy from outside EC2. It just times out due to the firewall rule:&lt;br /&gt;&lt;pre&gt;% &lt;span style="font-weight: bold;"&gt;curl http://ec2-67-202-56-11.compute-1.amazonaws.com/tiling/private.txt&lt;/span&gt;&lt;br /&gt;curl: (7) couldn't connect to host&lt;/pre&gt;But it does works from an EC2 machine (any EC2 machine):&lt;br /&gt;&lt;pre&gt;% &lt;span style="font-weight: bold;"&gt;curl http://ec2-67-202-56-11.compute-1.amazonaws.com/tiling/private.txt&lt;/span&gt;&lt;br /&gt;secret&lt;/pre&gt;&lt;h3&gt;Conclusion&lt;/h3&gt;By running a proxy on EC2, at 10 cents per hour (small instance) - or $72 a month, you can allow folks using EC2 to access your data on S3 for free. While running the proxy is not free, it is a fixed cost that might be acceptable to some organizations, particularly those that have an interest in making data publicly available (but can't stomach large bandwidth costs).&lt;br /&gt;&lt;br /&gt;A few questions:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Is this useful?&lt;/li&gt;&lt;li&gt;Is there a better way of doing it?&lt;/li&gt;&lt;li&gt;Can we have this built into S3 (please, Amazon)?&lt;/li&gt;&lt;/ul&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/9s8s8xrVFmI" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/7867943272436546694/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=7867943272436546694" title="4 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7867943272436546694?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7867943272436546694?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2008/09/hosting-large-public-datasets-on-amazon.html" title="Hosting Large Public Datasets on Amazon S3" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>4</thr:total></entry><entry gd:etag="W/&quot;DkcNRHs6fyp7ImA9WxdaFU8.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-7570203001448264185</id><published>2008-08-23T21:39:00.002+01:00</published><updated>2008-08-23T21:41:35.517+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-08-23T21:41:35.517+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Amazon Web Services" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Elastic Hadoop Clusters with Amazon's Elastic Block Store</title><content type="html">I gave a &lt;a href="http://skillsmatter.com/podcast/cloud-grid/hadoop-on-amazon-s3ec2"&gt;talk&lt;/a&gt; on Tuesday at the first &lt;a href="http://skillsmatter.com/event/java-jee/hadoop-user-group-meeting"&gt;Hadoop User Group UK&lt;/a&gt; about Hadoop and Amazon Web services - how and why you can run Hadoop with AWS. I mentioned how integrating Hadoop with Amazon's "Persistent local storage", which Werner Vogels had &lt;a href="http://www.allthingsdistributed.com/2008/04/persistent_storage_for_amazon.html"&gt;pre-announced&lt;/a&gt; in April, would be a great feature to have to enable truly elastic Hadoop clusters that you could stop and start on demand.&lt;br /&gt;&lt;br /&gt;Well, the very next day Amazon launched this service, called &lt;a href="http://www.amazon.com/b/ref=sc_fe_c_0_201590011_1?ie=UTF8&amp;amp;node=689343011&amp;amp;no=201590011"&gt;Elastic Block Store&lt;/a&gt; (EBS). So in this post I thought I'd sketch out how an elastic Hadoop might work.&lt;br /&gt;&lt;br /&gt;A bit of background. Currently there are three main ways to use Hadoop with AWS:&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;1. MapReduce with S3 source and sink&lt;/h3&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_IhqEHw4_Ick/SK7wd4umVGI/AAAAAAAAAE4/pnzP5XjfjtI/s1600-h/s3-mapred.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: right; cursor: pointer;" src="http://2.bp.blogspot.com/_IhqEHw4_Ick/SK7wd4umVGI/AAAAAAAAAE4/pnzP5XjfjtI/s320/s3-mapred.png" alt="" id="BLOGGER_PHOTO_ID_5237387812913173602" border="0"&gt;&lt;/a&gt;In this set up, the data resides on S3, and the MapReduce daemons run on a temporary EC2 cluster for the duration of the job run. This works, and is especially convenient if you've already store your data on S3, but you don't get any data locality. Data locality is what enables the magic of MapReduce to work efficiently - the computation is scheduled to run on the machine where the data is stored, so you get huge savings in not having to ship terabytes of data around the network. EC2 does not share nodes with S3 storage, in fact they are often in different data centres, so performance is nowhere near as good as a regular Hadoop cluster where the data in stored in HDFS (see 3. below).&lt;br /&gt;&lt;br /&gt;It's not all doom and gloom, as the bandwidth between EC2 and S3 is actually pretty good, as Rightscale &lt;a href="http://blog.rightscale.com/2007/10/28/network-performance-within-amazon-ec2-and-to-amazon-s3/"&gt;found when they did some measurements&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;2. MapReduce from S3 with HDFS staging&lt;/h3&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_IhqEHw4_Ick/SK7wv4AZxeI/AAAAAAAAAFA/66XE-v6mgFE/s1600-h/s3-hdfs-mapred.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: right; cursor: pointer;" src="http://1.bp.blogspot.com/_IhqEHw4_Ick/SK7wv4AZxeI/AAAAAAAAAFA/66XE-v6mgFE/s320/s3-hdfs-mapred.png" alt="" id="BLOGGER_PHOTO_ID_5237388121957058018" border="0"&gt;&lt;/a&gt;Data is stored on S3 but copied to a temporary HDFS cluster running on EC2. This is just a variation of the previous set-up, which is good if you want to run several jobs against the same input data. You save by only copying the data across the network once, but you pay a little more due to HDFS replication.&lt;br /&gt;&lt;br /&gt;The bottleneck is still copying the data out of S3. (Copying the results back into S3 isn't usually as bad as the output is often an order or two of magnitude smaller than the input.)&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;3. HDFS on Amazon EC2&lt;/h3&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_IhqEHw4_Ick/SK7w--YOZvI/AAAAAAAAAFI/iu8En_A_Arw/s1600-h/ec2.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: right; cursor: pointer;" src="http://2.bp.blogspot.com/_IhqEHw4_Ick/SK7w--YOZvI/AAAAAAAAAFI/iu8En_A_Arw/s320/ec2.png" alt="" id="BLOGGER_PHOTO_ID_5237388381365626610" border="0"&gt;&lt;/a&gt;Of course, you could just run a Hadoop cluster on EC2 and store your data there (and not in S3). In this scenario, you are committed to running your EC2 cluster long term, which can prove expensive, although the locality is excellent.&lt;br /&gt;&lt;br /&gt;These three scenarios demonstrate that you pay for locality. However, there is a gulf between S3 and local disks that EBS fills nicely. EBS does not have the bandwidth of local disks, but it's significantly better than S3. Rightscale &lt;a href="http://blog.rightscale.com/2008/08/20/amazon-ebs-explained/"&gt;again&lt;/a&gt;:&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;The bottom line though is that performance exceeds what we’ve seen for filesystems striped across the four local drives of x-large instances.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;h3&gt;Implementing Elastic Hadoop&lt;/h3&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_IhqEHw4_Ick/SK7xRq_YwjI/AAAAAAAAAFY/0ypyV3MrprQ/s1600-h/ec2-with-storage-volumes.png"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://3.bp.blogspot.com/_IhqEHw4_Ick/SK7xRq_YwjI/AAAAAAAAAFY/0ypyV3MrprQ/s320/ec2-with-storage-volumes.png" alt="" id="BLOGGER_PHOTO_ID_5237388702578688562" border="0"&gt;&lt;/a&gt;The main departure from the current Hadoop on EC2 approach is the need to maintain a map from storage volume to node type: i.e. we need to remember which volume is a master volume (storing the namenode's data) and which is a worker volume (storing the datanode's data). It would be nice if you could just start up EC2 instances for all the volumes, and have them figure out which is which, but this might not work as the master needs to be started first so its address can be given to the workers in their user data. (This choreography problem could be solved by introducing ZooKeeper, but that's another story.) So for a first cut, we could simply keep two files (locally, or on S3 or even SimpleDB) called &lt;font style="font-style: italic;"&gt;master-volumes&lt;/font&gt;, and &lt;font style="font-style: italic;"&gt;worker-volumes&lt;/font&gt;, which simply list the volume IDs for each node type, one per line.&lt;br /&gt;&lt;br /&gt;Assume there is one master running the namenode and jobtracker, and &lt;font style="font-style: italic;"&gt;n&lt;/font&gt; worker nodes each running a datanode and tasktracker.&lt;br /&gt;&lt;br /&gt;To create a new cluster&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Create &lt;font style="font-style: italic;"&gt;n&lt;/font&gt; + 1 volumes.&lt;/li&gt;&lt;li&gt;Create the &lt;font style="font-style: italic;"&gt;master-volumes&lt;/font&gt; file and write the first volume ID into it.&lt;/li&gt;&lt;li&gt;Create the &lt;font style="font-style: italic;"&gt;worker-volumes&lt;/font&gt; file and write the remaining volume IDs to it.&lt;/li&gt;&lt;li&gt;Follow the steps for starting a cluster.&lt;/li&gt;&lt;/ol&gt;To start a cluster&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Start the master instance passing it the &lt;font style="font-style: italic;"&gt;master-volumes&lt;/font&gt; as user data. On startup the instance attaches to the volume it was told to. It then formats the namenode if it isn't already formatted, then starts the namenode, secondary namenode and jobtracker.&lt;/li&gt;&lt;li&gt;Start &lt;span style="font-style: italic;"&gt;n&lt;/span&gt; worker instances passing it the &lt;font style="font-style: italic;"&gt;worker-volumes&lt;/font&gt; as user data. On startup each instance attaches to the volume on line &lt;font style="font-style: italic;"&gt;i&lt;/font&gt;, where &lt;font style="font-style: italic;"&gt;i&lt;/font&gt; is the &lt;font face="courier new"&gt;ami-launch-index&lt;/font&gt; of the instance. Each instance then starts a datanode and tasktacker.&lt;/li&gt;&lt;li&gt;If any worker instances failed to start then launch them again.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;To shutdown a cluster&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Shutdown the Hadoop cluster daemons.&lt;/li&gt;&lt;li&gt;Detach the EBS volumes.&lt;/li&gt;&lt;li&gt;Shutdown the EC2 instances.&lt;/li&gt;&lt;/ol&gt;To grow a cluster&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Create &lt;font style="font-style: italic;"&gt;m&lt;/font&gt; new volumes, where &lt;font style="font-style: italic;"&gt;m&lt;/font&gt; is the size to grow by.&lt;/li&gt;&lt;li&gt;Append the &lt;font style="font-style: italic;"&gt;m&lt;/font&gt; new volume IDs to the &lt;font style="font-style: italic;"&gt;worker-volumes&lt;/font&gt; file.&lt;/li&gt;&lt;li&gt;Start &lt;span style="font-style: italic;"&gt;m&lt;/span&gt; worker instances passing it the &lt;font style="font-style: italic;"&gt;worker-volumes&lt;/font&gt; as user data. On startup each instance attaches to the volume on line &lt;font style="font-style: italic;"&gt;n + i&lt;/font&gt;, where &lt;font style="font-style: italic;"&gt;i&lt;/font&gt; is the &lt;font face="courier new"&gt;ami-launch-index&lt;/font&gt; of the instance. Each instance then starts a datanode and tasktacker.&lt;/li&gt;&lt;/ol&gt;Future enhancements: attach multiple volumes for performance/storage growth on the namenode, or resilience on the namenode; integrate the secondary namenode backup facility with EBS snapshots to S3; provide tools for managing the &lt;font style="font-style: italic;"&gt;worker-volumes&lt;/font&gt; file (for example, integrating with datanode decommissioning).&lt;br /&gt;&lt;br /&gt;Building this would be a great project to work on - I hope someone does it!&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/Q-zoiOc-7uM" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/7570203001448264185/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=7570203001448264185" title="9 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7570203001448264185?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7570203001448264185?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2008/08/elastic-hadoop-clusters-with-amazons.html" title="Elastic Hadoop Clusters with Amazon's Elastic Block Store" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_IhqEHw4_Ick/SK7wd4umVGI/AAAAAAAAAE4/pnzP5XjfjtI/s72-c/s3-mapred.png" height="72" width="72" /><thr:total>9</thr:total></entry><entry gd:etag="W/&quot;DU4FRXY9cSp7ImA9WxdVGEk.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-1050541487495156879</id><published>2008-07-23T20:26:00.002+01:00</published><updated>2008-07-23T22:18:34.869+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-07-23T22:18:34.869+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Pluggable Hadoop</title><content type="html">&lt;span style="font-weight:bold;"&gt;Update&lt;/span&gt;: This quote from Tim O'Reilly in his OSCON keynote today sums up the changes I describe below: "Do less and then create extensibility mechanisms." (via &lt;a href="http://raibledesigns.com/rd/entry/oscon_2008_the_keynote"&gt;Matt Raible&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;I'm noticing an increased desire to make Hadoop more modular. I'm not sure why this is happening now, but it's probably because as more people start using Hadoop it needs to be more malleable (people want to plug in their own implementations of things), and the way to do that in software is through modularity.&lt;br /&gt;&lt;br /&gt;Some examples:&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Job scheduling&lt;/h3&gt;The current scheduler is a simple FIFO scheduler which is adequate for small clusters with a few cooperating users. On larger clusters the best advice has been to use &lt;a href="http://hadoop.apache.org/core/docs/current/hod.html"&gt;HOD&lt;/a&gt; (Hadoop On Demand), but that has its own problems with inefficient cluster utilization. This situation led to a number of proposals to make the scheduler pluggable (&lt;a href="https://issues.apache.org/jira/browse/HADOOP-2510"&gt;HADOOP-2510&lt;/a&gt;, &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3412"&gt;HADOOP-3412&lt;/a&gt;, &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3444"&gt;HADOOP-3444&lt;/a&gt;). Already there is a &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3746"&gt;fair scheduler implementation&lt;/a&gt; (like the &lt;a href="http://en.wikipedia.org/wiki/Completely_Fair_Scheduler"&gt;Completely Fair Scheduler&lt;/a&gt; in Linux) from Facebook.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;HDFS block placement&lt;/h3&gt;Today the algorithm for placing a file's blocks across datanodes in the cluster is hardcoded into HDFS, and while it has &lt;a href="https://issues.apache.org/jira/browse/HADOOP-2559"&gt;evolved&lt;/a&gt;, it is clear that a one-size-fits-all approach is not necessarily the best approach. Hence the new proposal to support &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3799"&gt;pluggable block placement algorithms&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Instrumentation&lt;/h3&gt;Finding out what is happening in a distributed system is a hard problem. Today, Hadoop has a &lt;a href="http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/metrics/package-summary.html"&gt;metrics API&lt;/a&gt; (for gathering statistics from the main components of Hadoop), but there is interest in adding other logging systems, such as &lt;a href="http://radlab.cs.berkeley.edu/wiki/Projects/X-Trace_on_Hadoop"&gt;X-Trace&lt;/a&gt;, via a new &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3772"&gt;instrumentation API&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Serialization&lt;/h3&gt;The ability to use &lt;a href="https://issues.apache.org/jira/browse/HADOOP-1986"&gt;pluggable serialization frameworks&lt;/a&gt; in MapReduce appeared in Hadoop 0.17.0, but has received renewed interest due to the talk around &lt;a href="http://incubator.apache.org/thrift/"&gt;Apache Thrift&lt;/a&gt; and &lt;a href="http://code.google.com/p/protobuf/"&gt;Google Protocol Buffers&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Component lifecycle&lt;/h3&gt;There is work being done to &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3628"&gt;add a lifecyle interface to Hadoop components&lt;/a&gt;. One of the goals is to make it easier to subclass components, so they can be customized.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Remove dependency cycles&lt;/h3&gt;This is really just good engineering practice, but the existence of dependencies makes it harder to understand, modify and extend code. Bill de hÓra did a great &lt;a href="http://www.dehora.net/journal/2008/07/06/3-12-minutes-to-sort-a-terabyte-hadoops-code-structure/"&gt;analysis&lt;/a&gt; of Hadoop's code structure (and its deficiencies), which has lead to some work to enforce module dependencies and remove the cycles.&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/I4nzpaMFwRM" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/1050541487495156879/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=1050541487495156879" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1050541487495156879?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/1050541487495156879?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2008/07/pluggable-hadoop.html" title="Pluggable Hadoop" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CEUGQX4-eip7ImA9WxdWFU8.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-2466668333674859777</id><published>2008-07-08T10:43:00.012+01:00</published><updated>2008-07-08T14:03:40.052+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-07-08T14:03:40.052+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Thrift" /><category scheme="http://www.blogger.com/atom/ns#" term="Serialization" /><category scheme="http://www.blogger.com/atom/ns#" term="RPC" /><category scheme="http://www.blogger.com/atom/ns#" term="MapReduce" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>RPC and Serialization with Hadoop, Thrift, and Protocol Buffers</title><content type="html">Hadoop and related projects like Thrift provide a choice of protocols and formats for doing RPC and serialization. In this post I'll briefly run through them and explain where they came from, how they relate to each other and how Google's &lt;a href="http://google-code-updates.blogspot.com/2008/07/protocol-buffers-our-serialized.html"&gt;newly released Protocol Buffers&lt;/a&gt; might fit in.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;RPC and Writables&lt;/h3&gt;Hadoop has its own RPC mechanism that dates back to when Hadoop was a part of &lt;a href="http://lucene.apache.org/nutch/"&gt;Nutch&lt;/a&gt;. It's used throughout Hadoop as the mechanism by which daemons talk to each other. For example, a &lt;code&gt;DataNode&lt;/code&gt; communicates with the &lt;code&gt;NameNode&lt;/code&gt;  using the RPC interface &lt;code&gt;DatanodeProtocol&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;Protocols are defined using Java interfaces whose arguments and return types are primitives, Strings, Writables, or arrays. These types can all be serialized using Hadoop's specialized serialization format, based on &lt;a href="http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Writable.html"&gt;Writable&lt;/a&gt;. Combined with the magic of &lt;a href="http://java.sun.com/j2se/1.3/docs/guide/reflection/proxy.html"&gt;Java dynamic proxies&lt;/a&gt;, we get a simple RPC mechanism which for the caller appears to be a Java interface.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;MapReduce and Writables&lt;/h3&gt;Hadoop uses Writables for another, quite different, purpose: as a serialization format for MapReduce programs. If you've ever written a Hadoop MapReduce program you will have used Writables for the key and value types. For example:&lt;br /&gt;&lt;pre&gt;&lt;code&gt;&lt;br /&gt;&lt;br /&gt;public class MapClass&lt;br /&gt;implements Mapper&amp;lt;LongWritable, Text, Text, IntWritable&amp;gt; {&lt;br /&gt;&lt;br /&gt;// ...&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;(&lt;code&gt;Text&lt;/code&gt; is just a Writable version of Java &lt;code&gt;String&lt;/code&gt;.)&lt;br /&gt;&lt;br /&gt;The primary benefit of using Writables is in their efficiency. Compared to &lt;a href="http://java.sun.com/j2se/1.4.2/docs/guide/serialization/index.html"&gt;Java serialization&lt;/a&gt;, which would have been an obvious alternative choice, they have a more compact representation. Writables don't store their type in the serialized representation, since at the point of deserialization it is known which type is expected. For the MapReduce code above, the input key is a &lt;code&gt;LongWritable&lt;/code&gt;, so an empty &lt;code&gt;LongWritable&lt;/code&gt; instance is asked to populate itself from the input data stream.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;More flexible MapReduce&lt;/h3&gt;There are downsides of having to use Writables for MapReduce types, however. For a newcomer to Hadoop it's another hurdle: something else to learn ("why can't I just use a String?"). More seriously, perhaps, is that it's hard to use different binary storage formats for MapReduce input and output. For example, Apache Thrift (see below) is an increasingly popular way of storing binary data. It's possible, but cumbersome and inefficient, to read or write Thrift data from MapReduce.&lt;br /&gt;&lt;br /&gt;From Hadoop 0.17.0 onwards &lt;a href="https://issues.apache.org/jira/browse/HADOOP-1986"&gt;you no longer have to use Writables for key and value types in MapReduce programs&lt;/a&gt;. You can use any serialization framework. (Note that this is change is completely independent of Hadoop's RPC mechanism, which still uses Writables - and can only use Writables - as its on-wire format.) So it's easier to use Thrift types, say, throughout your MapReduce program. Or you can even use Java serialization (with &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3566"&gt;some limitations&lt;/a&gt; which will be fixed). What's more, you can add your own serialization framework if you like.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Record I/O, Thrift and Protocol Buffers&lt;br /&gt;&lt;/h3&gt;Another problem with Writables, at least for the MapReduce programmer, is that creating new types is a burden. You have to implement the Writable interface, which means designing the on-wire format, and writing two methods: one to write the data in that format and one to read it back.&lt;br /&gt;&lt;br /&gt;Hadoop's Record I/O was created to solve this problem. You write a definition of your types using a record definition language, then run a record compiler to generate Java source code representations of your types. All Record I/O types are Writable, so they plug into Hadoop very easily. As a bonus, you can generate bindings for other languages, so it's easy to read your data files from other programs.&lt;br /&gt;&lt;br /&gt;For whatever reason, Record I/O never really took off. It's used in &lt;a href="http://zookeeper.sourceforge.net/"&gt;ZooKeeper&lt;/a&gt;, but that's about it (and ZooKeeper will &lt;a href="http://publists.facebook.com/pipermail/thrift/2008-January/000330.html"&gt;move away&lt;/a&gt; from it someday). Momentum has switched to &lt;a href="http://incubator.apache.org/thrift/"&gt;Thrift&lt;/a&gt; (from Facebook, now in the Apache Incubator), which offers a very similar proposition, but in more languages. Thrift also makes it easy to build a (cross-language) RPC mechanism.&lt;br /&gt;&lt;br /&gt;Yesterday, Google open sourced &lt;a href="http://code.google.com/apis/protocolbuffers/"&gt;Protocol Buffers&lt;/a&gt;, its "language-neutral, platform-neutral, extensible mechanism for serializing structured data". Record I/O, Thrift and Protocol Buffers are really solving the same problem, so it will be interesting to see how this develops. Of course, since we're talking about persistent data formats, nothing's going to go away in the short or medium term while people have significant amounts of data locked up in these formats.&lt;br /&gt;&lt;br /&gt;That's why it makes sense to add support in Hadoop for MapReduce using Thrift and Protocol Buffers: so people can process data in the format they have it in. This will be a relatively simple addition.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;What Next?&lt;/h3&gt;For RPC, where a message is short-lived, changing the mechanism is more viable in the short term. Going back to Hadoop's RPC mechanism, now that both Thrift and Protocol Buffers offer an alternative, it may well be time to evaluate them to see if either can offer a performance boost. It would be a big job to retrofit RPC in Hadoop with another implementation, but if there are significant performance gains to be had, then it would be worth doing.&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/y6qV5CBXDBU" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/2466668333674859777/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=2466668333674859777" title="4 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/2466668333674859777?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/2466668333674859777?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2008/07/rpc-and-serialization-with-hadoop.html" title="RPC and Serialization with Hadoop, Thrift, and Protocol Buffers" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>4</thr:total></entry><entry gd:etag="W/&quot;CEcERno7fSp7ImA9WxdWEEQ.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-5017354658620266704</id><published>2008-07-03T14:09:00.005+01:00</published><updated>2008-07-03T14:33:27.405+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-07-03T14:33:27.405+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="MapReduce" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Hadoop beats terabyte sort record</title><content type="html">Hadoop has beaten the record for the terabyte sort benchmark, bringing it from 297 seconds to 209. Owen O'Malley wrote the MapReduce program (which by the way has a clever partitioner to ensure the reducer outputs are globally sorted and not just sorted per output partition, which is what the default sort does), and then ran it on 910 nodes on Yahoo!'s cluster. There are more details in Owen's &lt;a href="http://developer.yahoo.com/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.html"&gt;blog post&lt;/a&gt; (and there's a link to the benchmark page which has a PDF explaining his program). You can also look at the &lt;a href="http://svn.apache.org/viewvc/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/terasort/"&gt;code in trunk&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Well done Owen and well done Hadoop!&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/24uWmVN0Bk4" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/5017354658620266704/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=5017354658620266704" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5017354658620266704?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5017354658620266704?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2008/07/hadoop-beats-terabyte-sort-record.html" title="Hadoop beats terabyte sort record" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;DkYERXYyfyp7ImA9WxdQGUo.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-7956405302605780166</id><published>2008-06-20T16:01:00.001+01:00</published><updated>2008-06-20T16:01:44.897+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-06-20T16:01:44.897+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="MapReduce" /><category scheme="http://www.blogger.com/atom/ns#" term="Query Languages" /><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><title>Hadoop Query Languages</title><content type="html">If you want a high-level query language for drilling into your huge Hadoop dataset, then you've got some choice:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://incubator.apache.org/pig/"&gt;Pig&lt;/a&gt;, from Yahoo! and now incubating at Apache, has an imperative language called Pig Latin for performing operations on large data files.&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.jaql.org/"&gt;Jaql&lt;/a&gt;, from IBM and soon to be open sourced, is a declarative query language for JSON data.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Hive, from Facebook and soon to &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3601"&gt;become a Hadoop contrib module&lt;/a&gt;, is a data warehouse system with a declarative query language that is a hybrid of SQL and Hadoop streaming.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;All three projects have different strengths, but there is plenty of scope for collaboration and cross-pollination, particularly in the query language. For example, at the &lt;a href="http://research.yahoo.com/node/2104"&gt;Hadoop Summit&lt;/a&gt; in March, Joydeep Sen Sarma of Facebook said that they would be receptive to users who wanted to use Pig Latin or Jaql in Hive. And Kevin Beyer of IBM Research said that Pig and Jaql are converging, and they've had discussions with the Pig team about how to bring them even closer together.&lt;br /&gt;&lt;br /&gt;Meanwhile, to learn more I recommend &lt;a href="http://www.cs.cmu.edu/%7Eolston/publications/sigmod08.pdf"&gt;Pig Latin: A Not-So-Foreign Language for Data Processing&lt;/a&gt; (by Chris Olston &lt;span style="font-style: italic;"&gt;et al&lt;/span&gt;&lt;span&gt;), and the &lt;a href="http://research.yahoo.com/node/2104"&gt;slides and videos from the Hadoop Summit&lt;/a&gt;.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(And I haven't even included &lt;a href="http://www.cascading.org/"&gt;Cascading&lt;/a&gt;, from &lt;a href="http://chris.wensel.net/"&gt;Chris K. Wensel&lt;/a&gt;, which, while not a query language &lt;span style="font-style: italic;"&gt;per se&lt;/span&gt;, is an abstraction built on MapReduce for building data processing flows in Java or Groovy using a plumbing metaphor with constructs such as taps, pipes, and flows. Well worth a look too.)&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/zMmed1L2000" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/7956405302605780166/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=7956405302605780166" title="4 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7956405302605780166?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/7956405302605780166?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2008/06/hadoop-query-languages.html" title="Hadoop Query Languages" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>4</thr:total></entry><entry gd:etag="W/&quot;A0MESX87fSp7ImA9WxdQE0g.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-3339094449895398610</id><published>2008-06-13T13:15:00.000+01:00</published><updated>2008-06-13T13:16:48.105+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-06-13T13:16:48.105+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><category scheme="http://www.blogger.com/atom/ns#" term="Cloud Computing" /><title>"The Next Big Thing"</title><content type="html">&lt;a href="http://mvdirona.com/jrh/perspectives/2007/10/29/WelcomeToPerspectives.aspx"&gt;James Hamilton&lt;/a&gt; on &lt;a href="http://perspectives.mvdirona.com/2008/01/15/TheNextBigThing.aspx"&gt;The Next Big Thing&lt;/a&gt;:&lt;blockquote&gt;Storing blobs in the sky is fine but pretty reproducible by any competitor.  Storing structured data as well as blobs is considerably more interesting but what has even more lasting business value is the storing data in the cloud AND providing a programming platform for multi-thousand node data analysis.  Almost every reasonable business on the planet has a complex set of dimensions that need to be optimized.&lt;/blockquote&gt;&lt;br /&gt;I think we're only beginning to see interesting data processing being done in the cloud - there's much more to come.&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/Xf4dh57LFFM" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/3339094449895398610/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=3339094449895398610" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/3339094449895398610?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/3339094449895398610?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2008/05/next-big-thing.html" title="&quot;The Next Big Thing&quot;" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;AkQBSHw_cCp7ImA9WxdREUs.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-5153878883282354355</id><published>2008-05-30T17:15:00.004+01:00</published><updated>2008-05-30T18:25:59.248+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-05-30T18:25:59.248+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Family" /><category scheme="http://www.blogger.com/atom/ns#" term="Mobile" /><title>Bluetooth Castle</title><content type="html">&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_IhqEHw4_Ick/SEAyN-dm-EI/AAAAAAAAAEw/o8lOSEi9SRM/s1600-h/raglan_castle.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer;" src="http://4.bp.blogspot.com/_IhqEHw4_Ick/SEAyN-dm-EI/AAAAAAAAAEw/o8lOSEi9SRM/s200/raglan_castle.jpg" alt="" id="BLOGGER_PHOTO_ID_5206216384927168578" border="0" /&gt;&lt;/a&gt;Today I visited &lt;a href="http://en.wikipedia.org/wiki/Raglan_Castle"&gt;Raglan Castle&lt;/a&gt; in Monmouthshire with my family. &lt;a href="http://www.cadw.wales.gov.uk/"&gt;Cadw&lt;/a&gt;, the government body that manages the castle, were running a trial to deliver audio files to visitors' mobile phones using Bluetooth. As I walked through the entrance I simply made my phone discoverable, waited a few seconds for the MP3 to download, then started listening to a &lt;a href="http://www.cadw-feedback.org.uk/audio_en.html"&gt;guided tour&lt;/a&gt; of the castle. It's a great use of the technology: it just worked.&lt;br /&gt;&lt;br /&gt;The talk only lasted a few minutes, so we had plenty of time afterwards to run around the ruins.&lt;br /&gt;&lt;br /&gt;A couple of technical questions that sprang to mind:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;How would you set up a server to push files over Bluetooth? (There are loads of ways you could use this - maps of the local area at transport hubs, sharing the schedule at conferences, random photo of the day at home, etc.)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Can you make audio files navigable? That is, make it easy to go to the part of audio file that is about a given exhibit by typing in the exhibit's number? (This problem reminds me of Cliff Schmidt's &lt;a href="http://www.eu.apachecon.com/eu2008/program/talk/2625.html"&gt;talk&lt;/a&gt; about the Talking Book Device at ApacheCon EU 2008.)&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/DNegKFBKkXU" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/5153878883282354355/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=5153878883282354355" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5153878883282354355?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/5153878883282354355?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2008/05/bluetooth-castle.html" title="Bluetooth Castle" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/_IhqEHw4_Ick/SEAyN-dm-EI/AAAAAAAAAEw/o8lOSEi9SRM/s72-c/raglan_castle.jpg" height="72" width="72" /><thr:total>3</thr:total></entry><entry gd:etag="W/&quot;A0ICSHw8eyp7ImA9WxZaFUU.&quot;"><id>tag:blogger.com,1999:blog-8898949683610477251.post-4309697238982889737</id><published>2008-04-22T21:05:00.012+01:00</published><updated>2008-04-30T22:06:09.273+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-04-30T22:06:09.273+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Hadoop" /><category scheme="http://www.blogger.com/atom/ns#" term="Cloud Computing" /><title>Portable Cloud Computing</title><content type="html">Last July I asked "&lt;a href="http://www.lexemetech.com/2007/07/why-are-there-no-amazon-s3ec2.html"&gt;Why are there no Amazon S3/EC2 competitors?&lt;/a&gt;", lamenting the lack of competition in the utility or cloud computing market and the implications for disaster recovery. Closely tied to disaster recover is &lt;span style="font-style: italic;"&gt;portability&lt;/span&gt; -- the ability to switch between different utility computing providers as easily as I switch electricity suppliers. (OK, that can be a pain, at least in the UK, but it doesn't require rewiring my house.)&lt;br /&gt;&lt;br /&gt;It's useful to compare &lt;a href="http://aws.amazon.com/"&gt;Amazon Web Services&lt;/a&gt; with Google's recently launched &lt;a href="http://code.google.com/appengine/"&gt;App Engine&lt;/a&gt; in these terms. In some sense they compete, but they are strikingly different offerings. Amazon's services are &lt;span style="font-style: italic;"&gt;bottom up&lt;/span&gt;: "here's a CPU, now install your own software". Google's is &lt;span style="font-style: italic;"&gt;top down&lt;/span&gt;: "write some code to these APIs and we'll scale it for you". There's no way I can take my EC2 services and run them on App Engine. But I can do the reverse -- sort of -- thanks to &lt;a href="http://appdrop.com/"&gt;AppDrop.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;And that's the thing. What I would love is a utility service from different providers that I can switch between. That's one of the meanings of "utility", after all. (Moving lots of data between providers may make this difficult or expensive to do in practice -- "data inertia" -- but that's not a reason not to have the capability.)&lt;br /&gt;&lt;br /&gt;There are at least two ways this can happen. One's the AppDrop approach -- emulate Google's API and provide an alternative place to run applications, in this case it's EC2.&lt;br /&gt;&lt;br /&gt;However, there's another way: build "standard, non-proprietary cloud APIs with open-source implementations", as Doug Cutting says on his blog post &lt;a href="http://blog.lucene.com/2008/04/09/cloud-commodity-or-proprietary/"&gt;Cloud: commodity or proprietary?&lt;/a&gt; In this world, applications use a common API, and host with whichever provider they fancy. Bottom up offerings like Amazon's facilitate this approach: the underlying platforms may differ, but it's easy to run your application on the provided platform -- for example, by building an Amazon AMI. Google's top down approach is not so amenable, application developers are locked-in to the APIs Google provide. (Of course, Google may open this platform up more over time, but it remains to be seen if they will open it up to the extent of being able to run arbitrary executables.)&lt;br /&gt;&lt;br /&gt;As Doug notes, &lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt; is providing a lot of the building blocks for building cloud services: filesystem (&lt;a href="http://hadoop.apache.org/core/"&gt;HDFS&lt;/a&gt;), database (&lt;a href="http://hadoop.apache.org/hbase/"&gt;HBase&lt;/a&gt;), computation (&lt;a href="http://hadoop.apache.org/core/"&gt;MapReduce&lt;/a&gt;), coordination (&lt;a href="http://zookeeper.sourceforge.net/"&gt;ZooKeeper&lt;/a&gt;). And here, perhaps, is where the two approaches may meet -- AppDrop could be backed by HBase (or indeed &lt;a href="http://www.hypertable.org/"&gt;Hypertable&lt;/a&gt;), or HBase (and Hypertable) could have an open API which your application could use directly.&lt;br /&gt;&lt;br /&gt;Rails or Django on HBase, anyone?&lt;img src="http://feeds.feedburner.com/~r/TomWhite/~4/NluiyKOcNo8" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.tom-e-white.com/feeds/4309697238982889737/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=8898949683610477251&amp;postID=4309697238982889737" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/4309697238982889737?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8898949683610477251/posts/default/4309697238982889737?v=2" /><link rel="alternate" type="text/html" href="http://www.tom-e-white.com/2008/04/portable-cloud-computing.html" title="Portable Cloud Computing" /><author><name>Tom White</name><uri>http://www.blogger.com/profile/02418758537880869494</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="http://farm2.static.flickr.com/1358/822201572_051b33f802_s.jpg" /></author><thr:total>1</thr:total></entry></feed>
