Tom White

Bristech 2019

2019-11-08T11:38:00.000+00:00

Yesterday I went to Bristech. What a great conference! It’s one day, three tracks.

The thing I like about it is the variety of the talks. It’s not organized around a single programming language or technology, so it’s easy to go to talks outside your usual sphere of interest (it’s quite hard not to). I went to talks about MLOps (keynote by Luke Marsden), the Language Server Protocol (by Krzysztof Cieślak), a reactive JS compiler called Svelte (by Peter Allen), accessibility (by Svetlana Kouznetsova), Cloud Native ML (by Ant Kennedy), and building autonomous Mars rovers (by Mark Woods). I gave a talk on Single Cell data and algorithms (thanks to Steve Loughran for suggesting I submit it, as well as live tweeting it!).

Nic Hemley and the other organizers put a lot of work into lots of small details that made the day more memorable. To name a few: the Icelandic thunder clap; caterpillar and butterfly stickers to encourage attendees to talk to each other; lunchtime music and visuals; cinema popcorn.

As a speaker I was very impressed by the care taken by track chairs on the speaker intros. Speakers were asked beforehand to fill in a short “interview” document, which chairs then used to write an intro for each speaker. My track chair was Hannah Smith who was also very good at gently reminding people in the Q&A session to ask questions rather than give their opinions. So we didn’t have any “this is more of a comment than a question” style ramblings, which I’m sure everyone was pleased about.

And the Watershed is a very nice venue, even if it gets “night club busy” between talks.

Great conference. Highly recommended.

How I manage my diabetes

2019-11-02T13:07:00.006+00:00

I was diagnosed with Type 1 Diabetes (T1D) in February last year. Since then I have learnt a lot about the condition, and how to manage it. The first few months were quite turbulent, but now it’s become more settled and is a part of life, and thankfully I don’t spend every minute thinking about it. It never goes away though.

In this post I describe the technology I’m using to manage T1D. Every PWD (person with diabetes) is different, so what works for me won’t necessarily work for others. Also, some of these things don’t even work for me all the time, such is the unpredictable nature of diabetes. So there are definitely improvements I could make. I’ll mention some of them at the end of this piece.

Some of my diabetes kit

The main tools I use are a FreeStyle Libre to measure my blood glucose levels, and insulin pens to administer multiple daily injections (MDI) of insulin. I also use a number of apps and websites to manage the data.

Quite simply, the Libre is a superb piece of technology. It gives an amazing amount of insight into blood glucose levels - you can see what happens after you eat a particular food, or the effect of exercise, and even what happened to your levels during the night. I’ve been using one for over a year, and without it I really think I would be a lot more stressed about BG levels, and probably overcompensating for lows and oblivious to highs.

There are two parts to the Libre system: the sensor and the reader. The sensor is a small white disc that you stick to your upper arm. It has a small needle that sits under the skin where the glucose sensor is located. Each sensor lasts 14 days before it must be changed. The reader is a custom device, like a tiny phone (see picture above), but you can also use your phone to read the sensor using NFC. The sensor only stores 8 hours worth of measurements, so you need to tap the reader on the sensor at least that often to avoid gaps in history.

I use the reader rather than my phone to read the sensor (I only recently upgraded my phone to a version that can read the sensor). The reader can store 90 days of readings, so I periodically download the data as a backup and as a way to run different analyses on it.

Abbott, the company that manufactures the Libre, provides a desktop application to download the data from the Libre over USB. I wrote a script to upload the resulting data to a service called Nightscout.

Nightscout

Nightscout is an open source project created by the #WeAreNotWaiting community to allow people with T1D to store their CGM data in the cloud. It was started as a way for parents of children with T1D to monitor their BG levels (particularly at night, without disturbing them), but it is now widely used by many in the T1D community, and forms a basis for DIY looping systems, like OpenAPS.

I use Nightscout as a historical data store, rather than for its realtime capabilities. I like Nightscout because it's open source and therefore not tied to a corporation (which could pull the service, change its terms, etc), but the downside is that you do have to run your own service, although that’s pretty easy to do on Heroku.

Weekly BG summaries with dboard

Nightscout provides its own analytics, which are very useful. I also wrote a webpage to provide a weekly summary called dboard that reads data from Nightscout and shows a few key stats (time in range, average BG, estimated HbA1c) for the week, so I can see at a glance how the week was.

Turning to insulin, MDI uses two types of insulin - a basal dose of long-acting insulin (I use Levemir) every night before bed that provides a background supply of insulin over the day, and bolus doses of rapid-acting insulin (I use NovoRapid) taken before every meal or snack and which are adjusted to “cover” the carbs in that meal.

I don’t have an insulin pump. After I was diagnosed I assumed that I would eventually move on to one, since they offer a high degree of blood glucose control. However, my doctor said that since my control was very good using MDI, there wasn’t a strong reason to move to a pump. I'm still on MDI, and it works well for me. I was quite surprised to find that the needles on insulin pens are so fine that you can barely feel them, so injections are not normally painful. Extracting a drop of blood from a finger with a lancet for testing the blood glucose level hurts more.

Interestingly, it’s very likely that I’m still in the “honeymoon phase” of T1D, which is when the body still produces some insulin. This helps with control, and is probably another reason MDI works well for me. The literature on the duration of the honeymoon phase is maddeningly vague, but it’s typically between months and years. (One interesting paper I found suggests that doing regular exercise can prolong the honeymoon. A good reason to keep up the running…)

How do I know how much insulin to take? The nightly basal dose is fixed and the same every day - the dose was arrived at by trial and error soon after diagnosis, and it hasn’t changed much since.

Bolus doses for meals depend on two main variables: the amount of carbohydrate in the meal, and (to a lesser extent) the amount of exercise I’m expecting to do over the following few hours. This is the most fiddly part of managing T1D, since I have to work out the number of carbs in everything I eat. What’s more, I have to try to take my insulin at the right time before eating since injected insulin takes a while to get into the blood stream and become effective. To complicate things, some foods release glucose slowly, others a lot faster.

For carb counting at home where we cook most of our meals from scratch I wrote a web app called Ingreedy that calculates the number of carbs in a meal from its ingredients. This was especially useful in the early days when we had no idea about the carbs in particular foods. I also eat fewer carbs than I used to, around 120g a day.

Carb counting a recipe with Ingreedy

I'm now better at eyeballing the (rough) number of carbs in food, which I need to do when eating out. If I’m unsure I use the Carbs & Cals app which has photos of foods for different portion sizes and their carb counts.

After getting a carb count I have to convert this into the amount of insulin to take - there is a simple ratio to calculate this (e.g. 15g carbs translates to 1 unit of insulin), but the final amount is adjusted by my current BG reading, insulin already "on board" (i.e. in my body), and any exercise I’m going to do in the next few hours. I have a Google spreadsheet that acts as a food diary and does the calculation for me. Then I manually log the meal entry into Nightscout, so I have a log of number of carbs eaten and insulin taken.

When I was first diagnosed I was told that it’s important not to always inject in the same place, since it can make injections less effective (and also cause damage to the area concerned). So I needed a scheme to rotate injection sites. At first I recorded a note of the site I just used on paper so I could choose the next one by looking at the note, but that quickly became tiresome, so I made a little gadget out of cardboard with a pin that I moved every time I did an injection. I still use that for my nightly basal injections.

For my meal bolus injections I now use a scheme that maps day of week and meal to an injection site on my tummy. This means I rotate through sites once a week, which is fine, and most importantly it’s stateless (i.e. I don’t need to remember where the previous site was). (Writing this prompts me to think about moving to a similar scheme for my basal injections.)

Low blood sugar is an inevitable part of T1D. Sometimes you misjudge the amount of insulin and take too much (the pens I use can only administer whole units, so you have to round up or down to the nearest unit), or you do more exercise than you anticipated, or for some other reason your BG goes lower than you want it to. Although I can feel a hypo coming on, the Libre is a very useful tool for heading them off as I can see how quickly my BG is changing and possibly pre-empt a hypo. (For example, if my BG is a bit lower than expected and I’m going to go out soon I might eat something so I don’t go too low.)

When I’m having a hypo I will take glucose tablets, which I like for two reasons. First, they have a fixed dose (4g of carbs), and second they don’t taste that nice so you aren’t tempted to overdo the glucose and go too high.

I’ll also fall back to measuring BG with a finger prick test since blood generally gives a more accurate and more up-to-date reading than the Libre (which measures glucose from interstitial fluid, and has a 5 to 10 minute delay to changes in BG). I have a spare meter for this, or I use the Libre with a test strip.

It’s easily overlooked, but making sure I have all the supplies I need requires a bit of organization. I thought about building an app that notifies me when I need to order something, but I never got round to it. Instead I have a calendar reminder each week to check supplies and order more of any that are running low. Low tech, but works for me so far.

Finally, there are a few things I’ve been thinking about changing or tweaking slightly.

Can I replace the Libre reader with the phone app?
Do I need dboard given that both Libre and Nightscout have good analytics?
Can I log insulin doses automatically (e.g. with CLIPSULIN)?
Do I need to log meals and carbs?
Is there a way of recording hypos? (Preferably automatically since when you are having a hypo you don’t think about logging stuff.)
How can we make it easier to move someone’s entire T1D data between systems?

Pastures new-ish

2018-04-27T14:41:00.000+01:00

After nine and a half years, today is my last day at Cloudera. It’s difficult to write those words as so much of my life has been bound up with this company. On the day I started, I didn’t meet my co-workers as I was living several thousand miles away in a barn in Wales. (The others were in a borrowed meeting room in San Mateo.) As I leave I am still in a barn in Wales (different barn though), but a lot has happened in the intervening period.

On the personal side, my family and I lived in San Francisco during the early formative years of Cloudera, a time we will always treasure for the lifelong friendships we made.

On the professional side, it is no exaggeration to say that working at Cloudera has been the highlight of my career. I already knew that Hadoop was pretty special when I joined (I may have been biased as I was writing a book on it), but I had no idea how it would transform the industry and how it would be used in every sector you could imagine.

To all of you I have worked with over the last decade—at Apache, Cloudera and elsewhere, on many projects—I consider myself to be incredibly fortunate to have had the opportunity to work with you. Thank you.

So what’s next for me?

Jim Waldo, who worked on distributed systems at Sun, once said that he alternated six month periods between the lab and the outside world: in the lab he and his team built systems software, and in the outside world he saw how people used the system he was building. Doing so gave him valuable feedback on the system design, even though it was time away from being able to build the system.

In some ways this is another way of framing the explore/exploit tradeoff, where you decide between exploring new technological ground—building a new system—and exploiting that system to solve particular problems you are interested in, which is why you built the system in the first place. (Of course, this framing is oversimplified, since there are many people working on both parts simultaneously. It’s a useful way of thinking about things as an individual actor though.)

For the past few years I have been working on a few open source biology and healthcare projects (like GATK, Hail, and OHDSI). I think that the problems in biology are big enough and messy enough that new systems will need to be built. We can’t stop exploring the technological ground since the sheer amount of data will overwhelm even the best of today’s cutting-edge technology. (I like to cite the paper Big Data: Astronomical or Genomical? here for some concrete numbers.)

Having said that, there is still a lot of mileage left in our current crop of tools—which include Spark, TensorFlow, Jupyter, and the cloud. And this is what I am going to do: continue the work to apply tools like these to more bio projects, only now working as a freelancer. I plan to write more about what I’m up to on this blog, so please follow along.

Cloudera Inbox Zero for the first time ever!

Type 1 Diabetes

2018-04-15T15:40:00.000+01:00

On 16 February this year I was diagnosed with Type 1 Diabetes.

I had been feeling under the weather - a bit weak, but also persistently thirsty and hungry. I would gulp down a large glass of water in one go with ease (a bit like I did when I was 10 years old after running around outside for hours on a hot day), and I would do this several times every day. I would eat a sandwich after dinner, even after having seconds. Strangely, I was losing weight despite eating a lot. And I found it hard to complete my usual morning run, and when I did manage it, it was noticeably slower than normal.

In retrospect these are all classic symptoms of type 1 diabetes: weight loss, increased thirst and hunger. My body was not producing enough insulin, which is needed to use the glucose in my blood. The weight loss occurred because my body was using fat reserves for energy. My heightened thirst was my body’s way of trying to flush the excess glucose from my system by getting me to urinate more.

So I went to see the doctor and she arranged a blood test, which I had the next day. That was at lunchtime on Friday, and the nurse who took my blood said it would be a week or two before the results came back. So I was surprised when the phone rang at 6pm, and the doctor’s receptionist asked me to come in. On Friday evening? Weren’t they closed then?

A few minutes later I went in, and she said that my blood glucose level was over 30, and that normally it would be 7. “Tom, you have diabetes.” I didn’t know what to say. I remember asking what the blood glucose level was measured in (millimoles per litre). (I’m always impressing on my daughters the importance of units in science.)

I also asked how they knew it was Type 1. Mainly from the symptoms - I'm quite skinny! Most people with diabetes have Type 2, which is characterised by insulin resistance: the body is still producing insulin, but it can’t use it as effectively. Less than 10% of diabetes sufferers have Type 1, and while most of those are children, adults can get the disease too.

The doctor sent me to the hospital straightaway so they could check if my body was coping with the high sugar levels. Left untreated, high blood glucose levels can lead to a dangerous condition called ketoacidosis.

I rang Eliane to tell her, and we both had a bit of a wobble. She drove over to take me to the hospital. We went to the Emergency Admissions Unit, where they would check my ketone levels (which, thankfully, were normal) and give me a shot of insulin. After that I could go home - there was no need to stay overnight, but they asked me to go back the next day and Sunday to be given insulin again.

The next day, Monday, I saw the Diabetic Specialist Nurse (DSN) who gave me a blood glucose monitor and insulin pens so I could manage the condition myself. She explained my new routine: before every meal and before bed I have to check my blood glucose level and inject insulin. The bedtime insulin is a slow-acting background insulin that lasts for almost a whole day, whereas the mealtime insulin is fast-acting and is meant to compensate for the blood glucose level rise caused by the carbohydrates in the meal. The idea is that you count the number of carbohydrates in the meal you are about to eat, then calculate the number of units of insulin that are needed to cover it.

It doesn’t sound like much, but I hadn’t really paid much attention to the nutritional composition of meals before. And nor had Eliane. We eat healthily, and mainly cook from scratch, but having to analyse each meal was a big change. In the first few weeks it felt like we were spending all our time analysing recipes. It gets easier, but it’s still time consuming.

The goal with diabetes is to keep the blood glucose level between 4 and 7 mmol/l. A person without diabetes has a pancreas that does this for them. Unfortunately, my pancreas has stopped performing this role, hence the need for injected insulin. There are two things to avoid: hyperglycaemia, which is when the blood glucose level is too high, and hypoglycaemia when it is too low.

Broadly speaking, hyperglycaemia has long term effects (such as cardiovascular problems), while hypoglycaemia needs to be treated immediately since its more severe form can require hospitalisation. Normally though, treating a mild “hypo” involves eating something with very fast acting sugar in it (like glucose tablets) and waiting 15 minutes for the level to get back in range. During this time you likely feel weak and shaky.

This chart shows all my blood glucose readings. In the first couple of weeks all my readings were out of range, but then they started stabilising, and now they are mainly in range.

When someone is newly diagnosed with diabetes in the UK, the full support network of the NHS swings into action. In addition to my GP, I have a diabetes consultant, two DSNs, and a dietitian. I saw all of these people in the first week, and I have ongoing support from the DSNs and the dietitian, who I can phone or email if I have a question, or need some help with my insulin dose adjustment. I have a recurring appointment with the consultant every six months. I’ve also been added to the system for annual screening for eye disease. The NHS also provides education programs for carb counting and insulin dose adjustment (one has the fabulous acronym DAFYDD - dose adjustment for your daily diet), as well as an excellent series of videos for newly diagnosed patients. All of this is free. And in Wales, where I live, there are no prescription charges for anyone (for people with diabetes in England, prescriptions are free too), so I don’t have to pay for the medical supplies that I now depend on every day.

All of the medical staff that Eliane and I have encountered in the last two months have been unfailingly kind and supportive, even when working under pressure (like the first night in the EAU). It’s incredibly reassuring to have access to the resources the NHS provides. I would like to thank everyone there, along with my family, friends, and colleagues from work who have helped me get through the last two months.

Be Part of Something Bigger - Vote #Remain

2016-06-22T14:10:00.000+01:00

David Cameron did not have to call this referendum. He did so in an attempt to settle the European issue within his own party, as there is no new EU treaty that we are voting on. The referendum is not binding either, as the legal blogger David Allen Green pointed out.

Far from quelling the debate within the Tory party, the lead up to the referendum has had the opposite effect. The debate over the last couple of months has been increasingly toxic, with both sides making outlandish claims. Parts of the Leave campaign have been xenophobic and racist, in an attempt to scare people to leaving the EU - this is the true Project Fear. And then last week the appalling murder of the Labour MP Jo Cox brought about some reflection on how we’ve moved away from a more respectful, kinder politics. In the words of Stephen Kinnock, "When insecurity, fear and anger are used to light a fuse, an explosion is inevitable.”

But there is a referendum tomorrow, so we have a duty to vote. The vote is about a host of issues, and on all of them I believe we are better staying as a member of the EU. In my mind it boils down to being a part of something bigger than yourself. This is true on a personal level - being part of a company, a team, a club, or an organisation allows you to achieve more than if you go it alone. I’ve seen this in my professional life where loose-knit groups of programmers build open source software that an individual could never dream of. Of course, where there are many personalities pulling in different directions you get conflict, things get messy, compromises are needed, and you don’t always get your own way. But on some decisions you do have influence, and you do get to shape the results.

Being a part of the EU is about the UK being a part of something bigger, and being able to influence policy on issues that affect the UK. The world is a messy chaotic place, and there are many deeply-ingrained, complex problems that require complex policy interventions. Climate change, migration, tax havens, peace - to name a few - all of these need a coordinated approach that cross national borders. Leaving would squander our influence in attacking these problems, while doing nothing to solve them - for us or for the the rest of the world.

One of the more worrying themes of the Leave campaign is not to trust the experts. This allows them to conveniently dismiss the overwhelming opinion amongst economists that Brexit would mean the UK is worse off outside the EU. It’s like climate change denial, and running a country with that kind of gut-feeling policy making is terrifying.

Britain has been at its best when it has been an outward-looking nation, one that works with others and trades with others. That’s why I am going to vote to Remain in the EU.

The Earth Moon Game

2015-07-05T22:18:00.000+01:00

If the Moon were the size of a tennis ball then the Earth would be the size of a basketball. How far apart should the balls be placed so that the distance is to scale?

Before you read on, you might like to have a go yourself. If you don't have a tennis ball and basketball to hand, you can play with this online version I wrote.

My kids and I had a stall at our school fair this Friday where we played this as a game:

The Earth's diameter is 12,742 km and the Moon's is 3,475 km, so the Earth's diameter is about 3.7 larger. (We measured the basketball's diameter to be 23.5 cm, and the tennis ball to be 6.5 cm, so the ratio is about 3.6, which is pretty close!)

The Moon is (on average) 384,400 km from the Earth (the Lunar distance, measured from the centres of the two bodies), which is 111 times the Moon's diameter. Scaling this to the tennis ball, we get a distance of 111 × 6.5 cm = 7.2 metres.

Here's a picture showing the results at the end of the fair:

The basketball representing the Earth is in the bottom of the picture, and the tennis ball just visible at the top is the Moon, 7.2 metres away. To the right of the green tape are white flags that are the players' guesses for where the Moon would be.

It's striking that all the guesses were too low. This seems to be a mixture of two things. Firstly, people really do think that the Moon is closer than it actually is. Secondly, people tend to copy other people, so they would place their flags close to where the others were. (We told everyone that the Moon didn't have to be restricted to the green tape - that just happened to be how long it was.)

We saw a few interesting tactics though. One girl put one flag so it was the closest to the Earth compared to all the other flags, then another so it was the furthest out. She seemed to think that everyone else had either over- or underestimated the distance - which of course they had! (She didn't win though, as someone put their flag even further out later on.) Someone else put five flags over a range of about 25cm where she thought the Moon would be.

The most successful approach seemed to be for the player to stand where the Earth is, and have someone walk away holding the tennis ball until it subtends the same angle as the Moon does in the sky (or your mind's eye). This is easier said than done, however. The player in fourth place (who was about five years old) used this technique.

Here's the data plotted graphically, with each flag shown as a line. The blue line represents Earth, and the orange line the Moon.

Interestingly, the guesses did not benefit from the Wisdom of Crowds effect, where the average tends to be a good predictor of the actual answer:

The opening anecdote [of the book of the same name by James Surowiecki] relates Francis Galton's surprise that the crowd at a county fair accurately guessed the weight of an ox when their individual guesses were averaged

For the Earth Moon Game, however, the median distance was 2.6 metres, and the mean was 2.7 metres, which was 2.3 standard deviations (sd=1.96 metres) from the true distance, 7.2 metres.

The Hay Dark Skies Festival, Reverend Thomas William Webb, and Jupiter

2015-04-19T18:09:00.000+01:00

In 2013, the Brecon Beacons was designated a Dark Sky Reserve, and a year later the first Dark Skies Festival was held in Hay-on-Wye. The second festival took place this weekend, and my family went along to some of the activities.

Young stargazers, Lottie and Millie

In the morning, we found ourselves in a planetarium tent, then we looked at sunspots, and held pieces of meteorite.

The evening event was stargazing at Holy Trinity Church in Hardwicke, just outside Hay. Quite apart from the lack of light pollution, the location was a special one, since the vicar of the parish from 1856 until 1885 was Reverend Thomas William Webb, who in his spare time observed the night sky with telescopes and an observatory he had built himself.

Holy Trinity Church, Hardwicke

In 1859, while at Hardwicke he wrote the classic book, Celestial Objects for the Common Telescope, the object of which was "to furnish the possessors of ordinary telescopes with plain directions for their use, and a list of objects for their advantageous employment".

The book remained in print well into the following century (and was recently republished by Cambridge University Press), and it's probably difficult to overemphasise the importance of this book in encouraging generation after generation of amateur stargazers.

In the words of Janet and Mark Robinson, who used to live in the vicarage and have edited a book about Webb,

Like Patrick Moore, he was an enthusiast who wanted to inspire as many people as possible to look through a telescope. Even at the choir party he "arranged the telescope and acted as showman and all in turn had a look at Saturn".

Webb would no doubt have been pleased to see yesterday's gathering of enthusiastic amateurs (including the Robinsons) with an impressive range of telescopes, on a cold but very clear night. The highlight for us was seeing Jupiter and its four brightest moons (Io, Europa, Ganymede and Callisto) through a large reflecting telescope. We could even see the north and south belts, and the Great Red Spot (or Pink Splodge as Lottie named it).

Sunset. Venus is visible top centre

Thank you to the organisers of the Hay Dark Skies Festival, and the volunteers from the Usk Astronomical Society (the oldest astronomical society in the UK), the Abergavenny Astronomy Society and the Heads of the Valleys Astronomical Society.

Tennis Ball Parabola

2015-03-08T15:33:00.000+00:00

Here's an image of me throwing a tennis ball to Lottie:

Millie filmed the video and edited it down to a shorter segment. I turned the resulting video frames into a series of JPEGs by running:

ffmpeg -i Tennis\ Ball.mp4 tennis-%03d.jpeg

Then I composed them into a single image using ImageMagick:

convert -compose lighten tennis-014.jpeg tennis-015.jpeg \
-composite tennis-016.jpeg \
-composite tennis-017.jpeg \
...

-composite tennis-043.jpeg \

-composite result.jpeg

Millie then used Desmos (an online graphing editor) to superimpose a parabola on the image.

Update: Dima Spivak suggested I use the picture to estimate g, the acceleration due to gravity.

My head measures 0.22 m (chin to crown), and is 49 pixels on the picture.
The vertical distance, d, from the highest ball to the ball above Lottie's hands is 204 pixels, or 0.916 m.
The time, t, it took to travel this distance was between 12 and 13 frames (it's hard to say more precisely than this from the picture), which at 29.97 frames per second is between 0.4 and 0.434 seconds.

The acceleration is 2d/t², which works out at between 9.7 and 11.4 m/s². This range contains the accepted value of g, which is 9.8 m/s².

Hadoop for Science

2015-01-16T15:57:00.004+00:00

Some of the largest datasets are generated by the sciences. For example, the Large Hadron Collider produces around 30PB of data a year. I'm interested in the technologies and tools for analyzing these kind of datasets, and how they work with Hadoop, so here's a brief post.

Open Data

Amazon S3 seems to be emerging as the de facto solution for sharing large datasets. In particular, AWS curates a variety of public data sets that can be accessed for free (from within AWS; there are egress charges otherwise). To take one example from genomics, the 1000 Genomes project hosts a 200TB dataset on S3.

Hadoop has long supported S3 as a filesystem, but recently there has been a lot of work to make it more robust and scalable. It’s natural to process S3-resident data in the cloud, and here there are many options for Hadoop. The recently released Cloudera Director, for example, makes it possible to run all the components of CDH in the cloud.

Notebooks

By "notebooks" I mean web-based, computational scientific notebooks, exemplified by the IPython Notebook. Notebooks have been around in the scientific community for a long time (they were added to IPython in 2011), but increasingly they seem to be reaching the larger data scientist and developer community. Notebooks combine prose and computation, which is great for exposition and interactivity. They are also easy to share, which helps foster collaboration and reproducibility of research.

It’s possible to run IPython against PySpark (notebooks are inherently interactive, so working with Spark is the natural Hadoop lead in), but it requires a bit of manual set up. Hopefully that will get easier—ideally Hadoop distributions like CDH will come with packages to run an appropriately-configured IPython notebook server.

Distributed Data Frames

IPython supports many different languages and libraries. (Despite its name IPython is not restricted to Python; in fact, it is being refactored into more modular pieces as a part of the Jupyter project.) Most notebook users are data scientists, and the central abstraction that they work with is the data frame. Both R and pandas, for example, use data frames, although both systems were designed to work on a single machine.

The challenge is to make systems like R and pandas work with distributed data. Many of the solutions to date have addressed this problem by adding MapReduce user libraries. However, this is unsatisfactory for several reasons, but primarily because the user has to explicitly think about the distributed case and can’t use the existing libraries on distributed data. Instead, what’s needed is a deeper integration so that the same R and pandas libraries work on local and distributed data.

There are several projects and teams working on distributed data frames, including Sparkling Pandas (which has the best name), Adatao’s distributed data frame, and Blaze. All are at an early stage, but as they mature the experience of working with distributed data frames from R or Python will become practically seamless. Of course, Spark already provides machine learning libraries for Scala, Java, and Python, which is a different approach to getting existing libraries like R or Pandas running on Hadoop. Having multiple competing solutions is broadly a good thing, and something that we see a lot of in open source ecosystems.

Combining the Pieces

Imagine if you could share a large dataset and the notebooks containing your work in a form that makes it easy for anyone to run them—it’s a sort of holy grail for researchers.

To see what this might look like, have a look at the talk by Andy Petrella and Xavier Tordoir on Lightning fast genomics, where they used a Spark Notebook and the ADAM genomics processing engine to run a clustering algorithm over a part of the 1000 Genomes dataset. It combines all the topics above—open data, cloud computing, notebooks, and distributed data frames—into one.

There’s still work to be done to expand the tooling and to make the whole experience smoother, nevertheless this demo shows that it's possible for scientists to analyse large amounts of data, on demand and in a way that is repeatable, using powerful high-level machine learning libraries. I'm optimistic that tools like this will become commonplace in the not-to-distant future.

Marmalade

2015-01-11T16:53:00.000+00:00

I made some marmalade. I've never made it before, although I have memories of my parents making it every January, and how slicing the peel seemed to take hours. I used this meta recipe from Felicity Cloake that Eliane found, and it seemed to work pretty well.

Five years at Cloudera

2013-10-13T21:13:00.000+01:00

Five years ago today was my first day at Cloudera. The team I joined consisted of the four founders—Mike Olson, Amr Awadallah, Jeff Hammerbacher, Christophe Bisciglia—as well as Aaron Kimball who had joined a week or so before, Alex Loddengaard who was working as an intern, and Matei Zaharia who joined on the same day as me as a part-time consultant.

Before I joined I had been working as an independent Apache Hadoop consultant for a year (probably the first Hadoop consultant anywhere), and was halfway through writing a book on Hadoop. The interview process had involved speaking to all four founders, and I remember when I came off the phone after the last call it was late in the UK but I couldn't sleep because the vision they had described was exactly what I wanted to see for Hadoop: a company that wanted to make Hadoop accessible to everyone, by making it easier to use and run, while maintaining a strong commitment to open source. The last point sealed the deal for me, and really at that point there was no way I could not join, and five years on I can say without exaggeration that it was the best decision of my professional life.

When I started I was living in Wales, which meant that on my first day I didn't see any of my new colleagues! That was remedied a few weeks later on when I visited California (and ApacheCon in New Orleans) in early November 2008. Initially the others were working out of a single room in AdMob's offices in San Mateo, but it wasn't long before we moved to a smart brick-lined office in Burlingame. I was around for the moving in day, which involved more flatpack assembly skills than programming.

From the very beginning we worked on making Hadoop easier to use, run, and support, and better integrated with other systems, so that it could enjoy broader adoption. That was borne out in the early projects at Cloudera which included creating training material, creating packages for Red Hat and Debian (CDH, and later Bigtop), writing tools for data ingest (Flume and Sqoop), creating a rich web UI for Hadoop users (Hue), as well as making contributions to the core project. I was mainly involved in the latter, which I did at the same time as completing the book in time for the Hadoop Summit 2009, which would never have been possible without the time and space my teammates gave me.

Over the first year I would visit every three months or so, and naturally each time the team would have grown. I always enjoyed meeting the new people who had joined since my last visit, but I realized that at such a formative time in a company's life, when the culture was being laid down that being closer to the team would make it easier for me to stay involved. The opportunity to move to California came up, and on the last day of October 2009 I arrived in San Francisco with my wife, Eliane, and two girls.

As anyone who has moved to a new country knows, there's a lot of things to sort out—somewhere to live, a school for the girls, reams of paperwork—and during this time the folks at Cloudera were incredibly helpful and supportive. When we moved into our new apartment (which Eliane had found a mere two weeks after we arrived) half of the engineering team turned up to help with Ikea flatpack assembly.

At the end of our three year sojourn in the US, we left having made many friends, sad to leave, but happy knowing we'd be living closer to our family again. Cloudera was an order of magnitude larger than when I had arrived, and was now an international company with offices in several countries across the world.

Over the last five years I've been lucky enough to have been given the freedom to work on many parts of the Hadoop stack, in different parts of the Hadoop community, and with different teams at Cloudera. In the course of doing so, I've worked with the most talented and intelligent group of people in my life. It's hard work, and challenging, but also a lot of fun and incredibly enriching. I have every reason to expect it to continue. Thanks Cloudera!

Update on October 14: reworded to state that ApacheCon 2008 was held in New Orleans, not California. Thanks to Isabel Drost-Fromm for pointing out the error.

Making a Kitchen Table

2013-03-30T18:25:00.000+00:00

A couple of weeks ago I made a new kitchen table.

It was much easier than it looks as all I had to do was attach some hairpin legs to a worktop. If you haven't seen hairpin legs before, here's a closeup:

Eliane got the idea for the design after seeing something similar on the web, and she ordered the worktop from Worktop Express, and the hairpin legs from the Iron Mill.

I worked out what size screws to use (#12) and the pilot drill size using this handy chart. I also found a tip somewhere that said putting a little wax on the screw makes it easier to drive in with hardwoods (our worktop is oak).

The table is pretty sturdy, and hasn't collapsed! It was quicker to put together than some Ikea furniture, and it's very satisfying having an everyday piece of furniture that we designed and built ourselves.

Have you put the chickens to bed?

2013-02-03T17:52:00.001+00:00

"Have you put the chickens to bed?" -- it's a question we ask each other frequently in our house, since we are the proud owners of seven beautiful hens. Normally Eliane has, but when Lottie, our younger daughter, asked long after it had got dark one evening last week it turned out that none of us had, despite having IFTTT alerts set up to remind us.

The problem with the alert is that it is set to go off at sunset, which is all that IFTTT allows, and that's a bit too early as it's not dark enough for the chickens to be in their house. So we wait a bit, then we forget.

So I decided to write an Android app to send an alert a fixed amount of time (say 45 minutes) after sunset, so that when we received it, it would be dark, the chickens would be in their house, and we could close the door there and then.

This is the result:

Eliane is currently beta testing it, so we'll see how well it works. (Obviously the long term goal is an automatic sensor to open and close the chicken house door, but we're not there yet.)

Writing Android Apps

This is the first Android app I've written, and overall I found the process very straightforward. A couple of years ago I ran a "Hello World" Android tutorial, and I seem to remember most of the time taken to get the app running was installing the Eclipse plugin. This time the Android Developer Tools (ADT) include a customized version of Eclipse, making the getting started process much smoother.

The Android API is huge and fairly intimidating. It is, however, incredibly well documented, and the user guides are invaluable. The hardest part of writing the app was figuring out which parts of the API to use - do I need a BroadcastReceiver or a Service?, how do AlarmManager and Notification interact? - that kind of thing. There's a lot of material online covering how to do various things in Android, and these offered general pointers, but not necessarily useful code, since the API evolves rapidly from release to release. And although the older code is generally supported, since compatibility is taken very seriously, there may be a better way of doing things in later versions.

The ADT tooling is good and encourages you to do the right thing - for example, extracting natural language strings from your app so it's easy to change them (or translate them) later. In this case, a class called R is generated which has references to all the assets that you need in you app: icons, sound files, strings, etc. For example, the audio file which plays when the notification is received is referred to with:

R.raw.cluck

To generate the icons I drew a chicken on a piece of paper with a sharpie, then took a photo of it and used an online image editor to make the background transparent. The Android Asset Studio completed the job of converting the image to a set of icons. (I didn't use Inkscape in the end, but this blog entry shows how to convert from an Inkscape drawing.)

What's Next?

The biggest limitation in the app at the moment is that the calculation for sunset time is hardcoded for the UK. Using the Location API is the obvious next step there.

There are also some complications to do with making sure that notifications will still be sent even the phone is rebooted. I want to make sure that works properly before putting the app on Google Play.

The UI is pretty rudimentary too and could do with some work.

And before we get to the fully-automated solution, we could have a sensor that detects if the door is open or closed and only sends the reminder if the door hasn't been closed for the night.

Source is on GitHub.

How far away is the sea?

2012-12-31T16:06:00.000+00:00

I wanted an app to answer this question, so I wrote one:

You can try it out at http://how-far-away-is-the-sea.appspot.com/. It works well on phones too, so you can use it when you are out and about.

How does it work?

I used the dataset of land polygons from Natural Earth, which as the name suggests covers the whole world. The scale is 1:10 million, so inevitably there is some inaccuracy near the coast, particularly where it's wiggly.

The app uses your current location (or a location you selected by clicking on the map) and computes the closest point in the set of land polygons. This calculation is performed using the JTS Topology Suite, a library for 2D spatial work, and it runs as a Java webapp hosted on Google App Engine.

Originally I used Geotools to perform the geospatial calculations, but unfortunately it doesn't run on GAE, so I wrote an offline tool to convert the Natural Earth shapefiles to a JTS binary format. JTS works fine on GAE, but it lacks a distance calculator. Luckily spatial4j has the requisite distance functions, and it too works on GAE.

The webapp exposes a simple query endpoint, so a request for the following URL, for example:

http://how-far-away-is-the-sea.appspot.com/query?lat=51.856479&lng=-3.13551

will return a JSON document with the closest point on the coast, whether the (origin) location is on land or at sea, and the distance in metres to the coast:

"latitude":51.856479,

"longitude":-3.13551,

"coastLatitude":51.55853913000007,

"coastLongitude":-2.984038865999878,

"onLand":true,

"distanceToCoast":34734.59501052392

The page that the user sees is a simple static HTML page that uses the Google Maps API (v3) to render the map and the markers, and jQuery to query the Java webapp.

The complete source code is on Github at https://github.com/tomwhite/how-far-away-is-the-sea.

Further ideas

Some of the polygons are a poor approximation to the coastline, so it would be nice to get a higher-resolution dataset. There are likely many potential sources, such as this one for the UK.

It would be interesting to use the dataset to answer the question: "which is the furthest point from the sea [in the UK/in X/in the world]?". I'd like to find time to do that sometime. Adding in spatial indexes might be helpful too.

If you liked this app then you might like...

Is it day or night?

IFTTT

2012-12-16T12:31:00.001+00:00

IFTTT, pronounced "ift", and which stands for "if this then that", is a great service for wiring bits of the internet together. The idea is that you create rules for performing actions, based on triggers.

If this [trigger] occurs then perform that [action].

There are lots of triggers and actions, provided by channels. For example, the Weather Channel provides a trigger which fires at sunset. And the Google Talk Channel provides an action to send a chat message. I combined the trigger and action into a recipe called "Did you put the chickens to bed?" which will remind me (and Eliane) to close the chicken shed in the evening.

I love the simplicity of the whole thing. I quickly added a recipe to send a weekly SMS to remind me to put the rubbish out. And one to send an email to Lottie when there is a full moon. Emilia created a recipe to send her an email when a friend of hers posts something on his blog. I fear the recipe that tells me when it has started raining will be deleted soon due to email overload.

When you start thinking in this way, the more interesting uses invariably involve the the physical world in some way. I want to have a recipe that says "if we're running out of coffee beans then order some more", or "if I'm on Skype light up a lamp outside my office so the kids know not to come in" (this one is close with the blink(1) device), or even "it's actually dark now and you still haven't closed the chicken shed door".

Apportionment

2012-12-09T22:04:00.001+00:00

[I wrote this in July, but never got round to posting it.]

Last weekend I visited the U.S. Capitol in Washington, D.C., with my family, and I learned that the House of Representatives has 435 seats which are appointed so that each state has a number of seats that is proportional to its population. It sounded simple when the tour guide said it, but I wondered how are fractions handled fairly? Simply rounding off quotas doesn't work—firstly because some states could get no seats, which would be unfair, and secondly, how do you make sure that the rounding is both fair and assigns all 435 seats?

When I got home I read about the apportionment problem, as it is known, which has a long and interesting history. Wikipedia [1] is a good read, as usual; and [2] goes into the history and mathematics of different apportionment algorithms in depth, at least one of which suffers causes a paradox. Here I'm interested in looking at the algorithm that is used today to calculate apportionments for the House of Representatives, and why it is considered to be the fairest.

The Algorithm

The algorithm in use today for apportioning seats is due to Huntington and Hill and is known as the Huntington-Hill method, or the method of equal proportions. It's best understood as a dynamic process, which works as follows:

To start, each state is given one seat. (This ensures that states with relatively small populations, like Wyoming, get at least one seat.) Then, each remaining seat is allocated in turn to the state is allocated to the state with the highest priority, where the priority of a state of population \(P\) and \(n\) previously-allocated seats is defined as

\begin{align} \frac {P} {\sqrt{n(n+1)}}\label{pri} \end{align}

We'll see why the priority is defined as it is below, but for now notice that it is approximately \(P/n\), so the seat is given to the state that has the least number of representatives per person, roughly speaking.

Results for the 2010 Census

Running the algorithm for the state populations from the 2010 Census (using a program I wrote [5]) gives the following apportionment, which agrees with the U.S. Census Bureau [3]. (The quota column is the percentage of the population for each state.)

State	Seats	Population	Quota	People per representative
Alabama	7	4802982	6.76	686140
Alaska	1	721523	1.02	721523
Arizona	9	6412700	9.02	712522
Arkansas	4	2926229	4.12	731557
California	53	37341989	52.54	704565
Colorado	7	5044930	7.10	720704
Connecticut	5	3581628	5.04	716325
Delaware	1	900877	1.27	900877
Florida	27	18900773	26.59	700028
Georgia	14	9727566	13.69	694826
Hawaii	2	1366862	1.92	683431
Idaho	2	1573499	2.21	786749
Illinois	18	12864380	18.10	714687
Indiana	9	6501582	9.15	722398
Iowa	4	3053787	4.30	763446
Kansas	4	2863813	4.03	715953
Kentucky	6	4350606	6.12	725101
Louisiana	6	4553962	6.41	758993
Maine	2	1333074	1.88	666537
Maryland	8	5789929	8.15	723741
Massachusetts	9	6559644	9.23	728849
Michigan	14	9911626	13.94	707973
Minnesota	8	5314879	7.48	664359
Mississippi	4	2978240	4.19	744560
Missouri	8	6011478	8.46	751434
Montana	1	994416	1.40	994416
Nebraska	3	1831825	2.58	610608
Nevada	4	2709432	3.81	677358
New Hampshire	2	1321445	1.86	660722
New Jersey	12	8807501	12.39	733958
New Mexico	3	2067273	2.91	689091
New York	27	19421055	27.32	719298
North Carolina	13	9565781	13.46	735829
North Dakota	1	675905	0.95	675905
Ohio	16	11568495	16.28	723030
Oklahoma	5	3764882	5.30	752976
Oregon	5	3848606	5.41	769721
Pennsylvania	18	12734905	17.92	707494
Rhode Island	2	1055247	1.48	527623
South Carolina	7	4645975	6.54	663710
South Dakota	1	819761	1.15	819761
Tennessee	9	6375431	8.97	708381
Texas	36	25268418	35.55	701900
Utah	4	2770765	3.90	692691
Vermont	1	630337	0.89	630337
Virginia	11	8037736	11.31	730703
Washington	10	6753369	9.50	675336
West Virginia	3	1859815	2.62	619938
Wisconsin	8	5698230	8.02	712278
Wyoming	1	568300	0.80	568300

The Mathematics

The algorithm finally settled on by Congress was chosen because it was thought to be the fairest. There are different ways of defining what "fair" means, and so it cannot be settled mathematically. In this context "fair" is taken to mean "minimizes the relative difference in representatives per person between states".

To see how the algorithm meets this definition of fairness, let's see what happens when we examine any two states to see if transferring one seat between them would improve the apportionment. This is the argument published by E. V. Huntington in [4].

Suppose after the apportionment, state \(A\) has received \(x+1\) seats, and state \(B\) has received \(y\) seats. Furthermore, also suppose that \(A\) is over-represented because the number of people per representative is less than for \(B\):

\begin{align} \frac {A} {x+1} &\lt \frac {B} {y}\label{Aover} \end{align}

We can check this in the case of California and New York:

\begin{align} \frac {37,341,989} {53} &\lt \frac {19,421,055} {27}\nonumber \end{align}

\begin{align} 704565.83 &\lt 719298.33 \nonumber \end{align}

Now let's see what happens if we try to transfer one seat from \(A\) to \(B\)—does that make things fairer?

In the round when \(A\) won its last seat (number \(x+1\)), we know that its priority (defined by (\ref{pri})) was higher than \(B\)'s. That is,

\begin{align} \frac {A^2} {x(x+1)} &\gt \frac {B^2} {y(y+1)}\label{priority} \end{align}

(Note that even if \(B\) hadn't won its last seat (number \(y\)) at that point, the inequality still holds, since the number of seats it had would be less than \(y\).)

Again we can check this in the case of California and New York:

\begin{align} \frac {37,341,989^2} {52 \times 53} &\gt \frac {19,421,055^2} {27 \times 28}\nonumber \end{align}

Which is true. (The numbers also tally with the U.S. Census Bureau [6], and my program to calculate apportionments [5], where the priority value for California's last seat is \(711,308\), which is \(37,341,989/\sqrt{52 \times 53}\).)

Dividing (\ref{priority}) by (\ref{Aover}) we get

\begin{align} \frac {A} {x} &\gt \frac {B} {y+1}\label{Bover} \end{align}

which we can interpret as saying that \(B\) would be over-represented if one seat were transferred to it from \(A\). For our example of California and New York, this becomes

\begin{align} 718115.17 &\gt 693609.11 \nonumber \end{align}

The question now is, which over-representation is the smallest? That is, which is fairer, and therefore, to be preferred?

Using (\ref{Aover}), we calculate the relative difference before the transfer as

\begin{align} \newcommand{\slfrac}[2]{\left.#1\middle/#2\right.} \slfrac{ \left( \frac {B} {y} - \frac {A} {x+1} \right) } {\frac {A} {x+1}} = \frac {B(x+1)} {Ay} - 1 \label{Adiff} \end{align}

And, using (\ref{Bover}), the relative difference after the transfer is

\begin{align} \newcommand{\slfrac}[2]{\left.#1\middle/#2\right.} \slfrac{ \left( \frac {A} {x} - \frac {B} {y+1} \right) } {\frac {B} {y+1}} = \frac {A(y+1)} {Bx} - 1 \label{Bdiff} \end{align}

To compare these relative differences, note that we can rewrite (\ref{priority}) as

\begin{align} \frac {A(y+1)} {Bx} &\gt \frac {B(x+1)} {Ay} \end{align}

Thus

\begin{align} \frac {A(y+1)} {Bx} - 1 &\gt \frac {B(x+1)} {Ay} - 1 \end{align}

and the relative difference is smaller before the seat transfer (using (\ref{Adiff}) and (\ref{Bdiff})). So the original apportionment is optimal. There was nothing special about the choice of \(A\) and \(B\), so we can conclude that the apportionment is optimal overall.

Again, this checks out for our example. The relative difference for 53 seats for California and 27 for New York is \(0.021\), versus \(0.035\) for 52 for California and 28 for New York.

References

[1] United States congressional apportionment, Wikipedia. ↩

[2] Apportionment: Introduction, American Mathematical Society. ↩

[3] "APPORTIONMENT POPULATION AND NUMBER OF REPRESENTATIVES, BY STATE: 2010 CENSUS", U.S. Census Bureau. ↩

[4] The Apportionment of Representatives in Congress, E. V. Huntington, Transactions of the American Mathematical Society, Vol. 30, No. 1. (Jan., 1928), pp. 85-110. ↩

[5] A program to calculate apportionments, Tom White, July 2012. ↩

[6] PRIORITY VALUES FOR 2010 CENSUS, U.S. Census Bureau. ↩

d3troit

2012-12-03T22:27:00.000+00:00

I wrote a visualization of the populations of the largest cities in the US over the years. I got the idea when I was in Detroit in the summer, and read about the huge decline in Detroit's city population since the 1950s when the automotive industry was at its peak. According to Wikipedia's Shrinking cities in the USA page, Detroit has declined by 61.4% from its peak population. This is a decline of over 1 million, which makes it the largest decline among US cities in absolute numbers, but not in percentage terms since St. Louis has a slightly higher percentage decline (62.7%, or 537,502 people).

These figures are all for city populations, not for the greater urban or metropolitan areas, which, in the case of Detroit, have both significantly increased in size since the 1950s. Detroit has therefore seen one of the largest population shifts to the suburbs since the middle of the 20th century. Much has been written on the dramatic decline of Detroit's city population, and the impact on life there. These are some pieces that I found interesting:

For more information on Detroit's history, particularly its buildings, I highly recommend Dan Austin's http://historicdetroit.org.
The Guardian has an amazing photo gallery of some of Yves Marchand and Romain Meffre's pictures of Detroit in ruins.
How to Bring Detroit Back From the Grave by Josh Harkinson, Mother Jones.
The Maker Culture is Reinventing Detroit by Gina Clifford, Wired.
We didn't get to see it, but a friend highly recommended the heidelberg project, a street art project.
Eliane on our visit: A journey into the past: Detroit in words

How I wrote the visualization

I used the wonderful d3 library to write the visualization, combining elements of the Population Pyramid example with an updating bar chart. The documentation on Object Constancy was critical to getting the visualization to work.

The population data is from the US Census Bureau, although I found the actual files linked from Wikipedia's Largest cities in the United States by population by decade. I had to write some simple scripts to turn the source data into CSV files that d3 could read.

Volcanoes!

2012-05-10T21:42:00.000+01:00

I've just finished reading "Super Volcano: The Ticking Time Bomb Beneath Yellowstone National Park" by Greg Breinin. Despite the hyperbolic title, it's a really good introduction to the subject. Actually, the title is entirely appropriate, since the previous Yellowstone eruption around 600,000 years ago was one thousand times as powerful as the 1980 Mount St. Helens eruption. And it's likely to erupt again, but no one knows when.

We've been on a bit of a volcano tour recently. First we visited Lassen Volcanic National Park in October (climbing the Cinder Cone was a highlight), and we stopped in on Mount St. Helens visitor center on our way to Seattle last month. Yesterday we ventured into the Yellowstone caldera (the bit that blew out in the last eruption).

Before reading the book I hadn't appreciated how recent our understanding of Yellowstone's geology is. It was only in the 1960s that scientists combined new empirical data about the ages of different rock formations in the park with the then emerging theory of plate tectonics. One of the scientists was Robert Christiansen of the U.S. Geological Survey, who, with Richard Blank, collected samples from all over Yellowstone and pieced together the puzzle of how Yellowstone formed. (He also wrote the definitive account of Yellowstone's geology in 2001.)

They realized that the series of calderas between Oregon and Wyoming were all eruptions caused by what is now known as the Yellowstone hotspot over the last 16 million years. The continental plate is moving south west, which makes the newer volcanoes appear in the north east.

This diagram from Wikipedia summarizes it nicely:

What's new in Apache Whirr 0.5.0-incubating

2011-06-04T23:04:00.017+01:00

Apache Whirr 0.5.0-incubating is now available. Whirr is a library and command line interface for running distributed services like Apache Hadoop in the cloud. Note that Whirr is currently undergoing Incubation at the Apache Software Foundation, which means that, in particular, the project has yet to be
fully endorsed by the ASF. Please read the full disclaimer.

In this release the Whirr development team have added many new features while still making the core more solid. This post covers some of the more important changes. The full list can be found in the release notes.

Improving the new user experience

Orchestrating multiple services on cloud instances is a challenge to make simple, and Whirr has sometimes been a little fiddly to get running. SSH settings, in particular, have been a common sticking point with new users. The new Whirr in 5 Minutes guide walks through the minimum number of commands you need to type to get a simple 3-node ZooKeeper cluster running in a few minutes. From there you can move on to the Quick Start Guide and the Configuration Guide.

The sample configurations in the recipes directory in the distribution contain useful settings for running the services on a variety of cloud providers. Users are always encouraged to share their working configurations with the community.

New services

Elastic Search and Voldemort have been added to the roster of services that come with Whirr. This brings the total to six; adding to Apache Cassandra, Apache Hadoop, Apache HBase, and Apache ZooKeeper.

API improvements

Whirr is still a young project so it is not surprising that its API is rapidly evolving. In WHIRR-245, the demarcation between the user API (for users who control Whirr clusters from Java) and the service API (for developers writing new Whirr services) was clarified. The user API can be found in the org.apache.whirr package; whereas the service API is in org.apache.whirr.service.

You can find out more about writing Whirr services in this presentation (PDF).

The firewall API that service writers use to open ports for services was simplified and made more powerful in WHIRR-275.

Overriding scripts

This feature was actually introduced in Whirr 0.4.0-incubating, but it's useful enough to mention here. In older versions of Whirr, if you wanted to make a modification to the scripts that run on cloud instances - to tweak some settings, for instance - you would have to upload your modifications (as well as all the other scripts) to a publicly available web server (Amazon S3 was a common choice), then point Whirr at the new location. Not particularly difficult, but a big enough barrier to discourage users from trying it.

The new approach is to push scripts to nodes from the launching machine, so you can just edit them locally before launch. Full instructions are covered in the FAQ.

Running scripts on nodes

In 0.5.0 the scripts that run on cloud instances have been broken up to be more fine-grained, so many services have individual start and stop scripts (WHIRR-266). Combined with the ability to run scripts on sets of nodes in the cluster (by ID or role), users now have more control of the cluster once it has launched (WHIRR-173). Try running whirr run-script at the command line to use this feature. There's a contrib script to run the Yahoo! Cloud Serving Benchmark (YCSB) against an HBase cluster, which takes advantage of the run-script command (WHIRR-287).

Also useful is WHIRR-291, which allows you to launch "blank" nodes with no services running on them (in a "noop" role), and then, with whirr run-script, run arbitrary scripts on them to bring them into the state you want.

Custom service builds

Developers who work on services supported in Whirr will find the ability to push a custom build to a cluster very useful for testing (WHIRR-220). For example, if you are working on a ZooKeeper feature, you can build a ZooKeeper tarball with your new feature, then launch a cluster that uses this tarball by specifying whirr.zookeeper.tarball.url as a local file:// URL pointing to your tarball. Whirr will push the tarball to a temporary blob store container, then each node will download from there.

I used a variation of this feature to try out a nightly Hadoop 0.22 build on a small Whirr cluster. In this case the tarball URL is not a local file, so Whirr doesn't copy the tarball to a blob store since it is already accessible from the cloud.

Service improvements

Whirr is only able to exist because of the powerful abstraction that jclouds provides for interacting with cloud providers. A great example of this power is the API that jclouds provides for discovering the hardware capabilities of an instance running on any provider. WHIRR-282 took advantage of the jclouds API to find the number of cores on a node to dynamically configure the number of slots in a Hadoop cluster. Previously, you had to set this manually for each cluster to take full advantage of larger image sizes.

This is just the beginning - there is more work to use memory capabilities to set configuration (WHIRR-229), and to use hardware capabilities generally in services other than Hadoop.

Cluster state storage

In previous releases of Whirr, information about launched instances was stored in a file on the machine that launched the cluster (~/.whirr/<cluster-name>/instances). With WHIRR-288, it's now possible to store this information in a blob store instead (such as Amazon S3, although any jclouds-supported blob store can be used), which is useful if you want to control clusters from multiple machines.

Bring Your Own Nodes

Or just BYON, for short. Many users have requested the ability to deploy to privately owned hardware - and jclouds added this feature in 1.0-beta-9. Whirr now has preliminary support for BYON clusters. In a nutshell, you write a YAML file enumerating the nodes to deploy to - their addresses, access credentials, etc. - then Whirr will start services on them. The nodes just need to have a base OS like Centos or Ubuntu installed. You can find an example BYON configuration in the recipes directory of the download.

BYON is also useful for testing locally by using VMware or VirtualBox to host target nodes.

A hummingbird

Last, but not least, Whirr finally has a logo! Many thanks to Alison Wong, who designed it and donated it to the ASF.

Credits

I would like to thank everyone who helped with the 0.5.0-incubating release. We have a growing community, and we welcome feedback and help from new users and developers. If you'd like to get involved you can start by downloading the new release and joining us on the mailing lists.

What's next?

It's difficult to make firm predictions about the contents of the next release since Whirr is an open source project with many open issues, but the general themes include:

Adding more services. In tandem, we want to make it easier to write new services by pushing common patterns into the core (e.g. WHIRR-326 is one example of this).
Improving existing services. By making them more flexible, better configured, easier to manage.
Adding more cloud providers. The latest release of jclouds supports 30 providers, and we need help testing more of them with Whirr.
Implementing services using other configuration management tools, rather than bash scripting. Andrei Savu is working on using Puppet to write new services (WHIRR-255).
Supporting elastic clusters, so new nodes can be added to running clusters (WHIRR-214).

Do Donors Choose Local Schools?

2011-05-09T04:23:00.008+01:00

DonorsChoose.org is a site where people donate money to school projects. For example, a teacher in Iowa might create a project request for some beanbags to create a reading area for her pupils. Then, via the website, donors can give as much or as little as they like to the project, and once the target is reached DonorsChoose purchase and deliver the beanbags to the school.

DonorsChoose are running a contest. They have opened up their data, and are challenging developers to "make discoveries and build apps that improve education in America".

I thought I'd do a little hack to answer the question "Do donors tend to choose their local schools?"

I wrote a short Python program to calculate the distance between each donor's address (where it was provided) and the address of the school for the project they were donating to. Then, using R, I plotted the following histogram:

It's striking that many donors are local. In fact, in my analysis, one in four donors live within four miles of the school they are donating to, and the median distance is 128 miles. However, there is a long tail reaching to over 5000 miles!

If we use a logarithmic scale for the y-axis (count), then a couple of features jump out. This plot is a scatter plot where counts are bucketed by integer distance.

There is a small peak at around 2500 miles, which is puzzling until you realize that this is the approximate distance between the East Coast and West Coast of the USA, where the majority of the population is located. I'm guessing that this bump corresponds to people who donate to schools of friends and relatives on the other coast.

The other noticeable feature is the significant drop off after 2500 miles. This small number of donations is where the donor or school is located in the non-contiguous states (Alaska and Hawaii), which have only a small fraction of the total population.

How I produced the images

I wrote a Python program to parse the CSV data from DonorsChoose. It reads two data files - the projects file and the donations file. The files are joined by the project ID field, which means we can access the school ZIP code (from the projects file), and the partial ZIP code of the donor (from the donations file). The donor's ZIP code is optional (and was actually only present in 46% of donations, so the results are restricted to this subset of donations). Also, for privacy reasons, only the first 3 digits of the donor's ZIP code are provided by DonorsChoose. This makes the distance measurements less accurate, particularly for local donors.

In the case of the partial ZIP code matching the school ZIP code, I set the distance to zero, on the assumption that the donor lives close to the school. This assumption will tend to overcount the zero distance case, and undercount small distances.

If the partial ZIP code did not match the school ZIP code, I chose a ZIP code with that prefix at random and calculate the distance between that ZIP code and the school's ZIP code. For this calculation I used Kevin T. Ryan's Python code at ActiveState, which I modified slightly to support partial ZIP codes.

The program buckets integers distances and writes the counts to a file. I then used R to plot the distributions show above.

I've put all my code into a GitHub repository.

This hack just scratches the surface of the dataset, and I look forward to seeing some of the cool things that others do in this contest. The closing date is June 30, 2011.

Whirr in 5 Minutes

2011-04-16T23:23:00.005+01:00

A couple of days ago I wrote down a sequence of command lines to install Apache Whirr (an incubator project for running distributed systems on various cloud providers) and run a service from scratch. You just need Java, SSH, and some cloud credentials (Amazon EC2 in this case): I've reproduced the commands here:


export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
curl -O http://www.apache.org/dist/incubator/whirr/whirr-0.4.0-incubating/whirr-0.4.0-incubating.tar.gz
tar zxf whirr-0.4.0-incubating.tar.gz; cd whirr-0.4.0-incubating
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr
bin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

At this point you should have a 3 node ZooKeeper cluster running, which is easily checked with


echo "ruok" | nc $(awk '{print $3}' ~/.whirr/zookeeper/instances | head -1) 2181; echo

You can shutdown the cluster with the following command.


bin/whirr destroy-cluster --config recipes/zookeeper-ec2.properties

There are recipes for more services in the Whirr download package, and more detailed instructions in the Quick Start Guide.

My favourite talk at Devoxx 2010

2010-11-28T05:22:00.004+00:00

I went to Devoxx in Antwerp for the first time this year, and really enjoyed it. I didn't go to that many talks, but the quality seemed very high. My favourite talk was "Performance Anxiety" by Josh Bloch, because he's a great speaker and because he presented a single important idea so well.

The idea was this: determining the performance of programs should be treated as an empirical science. We should give up any hope (if any existed) that predicting a program's performance will become easier in the future, since every layer in the deep stack of a modern computer is becoming more complex. Increased complexity is actually the price we must pay for increased performance. And increased complexity leads, almost inevitably, to reduced predictability.

As an experimental demonstration, Josh ran a micro benchmark to sort an array of integers. (The demo actually failed to show what he wanted to show, but he assured us it had worked earlier... It's somehow reassuring when live demos don't work for Java demigods either.) Each invocation of the benchmark did a number of runs, and the timings of the runs converged on a stable value. However, between benchmark invocations, the stable values that they converged on varied by up to 20%.

The reason is subtle: the HotSpot compiler produces different compile plans on different runs, and these have different performance profiles. (This is explained in Cliff Click's 2009 JavaOne presentation, "The Art of (Java) Benchmarking".) They all converge on stable values, but different stable values for different runs. The fact that HotSpot is non-deterministic may not be particularly surprising, but Josh said that the same behaviour has been shown in C code and even assembler, since non-determinism exists at lower levels of the stack too.

The practical upshot is that we need to change how we iteratively benchmark code. No longer is it permissible to run a benchmark, make a change, run the benchmark again, see that the execution time was faster (even across a number of runs in one VM) and legitimately conclude that it was due to the change we made. We have to reach for statistical tools that tell us the improved execution time was significant after we have run enough VMs.

How many VMs? The short answer is "30", the longer answer is in "Statistically Rigorous Java Performance Evaluation" by Andy Georges, Dries Buytaert, and Lieven Eeckhout.

Thankfully there is a Java framework called Caliper which can help you run microbenchmarks and which even plots the error bars for you. This stuff needs to see wider adoption in the industry.

"Hadoop: The Definitive Guide" Coming Soon

2009-05-01T16:52:00.008+01:00

After a busy couple of months I've finished the writing for "Hadoop: The Definitive Guide". It's now going through the production process at O'Reilly.

You can pre-order it on Amazon and O'Reilly. You can also get the Rough Cuts version from O'Reilly to read today, although it hasn't yet been refreshed with my latest draft (I hope that will happen in the next few days).

Here's the final chapter listing. Readers of earlier drafts will notice that the number of chapters has grown: this is because the elephantine MapReduce chapter has been split into three (chapters 6, 7, and 8) to make things more digestible.

Meet Hadoop
MapReduce
The Hadoop Distributed Filesystem
Hadoop I/O
Developing a MapReduce Application
How MapReduce Works
MapReduce Types and Formats
MapReduce Features
Setting Up a Hadoop Cluster
Administering Hadoop
Pig
HBase
ZooKeeper
Case Studies

The writing's done but I still have to package up the example code. I'll be doing this soon, and it will appear on the book's website.

Draft Pig Chapter

2009-01-27T10:01:00.002+00:00

A couple of quick updates on the Hadoop book I'm writing. The Pig chapter is now available on Safari. It still has a few holes, but I'd love to hear feedback on it.

Also included is a Hadoop case study from Last.fm. Thanks to Adrian Woodhead and Marc de Palol for writing it.

Hadoop Developer Zeitgeist

2008-11-20T10:39:00.005+00:00

The Cloudera team have just released a website which has a few reports on various Hadoop development metrics. I like the Most Watched Open Jira Issues, as it gives a good summary of what Hadoop Core developers are thinking about.

Personally, I can't wait for the new MapReduce API (HADOOP-1230), which is currently the third most watched issue.