<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" xml:lang="en-US">
  <id>tag:oobaloo.co.uk,2013:/posts</id>
  <link rel="alternate" type="text/html" href="http://oobaloo.co.uk" />
  
  <title>pingles</title>
  <updated>2013-05-30T18:53:49Z</updated>
  <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/oobaloo" /><feedburner:info uri="oobaloo" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><link rel="license" type="text/html" href="http://creativecommons.org/licenses/by-nc-sa/2.0/" /><entry>
    <id>tag:oobaloo.co.uk,2013:Post/579269</id>
    <published>2013-05-30T18:46:00Z</published>
    <updated>2013-05-30T18:53:49Z</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/oobaloo/~3/mPUy_8IaKjc/amazon-redshift-plus-r-analytics-flow" />
    <title>Amazon Redshift + R: Analytics Flow</title>
    <content type="html">
      &lt;p&gt;Ok, so it’s a slightly fanboy-ish title but I’m starting to really like the early experimentation we’ve been doing with &lt;a target="_blank" href="http://aws.amazon.com/redshift/"&gt;Amazon’s Redshift&lt;/a&gt; service at &lt;a target="_blank" href="http://www.uswitch.com"&gt;uSwitch&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Our current data platform is a mix of &lt;a target="_blank" href="http://kafka.apacheorg"&gt;Apache Kafka&lt;/a&gt;, Apache &lt;a target="_blank" href="http://hadoop.apache.org" title="Link: http://hadoop.apache.org"&gt;Hadoop&lt;/a&gt;/&lt;a target="_blank" href="http://hive.apache.org"&gt;Hive&lt;/a&gt; and a set of heterogenous data sources mixed across the organisation (given &lt;a target="_blank" href="http://skillsmatter.com/podcast/design-architecture/keynote-3623"&gt;we’re fans of letting the right store find it’s place&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The data we ingest is reasonably sizeable (gigabytes a day); certainly enough to trouble the physical machines uSwitch used to host with. However, for nearly the last 3 years we’ve been breaking uSwitch’s infrastructure and systems apart and it’s now &lt;a target="_blank" href="http://www.quantisan.com/how-a-few-screws-cost-2000-and-a-240gb-multinodes-cluster-cost-50/"&gt;much easier to consume whatever resources you need&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Building &lt;a target="_blank" href="http://vimeo.com/45136211"&gt;data systems on immutable principles&lt;/a&gt; also makes this kind of experimentation so much easier. For a couple of weeks we (&lt;a target="_blank" href="http://www.quantisan.com"&gt;Paul&lt;/a&gt; and I) have been re-working some of our data warehousing ETL to see what a Redshift analytics world looks like.&lt;/p&gt;

&lt;p&gt;Of course it’s possible to just connect any JDBC SQL client to Redshift but we want to be able to do some more interactive analysis on the data we have. We want an Analytics &lt;a target="_blank" href="http://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop"&gt;REPL&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;Redshift in R&lt;/h2&gt;

&lt;p&gt;I’m certainly still a novice when it comes to both statistical analyses and R but it’s something I’m enjoying- and I’m lucky to work with people who are great at both.&lt;/p&gt;

&lt;p&gt;R already has a package for connecting to databases using &lt;a target="_blank" href="http://www.rforge.net/RJDBC/"&gt;JDBC&lt;/a&gt; but I built a small R package that includes both the Postgresql 8.4 JDBC driver and a few functions to make it nicer to interact with: &lt;a target="_blank" href="https://github.com/pingles/redshift-r/"&gt;Redshift.R&lt;/a&gt;. N.B. this was partly so I could learn about writing R packages, and partly about making it trivial for other R users in the company to get access to our experimental cluster.&lt;/p&gt;

&lt;p&gt;The package is pretty easy to install- download the tarball, uncompress and run an R statement. The &lt;a target="_blank" href="https://github.com/pingles/redshift-r/"&gt;full instructions are available on the project’s homepage&lt;/a&gt;. Once you’ve installed it you’re done- no need to download anything else.&lt;/p&gt;

&lt;h2&gt;Flow&lt;/h2&gt;

&lt;p&gt;What I found really interesting, however, was how I found my workflow once data was accessible in Redshift and directly usable from inside my R environment; the 20 minute lead/cycle time for a Hive query was gone and I could work interactively.&lt;/p&gt;

&lt;p&gt;I spent about half an hour working through the following example- it’s pretty noddy analytics but shows why I’m starting to get a little excited about Redshift: I can work mostly interactively without needing to break my work into pieces and switch around the whole time.&lt;/p&gt;&lt;script src="https://gist.github.com/pingles/5590842.js"&gt;&lt;/script&gt;&lt;h2&gt;Disclosure&lt;/h2&gt;

&lt;p&gt;It would be remiss of me not to mention that R already has packages for connecting to Hadoop and &lt;a target="_blank" href="http://cran.r-project.org/web/packages/RHive/"&gt;Hive&lt;/a&gt;, and work to provide faster querying through tools like &lt;a target="_blank" href="http://blog.cloudera.com/blog/2013/05/cloudera-impala-1-0-its-here-its-real-its-already-the-standard-for-sql-on-hadoop/"&gt;Cloudera’s Impala&lt;/a&gt;. My epiphany is probably also very old news to those already familiar with connecting to Vertica or Teradata warehouses with ODBC and R. &lt;/p&gt;

&lt;p&gt;The killer thing for me is that it cost us probably a few hundred dollars to create a cluster with production data in, kick the tyres, and realise there’s a much better analytics cycle for us out there. We're really excited to see where this goes.&lt;/p&gt;
    &lt;img src="http://feeds.feedburner.com/~r/oobaloo/~4/mPUy_8IaKjc" height="1" width="1"/&gt;</content>
    <author>
      <name>Paul Ingles</name>
    </author>
  <feedburner:origLink>http://oobaloo.co.uk/amazon-redshift-plus-r-analytics-flow</feedburner:origLink></entry>
  <entry>
    <id>tag:oobaloo.co.uk,2013:Post/85011</id>
    <published>2013-01-09T09:43:00Z</published>
    <updated>2013-05-21T03:39:04Z</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/oobaloo/~3/PXJT6Cy5xTY/kafka-for-uswitchs-event-pipeline" />
    <title>Kafka for uSwitch's Event Pipeline</title>
    <content type="html">
      &lt;p&gt;&lt;a href="http://incubator.apache.org/kafka/"&gt;Kafka&lt;/a&gt; is a high-throughput, persistent, distributed messaging system that was originally developed at LinkedIn. It forms the backbone of uSwitch.com’s new data analytics pipeline and this post will cover a little about Kafka and how we’re using it.&lt;/p&gt;
&lt;p&gt;Kafka is both performant and durable. To make it easier to achieve high throughput on a single node it also does away with lots of stuff message brokers ordinarily provide (making it a &lt;em&gt;simpler&lt;/em&gt; distributed messaging system).&lt;/p&gt;
&lt;h2&gt;Messaging&lt;/h2&gt;
&lt;p&gt;Over the past 2 years we’ve migrated from a monolithic environment based around Microsoft .NET and SQL Server to a mix of databases, applications and services. These change over time: applications and servers will come and go.&lt;/p&gt;
&lt;p&gt;This diversity is great for productivity but has made data analytics as a whole more difficult.&lt;/p&gt;
&lt;p&gt;We use Kafka to make it easier for the assortment of micro-applications and services, that compose to form uSwitch.com, to exchange and publish data.&lt;/p&gt;
&lt;p&gt;Messaging helps us decouple the parts of the infrastructure letting consumers and producers evolve and grow over time with less centralised coordination or control; I’ve referred to this as building a Data Ecosystem before.&lt;/p&gt;
&lt;p&gt;Kafka lets us consume data in realtime (so we can build reactive tools and products) and provides a unified way of getting data into long-term storage (HDFS).&lt;/p&gt;
&lt;p&gt;&lt;img src="http://oobaloo-assets.s3.amazonaws.com/kafka_broker.jpg" alt="Consumers and producers"&gt;&lt;/p&gt;
&lt;p&gt;Kafka’s model is pretty general; messages are published onto topics by producers, stored on disk and made available to consumers. It’s important to note that messages are pulled by consumers to avoid needing any complex throttling in the event of slow consumption.&lt;/p&gt;
&lt;p&gt;Kafka doesn’t dictate any serialisation it just expects a payload of &lt;code&gt;byte[]&lt;/code&gt;. We’re using &lt;a href="http://code.google.com/p/protobuf/"&gt;Protocol Buffers&lt;/a&gt; for most of our topics to make it easier to evolve schemas over time. Having a repository of definitions has also made it slightly easier for teams to see what events they can publish and what they can consume.&lt;/p&gt;
&lt;p&gt;This is what it looks like in Clojure code using &lt;a href="http://github.com/pingles/clj-kafka"&gt;clj-kafka&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/3529067.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;We use messages to record the products that are shown across our site, the searches that people perform, emails that are sent (and bounced), web requests and more. In total it’s probably a few million messages a day.&lt;/p&gt;
&lt;h2&gt;Metadata and State&lt;/h2&gt;
&lt;p&gt;Kafka uses Zookeeper for various bits of meta-information, including tracking which messages have already been retrieved by a consumer. To that end, it is the &lt;em&gt;consumers&lt;/em&gt; responsibility to track consumption- not the broker. Kafka’s client library already contains a Zookeeper consumer that will track the message offsets that have been consumed.&lt;/p&gt;
&lt;p&gt;As an side, the broker keeps no state about any of the consumers directly. This keeps it simple and means that there’s no need for complex structures kept in memory reducing the need for garbage collections.&lt;/p&gt;
&lt;p&gt;When messages are received they are written to a log file (well, handed off to the OS to write) named after the topic; these are serial append files so individual writes don’t need to block or interfere with each other.&lt;/p&gt;
&lt;p&gt;When reading messages consumers simply access the file and read data from it. It’s possible to perform parallel consumption through partitioned topics although this isn’t something we’ve needed yet.&lt;/p&gt;
&lt;p&gt;&lt;img src="http://oobaloo-assets.s3.amazonaws.com/kafka_storage.png" alt="Topic and message storage"&gt;&lt;/p&gt;
&lt;p&gt;Messages are tracked by their offset- letting consumers access from a given point into the topic. A consumer can connect and ask for all messages that Kafka has stored currently, or from a specified offset. This relatively long retention (compared to other messaging systems) makes Kafka extremely useful to support both real-time and batch reads. Further, because it takes advantage of disk throughput it makes it a cost-effective system too.&lt;/p&gt;
&lt;p&gt;The broker can be configured to keep messages up to a specified quantity or for a set period of time. Our broker is configured to keep messages for up to 20 days, after that and you’ll need to go elsehwere (most topics are stored on HDFS afterwards). This characteristic that has made it so useful for us- it makes getting data out of applications and servers and into other systems much easier, and more reliable, than periodically aggregating log files.&lt;/p&gt;
&lt;h2&gt;Performance&lt;/h2&gt;
&lt;p&gt;Kafka’s performance (and the design that achieves it) is derived from the observation that disk throughput has outpaced latency; it writes and reads sequentially and uses the operating system’s file system caches rather than trying to maintain its own- minimising the JVM working set, and again, avoiding garbage collections.&lt;/p&gt;
&lt;p&gt;The plot below shows results published within an &lt;a href="http://queue.acm.org/detail.cfm?id=1563874"&gt;ACM article&lt;/a&gt;; their experiment was to measure how quickly they could read 4-byte values sequentially and randomly from different storage.&lt;/p&gt;
&lt;p&gt;&lt;img src="http://oobaloo-assets.s3.amazonaws.com/disk_perf.png" alt="Performance"&gt;&lt;/p&gt;
&lt;p&gt;Please note the scale is logarithmic because the difference between random and sequential is so large for both SSD and spinning disks.&lt;/p&gt;
&lt;p&gt;Interestingly, it shows that &lt;strong&gt;sequential disk access, spinning or SSD, is faster than random memory access&lt;/strong&gt;. It also shows that, in their tests, sequential spinning disk performance was higher than SSD.&lt;/p&gt;
&lt;p&gt;In short, using sequential reads lets Kafka get performance close to random memory access. And, by keeping very little in the way of metadata, the broker can be extremely lightweight.&lt;/p&gt;
&lt;p&gt;If anyone is interested, the &lt;a href="http://incubator.apache.org/kafka/design.html"&gt;Kafka design document&lt;/a&gt; is very interesting and accessible.&lt;/p&gt;
&lt;h2&gt;Batch Load into HDFS&lt;/h2&gt;
&lt;p&gt;As I mentioned earlier, most topics are stored on HDFS so that we can maximise the amount of analysis we can perform over time.&lt;/p&gt;
&lt;p&gt;We use a Hadoop job that is derived from the code included within the Kafka distribution.&lt;/p&gt;
&lt;p&gt;The process looks a little like this:&lt;/p&gt;
&lt;p&gt;&lt;img src="http://oobaloo-assets.s3.amazonaws.com/kafka_hadoop.png" alt="Hadoop Loading"&gt;&lt;/p&gt;
&lt;p&gt;Each topic has a directory on HDFS that contains 2 further subtrees: these contain offset token files and data files. The input to the Hadoop job is an offset token file which contains the details of the broker to consume from, the message offset to read from, and the name of the topic. Although it’s a &lt;code&gt;SequenceFile&lt;/code&gt; the value bytes contain a string that looks like this:&lt;/p&gt;
&lt;div class="CodeRay"&gt;
  &lt;div class="code"&gt;&lt;pre&gt;broker.host.com topic-name  102991&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;The job uses a &lt;code&gt;RecordReader&lt;/code&gt; that connects to the Kafka broker and passes the message payload directly through to the mapper. Most of the time the mapper will just write the whole message bytes directly out which is then written using Hadoop’s &lt;a href="http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/SequenceFileOutputFormat.html"&gt;SequenceFileOutputFormat&lt;/a&gt; (so we can compress and split the data for higher-volume topics) and Hadoop’s &lt;a href="http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html"&gt;MultipleOutputs&lt;/a&gt; so we can write out 2 files- the data file and a newly updated offset token file.&lt;/p&gt;
&lt;p&gt;For example, if we run the job and consume from offset &lt;code&gt;102991&lt;/code&gt; to offset &lt;code&gt;918280&lt;/code&gt;, this will be written to the offset token file:&lt;/p&gt;
&lt;div class="CodeRay"&gt;
  &lt;div class="code"&gt;&lt;pre&gt;broker.host.com topic-name  918280&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Note that the contents of the file is exactly the same as before just with the offset updated. All the state necessary to perform incremental loads is managed by the offset token files.&lt;/p&gt;
&lt;p&gt;This ensures that the next time the job runs we can incrementally load only the new messages. If we introduce a bug into the Hadoop load job we can just delete one or more of the token files to cause the job to load from further back in time.&lt;/p&gt;
&lt;p&gt;Again, Kafka’s inherent persistence makes dealing with these kinds of HDFS loads much easier than dealing with polling for logs. Previously we’d used other databases to store metadata about the daily rotated logs we’d pulled but there was lots of additional computation in splitting apart files that would span days- incremental loads with Kafka are infinitely cleaner and efficient.&lt;/p&gt;
&lt;p&gt;Kafka has helped us both simplify our data collection infrastructure, letting us evolve and grow it more flexibly, and provided the basis for building real-time systems. It’s extremely simple and very easy to setup and configure, I’d highly recommend it for anyone playing in a similar space.&lt;/p&gt;
&lt;h2&gt;Related Stuff&lt;/h2&gt;
&lt;p&gt;As I publish this LinkedIn have just announced the release of &lt;a href="https://github.com/linkedin/camus"&gt;Camus&lt;/a&gt;: their Kafka to HDFS pipeline. The pipeline I’ve described above was inspired by the early Hadoop support within Kafka but has since evolved into something specific for use at uSwitch.&lt;/p&gt;
&lt;p&gt;Twitter also just published about their use of &lt;a href="http://engineering.twitter.com/2013/01/improving-twitter-search-with-real-time.html"&gt;Kafka and Storm to provide real-time search&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I can also recommend reading &lt;a href="http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf"&gt;“The Unified Logging Infrastructure for Data Analytics at Twitter”&lt;/a&gt; paper that was published late last year.&lt;/p&gt;
&lt;p&gt;Finally, this post was based on a brief presentation I gave internally in May last year: &lt;a href="https://speakerdeck.com/pingles/kafka-a-little-introduction"&gt;Kafka a Little Introduction&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;.&lt;/p&gt;
    &lt;img src="http://feeds.feedburner.com/~r/oobaloo/~4/PXJT6Cy5xTY" height="1" width="1"/&gt;</content>
    <author>
      <name>Paul Ingles</name>
    </author>
  <feedburner:origLink>http://oobaloo.co.uk/kafka-for-uswitchs-event-pipeline</feedburner:origLink></entry>
  <entry>
    <id>tag:oobaloo.co.uk,2013:Post/85052</id>
    <published>2012-12-19T15:35:00Z</published>
    <updated>2013-05-21T03:39:04Z</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/oobaloo/~3/J9mP13_y7bM/clojure-from-callbacks-to-sequences" />
    <title>Clojure - From Callbacks to Sequences</title>
    <content type="html">
      &lt;p&gt;I was doing some work with a &lt;a href="http://twitter.com/Quantisan"&gt;colleague&lt;/a&gt; earlier this week which involved connecting to an internal RabbitMQ broker and transforming some messages before forwarding them to our &lt;a href="http://kafka.apache.org/"&gt;Kafka&lt;/a&gt; broker.&lt;/p&gt;
&lt;p&gt;We’re using &lt;a href="https://github.com/michaelklishin/langohr"&gt;langohr&lt;/a&gt; to connect to RabbitMQ. &lt;a href="http://clojurerabbitmq.info/articles/queues.html"&gt;Its consumer and queue documentation&lt;/a&gt; shows how to use the &lt;code&gt;subscribe&lt;/code&gt; function to connect to a broker and print messages that arrive:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4335657.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;The example above is pretty close to what we started working with earlier today. It’s also quite similar to a lot of other code I’ve written in the past: connect to a broker or service and provide a block/function to be called when something interesting happens.&lt;/p&gt;
&lt;h2&gt;Sequences, not handlers&lt;/h2&gt;
&lt;p&gt;Although there’s nothing wrong with this I think there’s a nicer way: flip the responsibility so instead of the subscriber &lt;em&gt;pushing&lt;/em&gt; to our handler function we consume it through Clojure’s sequence abstraction.&lt;/p&gt;
&lt;p&gt;This is the approach I took when I wrote &lt;a href="https://github.com/pingles/clj-kafka"&gt;clj-kafka&lt;/a&gt;, a Clojure library to interact with &lt;a href="http://kafka.apache.org"&gt;LinkedIn’s Kafka&lt;/a&gt; (as an aside, Kafka is really cool- I’m planning a blog post on how we’ve been building a new data platform for &lt;a href="http://www.uswitch.com/"&gt;uSwitch.com&lt;/a&gt; but it’s well worth checking out).&lt;/p&gt;
&lt;p&gt;Here’s a little example of consuming messages through a sequence that’s taken from the &lt;a href="https://github.com/pingles/clj-kafka/blob/master/README.md"&gt;clj-kafka README&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4327700.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;We create our consumer and access messages through a sequence abstraction by calling &lt;code&gt;messages&lt;/code&gt; with the topic we wish to consume from.&lt;/p&gt;
&lt;p&gt;The advantage of exposing the items through a sequence is that it becomes instantly composable with the many functions that already exist within Clojure: &lt;code&gt;map&lt;/code&gt;, &lt;code&gt;filter&lt;/code&gt;, &lt;code&gt;remove&lt;/code&gt; etc.&lt;/p&gt;
&lt;p&gt;In my experience, when writing consumption code that uses handler functions/callbacks I’ve ended up with code that looks like this:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4327722.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;It makes consuming data more complicated and pulls more complexity into the handler function than necessary.&lt;/p&gt;
&lt;h2&gt;Push to Pull&lt;/h2&gt;
&lt;p&gt;This is all made possible thanks to a &lt;a href="http://clj-me.cgrand.net/2010/04/02/pipe-dreams-are-not-necessarily-made-of-promises/"&gt;lovely function written by Christophe Grande&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4327681.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;The function returns a vector containing 2 important parts: the sequence, and a function to put things into that sequence.&lt;/p&gt;
&lt;p&gt;Returning to our original RabbitMQ example, we can change the subscriber code to use &lt;code&gt;pipe&lt;/code&gt; to return the sequence that accesses the queue of messages:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4327837.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;We can then &lt;code&gt;map&lt;/code&gt;, &lt;code&gt;filter&lt;/code&gt; and more.&lt;/p&gt;
&lt;p&gt;We pull responsibility out of the handler function and into the consumption of the sequence. This is really important, and it compliments something else which I’ve recently noticed myself doing more often.&lt;/p&gt;
&lt;p&gt;In the handler function above I convert the function parameters to a map containing &lt;code&gt;:payload&lt;/code&gt;, &lt;code&gt;:ch&lt;/code&gt; and &lt;code&gt;:msg-meta&lt;/code&gt;. In our actual application we’re only concerned with reading the message payload and converting it from a JSON string to a Clojure map.&lt;/p&gt;
&lt;p&gt;Initially, we started writing something similar to this:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4328407.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;We have a function that exposes the messages through a sequence, but we pass a kind of transformation function as the last argument to &lt;code&gt;subscriber-seq&lt;/code&gt;. This initially felt ok: &lt;code&gt;subscriber-seq&lt;/code&gt; calls our handler and extracts the payload into our desired representation before putting it into the queue that backs the sequence.&lt;/p&gt;
&lt;p&gt;But we’re pushing more responsibility into &lt;code&gt;subscriber-seq&lt;/code&gt; than needs to be there.&lt;/p&gt;
&lt;p&gt;We’re just extracting and transforming messages as they appear in the sequence so we can and &lt;em&gt;should&lt;/em&gt; be building upon Clojure's existing functions: &lt;span&gt;map &lt;/span&gt;and the like. The code below feels much better:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4328456.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;It feels better for a similar reason as moving the handler to a sequence- we’re making our function less complex and encouraging the composition through the many functions that already exist. Line 13 is a great example of this for me- &lt;code&gt;map&lt;/code&gt;’ing a composite function to transform the incoming data rather than adding more work into &lt;code&gt;subscriber-seq&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Pipe&lt;/h2&gt;
&lt;p&gt;I’ve probably used Christophe’s &lt;code&gt;pipe&lt;/code&gt; function 3 or 4 times this year to take code that started with handler functions and evolved it to deal with sequences. I think it’s a really neat way of making callback-based APIs more elegant.&lt;/p&gt;
    &lt;img src="http://feeds.feedburner.com/~r/oobaloo/~4/J9mP13_y7bM" height="1" width="1"/&gt;</content>
    <author>
      <name>Paul Ingles</name>
    </author>
  <feedburner:origLink>http://oobaloo.co.uk/clojure-from-callbacks-to-sequences</feedburner:origLink></entry>
  <entry>
    <id>tag:oobaloo.co.uk,2013:Post/85058</id>
    <published>2012-12-17T13:47:00Z</published>
    <updated>2013-05-21T03:39:04Z</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/oobaloo/~3/xkX8ZTMU8sc/multi-armed-bandit-optimisation-in-clojure" />
    <title>Multi-armed Bandit Optimisation in Clojure</title>
    <content type="html">
      &lt;p&gt;The multi-armed (also often referred to as K-armed) bandit problem models the problem a gambler faces when attempting maximise the reward from playing multiple machines with varying rewards.&lt;/p&gt;
&lt;p&gt;For example, let’s assume you are standing in front of 3 arms and that you don’t know the rate at which they will reward you. How do you set about the task of pulling the arms to maximise your cumulative reward?&lt;/p&gt;
&lt;p&gt;It turns out there’s a fair bit of &lt;a href="http://en.wikipedia.org/wiki/Multi-armed_bandit"&gt;literature on this topic&lt;/a&gt;, and it’s also the subject of a recent O’Reilly book: &lt;a href="http://shop.oreilly.com/product/0636920027393.do"&gt;“Bandit Algorithms for Website Optimization”&lt;/a&gt; by &lt;a href="http://www.johnmyleswhite.com/"&gt;John Myles White&lt;/a&gt; (who also co-wrote the excellent &lt;a href="http://shop.oreilly.com/product/0636920018483.do"&gt;Machine Learning for Hackers&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;This article discusses my implementation of the algorithms which is available on GitHub and Clojars: &lt;a href="https://github.com/pingles/clj-bandit"&gt;https://github.com/pingles/clj-bandit&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Enter Clojure&lt;/h2&gt;
&lt;p&gt;I was happy to attend this year’s clojure-conj and started reading through the PDF whilst on the flight out. Over the next few evenings, afternoons and mornings (whenever I could squeeze in time) I spent some time hacking away at implementing the algorithms in Clojure. It was great fun and I pushed the results into a library: &lt;a href="https://github.com/pingles/clj-bandit"&gt;clj-bandit&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I was initially keen on implementing the same algorithms and being able to reproduce the results shown in the book. Since then I’ve spent a little time tweaking parts of the code to be a bit more functional/idiomatic. The bulk of this post covers this transition.&lt;/p&gt;
&lt;p&gt;I started with a structure that looked like this:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4219087.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;From there I started layering on functions that select the arm with each algorithm’s &lt;code&gt;select-arm&lt;/code&gt; implemented in its own namespace.&lt;/p&gt;
&lt;p&gt;One of the simplest algorithms is Epsilon-Greedy: it applies a fixed probability when deciding whether to explore (try other arms) or exploit (pull the currently highest-rewarding arm).&lt;/p&gt;
&lt;p&gt;The code, &lt;a href="https://github.com/pingles/clj-bandit/blob/master/src/clj_bandit/algo/epsilon.clj#L6"&gt;as implemented in clj-bandit&lt;/a&gt;, looks like this:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4219137.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;We generate a random number (between 0 and 1) and either pick the best performing or a random item from the arms.&lt;/p&gt;
&lt;p&gt;In &lt;a href="https://github.com/pingles/clj-bandit/blob/0542e9b14ca11f6ffe450c083cd05319e27c96ca/src/clj_bandit/algo/epsilon.clj"&gt;my initial implementation&lt;/a&gt; I kept algorithm functions together in a protocol, and used another protocol for storing/retrieving arm data. These were reified into an ‘algorithm’:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4219164.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;Applying &lt;code&gt;select-arm&lt;/code&gt; to the current arm state would select the next arm to pull. Having pulled the arm, &lt;code&gt;update-reward&lt;/code&gt; would let the ‘player’ track whether they were rewarded or not.&lt;/p&gt;
&lt;p&gt;This worked, but it looked a little kludgey and made the corresponding &lt;a href="https://github.com/pingles/clj-bandit/blob/0542e9b14ca11f6ffe450c083cd05319e27c96ca/src/clj_bandit/simulate.clj"&gt;monte-carlo simulation code&lt;/a&gt; equivalently disgusting.&lt;/p&gt;
&lt;p&gt;I initially wanted to implement all the algorithms so I could reproduce the same results that were included in the book but the resulting code definitely didn’t feel right.&lt;/p&gt;
&lt;h2&gt;More Functional&lt;/h2&gt;
&lt;p&gt;After returning from the conference I started looking at the code and started moving a few functions around. I dropped the protocols and went back to the original datastructure to hold the algorithm’s current view of the world.&lt;/p&gt;
&lt;p&gt;I decided to change my approach slightly and introduced a record to hold data about the arm.&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4317018.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;The algorithms don’t need to know about the identity of any individual arm- they just need to pick one from the set. It tidied a lot of the code in the algorithms. For example, here’s the &lt;code&gt;select-arm&lt;/code&gt; code from the UCB algorithm:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4317047.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;h2&gt;Functional Simulation&lt;/h2&gt;
&lt;p&gt;The cool part about the book and it’s &lt;a href="http://github.com/johnmyleswhite/BanditsBook"&gt;accompanying code&lt;/a&gt; is that includes a simulation suitable for measuring the performance and behaviour of each of the algorithms.&lt;/p&gt;
&lt;p&gt;This is important because the bandit algorithms have a complex feedback cycle: their behaviour is constantly changing in the light of data given to them during their lifetime.&lt;/p&gt;
&lt;p&gt;For example, following from the code by John Myles White in his book, we can visualise the algorithm’s performance over time. One measure is accuracy (that is, how likely is the algorithm to pick the highest paying arm at a given iteration) and we can see the performance across algorithms over time, and according to their exploration/exploitation parameters, in the plot below:&lt;/p&gt;
&lt;p&gt;        &lt;p class="posthaven-file posthaven-file-image posthaven-file-state-processed"&gt;
          &lt;img class="posthaven-gallery-image" src="https://phaven-prod.s3.amazonaws.com/files/image_part/asset/138932/NC4Fm7P4RDp1LBENfUxRF7JRy7I/medium_media_httpclojureband_cFrGk.png" data-posthaven-state='processed'
data-medium-src='https://phaven-prod.s3.amazonaws.com/files/image_part/asset/138932/NC4Fm7P4RDp1LBENfUxRF7JRy7I/medium_media_httpclojureband_cFrGk.png'
data-medium-width='750'
data-medium-height='641'
data-large-src='https://phaven-prod.s3.amazonaws.com/files/image_part/asset/138932/NC4Fm7P4RDp1LBENfUxRF7JRy7I/large_media_httpclojureband_cFrGk.png'
data-large-width='750'
data-large-height='641'
data-thumb-src='https://phaven-prod.s3.amazonaws.com/files/image_part/asset/138932/NC4Fm7P4RDp1LBENfUxRF7JRy7I/thumb_media_httpclojureband_cFrGk.png'
data-thumb-width='200'
data-thumb-height='200'
data-orig-src='https://phaven-prod.s3.amazonaws.com/files/image_part/asset/138932/NC4Fm7P4RDp1LBENfUxRF7JRy7I/media_httpclojureband_cFrGk.png'
data-orig-width='750'
data-orig-height='641'
data-posthaven-id='138932' /&gt;
        &lt;/p&gt;
&lt;/p&gt;
&lt;p&gt;The simulation works by using a series of simulated bandit arms. These will reward randomly according to a specified probability:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4317088.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;We can model the problem neatly by creating a function representing the arm, we can then pull the arm by applying the function.&lt;/p&gt;
&lt;p&gt;As I mentioned earlier, when the code included protocols for algorithms and storage, the simulation code ended up being pretty messy. After I’d dropped those everything felt a little cleaner and more Clojure-y. This felt more apparent when it came to rewriting the simulation harness.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/pingles/clj-bandit/blob/master/src/clj_bandit/simulate.clj"&gt;clj-bandit.simulate&lt;/a&gt; has all the code, but the key part was the introduction of 2 functions that performed the simulation:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4317131.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;simulation-seq&lt;/code&gt; creates a sequence through &lt;code&gt;iterate&lt;/code&gt;’ing the &lt;code&gt;simulate&lt;/code&gt; function. &lt;code&gt;simulate&lt;/code&gt; is passed a map that contains the current state of the algorithm (the performance of the arms), it returns the updated state (based on the pull during that iteration) and tracks the cumulative reward. Given we’re most interested in the performance of the algorithm we can then just &lt;code&gt;(map :result ...)&lt;/code&gt; across the sequence. Much nicer than nested &lt;code&gt;doseq&lt;/code&gt;’s!&lt;/p&gt;
&lt;h2&gt;Further Work&lt;/h2&gt;
&lt;p&gt;At &lt;a href="http://www.uswitch.com"&gt;uSwitch&lt;/a&gt; we’re interested in experimenting with multi-armed bandit algorithms. We can use simulation to estimate performance using already observed data. But, we’d also need to do a little work to consume these algorithms into web applications.&lt;/p&gt;
&lt;p&gt;There are existing Clojure libraries for embedding optimisations into your Ring application:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://github.com/ptaoussanis/touchstone"&gt;Touchstone&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/jeandenis/bestcase"&gt;Bestcase&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;These both provide implementations of the algorithms and for storing their state in stores like Redis.&lt;/p&gt;
&lt;p&gt;I chose to work through the book because I was interested in learning more about the algorithms but I also like the idea of keeping the algorithms and application concerns separate.&lt;/p&gt;
&lt;p&gt;Because of that I’m keen to work on a separate library that makes consuming the algorithms from &lt;a href="https://github.com/pingles/clj-bandit"&gt;clj-bandit&lt;/a&gt; into a Ring application easier. I’m hoping that over the coming holiday season I’ll get a chance to spend a few more hours working on it.&lt;/p&gt;
    &lt;img src="http://feeds.feedburner.com/~r/oobaloo/~4/xkX8ZTMU8sc" height="1" width="1"/&gt;</content>
    <author>
      <name>Paul Ingles</name>
    </author>
  <feedburner:origLink>http://oobaloo.co.uk/multi-armed-bandit-optimisation-in-clojure</feedburner:origLink></entry>
  <entry>
    <id>tag:oobaloo.co.uk,2013:Post/85066</id>
    <published>2012-11-22T13:51:00Z</published>
    <updated>2013-05-21T03:39:04Z</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/oobaloo/~3/cJdFgEitXTE/analysing-and-predicting-performance-of-our-n" />
    <title>Analysing and predicting performance of our Neo4j Cascading Connector with linear regression in R</title>
    <content type="html">
      &lt;p&gt;As I mentioned in an &lt;a href="http://oobaloo.co.uk/sinking-data-to-neo4j-from-hadoop-with-cascad"&gt;earlier article&lt;/a&gt;, Paul and I have &lt;a href="http://github.com/pingles/cascading.neo4j"&gt;produced a library to help connect Cascading and Neo4j&lt;/a&gt; making it easy to sink data from Hadoop with  Cascading flows into a Neo4j instance. Whilst we were waiting for our jobs to run we had a little fun with some regression analysis to optimise the performance. This post covers how we did it with R.&lt;/p&gt;
&lt;p&gt;I’m posting because it wasn’t something I’d done before and it turned out to be pretty good fun. We played with it for a day and haven’t done much else with it since so I’m also publishing in case it’s useful for others.&lt;/p&gt;
&lt;p&gt;We improved the write performance of our library by adding support for batching- collecting mutations into sets of transactions that are batched through Neo4j’s REST API. This improved performance (rather than using a request/response for every mutation) but also meant we needed to specify a chunk size; writing all mutations in a single transaction would be impossible.&lt;/p&gt;
&lt;p&gt;There are 2 indepent variables that we could affect to tweak performance: the batch size and the number of simultaneous connections that are making those batch calls. N.B this assumes any other hidden factors remain constant.&lt;/p&gt;
&lt;p&gt;For us, running this on a Hadoop cluster, these 2 variables determine the batch size in combination with the number of Hadoop’s reduce or map tasks concurrently executing.&lt;/p&gt;
&lt;p&gt;We took some measurements during a few runs of the connector across our production data to help understand whether we were making the library faster. We then produced a regression model from the data and use the optimize function to help identify the sweet spot for our job’s performance.&lt;/p&gt;
&lt;p&gt;We had 7 runs on our production Hadoop cluster. We let the reduce tasks (where the Neo4j write operations were occurring) run across genuine data for 5 minutes and measured how many nodes were successfully added to our Neo4j server. Although the cluster was under capacity (so the time wouldn’t include any idling/waiting) our Neo4j server instance runs on some internal virtualised infrastructure and so could have exhibited variance beyond our changes.&lt;/p&gt;
&lt;p&gt;The results for our 7 observerations are in the table below:&lt;/p&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;th&gt;Test No. &lt;/th&gt;
&lt;th&gt;Number of Reducers&lt;/th&gt; &lt;th&gt;Batch Size&lt;/th&gt; &lt;th&gt;Nodes per minute&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;5304.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;13218.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;13265.636&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;11289.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;17682.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;20984.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;20201.6&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;h2&gt;Regression in R&lt;/h2&gt;
&lt;p&gt;A regression lets us attempt to predict the value of a continuous variable based upon the value of one or more other independent variables. It also lets us quantify the strength of the relationship between the dependent variable and independent variables.&lt;/p&gt;
&lt;p&gt;Given our experiment, we could determine whether batch size and the number of reducers (the independent variables) affected the number of Neo4j nodes we could create per minute (the dependent variable). If there was, we would use values for those 2 variables to predict performance.&lt;/p&gt;
&lt;p&gt;The first stage is to load the experiment data into R and get it into a data frame. Once we’ve loaded it we can use R’s lm function to fit a linear model and look at our data.&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4131196.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;In the above, the formula parameter to lm lets us describe that &lt;code&gt;nodes.per.minute&lt;/code&gt; is our dependent variable (our outcome), and &lt;code&gt;reducers&lt;/code&gt; and &lt;code&gt;batch.size&lt;/code&gt; are our independent variables (our predictors).&lt;/p&gt;
&lt;p&gt;Much like other analysis in R, the first thing we can look at is a summary of this model, which produces the following:&lt;/p&gt;
&lt;div class="CodeRay"&gt;
  &lt;div class="code"&gt;&lt;pre&gt;lm(formula = nodes.per.minute ~ reducers + batch.size, data = results)

Residuals:
    1     2     3     4     5     6     7 
-2330  2591 -1756 -1135  3062 -1621  1188 

Coefficients:
            Estimate Std. Error t value Pr(&amp;gt;|t|)  
(Intercept)   2242.8     3296.7   0.680   0.5336  
reducers       998.1      235.6   4.236   0.0133 *
batch.size     439.3      199.3   2.204   0.0922 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 2735 on 4 degrees of freedom
Multiple R-squared: 0.8362, Adjusted R-squared: 0.7543 
F-statistic: 10.21 on 2 and 4 DF,  p-value: 0.02683&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;The summary data tells us that the model supports the data relatively well. Our R-squared is 0.075 and both batch size and reducer size are considered significant.&lt;/p&gt;
&lt;p&gt;But, what if we tweak our model? We suspect that the shape of the performance through increasing reducers and batch size is unlikely to exhibit linear growth. We can change the formula of our model and see whether we can improve the accuracy of our model:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4131211.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;And let’s the the results of calling &lt;code&gt;summary(model)&lt;/code&gt;:&lt;/p&gt;
&lt;div class="CodeRay"&gt;
  &lt;div class="code"&gt;&lt;pre&gt;Call:
lm(formula = nodes.per.minute ~ reducers + I(reducers^2) + batch.size + 
    I(batch.size^2), data = results)

Residuals:
         1          2          3          4          5          6          7 
-2.433e+02  9.318e+02 -3.995e+02  9.663e-13 -7.417e+02  5.323e+01  3.995e+02 

Coefficients:
                 Estimate Std. Error t value Pr(&amp;gt;|t|)  
(Intercept)     -15672.16    3821.48  -4.101   0.0546 .
reducers          2755.10     337.07   8.174   0.0146 *
I(reducers^2)     -101.74      18.95  -5.370   0.0330 *
batch.size        2716.07     540.07   5.029   0.0373 *
I(batch.size^2)    -85.94      19.91  -4.316   0.0497 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 948.6 on 2 degrees of freedom
Multiple R-squared: 0.9901, Adjusted R-squared: 0.9704 
F-statistic: 50.25 on 4 and 2 DF,  p-value: 0.01961&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Our R-squared is now 0.9704- our second model fits the data better than our first model.&lt;/p&gt;
&lt;h2&gt;Optimization&lt;/h2&gt;
&lt;p&gt;Given the above, we’d like to understand the values for batch size and number of reducers that will give us the highest throughput.&lt;/p&gt;
&lt;p&gt;R has an &lt;code&gt;optimize&lt;/code&gt; function that, given a range of values for a function parameter, returns the optimal argument for the return value.&lt;/p&gt;
&lt;p&gt;We can create a function that calls &lt;code&gt;predict.lm&lt;/code&gt; with our model to predict values. We can then use the &lt;code&gt;optimize&lt;/code&gt; function to find our optimal solution:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4131246.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;We use a set batch size of 20 and optimize to discover that the optimal number of reducers is 13 with a throughput of 22,924 nodes/minute. The second command optimizes for batch size with a fixed number of reducers. Again, it suggests a batch size of 15 for an overall throughput of 24,409 nodes/minute.&lt;/p&gt;
&lt;p&gt;This supports what we observed earlier with the summary data: number of reducers is more significant than batch size for predicting throughput.&lt;/p&gt;
&lt;p&gt;I’m extremely new to most of R (and statistics too if I’m honest- the last year is the most I’ve done since university) so if anyone could tell me if there’s a way to perform an optimization for &lt;em&gt;both&lt;/em&gt; variables that would be awesome.&lt;/p&gt;
&lt;p&gt;Please note this post was more about our experimentation and the process- I suspect our data might be prone to systematic error and problems because we only have a few observations. I’d love to run more experiments and get more measurements but we moved on to a new problem :)&lt;/p&gt;
    &lt;img src="http://feeds.feedburner.com/~r/oobaloo/~4/cJdFgEitXTE" height="1" width="1"/&gt;</content>
    <author>
      <name>Paul Ingles</name>
    </author>
  <feedburner:origLink>http://oobaloo.co.uk/analysing-and-predicting-performance-of-our-n</feedburner:origLink></entry>
  <entry>
    <id>tag:oobaloo.co.uk,2013:Post/85075</id>
    <published>2012-11-22T10:53:00Z</published>
    <updated>2013-05-21T03:39:05Z</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/oobaloo/~3/GcrC9NbvYf4/sinking-data-to-neo4j-from-hadoop-with-cascad" />
    <title>Sinking Data to Neo4j from Hadoop with Cascading</title>
    <content type="html">
      &lt;p&gt;Recently, I worked with a colleague (Paul Lam, aka &lt;a href="http://twitter.com/Quantisan"&gt;@Quantisan&lt;/a&gt; on building a connector library to let Cascading interoperate with Neo4j: &lt;a href="https://github.com/pingles/cascading.neo4j"&gt;cascading.neo4j&lt;/a&gt;. Paul had been experimenting with Neo4j and Cypher to explore our data through graphs and we wanted an easy way to flow our existing data on Hadoop into Neo4j.&lt;/p&gt;
&lt;p&gt;The data processing pipeline we’ve been growing at uSwitch.com is built around Cascalog, Hive, Hadoop and Kafka.&lt;/p&gt;
&lt;p&gt;Once the data has been aggregated and stored a lot of our ETL is performed upon Cascalog and, by extension, Cascading. Querying/analysis is a mix of Cascalog and Hive. This layer is built upon our long-term data storage system: Hadoop; this, all combined, lets us store high-resolution data immutably at a much lower cost than uSwitch’s previous platform.&lt;/p&gt;
&lt;p&gt;Cascading is:&lt;/p&gt;
&lt;blockquote class="posterous_short_quote"&gt;
&lt;p&gt;application framework for Java developers to quickly and easily develop robust Data Analytics and Data Management applications on Apache Hadoop&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cascading provides a model on top of Hadoop’s very file-oriented, raw MapReduce API. It models the world around flows of data (Tuples), between Taps and according to Schemes.&lt;/p&gt;
&lt;p&gt;For example, in Hadoop you might configure a job that reads data from a SequenceFile at a given location on HDFS. Cascading separates the 2 into slightly different concerns- the Tap would represent file storage on HDFS, the Scheme would be configured to read data according to the SequenceFile format.&lt;/p&gt;
&lt;p&gt;Cascading provides a lot of classes to make it easier to build flows that join and aggregate data without you needing to write lots of boiler-plate MapReduce code.&lt;/p&gt;
&lt;p&gt;Here’s an example of writing a Cascading flow (taken from the README of our cascading.neo4j library). It reads data from a delimited file (representing the Neo4j Nodes we want to create), and flows every record to our Neo4j Tap.&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4130522.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;Although we use Cascading, really the majority of that is through Cascalog- a Clojure library that provides a datalog like language for analysing our data.&lt;/p&gt;
&lt;p&gt;We wrote a small test flow we could use whilst testing our connector, a similar Cascalog example for the Cascading code above looks like this:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/4130542.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;Our library, &lt;a href="http://github.com/pingles/cascading.neo4j"&gt;cascading.neo4j&lt;/a&gt; (hosted on &lt;a href="http://conjars.org"&gt;Conjars.org&lt;/a&gt;) provides a Tap and Scheme suitable for sinking (writing data) to a Neo4j server using their &lt;a href="http://github.com/neo4j/java-rest-binding"&gt;REST connector&lt;/a&gt;. We extend and implement Cascading classes and interfaces making it easy to then integrate into our Cascalog processing.&lt;/p&gt;
&lt;p&gt;cascading.neo4j allows you to create nodes, set properties on those node, add nodes to an exact match index, and create relationships (again with properties) between those nodes.&lt;/p&gt;
&lt;p&gt;I should point out that cascading.neo4j may not be suitable for all batch processing purposes: most significantly, its pretty slow. Our production Neo4j server (on our internal virtualised infrastructure) lets us sink around 20,000 nodes per minute through the REST api. This is certainly a lot slower than &lt;a href="http://docs.neo4j.org/chunked/stable/batchinsert.html"&gt;Neo4j’s Batch Insert API&lt;/a&gt; and may make it unusable in some situations.&lt;/p&gt;
&lt;p&gt;However, if the connector is fast enough for you it means you can sink data directly to Neo4j from your existing Cascading flows.&lt;/p&gt;
&lt;h2&gt;Performance&lt;/h2&gt;
&lt;p&gt;Whilst tweaking and tuning the connector we ran through &lt;a href="http://docs.neo4j.org/chunked/stable/linux-performance-guide.html"&gt;Neo4j’s Linux Performance Guide&lt;/a&gt; (a great piece of technical documentation) that helped us boost performance a fair bit.&lt;/p&gt;
&lt;p&gt;We also noticed the REST library allows for transactions to hold batch operations- to include multiple mutations in the same roundtrip. Our Neo4j RecordWriter will chunk batches- rather than writing all records in one go, you can specify the size.&lt;/p&gt;
&lt;p&gt;We ran some tests and, on our infrastructure, using batch sizes of around 15 and 13 reducers (that is 13 ‘connections’ to our Neo4j REST api) yield the best performance of around 20,000 nodes per minute. We collected some numbers and Paul suggested we could have some fun putting those through a regression which will be the subject of my next post :)&lt;/p&gt;
&lt;h2&gt;Next steps&lt;/h2&gt;
&lt;p&gt;It’s currently still evolving a little and there’s a bit of duplicate code between the Hadoop and Local sections of the code. The biggest restrictions are it currently only supports sinking (writing) data and it’s speed may make it unsuitable for flowing very large graphs.&lt;/p&gt;
&lt;p&gt;Hopefully this will be useful for some people and Paul and I would love pull requests for our little project.&lt;/p&gt;
    &lt;img src="http://feeds.feedburner.com/~r/oobaloo/~4/GcrC9NbvYf4" height="1" width="1"/&gt;</content>
    <author>
      <name>Paul Ingles</name>
    </author>
  <feedburner:origLink>http://oobaloo.co.uk/sinking-data-to-neo4j-from-hadoop-with-cascad</feedburner:origLink></entry>
  <entry>
    <id>tag:oobaloo.co.uk,2013:Post/85081</id>
    <published>2012-08-18T13:44:00Z</published>
    <updated>2013-05-21T03:39:05Z</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/oobaloo/~3/EZykawKXF7s/compressing-cloudfront-assets-and-dfl8co" />
    <title>Compressing CloudFront Assets and dfl8.co</title>
    <content type="html">
      &lt;p&gt;Amazon’s web services have made rebuilding &lt;a href="http://www.uswitch.com"&gt;uSwitch.com&lt;/a&gt; so much easier. We’re gradually moving more and more static assets to CloudFront (although most visitors are in the UK responses have much lower latencies than direct from S3 or even our own nginx servers). CloudFront doesn't support serving gzip'ed content direct from S3 out of the box.&lt;/p&gt;
&lt;p&gt;Because of this, up until last week we were serving uncompressed assets, at least anything that wasn’t already compressed (such as images). Last week we put together a simple static assets nginx server to help compress things.&lt;/p&gt;
&lt;p&gt;Whilst doing the work for uSwitch.com I realised it would be trivial to write an application that would let any CloudFront user compress to any S3 bucket by using an equivalent URL structure. So I knocked up a quick &lt;a href="http://nodejs.org"&gt;node.js&lt;/a&gt; app that’s hosted on Heroku for all to use: &lt;a href="http://dfl8.co"&gt;dfl8.co&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;dfl8.co&lt;/h3&gt;
&lt;p&gt;S3 assets can be referenced through a pretty simple URL structure. By creating an app that behaves in the same way, and proxies (whilst compressing) the response, it would be easy to create a compressible S3 for everyone.&lt;/p&gt;
&lt;p&gt;For example, the URL &lt;code&gt;http://pingles-example.s3.amazonaws.com/sample.css&lt;/code&gt; references the S3 bucket &lt;code&gt;pingles-example&lt;/code&gt; and the object we want to retrieve is identified by the name &lt;code&gt;/sample.css&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The same resource can be accessed through &lt;code&gt;http://pingles-example.dfl8.co/sample.css&lt;/code&gt; and will be gzip compressed. CloudFront now lets you specify &lt;a href="http://aws.typepad.com/aws/2010/11/amazon-cloudfront-support-for-custom-origins.html"&gt;custom origins&lt;/a&gt; so for the above you’d add &lt;code&gt;http://pingles-example.dfl8.co&lt;/code&gt; to setup a CloudFront distribution for the &lt;code&gt;pingles-example&lt;/code&gt; S3 bucket.&lt;/p&gt;
&lt;p&gt;At the moment it will only proxy public resources. Response latency also seems quite high at the moment but given the aim is to get content into the highly-cached and optimised CloudFront I’m not too fussed by it.&lt;/p&gt;
    &lt;img src="http://feeds.feedburner.com/~r/oobaloo/~4/EZykawKXF7s" height="1" width="1"/&gt;</content>
    <author>
      <name>Paul Ingles</name>
    </author>
  <feedburner:origLink>http://oobaloo.co.uk/compressing-cloudfront-assets-and-dfl8co</feedburner:origLink></entry>
  <entry>
    <id>tag:oobaloo.co.uk,2013:Post/85088</id>
    <published>2012-07-15T21:06:00Z</published>
    <updated>2013-05-21T03:39:05Z</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/oobaloo/~3/r4EZKxNEAsw/evaluating-classifier-results-with-r-part-2" />
    <title>Evaluating classifier results with R part 2</title>
    <content type="html">
      &lt;p&gt;In a previous article I showed &lt;a href="http://oobaloo.co.uk/visualising-classifier-results-with-ggplot2"&gt;how to visualise the results of a classifier using ggplot2 in R&lt;/a&gt;. In the same article I mentioned that &lt;a href="http://alexfarquhar.posterous.com/"&gt;Alex&lt;/a&gt;, a colleague at Forward, had suggested looking further at R’s &lt;a href="http://cran.r-project.org/web/packages/caret/index.html"&gt;caret package&lt;/a&gt; that would produce more detailed statistics about the overall performance of the classifer and within individual classes.&lt;/p&gt;
&lt;h3&gt;Confusion Matrix&lt;/h3&gt;
&lt;p&gt;Using ggplot2 we can produce a plot like the one below: a visual representation of a &lt;a href="http://en.wikipedia.org/wiki/Confusion_matrix"&gt;confusion matrix&lt;/a&gt;. It gives us a nice overview but doesn’t reveal much about the specific performance characteristics of our classifier.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://cl.ly/1X1U1D0b133T0t1d3I0G"&gt;&lt;img src="http://cl.ly/1X1U1D0b133T0t1d3I0G/confusion_matrix_small.png" height="320" alt="" width="400"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;To produce our measures, we run our classifier across a set of test data and capture both the actual class and the predicted class. Our results are stored in a CSV file and will look a little like this:&lt;/p&gt;
&lt;div class="CodeRay"&gt;
  &lt;div class="code"&gt;&lt;pre&gt;actual, predicted
A, B
B, B,
C, C
B, A&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h3&gt;Analysing with Caret&lt;/h3&gt;
&lt;p&gt;With our results data as above we can run the following to produce a confusion matrix with caret:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/3116401.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;results.matrix&lt;/code&gt; now contains a confusionMatrix full of information. Let’s take a look at some of what it shows. The first table shows the contents of our matrix:&lt;/p&gt;
&lt;div class="CodeRay"&gt;
  &lt;div class="code"&gt;&lt;pre&gt;Reference
Prediction              A     B     C     D
A                     211   3     1     0
B                     9     26756 6     17
C                     1     12    1166  1
D                     0     18    3     1318&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Each column holds the &lt;em&gt;reference&lt;/em&gt; (or actual) data and within each row is the prediction. The diagonal represents instances where our observation correctly predicted the class of the item.&lt;/p&gt;
&lt;p&gt;The next section contains summary statistics for the results:&lt;/p&gt;
&lt;div class="CodeRay"&gt;
  &lt;div class="code"&gt;&lt;pre&gt;Overall Statistics

                     Accuracy : 0.9107          
                       95% CI : (0.9083, 0.9131)
          No Information Rate : 0.5306          
          P-Value [Acc &amp;gt; NIR] : &amp;lt; 2.2e-16&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Overall accuracy is calculated at just over 90% with a &lt;a href="http://en.wikipedia.org/wiki/P-value"&gt;p-value&lt;/a&gt; of &lt;code&gt;2 x 10^-16&lt;/code&gt;, or &lt;code&gt;0.00000000000000022&lt;/code&gt;. Our classifier seems to be doing a pretty reasonable job of classifying items.&lt;/p&gt;
&lt;p&gt;Our classifier is being tested by putting items into 1 of 13 categories- caret also produces a final section of statistics for the performance of each class.&lt;/p&gt;
&lt;div class="CodeRay"&gt;
  &lt;div class="code"&gt;&lt;pre&gt;Class: A        Class: B  ...   Class: J
Sensitivity             0.761733        0.9478          0.456693
Specificity             0.998961        0.9748          0.999962
Pos Pred Value          0.793233        0.9770          0.966667 
Neg Pred Value          0.998753        0.9429          0.998702
Prevalence              0.005206        0.5306          0.002387
Detection Rate          0.003966        0.5029          0.001090
Detection Prevalence    0.005000        0.5147          0.001128&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;The above shows some really interesting data.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://en.wikipedia.org/wiki/Sensitivity_and_specificity"&gt;Sensitivity and specificity&lt;/a&gt; respectively help us measure the performance of the classifier in correctly predicting the actual class of an item and &lt;em&gt;not&lt;/em&gt; predicting the class of an item that is of a different class; it measures true positive and true negative performance.&lt;/p&gt;
&lt;p&gt;From the above data we can see that our classifier correctly identified class B 94.78% of the time. That is, when we should have predicted class B we did. Further, when we shouldn’t have predicted class B we didn’t for 97.48% of examples. We can contrast this to class J: our specificity (true negative) is over 99% but our sensitivity (true positive) is around 45%; we do a poor job of positively identifying items of this class.&lt;/p&gt;
&lt;p&gt;Caret has also calculated a prevalence measure- that is, of all observations, how many were of items that actually belonged to the specified class; it calculates the &lt;em&gt;prevalence&lt;/em&gt; of a class within a population.&lt;/p&gt;
&lt;p&gt;Using the previously defined sensitivity and specificity, and prevalance measures caret can calculate &lt;a href="http://en.wikipedia.org/wiki/Positive_predictive_value"&gt;Positive predictive value&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Negative_predictive_value"&gt;Negative predictive value&lt;/a&gt;. These are important as they reflect the probability that a true positive/true negative is correct given knowledge about the prevalence of classes within the population. Class J has a &lt;em&gt;positive predictive value&lt;/em&gt; of over 96%: despite our classifier only being able to positively identify objects 45% of the time there’s a 96% chance that, when it does, such a classification is correct.&lt;/p&gt;
&lt;p&gt;The &lt;a href="http://cran.r-project.org/web/packages/caret/caret.pdf"&gt;caret documentation&lt;/a&gt; has some references to relevant papers discussing the measures it calculates.&lt;/p&gt;
    &lt;img src="http://feeds.feedburner.com/~r/oobaloo/~4/r4EZKxNEAsw" height="1" width="1"/&gt;</content>
    <author>
      <name>Paul Ingles</name>
    </author>
  <feedburner:origLink>http://oobaloo.co.uk/evaluating-classifier-results-with-r-part-2</feedburner:origLink></entry>
  <entry>
    <id>tag:oobaloo.co.uk,2013:Post/85092</id>
    <published>2012-07-14T15:13:00Z</published>
    <updated>2013-05-21T03:39:05Z</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/oobaloo/~3/KENU-KdyehA/visualising-classifier-results-with-ggplot2" />
    <title>Visualising classifier results with R and ggplot2</title>
    <content type="html">
      &lt;p&gt;Earlier in the year, myself and some colleagues started working on building better data processing tools for &lt;a href="http://www.uswitch.com"&gt;uSwitch.com&lt;/a&gt;. Part of the theory/reflection of this is captured in a presentation I was privileged to give at EuroClojure (titled &lt;a href="https://vimeo.com/45136211"&gt;Users as Data&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;In the last few days, our data team (&lt;a href="https://github.com/thibautsacreste"&gt;Thibaut&lt;/a&gt;, &lt;a href="http://www.quantisan.com"&gt;Paul&lt;/a&gt; and I) have been playing around with some of the data we collect and using it to build some classifiers. &lt;a href="http://en.wikipedia.org/wiki/Precision_and_recall"&gt;Precision and Recall&lt;/a&gt; provide quantitative measures but reading through &lt;a href="http://shop.oreilly.com/product/0636920018483.do"&gt;Machine Learning for Hackers&lt;/a&gt; showed some nice ways to visualise results.&lt;/p&gt;
&lt;h3&gt;Binary Classifier&lt;/h3&gt;
&lt;p&gt;Our first classifier attempted to classify data into 2 groups. Using &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; and &lt;a href="http://had.co.nz/ggplot2/"&gt;ggplot2&lt;/a&gt; I produced a plot (similar to the one presented in the Machine Learning for Hackers book) to show the results of the classifier.&lt;/p&gt;
&lt;p&gt;Our results were captured in a CSV file and looked a little like this:&lt;/p&gt;
&lt;div class="CodeRay"&gt;
  &lt;div class="code"&gt;&lt;pre&gt;A,0.25,0.15
A,0.2,0.19
B,0.19,0.25&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Each line contains the item's actual class, the predicted probability for membership of class A, and the predicted probability for membership of class B. Using ggplot2 we produce the following:&lt;/p&gt;
&lt;p&gt;&lt;a href="http://cl.ly/150l2h02120x0S0u1Y1d"&gt;&lt;img src="http://cl.ly/150l2h02120x0S0u1Y1d/classify-results-sample-small.png" height="285" alt="binary classification plot" width="400"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Items have been classified into 2 groups- A and B. The axis show the log probability (we’re using Naive Bayes to classify items) that the item belongs to the specified class. We use colour to identify the actual class for items and draw a line to represent the decision boundary (i.e. which of the 2 classes did our model predict).&lt;/p&gt;
&lt;p&gt;This lets us nicely see the relationship between predicted and actual classes.&lt;/p&gt;
&lt;p&gt;We can see there’s a bit of an overlap down the decision boundary line and we’re able to do a better job for classifying items in category B than A.&lt;/p&gt;
&lt;p&gt;The R code to produce the plot above is as follows. Note that because we had many millions of observations I randomly sampled to make it possible to compute on my laptop :)&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/3111413.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;h3&gt;More Classes!&lt;/h3&gt;
&lt;p&gt;But what if we want to see compare the results when we’re classifying items into more than 1 group?&lt;/p&gt;
&lt;p&gt;After chatting to &lt;a href="http://alexfarquhar.posterous.com/"&gt;Alex Farquhar&lt;/a&gt; (another data guy at Forward) he suggested plotting a &lt;a href="http://en.wikipedia.org/wiki/Confusion_matrix"&gt;confusion matrix&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Below shows the plot we produced that compares the actual and predicted classes for 14 items.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://cl.ly/1X1U1D0b133T0t1d3I0G"&gt;&lt;img src="http://cl.ly/1X1U1D0b133T0t1d3I0G/confusion_matrix_small.png" height="320" alt="" width="400"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The y-axis shows the predicted class for all items, and the x-axis shows the actual class. The tiles are coloured according to the frequency of the intersection of the two classes thus the diagonal represents where we predict the actual class. The colour represents the relative frequency of that observation in our data; given some classes occur more frequently we normalize the values before plotting.&lt;/p&gt;
&lt;p&gt;Any row of tiles (save for the diagonal) represents instances where we falsely identified items as belonging to the specified class. In the rendered plot we can see that items in Class G were often identified for items belonging to all other classes.&lt;/p&gt;
&lt;p&gt;Our input data looked a little like this:&lt;/p&gt;
&lt;div class="CodeRay"&gt;
  &lt;div class="code"&gt;&lt;pre&gt;1,0,0
0,3,0
1,0,2&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;It’s a direct encoding of our matrix- each column represents data for classes A to N, and each row represents data for classes A to N. The diagonal holds data for &lt;code&gt;A,A&lt;/code&gt;, &lt;code&gt;B,B&lt;/code&gt;, etc.&lt;/p&gt;
&lt;p&gt;The R code to plot the confusion matrix is as follows:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/3111748.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;Alex also suggested using the &lt;a href="http://cran.r-project.org/web/packages/caret/index.html"&gt;caret&lt;/a&gt; package which includes a function to build the confusion matrix from observations directly and also provides some useful summary statistics. I’m going to hack on our classifier’s Clojure code a little more and will be sure to post again with the findings!&lt;/p&gt;
    &lt;img src="http://feeds.feedburner.com/~r/oobaloo/~4/KENU-KdyehA" height="1" width="1"/&gt;</content>
    <author>
      <name>Paul Ingles</name>
    </author>
  <feedburner:origLink>http://oobaloo.co.uk/visualising-classifier-results-with-ggplot2</feedburner:origLink></entry>
  <entry>
    <id>tag:oobaloo.co.uk,2013:Post/85095</id>
    <published>2012-01-20T10:08:00Z</published>
    <updated>2013-05-21T03:39:05Z</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/oobaloo/~3/w5siGIQn_aM/protocol-buffers-with-clojure-and-leiningen" />
    <title>Protocol Buffers with Clojure and Leiningen</title>
    <content type="html">
      &lt;p&gt;This week I’ve been prototyping some data processing tools that will  work across the platforms we use (Ruby, Clojure, .NET). Having not tried &lt;a href="http://code.google.com/apis/protocolbuffers/"&gt;Protocol Buffers&lt;/a&gt; before I thought I’d spike it out and see how it might fit.&lt;/p&gt;
&lt;h2&gt;Protocol Buffers&lt;/h2&gt;
&lt;p&gt;The Google page obviously has a lot more detail but for anyone who’s not seen them: you define your messages in an intermediate language before compiling into your target language.&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/1646419.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;There’s a &lt;a href="http://code.google.com/p/ruby-protobuf/"&gt;Ruby library&lt;/a&gt; that makes it trivially easy to generate Ruby code so you can create messages as follows:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/1646428.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;h2&gt;Clojure and Leiningen&lt;/h2&gt;
&lt;p&gt;The next step was to see how these messages would interact with Clojure and Java. Fortunately, there’s already a few options and I tried out &lt;a href="https://github.com/flatland/clojure-protobuf"&gt;clojure-protobuf&lt;/a&gt; which conveniently includes a &lt;a href="http://github.com/technomancy/leiningen"&gt;Leiningen&lt;/a&gt; task for running both the Protocol Buffer compiler &lt;code&gt;protoc&lt;/code&gt; and &lt;code&gt;javac&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I added the dependency to my project.clj:&lt;/p&gt;
&lt;div class="CodeRay"&gt;
  &lt;div class="code"&gt;&lt;pre&gt;[protobuf "0.6.0-beta2"]&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;At the time, the &lt;code&gt;protobuf&lt;/code&gt; library expected your &lt;code&gt;.proto&lt;/code&gt; files to be placed in a &lt;code&gt;./proto&lt;/code&gt; directory under your project root. &lt;a href="https://github.com/pingles/clojure-protobuf"&gt;I forked&lt;/a&gt; to add a &lt;code&gt;:proto-path&lt;/code&gt; so that I could pull in the files from a git submodule.&lt;/p&gt;
&lt;p&gt;Assuming you have a proto file or two in your proto source directory, you should be able to invoke the compiler by running&lt;/p&gt;
&lt;div class="CodeRay"&gt;
  &lt;div class="code"&gt;&lt;pre&gt;$ lein protobuf compile
Compiling person.proto to /Users/paul/Work/forward/data-spike/protosrc
Compiling 1 source files to /Users/paul/Work/forward/data-spike/classes&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;You should now see some Java &lt;code&gt;.class&lt;/code&gt; files in your &lt;code&gt;./classes&lt;/code&gt; directory.&lt;/p&gt;
&lt;p&gt;Using &lt;a href="https://github.com/flatland/clojure-protobuf"&gt;clojure-protobuf&lt;/a&gt; to load an object from a byte array looks as follows:&lt;/p&gt;
&lt;p&gt;&lt;script src="https://gist.github.com/1646482.js"&gt;&lt;/script&gt;&lt;/p&gt;
&lt;h2&gt;Uberjar Time&lt;/h2&gt;
&lt;p&gt;I ran into a little trouble when I came to build the command-line tool and deploy it. When building with &lt;code&gt;lein uberjar&lt;/code&gt; it seemed that the &lt;code&gt;./classes&lt;/code&gt; directory was being cleaned causing the protobuf compiled Java classes to be unavailable to the application (causing the rest of the application to fail to build- I was using &lt;a href="https://github.com/clojure/tools.cli"&gt;tools.cli&lt;/a&gt; with a &lt;code&gt;main&lt;/code&gt; fn which meant using &lt;a href="http://clojure.org/compilation"&gt;:gen-class&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;I always turn to &lt;a href="https://github.com/technomancy/leiningen/blob/master/sample.project.clj"&gt;Leiningen’s sample project.clj&lt;/a&gt; and saw &lt;a href="https://github.com/technomancy/leiningen/blob/master/sample.project.clj#L45"&gt;:clean-non-project-classes&lt;/a&gt;. The comment mentioned it was set to &lt;code&gt;false&lt;/code&gt; by default so that wasn’t it.&lt;/p&gt;
&lt;p&gt;It turns out that Leiningen’s &lt;code&gt;uberjar&lt;/code&gt; task checks a different option when determining whether to clean the project before executing: &lt;a href="https://github.com/technomancy/leiningen/blob/master/src/leiningen/uberjar.clj#L79"&gt;:disable-implicit-clean&lt;/a&gt;. I added &lt;code&gt;:disable-implicit-clean true&lt;/code&gt; to our &lt;code&gt;project.clj&lt;/code&gt; and all was good:&lt;/p&gt;
&lt;div class="CodeRay"&gt;
  &lt;div class="code"&gt;&lt;pre&gt;$ lein protobuf compile, uberjar&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;I wasn’t a registered user of the &lt;a href="http://groups.google.com/group/leiningen"&gt;Leiningen mailing list&lt;/a&gt; (and am waiting for my question to be moderated) but it feels like &lt;code&gt;uberjar&lt;/code&gt; should honour &lt;code&gt;:clean-non-project-class&lt;/code&gt; too. I’d love to &lt;a href="http://lein-survey.herokuapp.com/"&gt;submit a patch to earn myself a sticker&lt;/a&gt; :)&lt;/p&gt;
    &lt;img src="http://feeds.feedburner.com/~r/oobaloo/~4/w5siGIQn_aM" height="1" width="1"/&gt;</content>
    <author>
      <name>Paul Ingles</name>
    </author>
  <feedburner:origLink>http://oobaloo.co.uk/protocol-buffers-with-clojure-and-leiningen</feedburner:origLink></entry>
</feed>
