Reid Draper's blog

Writing simple-check

2013-11-03 00:00:00

Writing simple-check

Nov 3, 2013

For the past several months I’ve been working on a QuickCheck (QC) library for Clojure: simple-check. In this post, we’ll look at three issues I ran into porting QC from Haskell to Clojure: typing, shrinking, and laziness. This will not act as an introduction to QC, or property-based testing. Further, this post assumes some familiarity with Haskell and Clojure.

Typing

One of the major differences between writing a QC in a statically-typed language and a dynamically-typed language is that with static-types, we get to use that information to inform QC of the generators to use to test our function. For example, if our function has the type [Int] -> Bool, Haskell QC will use this information to generate [Int]s. Furthermore, this takes advantage of the fact the we can be polymorphic on return type in Haskell. The Arbitrary type class in Haskell has a function, arbitrary, whose signature is Gen a. This allows the compiler to fill in the specialized version of Gen a for us, depending on context. In Clojure, we can only use type-based dispatch on an argument, not the return value. So, in dynamically-typed languages, we resort to explicitly specifying the generators to use for our test. Let’s see a concrete example:

In Haskell:

sortIdempotent :: [Int] -> Bool
sortIdempotent xs = (sort xs) == (sort (sort xs))

quickCheck sortIdempotent
-- +++ OK, passed 100 tests.

In Clojure:

(defn sort-idempotent?
  [coll]
  (= (sort coll) (sort (sort coll))))

(sc/quick-check 100
  (prop/for-all [coll (gen/vector gen/int)]
    (sort-idempotent? coll)))
;; {:result true, :num-tests 100, :seed 1383433754854}

In Erlang (also dynamically typed), using Erlang QuickCheck (EQC):

sort_idempotent(Xs) ->
  lists:sort(Xs) =:= lists:sort(lists:sort(Xs)).

prop_sort_idempotent() ->
    ?FORALL(Xs, list(int()),
            sort_idempotent(Xs)).

eqc:quickcheck(prop_sort_idempotent()).
%% OK, passed 100 tests

As you can see, with simple-check and Erlang QuickCheck, we have to explicitly provide the generator to use to test our function.

Shrinking

Some QC implementations have a feature called shrinking. This allows failing tests to be shrunk to ‘smaller’ failing cases, where ‘smaller’ is data-type specific, something that’d be easier for the programmer to debug. For example, if your function fails with a randomly-generated 100-element list, QC will try and remove elements, as long as the test continues to fail. In Haskell QuickCheck, random element generation and shrinking are treated separately. That is, if you want your type to shrink, you have to implement that separately from generating random values of your type. Let’s see the type class where these two functions live, Arbitrary:

class Arbitrary a where
  arbitrary :: Gen a

  -- the returned list is the first-level of the shrink tree
  shrink :: a -> [a]
  -- default implementation
  shrink _ = []

Most (all?) of the standard Prelude types have an Arbitrary instance already written, but you’ll need to write one for your own types. Generally you’ll write your implementation of arbitrary based on the provided generator-combinators, like choose, elements and oneof. If you want your type to shrink, you’ll have to implement this on your own. Again, this is due to the fact that value generation and shrinking are treated separately. simple-check and Erlang QuickCheck take a different approach. When you write a generator, using generator-combinators, you get shrinking ‘for free’. That’s because the notion of generating values and shrinking are tied together in these implementations. This is handy because it saves us from having to write boilerplate code to implement shrinking. Further, because it’s not nearly as common to create our own types in Clojure, let alone possible in Erlang, we don’t want to have to create our own new type solely to implement some shrink protocol. As a result, even implicit constraints in our generator are respected during shrinking. For example, suppose we write a new generator which multiplies randomly generated integers by two. This will always result in an even number being generated, and this will remain true during shrinking. This works because in simple-check, instead of the arbitrary function generating random values, we generate random values, along with the shrink tree for that value. Erlang QuickCheck is proprietary, but I imagine it works similarly. Let’s imagine how this might look using Haskell’s types:

-- a RoseTree is just an n-ary tree
data RoseTree a = RoseTree a [RoseTree a]

class Arbitrary a where
  -- instead of generating an `a`, we generate a shrink tree of `a`
  arbitrary :: Gen (RoseTree a)

The top of the tree is a randomly generated value, and its children are the first level of shrinking. Generator-combinators can then manipulate this shrink tree. Because we now act on these shrink trees, we simply create larger trees as we create more complex generators. To give a concrete example, the expression (fmap (partial * 2) gen/int) will create a new generator based on gen/int, which multiplies the randomly generated elements by two. But since this function is also applied to the children in the shrink tree, every element in the shrink tree will be multiplied by two. We can also now write generator-combinators like elements, which creates a generator by choosing a random element from a list. This generator will shrink toward choosing earlier elements in the list. Were we to use elements in our arbitrary function in Haskell QC, we’d have to write the shrinking logic ourselves. It’s important to note, however, that this is specific to Haskell QC, and not the language itself, we could’ve implemented Haskell’s QC as described here.

Laziness

Haskell QuickCheck takes advantage of whole-program laziness. For example, when shrinking, instead of traversing a tree of arguments to the function under test, and applying to values to the function the tree is traversed, we’re able to use fmap to lazily apply to function to the entire tree. We then need only traverse a tree of booleans (representing test success or failure). This allows for a higher-level of abstraction. Fortunately, Clojure lets us mimic this, as long as our types are represented as lazy sequences. To represent a large tree, we use a two-element vector, where the first element is the top value in the tree, and the second element is a lazy sequence, representing the children. Using Clojure’s lazy functions like map, filter and concat, we’re able to retain this laziness as we process the tree. However, as this tree can become large when fully-evaluated, finding bugs can be difficult. In Haskell, we’re able to find type-mistakes during compilation, whereas in Clojure we need to run our program, potentially sifting through a large tree to find our bugs, which may have been introduced several call-sites away from where we’re looking. In order to combat this, I specifically debugged with values I knew had small shrink trees, and could be easily printed at the REPL.

If you like this post, you should follow me on twitter.

Data Traceability

2013-05-16 00:00:00

Data Traceability

May 16, 2013

This text appears as Chapter 17 in O’Reilly’s Bad Data Handbook (ISBN-13: 978-1449321888). It is released under the CC BY-SA license.

Your software consistently provides impressive music recommendations by combining cultural and audio data. Customers are happy. However, things aren’t always perfect. Sometimes that Beyoncé track is attributed to Beyonce. The artist for the Béla Fleck solo album shows up as Béla Fleck and the Flecktones. Worse, the ボリス biography has the artist name listed as ???. Where did things go wrong? Did one of your customers provide you with data in an incorrect character encoding? Did one of the web-crawlers have a bug? Perhaps the name resolution code was incorrectly combining a solo artist with his band?

How do we solve this problem? We’d like to be able to trace data back to it’s origin, following each transformation. This is reified as data provenenace. In this chapter, we’ll explore ways of keeping track of the source of our data, techniques for backing out bad data, and the business value of adopting such ability.

Why?

The ability to trace a datum back to its origin is important for several reasons. It helps us to back-out or reprocess bad data, and conversely, it allows us to reward and boost good data sources and processing techniques. Furthermore, local privacy laws can mandate things like auditability, data transfer restrictions and more. For example, California’s Shine the Light Law requires businesses disclose the personal information that has been shared with third-parties, should a resident request. Europe’s Data Protection Directive provides even more stringent regulation to businesses collecting data about residents.

We’ll also later see how data traceability can provide further business value by allowing us to provide stronger measurements on the worth of a particular source, realize where to focus our development effort, and even manage blame.

Personal Experience

I previously worked in the data ingestion team at a music data company. We provided artist and song recommendations, artist biographies, news, and detailed audio analysis of digital music. We exposed those data feeds via web services and raw dumps. Behind the scenes, these feeds were composed of many sources of data, which which were in turn cleaned, transformed, and put through machine learning algorithms.

One of the first issues we ran into was learning how to trace a particular result back to its constituent parts. If a given artist recommendation was poor, was it because of our machine learning algorithm? Did we simply not have enough data for that artist? Was there some obviously wrong data from one of our sources? Being able to debug our product became a business necessity.

We developed several mechanisms for being able to debug our data woes, some of which I’ll explore here.

Snapshotting

Many of the data sources were updated frequently. At the same time, the web pages we crawled for news, reviews, biography information and similarity, were updated inconsistently. This meant that even if we were able to trace a particular datum back to its source, that source may have been drastically different than the time we had previously crawled or processed the data. In turn, we needed to not only capture the source of our data, but the time, and exact copy of the source. Our database columns or keys would then have an extra field for a timestamp.

Keeping track of the time and the original data also allows you to track changes from that source. You get closer to answering the question, “why were my recommendations for The Sea and Cake great last week, but terrible today?”

This process of writing data once and never changing it is called immutability, and it plays a key role in data traceability. I’ll return to it later, when I walk through an example.

Saving the source

Our data was stored in several different types of databases, including relational and key-value stores. However, nearly every schema had a source field. This field would contain one or more values. For original sources there would be a single source listed. As data was processed and transformed into roll-ups or learned-data, we would preserve the list of sources that went into creating that new piece of data. This allowed us to trace the final data product back to its constituent parts.

Weighting sources

One of the most important reason we collected data was to learn about new artists, albums and songs. That said, we didn’t always want to create a new entity that would end up in our final data product. Certain data sources were more likely to have errors, misspellings and other inaccuracies, so we wanted them to be vetted before they would progress through our system.

Furthermore, we wanted to be able to give priority processing to certain sources that either had higher information value or were for a particular customer. For applications like learning about new artists, we’d assign a trust-score to each source that would, among other things, determine whether a new artist was created.

If the artist wasn’t created based solely on this source, it would add weight to that artist being created if we ever heard of them again. In this way, the combined strength of several lower-weighted sources could lead the artist being created in our application.

Backing out data

Sometimes we identified that data was simply incorrect or otherwise bad. In such cases, we had to both remove the data from our production offering.

Recall, our data would pass through several stages of transformation on its way to the production offering. A backout, then, required that we first identify potential sources of the bad data, remove it, then reprocess the product without that source. (Sometimes the data transformations were so complex that it was easier to generate all permutations of source data, to spot the offender.) This is only possible since we had kept track of the sources that went into the final product.

Because of this observation, we had to make it easy to redo any stage of the data transformation with an altered source list. We designed our data processing pipeline to use parameterized source lists, so that it was easy to exclude a particular source, or explicitly declare the sources that were allowed to affect this particular processing stage.

Separating phases (and keeping them pure)

Often we would divide our data processing into several stages. It’s important to identify the state barriers in your application, as doing this allowed us to both write better code, and create more efficient infrastructure.

From a code perspective, keeping each of our stages separate allowed us to reduce side effects (such as I/O). In turn, this made code easier to test, because we didn’t have to set up mocks for half of our side-effecting infrastructure.

From an infrastructure perspective, keeping things separate allowed us to make isolated decisions about each stage of the process, ranging from compute power, to parallelism, to memory constraints.

Identifying the root cause

Identifying the root cause of data issues is important to being able to fix them, and control customer relationships. For instance, if a particular customer is having a data quality issue, it is helpful to know whether the origin of the issue was from data they gave you, or from your processing of the data they gave you. In the former case, there is real business value in being able to show the customer the exact source of the issue, as well as your solution.

Finding areas for improvement

Related to blame is the ability to find sources of improvement in your own processing pipeline and infrastructure. This means the steps in your processing pipeline become data sources in their own right.

It’s useful to know, for instance, when and how you derived a certain piece of data. Should an issue arise, you can immediately focus on the place it was created. Conversely, if a particular processing stage tends to produce excellent results, it is helpful to be able to understand why that is so. Ideally you can then replicate this into other parts of your system.

Organizationally, this type of knowledge also allows you to determine where to focus your teams’ effort, and even to reorganize your team structure. For example, you might want to place a new member of the team on one of the infrastructure pieces that is doing well, and should be a model for other pieces, as to give them a good starting place for learning the system. A more senior team member may be more effective on pieces of the infrastructure that are struggling.

Immutability: borrowing an idea from functional programming

Considering the examples above, a core element of our strategy was immutability: even though our processing pipeline transformed our data several times over, we never changed (overwrote) the original data.

This is an idea we borrowed from functional programming. Consider imperative languages like C, Java and Python, in which data tends to be mutable. For example, if we want to sort a list, we might call myList.sort(). This will sort the list in-place. Consequently, all references to myList will be changed. If we now want review myList’s original state, we’re out of luck: we should have made a copy before calling sort().

By comparison, functional languages like Haskell, Clojure and Erlang tend to treat data as immutable. Our list sorting example becomes something closer to myNewSortedList = sort(myList). This retains the unsorted list myList. One of the advantages of this immutability is that many functions become simply the result of processing the values passed in. Given a stack trace, we can often reproduce bugs immediately.

With mutable data, there is no guarantee that the value of a particular variable remains the same throughout the execution of the function. Because of this, we can’t necessarily rely on a stack trace to reproduce bugs.

Concerning our data processing pipeline, we could save each step of transformation and debug it later. For example, consider this workflow:

rawData = downloadFrom(someSite)
cleanData = cleanup(rawData)
newArtistData = extractNewArtists(cleanData)

Let’s say we’ve uncovered a problem in the cleanup() function. We would only have to correct the code and rerun that stage of the pipeline. We never replaced rawData and hence it would be available for any such debugging later.

To take further advantage of immutability, we persisted our data under a compound key of identifier and timestamp. This helped us find the exact inputs to any of our data processing steps, which saved time when we had to debug an issue.

An Example

As an example, let me walk you through creating a news aggregation site. Along the way, I’ll apply the lessons I describe above to demonstrate how data traceability affects the various aspects of the application.

Let’s say that our plan is to display the top stories of the day, with the ability to drill down by topic. Each story will also have a link to display coverage of the same event from other sources.

We’ll need to be able to do several things:

Crawl the web for news stories.
Determine a story’s popularity and timeliness based on social media activity, and perhaps its source. (For example, we assume a story on the New York Times homepage is important and/or popular).
Cluster stories about the same event together.
Determine event popularity. (Maybe this will be aggregate popularity of the individual stories?)

Crawlers

We’ll seed our crawlers with a number of known news sites. Every so often we’ll download the contents of the page and store it under a composite key with URL, source and timestamp, or a relational database row with these attributes. (Let’s say we crawl frequently-updated pages several times a day, and just once a day for other pages.)

From each of these home pages we crawl, we’ll download the individual linked stories. The stories will also be saved with URL, source and timestamp attributes. Additionally, we’ll store the composite ID of the homepage where we were linked to this story. That way if, for example, later we suspect we have a bug with the way we assign story popularity based on home page placement, we can review the home page as it was retrieved at a particular point in time. Ideally we should be able to trace data from our own homepage all the way back to the original HTML that our crawler downloaded.

In order to help determine popularity, and to further feed our news crawlers, we’ll also crawl social media sites. Just like with the news crawlers, we’ll want to keep a timestamped record of the HTML and other assets we crawl. Again, this will let us go back later and debug our code. One example of why this would be useful is if we suspect we are incorrectly counting links from shares of a particular article.

Change

Keeping previous versions of the sites we crawl allows for some interesting analytics. Historically, how many articles does the Boston Globe usually link to on their home page? Is there a larger variety of news articles in the summer? Another useful byproduct of this is that we can run new analytics on past data. Because immutability can give us a basis from the past, we’re not confined to just the data we’ve collected since we turned on our new analytics.

Clustering

Clustering data is a difficult problem. Outlying or mislabeled data can completely change our clusters. For this reason, it is important to be able to cheaply (in human and compute time) be able to experiment with rerunning our clustering with altered inputs. The inputs we alter may remove data from a particular source, or add a new topic modelling stage between crawling and clustering. In order to achieve this, our infrastructure must be loosely coupled such that we can just as easily provide inputs to our clustering system for testing as we do in production.

Popularity

Calculating story popularity shares many of the same issues as clustering stories. As we experiment, or debug an issue, we want to quickly test our changes and see the result. We also want to see the most popular story on our own page and trace all the way through our own processing steps, back to the origin site we crawled. If we find out we’ve ranked a story as more popular that we would’ve liked, we can trace it back to our origin crawl to see if, perhaps, we had put too much weight in its position on its source site.

Conclusion

You will need to debug data processing code and infrastructure just like normal code. By taking advantage of techniques like immutability, you can dramatically improve your ability to reason about your system. Furthermore, we can draw from decades of experience in software design to influence our data processing and infrastructure decisions.

If you like this post, you should follow me on twitter.

Introducing Knockbox

2011-12-10 00:00:00

Introducing Knockbox

Dec 10, 2011

For the past few weeks I’ve been working on a Clojure library called knockbox. It’s a library meant to make dealing with conflict-resolution in eventually-consistent databases easier. If you’re not familiar with eventual-consistency, I’d suggest this article by Amazon CTO Werner Vogels.

Distributed databases like Riak let you trade consistency for availability. This means that at any given moment, all of the replicas of your data might not be synchronized. In exchange for this, your database cluster can still operate when all but one replica of your data is unavailable. Amazon’s shopping-cart session state has been the iconic example. In their case, a write to add an item to your cart may go to a replica that is not up to date. At some point, the database notices that the replicas are in conflict, and you must resolve them. But how do you do this? If a coffee maker is in one replica and not the other, what happened? Was the coffee maker recently added and that just hasn’t been reflected in the other replica yet? Or was the coffee maker recently deleted? It turns out that you often have to change the way you represent your data in order to preserve the original intentions.

Developers who wanted to implement data-types with conflict-resolution semantics have had to figure it out themselves, or read academic papers like A comprehensive study of Convergent and Commutative Replicated Data Types. statebox was the first popular open source project to help ease the burden for developers wanting to take advantage of eventual-consistency. As I’ve been learning Clojure recently, I thought I’d try my hand at putting together a similar library.

The main goal has been to have the types conform to all appropriate Clojure Protocols and Java interfaces. This means my last-write-wins set should quack like a normal Clojure set. This lets you reuse existing code that expects normal Clojure data types. Next, I’ve defined a Resolvable Protocol for all of these types to implement. There’s only a single method, which looks like:

(resolve [a b])

This function should take two conflicing objects and return a new, resolved object.

Resolving a list of replicas (often called siblings when they’re in conflict) is as simple as providing the resolve function to reduce. This is, however, provided for you, as knockbox.core/resolve. Note that this function is in a different namespace than the resolve that you implement as part of the Resolvable Protocol (this lives in knockbox.resolvable).

There are currently two data-types implemented, sets and registers. A register is simply a container for another type. I also intend to implement counters, but have yet to come up with an implementation that has space-efficiency and pruning characteristics that I like.

Let’s now create some conflicting replicas, and see see how they get resolved. Here we’ll use a last-write-wins (lww) set. The resolution semantics used here are to use timestamps to resolve an add/delete conflict for a particular item. This is not the same as using timestamps for the whole set, because we’re doing it per item. To get a REPL with the correct classpath, you can either add [knockbox "0.0.1-SNAPSHOT"] to your project.clj, or clone the knockbox repository and type lein repl.

(require 'knockbox.core)
(require '[knockbox.sets :as kbsets])

(def original (into (kbsets/lww) #{:mug :kettle}))

(def a (disj original :kettle))
(def b (conj original :coffee))

(def c (conj original :coffee-roaster))
;; this one wins because its
;; timestamp is later
(def d (disj original :coffee-roaster))

(println (knockbox.core/resolve [a b c d]))
; => #{:coffee :mug}

;; notice that this is different
;; than simply taking the union of
;; the four sets
(println (clojure.set/union a b c d))
; => #{:coffee :coffee-roaster :kettle :mug}

Using timestamps is fine for some domains, but what if our update-rate is high enough that we can’t trust our clocks to be synchronized enough? The observed-remove set works by assigning a UUID to each addition. Deletes will then override any UUIDs they have seen for a particular item in the set. This means that when add/delete conflicts happen, addition will win because the delete action couldn’t have seen the UUID created by the addition. Let’s see this in action.

(require 'knockbox.core)
(require '[knockbox.sets :as kbsets])

(def original (into (kbsets/observed-remove) #{:gin :rum}))

(def a (conj original :vodka))
(def b (conj original :vodka))

;; we've only seen the addition
;; of :vodka from a, not b
(def c (disj a :vodka))

;; don't include a in here because
;; vector clocks will take care of
;; figuring out that c supersedes it
(println (knockbox.core/resolve [b c]))
; => #{:vodka :gin :rum}

That’s all for this first post, so go ahead and take a look at knockbox on github.

If you like this post, you should follow me on twitter.

Writing Your First Chef Recipe

2011-04-18 00:00:00

Writing Your First Chef Recipe

Apr 18, 2011

Chef is an infrastructure automation tool that lets you write Ruby code to describe how your machines should be set up. Applications for Chef vary from configuring complicated multi-node applications, to setting up your personal workstation.

As great as Chef is, getting started can be a bit daunting. It’s worse if you’re not sure exactly what Chef provides, and you’ve never written a lick of Ruby. This was me a few days ago, so I thought I’d write a quick Chef introduction from that perspective. In this tutorial, we’ll be creating a Chef recipe for the popular database Redis.

Before we get started, there are two terms we need to define, recipes and cookbooks. In Chef, recipes are what you write to install and configure things on your machine like Redis, sshd or Apache2. A cookbook is a collection of related recipes. For example, the MySQL cookbook might include two recipes, mysql::client and mysql::server. A cookbook might also have a recipe for installing something via package management, or from source. Our Redis cookbook will contain just one recipe, which installs Redis from source.

This recipe is available on github.

Getting Set Up

The first thing you’ll want to do is:

$ git clone https://github.com/opscode/chef-repo.git

This gives us the skeleton of our cookbook repository. Next, we’ll create an empty cookbook:

$ cd chef-repo 
$ rake new_cookbook COOKBOOK=redis

Our rake task created some folders we won’t need for this simple recipe, we’ll remove them:

$ cd cookbooks/redis/
$ rm -rf definitions/ files/ libraries/ providers/ resources/
$ cd ../..

The folders we’ll be looking at are:

cookbooks/redis
cookbooks/redis/attributes
cookbooks/redis/templates/default
cookbooks/redis/recipes

Next we’ll create the files we’ll be editing to create our recipe:

$ touch cookbooks/redis/attributes/default.rb
$ touch cookbooks/redis/recipes/default.rb
$ touch cookbooks/redis/templates/default/redis.conf.erb
$ touch cookbooks/redis/templates/default/redis.upstart.conf.erb

To run and test our cookbook, we’ll be using Vagrant, a tool for managing local virtual machines. Instructions for installing Vagrant can be found here. Create a file called Vagrantfile in the root of the repository. Edit it to look like this:

Vagrant::Config.run do |config|
  config.vm.box = "lucid32"
   config.vm.provision :chef_solo do |chef|
     chef.cookbooks_path = "cookbooks"
     chef.add_recipe "redis"
     chef.log_level = :debug
  end 
end

The two most important things to note here are that we’re telling our VM to use Chef to install Redis, and that we want the log level set to debug.

Now run this to download the Ubuntu 10.04 VM we’ll be using:

# note: this download is roughly 500MB
$ vagrant box add lucid32 http://files.vagrantup.com/lucid32.box

Writing Our Recipe

Now we are set up and ready to start writing our first recipe. We’ll start by looking at cookbooks/redis/metadata.rb. It records metadata about our cookbook, including other cookbooks it depends on, and supported OS’s. For this tutorial, we don’t need to edit it.

Attributes

Next we’ll look at cookbooks/redis/attributes/default.rb, which is where we’ll be defining the variable options for installing and running Redis. Edit it to look like:

default[:redis][:dir]       = "/etc/redis"
default[:redis][:data_dir]  = "/var/lib/redis"
default[:redis][:log_dir]   = "/var/log/redis"
# one of: debug, verbose, notice, warning
default[:redis][:loglevel]  = "notice"
default[:redis][:user]      = "redis"
default[:redis][:port]      = 6379
default[:redis][:bind]      = "127.0.0.1"

This file gives default values for configuration options. The defaults can be overridden by a specific machine. For example, on your development box you might want the data_dir to be someplace different. Since it’s just Ruby code, we can also use control statements to change these defaults based on things like the host OS. One of the most powerful parts of Chef is that the attributes we’re defining here will be available to all of our configuration file templates. This means we only have to declare the user variable once, and it will be used to create a new user, and start Redis running as that same user. We’re programming our config files.

A quick note for the non-Ruby programmers out there, when you see :redis, this is called a symbol. The short story is that it’s a string just like "redis", but is more memory efficient if used more than once. In Python, one of the above lines might look like:

default["redis"]["dir"] = "/etc/redis"

Templates

In Chef we use ERB templates to write our config files. In this recipe we’re using two templates, one for the configuration to redis-server and the other for upstart. Upstart is a replacement for etc/init.d/ scripts. Edit cookbooks/redis/templates/default/redis.conf.erb to look like:

port <%= node[:redis][:port] %>
bind <%= node[:redis][:bind] %>
loglevel <%= node[:redis][:loglevel] %>
dir <%= node[:redis][:data_dir] %>

daemonize no
logfile stdout
databases 16
save 900 1
save 300 10
save 60 10000
rdbcompression yes
dbfilename dump.rdb

and cookbooks/redis/templates/default/redis.upstart.conf.erb like:

#!upstart
description "Redis Server"

env USER=<%= node[:redis][:user] %>

start on startup
stop on shutdown

respawn

exec sudo -u $USER sh -c "/usr/local/bin/redis-server \
  /etc/redis/redis.conf 2>&1 >> \
  <%= node[:redis][:log_dir] %>/redis.log"

The Recipe File

Now it’s time to write the actual recipe. Having little Ruby experience, I’ll have to do some hand-waving in explaining that the following code is both Chef’s DSL, and perfectly valid Ruby code.

The following code is run from the top-down. It uses Chef resources to create a user, make directories, download and compile Redis, and write out the templates.

Edit cookbooks/redis/recipes/default.rb to look like:

package "build-essential" do
  action :install
end

user node[:redis][:user] do
  action :create
  system true
  shell "/bin/false"
end

directory node[:redis][:dir] do
  owner "root"
  mode "0755"
  action :create
end

directory node[:redis][:data_dir] do
  owner "redis"
  mode "0755"
  action :create
end

directory node[:redis][:log_dir] do
  mode 0755
  owner node[:redis][:user]
  action :create
end

remote_file "#{Chef::Config[:file_cache_path]}/redis.tar.gz" do
  source "https://github.com/antirez/redis/tarball/v2.0.4-stable"
  action :create_if_missing
end

bash "compile_redis_source" do
  cwd Chef::Config[:file_cache_path]
  code <<-EOH
    tar zxf redis.tar.gz
    cd antirez-redis-55479a7
    make && make install
  EOH
  creates "/usr/local/bin/redis-server"
end

service "redis" do
  provider Chef::Provider::Service::Upstart
  subscribes :restart, resources(:bash => "compile_redis_source")
  supports :restart => true, :start => true, :stop => true
end

template "redis.conf" do
  path "#{node[:redis][:dir]}/redis.conf"
  source "redis.conf.erb"
  owner "root"
  group "root"
  mode "0644"
  notifies :restart, resources(:service => "redis")
end

template "redis.upstart.conf" do
  path "/etc/init/redis.conf"
  source "redis.upstart.conf.erb"
  owner "root"
  group "root"
  mode "0644"
  notifies :restart, resources(:service => "redis")
end

service "redis" do
  action [:enable, :start]
end

Trying Our Recipe

Now that we’ve written our recipe, it’s time to try it out. In the root of your repository, run vagrant up. This will start the virtual machine and set up Redis using Chef. Once the command finishes, run this:

$ vagrant ssh
$ echo "ping" | nc localhost 6379
$ exit

If all went well, you should have seen +PONG. If you change something and want to re-run Chef, type vagrant provision.

When you’re done working, run vagrant destroy to reclaim your RAM.

Closing Thoughts

Chef is much more powerful than what I’ve presented, but I hope I’ve been able to show how easy it is to get started writing and editing recipes. If you’d like to learn more about Chef, check out the Opscode wiki.

If you like this post, you should follow me on twitter.

100-Node Riak Cluster for $2

2011-04-03 00:00:00

100-Node Riak Cluster for $2

Apr 3, 2011

Riak is a distributed key-value store; data is replicated and partitioned across your cluster. Increasing the cluster size allows you to scale both performance and fault-tolerance. One of the most powerful parts of Riak is the ability to add a new node to your cluster with one command:

riak-admin join riak@example.com

With the recent trend toward operations-as-code, I thought I would challenge myself to write a script to set up a 100-node Riak cluster with one command. Using Amazon EC2 micro-instances, the cluster costs $2 to run for an hour.

Riak works by splitting a 160-bit hash-space into a certain number of
virtual nodes (vnodes), say 1024. Each physical node is then responsible for 1024 / N vnodes, where N is the number of physical nodes in the cluster. As a new node joins, it takes some vnodes from the rest of the cluster.

I’ve written a simple Python script to launch a 100-node cluster. The script launches a master node, and notes its IP address. The other 99 nodes are launched and told to join the master. Riak doesn’t currently have provisions to deal with many nodes trying to join the cluster at once. To avoid the thundering-herd problem I simply have each node sleep for a random time, such that nodes are joining, on average, one every 15 seconds. Some sort of queueing system, and this bugfix, would eliminate the need for nodes to stagger their join requests. Here is a snippet from the Riak IRC about this. I didn’t get a chance to try it, but using Chef-server, there’s also a Riak cluster recipe.

After getting my script working with a 20-node cluster, I tried to launch 100, only to learn that AWS accounts are, by default, limited to 20 instances. Fortunately, the spot instance limit is 100, so I was able to use those.

The script is simple, and usage looks like:

./launch.py keypair ~/.ssh/keypair.pem user_data.sh 100

Approximately 35 minutes after running the script, I had a 95-node cluster. The command riak-admin ringready told me that two nodes were down. After starting them, I had a 97-node cluster. I wasn’t able to diagnose the problem with the other three nodes. I was impressed with how easy it was to automate Riak, and it’s clear that Basho has plans to make things even easier.

Now is a good time to note that the script doesn’t launch a truly production-ready cluster. For starters, it probably isn’t a good idea to use spot instances for a database. You would also be wise to have a smaller number of more powerful machines, rather than 100 micro instances. Next, I would recommend something like Chef, for more complicated infrastructure automation.

If you’d like to run your own 100-node cluster, check out this github repository. If you decide to keep your cluster up for more than an hour, here’s some data to play with.

It’s exciting to see how infrastructure automation is making it easy for small teams to build massive systems in short periods of time. Databases like Riak fit perfectly with this, as their administrative cost is low, and configuration remains simple regardless of how many nodes are in the cluster.

For those of you considering writing something similar, I highly recommend trying Vagrant for testing virtual machine setups before spending a dime on EC2.

If you like this post, you should follow me on twitter.