<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Reid Draper's blog</title>
    <subtitle><![CDATA[Reid Draper's blog]]></subtitle>
    <link href="reiddraper.com/atom.xml" rel="self" />
    <link href="reiddraper.com" />
    <id>reiddraper.com/atom.xml</id>
    <author>
        <name>Reid Draper</name>
        
        <email>reiddraper@gmail.com</email>
        
    </author>
    <updated>2013-11-03T00:00:00Z</updated>
    <entry>
    <title>Writing simple-check</title>
    <link href="reiddraper.com/writing-simple-check/index.html" />
    <id>reiddraper.com/writing-simple-check/index.html</id>
    <published>2013-11-03 00:00:00</published>
    <updated>2013-11-03T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1>Writing simple-check</h1>
<h3>Nov  3, 2013</h3>
<p>For the past several months I’ve been working on a
<a href="http://en.wikipedia.org/wiki/QuickCheck">QuickCheck</a> (QC) library for Clojure:
<a href="https://github.com/reiddraper/simple-check">simple-check</a>. In this post, we’ll
look at three issues I ran into porting QC from Haskell to Clojure: typing,
shrinking, and laziness. This will not act as an introduction to QC, or
property-based testing. Further, this post assumes some familiarity with
Haskell and Clojure.</p>
<h2 id="typing">Typing</h2>
<p>One of the major differences between writing a QC in a statically-typed
language and a dynamically-typed language is that with static-types, we get to
use that information to inform QC of the generators to use to test our
function. For example, if our function has the type <code>[Int] -&gt; Bool</code>, Haskell QC
will use this information to generate <code>[Int]</code>s. Furthermore, this takes
advantage of the fact the we can be polymorphic on <em>return</em> type in Haskell.
The <code>Arbitrary</code> type class in Haskell has a function, <code>arbitrary</code>, whose
signature is <code>Gen a</code>. This allows the compiler to fill in the specialized
version of <code>Gen a</code> for us, depending on context. In Clojure, we can only use
type-based dispatch on an <em>argument</em>, not the return value. So, in
dynamically-typed languages, we resort to explicitly specifying the generators
to use for our test. Let’s see a concrete example:</p>
<p>In Haskell:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="ot">sortIdempotent ::</span> [<span class="dt">Int</span>] <span class="ot">-&gt;</span> <span class="dt">Bool</span></span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a>sortIdempotent xs <span class="ot">=</span> (<span class="fu">sort</span> xs) <span class="op">==</span> (<span class="fu">sort</span> (<span class="fu">sort</span> xs))</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a>quickCheck sortIdempotent</span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a><span class="co">-- +++ OK, passed 100 tests.</span></span></code></pre></div>
<p>In Clojure:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode clojure"><code class="sourceCode clojure"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>(<span class="bu">defn</span><span class="fu"> sort-idempotent?</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>  [coll]</span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>  (<span class="kw">=</span> (<span class="kw">sort</span> coll) (<span class="kw">sort</span> (<span class="kw">sort</span> coll))))</span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>(sc/quick-check <span class="dv">100</span></span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a>  (prop/for-all [coll (gen/vector gen/int)]</span>
<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a>    (sort-idempotent? coll)))</span>
<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="co">;; {:result true, :num-tests 100, :seed 1383433754854}</span></span></code></pre></div>
<p>In Erlang (also dynamically typed), using <a href="http://www.quviq.com/index.html">Erlang QuickCheck (EQC)</a>:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode erlang"><code class="sourceCode erlang"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="fu">sort_idempotent(</span><span class="va">Xs</span><span class="fu">)</span> <span class="op">-&gt;</span></span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">lists:sort(</span><span class="va">Xs</span><span class="fu">)</span> <span class="op">=:=</span> <span class="fu">lists:sort(lists:sort(</span><span class="va">Xs</span><span class="fu">)).</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a><span class="fu">prop_sort_idempotent()</span> <span class="op">-&gt;</span></span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a>    <span class="fu">?</span><span class="va">FORALL</span><span class="fu">(</span><span class="va">Xs</span><span class="fu">,</span> <span class="fu">list(int()),</span></span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a>            <span class="fu">sort_idempotent(</span><span class="va">Xs</span><span class="fu">)).</span></span>
<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a><span class="fu">eqc:quickcheck(prop_sort_idempotent()).</span></span>
<span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a><span class="co">%% OK, passed 100 tests</span></span></code></pre></div>
<p>As you can see, with <strong>simple-check</strong> and Erlang QuickCheck, we have to
explicitly provide the generator to use to test our function.</p>
<h2 id="shrinking">Shrinking</h2>
<p>Some QC implementations have a feature called shrinking. This allows failing
tests to be shrunk to ‘smaller’ failing cases, where ‘smaller’ is data-type
specific, something that’d be easier for the programmer to debug. For example,
if your function fails with a randomly-generated 100-element list, QC will try
and remove elements, as long as the test continues to fail. In Haskell
QuickCheck, random element generation and shrinking are treated separately.
That is, if you want your type to shrink, you have to implement that separately
from generating random values of your type. Let’s see the type class where
these two functions live, <code>Arbitrary</code>:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> <span class="dt">Arbitrary</span> a <span class="kw">where</span></span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="ot">  arbitrary ::</span> <span class="dt">Gen</span> a</span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a>  <span class="co">-- the returned list is the first-level of the shrink tree</span></span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a><span class="ot">  shrink ::</span> a <span class="ot">-&gt;</span> [a]</span>
<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a>  <span class="co">-- default implementation</span></span>
<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a>  shrink _ <span class="ot">=</span> []</span></code></pre></div>
<p>Most (all?) of the standard Prelude types have an <code>Arbitrary</code> instance already
written, but you’ll need to write one for your own types. Generally you’ll
write your implementation of <code>arbitrary</code> based on the provided
generator-combinators, like <code>choose</code>, <code>elements</code> and <code>oneof</code>. If you want your
type to shrink, you’ll have to implement this on your own. Again, this is due
to the fact that value generation and shrinking are treated separately.
<em>simple-check</em> and Erlang QuickCheck take a different approach. When you write
a generator, using generator-combinators, you get shrinking ‘for free’. That’s
because the notion of generating values and shrinking are tied together in
these implementations. This is handy because it saves us from having to write
boilerplate code to implement shrinking. Further, because it’s not nearly as
common to create our own types in Clojure, let alone possible in Erlang, we
don’t want to have to create our own new type solely to implement some shrink
protocol. As a result, even implicit constraints in our generator are respected
during shrinking. For example, suppose we write a new generator which
multiplies randomly generated integers by two. This will always result in an
even number being generated, and this will remain true during shrinking. This
works because in simple-check, instead of the arbitrary function generating
random values, we generate random values, along with the shrink tree for that
value. Erlang QuickCheck is proprietary, but I imagine it works similarly.
Let’s imagine how this might look using Haskell’s types:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="co">-- a RoseTree is just an n-ary tree</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="kw">data</span> <span class="dt">RoseTree</span> a <span class="ot">=</span> <span class="dt">RoseTree</span> a [<span class="dt">RoseTree</span> a]</span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> <span class="dt">Arbitrary</span> a <span class="kw">where</span></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a>  <span class="co">-- instead of generating an `a`, we generate a shrink tree of `a`</span></span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a><span class="ot">  arbitrary ::</span> <span class="dt">Gen</span> (<span class="dt">RoseTree</span> a)</span></code></pre></div>
<p>The top of the tree is a randomly generated value, and its children are the
first level of shrinking. Generator-combinators can then manipulate
this shrink tree. Because we now act on these shrink trees, we simply create
larger trees as we create more complex generators. To give a concrete example,
the expression <code>(fmap (partial * 2) gen/int)</code> will create a new generator based
on <code>gen/int</code>, which multiplies the randomly generated elements by two. But
since this function is also applied to the children in the shrink tree, every
element in the shrink tree will be multiplied by two. We can also now write
generator-combinators like <code>elements</code>, which creates a generator by choosing a
random element from a list. This generator will shrink toward choosing earlier
elements in the list. Were we to use <code>elements</code> in our <code>arbitrary</code> function in
Haskell QC, we’d have to write the shrinking logic ourselves. It’s
important to note, however, that this is specific to Haskell QC, and
not the language itself, we could’ve implemented Haskell’s QC as
described here.</p>
<h2 id="laziness">Laziness</h2>
<p>Haskell QuickCheck takes advantage of whole-program laziness. For example, when
shrinking, instead of traversing a tree of arguments to the function under
test, and applying to values to the function the tree is traversed, we’re able
to use <code>fmap</code> to lazily apply to function to the entire tree. We then need only
traverse a tree of booleans (representing test success or failure). This allows
for a higher-level of abstraction. Fortunately, Clojure lets us mimic this, as
long as our types are represented as lazy sequences. To represent a large tree,
we use a two-element vector, where the first element is the top value in the
tree, and the second element is a lazy sequence, representing the children.
Using Clojure’s lazy functions like <code>map</code>, <code>filter</code> and <code>concat</code>, we’re able to
retain this laziness as we process the tree. However, as this tree can become
large when fully-evaluated, finding bugs can be difficult. In Haskell, we’re
able to find type-mistakes during compilation, whereas in Clojure we need to
run our program, potentially sifting through a large tree to find our bugs,
which may have been introduced several call-sites away from where we’re
looking. In order to combat this, I specifically debugged with values I knew
had small shrink trees, and could be easily printed at the REPL.</p>
<p class="twitter_follow">If you like this post, you should follow me on
  <a href="http://twitter.com/reiddraper">twitter.</a>
</p>
]]></summary>
</entry>
<entry>
    <title>Data Traceability</title>
    <link href="reiddraper.com/data-traceability/index.html" />
    <id>reiddraper.com/data-traceability/index.html</id>
    <published>2013-05-16 00:00:00</published>
    <updated>2013-05-16T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1>Data Traceability</h1>
<h3>May 16, 2013</h3>
<p>This text appears as Chapter 17 in O’Reilly’s <a href="http://shop.oreilly.com/product/0636920024422.do">Bad Data
Handbook</a> (ISBN-13:
978-1449321888). It is released under the <a href="http://creativecommons.org/licenses/by-sa/3.0/">CC
BY-SA</a> license.</p>
<hr />
<p>Your software consistently provides impressive music recommendations
by combining cultural and audio data. Customers are happy.
However, things aren’t always perfect. Sometimes that Beyoncé track is
attributed to Beyonce. The artist for the Béla Fleck solo album shows up as
Béla Fleck and the Flecktones. Worse, the ボリス biography
has the artist name listed as ???. Where did things go wrong?
Did one of your customers provide you with data in an incorrect
character encoding? Did one of the web-crawlers have a bug? Perhaps
the name resolution code was incorrectly combining a solo artist
with his band?</p>
<p>How do we solve this problem?
We’d like to be able to trace data back to it’s origin, following
each transformation. This is reified as <em>data provenenace</em>.
In this chapter, we’ll explore ways of keeping track of
the source of our data, techniques for backing out
bad data, and the business value of adopting such
ability.</p>
<h2 id="why">Why?</h2>
<p>The ability to trace a datum back to its origin is important
for several reasons. It helps us to back-out or reprocess bad data,
and conversely, it allows us to reward and boost good data
sources and processing techniques. Furthermore, local privacy
laws can mandate things like auditability, data transfer
restrictions and more. For example, California’s Shine the Light
Law requires businesses disclose the personal information that has
been shared with third-parties, should a resident request. Europe’s
Data Protection Directive provides even more stringent regulation
to businesses collecting data about residents.</p>
<p>We’ll also later see how data traceability can provide further
business value by allowing us to provide stronger measurements
on the worth of a particular source, realize where to
focus our development effort, and even manage blame.</p>
<h2 id="personal-experience">Personal Experience</h2>
<p>I previously worked in the data ingestion team at a music data
company. We provided artist and song recommendations, artist
biographies, news, and detailed audio analysis of digital music.
We exposed those data feeds via web services and raw dumps.
Behind the scenes, these feeds were composed of many sources of
data, which which were in turn cleaned, transformed, and put
through machine learning algorithms.</p>
<p>One of the first issues we ran into was learning how to trace a
particular result back to its constituent parts. If a given
artist recommendation was poor, was it because of our machine
learning algorithm? Did we simply not have enough data for that
artist? Was there some obviously wrong data from one of our
sources? Being able to debug our product became a business
necessity.</p>
<p>We developed several mechanisms for being able to debug our data
woes, some of which I’ll explore here.</p>
<h3 id="snapshotting">Snapshotting</h3>
<p>Many of the data sources were updated frequently. At the same
time, the web pages we crawled for news, reviews, biography
information and similarity, were updated inconsistently. This
meant that even if we were able to trace a particular datum back
to its source, that source may have been drastically different
than the time we had previously crawled or processed the data. In
turn, we needed to not only capture the source of our data, but
the time, and exact copy of the source. Our database columns or
keys would then have an extra field for a timestamp.</p>
<p>Keeping track of the time and the original data also allows you
to track changes from that source. You get closer to answering
the question, “why were my recommendations for The Sea and Cake
great last week, but terrible today?”</p>
<p>This process of writing data once and never changing it is called
<em>immutability,</em> and it plays a key role in data traceability.
I’ll return to it later, when I walk through an example.</p>
<h3 id="saving-the-source">Saving the source</h3>
<p>Our data was stored in several different types of databases, including
relational and key-value stores. However, nearly every schema had
a <em>source</em> field. This field would contain one or more values.
For original sources there would be a single source listed.
As data was processed and transformed into roll-ups or
learned-data, we would preserve the list of sources that went
into creating that new piece of data. This allowed us to trace
the final data product back to its constituent parts.</p>
<h3 id="weighting-sources">Weighting sources</h3>
<p>One of the most important reason we collected data was to learn
about new artists, albums and songs. That said, we didn’t always
want to create a new entity that would end up in our final data
product. Certain data sources were more likely to have errors,
misspellings and other inaccuracies, so we wanted them to be
vetted before they would progress through our system.</p>
<p>Furthermore, we wanted to be able to give priority processing to
certain sources that either had higher information value or were
for a particular customer. For applications like learning about
new artists, we’d assign a trust-score to each source that would,
among other things, determine whether a new artist was created.</p>
<p>If the artist wasn’t created based solely on this source, it
would add weight to that artist being created if we ever heard of
them again. In this way, the combined strength of several
lower-weighted sources could lead the artist being created in our
application.</p>
<h3 id="backing-out-data">Backing out data</h3>
<p>Sometimes we identified that data was simply incorrect or otherwise
bad. In such cases, we had to both remove the data from our
production offering.</p>
<p>Recall, our data would pass through several stages of transformation
on its way to the production offering. A backout, then, required that
we first identify potential sources of the bad data, remove it,
then reprocess the product without that source. (Sometimes the data
transformations were so complex that it was easier to generate all
permutations of source data, to spot the offender.) This is only
possible since we had kept track of the sources that went into the
final product.</p>
<p>Because of this observation, we had to make it easy to redo any
stage of the data transformation with an altered source list. We
designed our data processing pipeline to use parameterized source
lists, so that it was easy to exclude a particular source, or
explicitly declare the sources that were allowed to affect this
particular processing stage.</p>
<h3 id="separating-phases-and-keeping-them-pure">Separating phases (and keeping them pure)</h3>
<p>Often we would divide our data processing into several stages. It’s
important to identify the state barriers in your application, as doing
this allowed us to both write better code, and create more efficient
infrastructure.</p>
<p>From a code perspective, keeping each of our stages separate allowed us to
reduce side effects (such as I/O). In turn, this made code easier to
test, because we didn’t have to set up mocks for half of our side-effecting
infrastructure.</p>
<p>From an infrastructure perspective, keeping things separate allowed us to make
isolated decisions about each stage of the process, ranging from compute power,
to parallelism, to memory constraints.</p>
<h3 id="identifying-the-root-cause">Identifying the root cause</h3>
<p>Identifying the root cause of data issues is important to being able
to fix them, and control customer relationships. For instance, if a
particular customer is having a data quality issue, it is helpful to
know whether the origin of the issue was from data they gave you, or
from your processing of the data they gave you. In the former case,
there is real business value in being able to show the customer the
exact source of the issue, as well as your solution.</p>
<h3 id="finding-areas-for-improvement">Finding areas for improvement</h3>
<p>Related to blame is the ability to find sources of improvement in your
own processing pipeline and infrastructure. This means the steps in
your processing pipeline become data sources in their own right.</p>
<p>It’s useful to know, for instance, when and how you derived a certain
piece of data. Should an issue arise, you can immediately focus on
the place it was created. Conversely, if a particular processing stage
tends to produce excellent results, it is helpful to be able to
understand why that is so. Ideally you can then replicate this into
other parts of your system.</p>
<p>Organizationally, this type of knowledge also allows you to determine
where to focus your teams’ effort, and even to reorganize your team
structure. For example, you might want to place a new member of the
team on one of the infrastructure pieces that is doing well, and
should be a model for other pieces, as to give them a good starting
place for learning the system. A more senior team member may be more
effective on pieces of the infrastructure that are struggling.</p>
<h2 id="immutability-borrowing-an-idea-from-functional-programming">Immutability: borrowing an idea from functional programming</h2>
<p>Considering the examples above, a core element of our strategy was
<em>immutability</em>: even though our processing pipeline transformed our
data several times over, we never changed (overwrote) the original
data.</p>
<p>This is an idea we borrowed from functional programming. Consider
imperative languages like C, Java and Python, in which data tends to
be mutable. For example, if we want to sort a list, we might call
<code>myList.sort()</code>. This will sort the list in-place. Consequently, all
references to <code>myList</code> will be changed. If we now want review
<code>myList</code>’s original state, we’re out of luck: we should have made a
copy before calling <code>sort()</code>.</p>
<p>By comparison, functional languages like Haskell, Clojure and Erlang
tend to treat data as immutable. Our list sorting example becomes
something closer to <code>myNewSortedList = sort(myList)</code>. This retains the
unsorted list <code>myList</code>. One of the advantages of this immutability is
that many functions become simply the result of processing the values
passed in. Given a stack trace, we can often reproduce bugs
immediately.</p>
<p>With mutable data, there is no guarantee that the value of a
particular variable remains the same throughout the execution of the
function. Because of this, we can’t necessarily rely on a stack trace
to reproduce bugs.</p>
<p>Concerning our data processing pipeline, we could save each step of
transformation and debug it later. For example, consider this
workflow:</p>
<pre><code>rawData = downloadFrom(someSite)
cleanData = cleanup(rawData)
newArtistData = extractNewArtists(cleanData)</code></pre>
<p>Let’s say we’ve uncovered a problem in the <code>cleanup()</code> function. We
would only have to correct the code and rerun that stage of the
pipeline. We never replaced <code>rawData</code> and hence it would be
available for any such debugging later.</p>
<p>To take further advantage of immutability, we persisted our data under a
compound key of identifier and
timestamp. This helped us find the exact inputs to any of our data
processing steps, which saved time when we had to debug an issue.</p>
<h2 id="an-example">An Example</h2>
<p>As an example, let me walk you through creating a news aggregation
site. Along the way, I’ll apply the lessons I describe above to
demonstrate how data traceability affects the various aspects of the
application.</p>
<p>Let’s say that our plan is to display the top stories of the day, with
the ability to drill down by topic. Each story will also have a link
to display coverage of the same event from other sources.</p>
<p>We’ll need to be able to do several things:</p>
<ol type="1">
<li>Crawl the web for news stories.</li>
<li>Determine a story’s popularity and timeliness based on social media
activity, and perhaps its source. (For example, we assume a story on
the New York Times homepage is important and/or popular).</li>
<li>Cluster stories about the same event together.</li>
<li>Determine event popularity. (Maybe this will be aggregate popularity
of the individual stories?)</li>
</ol>
<h3 id="crawlers">Crawlers</h3>
<p>We’ll seed our crawlers with a number of known news sites. Every so
often we’ll download the contents of the page and store it under a
composite key with URL, source and timestamp, or a relational database
row with these attributes. (Let’s say we crawl frequently-updated
pages several times a day, and just once a day for other pages.)</p>
<p>From each of these home pages we crawl, we’ll download the individual
linked stories. The stories will also be saved with URL, source and
timestamp attributes. Additionally, we’ll store the composite ID of
the homepage where we were linked to this story. That way if, for
example, later we suspect we have a bug with the way we assign story
popularity based on home page placement, we can review the home page
as it was retrieved at a particular point in time. Ideally we should
be able to trace data from our own homepage all the way back to the
original HTML that our crawler downloaded.</p>
<p>In order to help determine popularity, and to further feed
our news crawlers, we’ll also crawl social media
sites. Just like with the news crawlers, we’ll want
to keep a timestamped record of the HTML and other assets
we crawl. Again, this will let us go back later and debug
our code. One example of why this would be useful is if
we suspect we are incorrectly
counting links from shares of a particular article.</p>
<h3 id="change">Change</h3>
<p>Keeping previous versions of the sites we crawl allows for some
interesting analytics. Historically, how many articles does the Boston
Globe usually link to on their home page? Is there a larger variety of
news articles in the summer? Another useful byproduct of this is that
we can run new analytics on past data. Because immutability can give
us a basis from the past, we’re not confined to just the data we’ve
collected since we turned on our new analytics.</p>
<h3 id="clustering">Clustering</h3>
<p>Clustering data is a difficult problem. Outlying or mislabeled data
can completely change our clusters. For this reason, it is important to
be able to cheaply (in human and compute time) be able to experiment with
rerunning our clustering with altered inputs. The inputs we alter may
remove data from a particular source, or add a new topic modelling
stage between crawling and clustering. In order to achieve this, our
infrastructure must be loosely coupled such that we can just as easily
provide inputs to our clustering system for testing as we do in production.</p>
<h3 id="popularity">Popularity</h3>
<p>Calculating story popularity shares many of the same issues as
clustering stories. As we experiment, or debug an issue, we want to
quickly test our changes and see the result. We also want to see the
most popular story on our own page and trace all the way through our
own processing steps, back to the origin site we crawled. If we find
out we’ve ranked a story as more popular that we would’ve liked, we
can trace it back to our origin crawl to see if, perhaps, we had put
too much weight in its position on its source site.</p>
<h2 id="conclusion">Conclusion</h2>
<p>You will need to debug data processing code and infrastructure just
like normal code. By taking advantage of techniques like immutability,
you can dramatically improve your ability to reason about your system.
Furthermore, we can draw from decades of experience in software design
to influence our data processing and infrastructure decisions.</p>
<p class="twitter_follow">If you like this post, you should follow me on
  <a href="http://twitter.com/reiddraper">twitter.</a>
</p>
]]></summary>
</entry>
<entry>
    <title>Introducing Knockbox</title>
    <link href="reiddraper.com/introducing-knockbox/index.html" />
    <id>reiddraper.com/introducing-knockbox/index.html</id>
    <published>2011-12-10 00:00:00</published>
    <updated>2011-12-10T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1>Introducing Knockbox</h1>
<h3>Dec 10, 2011</h3>
<p>For the past few weeks I’ve been working on a Clojure
library called <a href="https://github.com/reiddraper/knockbox">knockbox</a>.
It’s a library meant to make dealing with conflict-resolution
in eventually-consistent databases easier. If you’re not familiar
with eventual-consistency, I’d suggest
<a href="http://www.allthingsdistributed.com/2008/12/eventually_consistent.html">this</a> article
by Amazon CTO Werner Vogels.</p>
<p>Distributed databases like <a href="https://github.com/basho/riak">Riak</a> let you trade
consistency for availability. This means that at any given moment,
all of the replicas of your data might not be synchronized.
In exchange for this, your database cluster can still operate when
all but one replica of your data is unavailable. Amazon’s shopping-cart
session state has been the iconic example. In their case, a write to add an
item to your cart may go to a replica that is not up to date. At some point,
the database notices that the replicas are in conflict, and you must resolve them.
But how do you do this? If a coffee maker is in one replica and not the other, what happened?
Was the coffee maker recently added and that just hasn’t been reflected in the other replica yet?
Or was the coffee maker recently deleted? It turns out that you often have to change the
way you represent your data in order to preserve the original intentions.</p>
<p>Developers who wanted to implement data-types with conflict-resolution semantics
have had to figure it out themselves, or read academic papers like
<a href="http://hal.archives-ouvertes.fr/inria-00555588/">A comprehensive study of Convergent and Commutative Replicated Data Types</a>.
<a href="https://github.com/mochi/statebox">statebox</a> was the first popular open source
project to help ease the burden for developers wanting to take advantage of
eventual-consistency. As I’ve been learning Clojure recently, I thought
I’d try my hand at putting together a similar library.</p>
<p>The main goal has been to have the types conform to all appropriate
Clojure Protocols and Java interfaces. This means my last-write-wins
set should quack like a normal Clojure set. This lets you reuse existing
code that expects normal Clojure data types. Next, I’ve defined
a <code>Resolvable</code> Protocol for all of these types to implement. There’s
only a single method, which looks like:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode clojure"><code class="sourceCode clojure"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>(<span class="kw">resolve</span> [a b])</span></code></pre></div>
<p>This function should take two conflicing objects and return a new,
resolved object.</p>
<p>Resolving a list of replicas (often called siblings when they’re in conflict)
is as simple as providing the <code>resolve</code> function to <code>reduce</code>. This is, however,
provided for you, as <code>knockbox.core/resolve</code>. Note that this function is in
a different namespace than the <code>resolve</code> that you implement as part of
the <code>Resolvable</code> Protocol (this lives in <code>knockbox.resolvable</code>).</p>
<p>There are currently two data-types implemented, sets and registers.
A register is simply a container for another type. I also intend to
implement counters, but have yet to come up with an implementation
that has space-efficiency and pruning characteristics that I like.</p>
<p>Let’s now create some conflicting replicas, and see see how they
get resolved. Here we’ll use a last-write-wins (<code>lww</code>) set. The resolution
semantics used here are to use timestamps to resolve an add/delete
conflict for a particular item. This is not the same as using
timestamps for the whole set, because we’re doing it per
item. To get a REPL with the correct classpath, you
can either add <code>[knockbox "0.0.1-SNAPSHOT"]</code> to your <code>project.clj</code>,
or clone the knockbox repository and type <code>lein repl</code>.</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode clojure"><code class="sourceCode clojure"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>(<span class="kw">require</span> <span class="at">&#39;knockbox.core</span>)</span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>(<span class="kw">require</span> &#39;[knockbox.sets <span class="at">:as</span> kbsets])</span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>(<span class="bu">def</span><span class="fu"> original </span>(<span class="kw">into</span> (kbsets/lww) #{<span class="at">:mug</span> <span class="at">:kettle</span>}))</span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a>(<span class="bu">def</span><span class="fu"> a </span>(<span class="kw">disj</span> original <span class="at">:kettle</span>))</span>
<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a>(<span class="bu">def</span><span class="fu"> b </span>(<span class="kw">conj</span> original <span class="at">:coffee</span>))</span>
<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a>(<span class="bu">def</span><span class="fu"> c </span>(<span class="kw">conj</span> original <span class="at">:coffee-roaster</span>))</span>
<span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a><span class="co">;; this one wins because its</span></span>
<span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a><span class="co">;; timestamp is later</span></span>
<span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a>(<span class="bu">def</span><span class="fu"> d </span>(<span class="kw">disj</span> original <span class="at">:coffee-roaster</span>))</span>
<span id="cb2-13"><a href="#cb2-13" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-14"><a href="#cb2-14" aria-hidden="true" tabindex="-1"></a>(<span class="kw">println</span> (knockbox.core/resolve [a b c d]))</span>
<span id="cb2-15"><a href="#cb2-15" aria-hidden="true" tabindex="-1"></a><span class="co">; =&gt; #{:coffee :mug}</span></span>
<span id="cb2-16"><a href="#cb2-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-17"><a href="#cb2-17" aria-hidden="true" tabindex="-1"></a><span class="co">;; notice that this is different</span></span>
<span id="cb2-18"><a href="#cb2-18" aria-hidden="true" tabindex="-1"></a><span class="co">;; than simply taking the union of</span></span>
<span id="cb2-19"><a href="#cb2-19" aria-hidden="true" tabindex="-1"></a><span class="co">;; the four sets</span></span>
<span id="cb2-20"><a href="#cb2-20" aria-hidden="true" tabindex="-1"></a>(<span class="kw">println</span> (clojure.set/union a b c d))</span>
<span id="cb2-21"><a href="#cb2-21" aria-hidden="true" tabindex="-1"></a><span class="co">; =&gt; #{:coffee :coffee-roaster :kettle :mug}</span></span></code></pre></div>
<p>Using timestamps is fine for some domains, but what if our update-rate is high
enough that we can’t trust our clocks to be synchronized enough? The
<code>observed-remove</code> set works by assigning a UUID to each addition. Deletes
will then override any UUIDs they have seen for a particular item in the set.
This means that when add/delete conflicts happen, addition will win because
the delete action couldn’t have seen the UUID created by the addition. Let’s
see this in action.</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode clojure"><code class="sourceCode clojure"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>(<span class="kw">require</span> <span class="at">&#39;knockbox.core</span>)</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>(<span class="kw">require</span> &#39;[knockbox.sets <span class="at">:as</span> kbsets])</span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a>(<span class="bu">def</span><span class="fu"> original </span>(<span class="kw">into</span> (kbsets/observed-remove) #{<span class="at">:gin</span> <span class="at">:rum</span>}))</span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a>(<span class="bu">def</span><span class="fu"> a </span>(<span class="kw">conj</span> original <span class="at">:vodka</span>))</span>
<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a>(<span class="bu">def</span><span class="fu"> b </span>(<span class="kw">conj</span> original <span class="at">:vodka</span>))</span>
<span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a><span class="co">;; we&#39;ve only seen the addition</span></span>
<span id="cb3-10"><a href="#cb3-10" aria-hidden="true" tabindex="-1"></a><span class="co">;; of :vodka from a, not b</span></span>
<span id="cb3-11"><a href="#cb3-11" aria-hidden="true" tabindex="-1"></a>(<span class="bu">def</span><span class="fu"> c </span>(<span class="kw">disj</span> a <span class="at">:vodka</span>))</span>
<span id="cb3-12"><a href="#cb3-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-13"><a href="#cb3-13" aria-hidden="true" tabindex="-1"></a><span class="co">;; don&#39;t include a in here because</span></span>
<span id="cb3-14"><a href="#cb3-14" aria-hidden="true" tabindex="-1"></a><span class="co">;; vector clocks will take care of</span></span>
<span id="cb3-15"><a href="#cb3-15" aria-hidden="true" tabindex="-1"></a><span class="co">;; figuring out that c supersedes it</span></span>
<span id="cb3-16"><a href="#cb3-16" aria-hidden="true" tabindex="-1"></a>(<span class="kw">println</span> (knockbox.core/resolve [b c]))</span>
<span id="cb3-17"><a href="#cb3-17" aria-hidden="true" tabindex="-1"></a><span class="co">; =&gt; #{:vodka :gin :rum}</span></span></code></pre></div>
<p>That’s all for this first post, so go ahead and take a look
at <a href="https://github.com/reiddraper/knockbox">knockbox on github</a>.</p>
<p class="twitter_follow">If you like this post, you should follow me on
  <a href="http://twitter.com/reiddraper">twitter.</a>
</p>
]]></summary>
</entry>
<entry>
    <title>Writing Your First Chef Recipe</title>
    <link href="reiddraper.com/first-chef-recipe/index.html" />
    <id>reiddraper.com/first-chef-recipe/index.html</id>
    <published>2011-04-18 00:00:00</published>
    <updated>2011-04-18T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1>Writing Your First Chef Recipe</h1>
<h3>Apr 18, 2011</h3>
<p><a href="http://www.opscode.com/chef/" title="Opscode Chef">Chef</a> is an infrastructure automation tool
that lets you write Ruby code to describe how your machines should be set up.
Applications for Chef vary from configuring complicated multi-node applications, to
<a href="http://jtimberman.posterous.com/managing-my-workstations-with-chef">setting up your personal workstation</a>.</p>
<p>As great as Chef is, getting started can be a bit daunting. It’s worse if you’re not
sure exactly what Chef provides, and you’ve never written a lick of Ruby. This was
me a few days ago, so I thought I’d write a quick Chef introduction from
that perspective. In this tutorial, we’ll be creating a Chef recipe for the
popular database <a href="http://redis.io/" title="Redis">Redis</a>.</p>
<p>Before we get started, there are two terms we need to define, recipes and cookbooks.
In Chef, recipes are what you write to install and configure things
on your machine like Redis, sshd or Apache2.
A cookbook is a collection of related recipes. For example, the MySQL
cookbook might include two recipes, <code>mysql::client</code> and <code>mysql::server</code>.
A cookbook might also have a recipe for installing something via package management,
or from source.
Our Redis cookbook will contain just one recipe, which installs Redis
from source.</p>
<p>This recipe is available <a href="https://github.com/reiddraper/your-first-chef-recipe">on github</a>.</p>
<h2 id="getting-set-up">Getting Set Up</h2>
<p>The first thing you’ll want to do is:</p>
<pre><code>$ git clone https://github.com/opscode/chef-repo.git</code></pre>
<p>This gives us the skeleton of our cookbook repository. Next, we’ll create an empty
cookbook:</p>
<pre><code>$ cd chef-repo 
$ rake new_cookbook COOKBOOK=redis</code></pre>
<p>Our <code>rake</code> task created some folders we won’t need for this simple recipe, we’ll remove them:</p>
<pre><code>$ cd cookbooks/redis/
$ rm -rf definitions/ files/ libraries/ providers/ resources/
$ cd ../..</code></pre>
<p>The folders we’ll be looking at are:</p>
<pre><code>cookbooks/redis
cookbooks/redis/attributes
cookbooks/redis/templates/default
cookbooks/redis/recipes</code></pre>
<p>Next we’ll create the files we’ll be editing to create our recipe:</p>
<pre><code>$ touch cookbooks/redis/attributes/default.rb
$ touch cookbooks/redis/recipes/default.rb
$ touch cookbooks/redis/templates/default/redis.conf.erb
$ touch cookbooks/redis/templates/default/redis.upstart.conf.erb</code></pre>
<p>To run and test our cookbook, we’ll be using <a href="http://vagrantup.com/">Vagrant</a>,
a tool for managing local virtual machines.
Instructions for installing Vagrant can be found
<a href="http://vagrantup.com/docs/getting-started/index.html" title="installation">here</a>.
Create a file called <code>Vagrantfile</code> in the root of the repository.
Edit it to look like this:</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode ruby"><code class="sourceCode ruby"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="dt">Vagrant</span><span class="op">::</span><span class="dt">Config</span><span class="at">.run</span> <span class="cf">do</span> <span class="op">|</span>config<span class="op">|</span></span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>  config<span class="at">.vm.box</span> <span class="op">=</span> <span class="st">&quot;lucid32&quot;</span></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>   config<span class="at">.vm.provision</span> <span class="wa">:chef_solo</span> <span class="cf">do</span> <span class="op">|</span>chef<span class="op">|</span></span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a>     chef<span class="at">.cookbooks_path</span> <span class="op">=</span> <span class="st">&quot;cookbooks&quot;</span></span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a>     chef<span class="at">.add_recipe</span> <span class="st">&quot;redis&quot;</span></span>
<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a>     chef<span class="at">.log_level</span> <span class="op">=</span> <span class="wa">:debug</span></span>
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a>  <span class="cf">end</span> </span>
<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a><span class="cf">end</span></span></code></pre></div>
<p>The two most important things to note here are that we’re telling our
VM to use Chef to install Redis, and that we want the log level set
to debug.</p>
<p>Now run this to download the Ubuntu 10.04 VM we’ll be using:</p>
<pre><code># note: this download is roughly 500MB
$ vagrant box add lucid32 http://files.vagrantup.com/lucid32.box</code></pre>
<h2 id="writing-our-recipe">Writing Our Recipe</h2>
<p>Now we are set up and ready to start writing our first recipe.
We’ll start by looking at <code>cookbooks/redis/metadata.rb</code>. It records
metadata about our cookbook, including other cookbooks it depends
on, and supported OS’s. For this tutorial, we don’t need to edit it.</p>
<h3 id="attributes">Attributes</h3>
<p>Next we’ll look at <code>cookbooks/redis/attributes/default.rb</code>,
which is where we’ll be defining the variable options for installing
and running Redis. Edit it to look like:</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode ruby"><code class="sourceCode ruby"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>default<span class="kw">[</span><span class="wa">:redis</span><span class="kw">][</span><span class="wa">:dir</span><span class="kw">]</span>       <span class="op">=</span> <span class="st">&quot;/etc/redis&quot;</span></span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>default<span class="kw">[</span><span class="wa">:redis</span><span class="kw">][</span><span class="wa">:data_dir</span><span class="kw">]</span>  <span class="op">=</span> <span class="st">&quot;/var/lib/redis&quot;</span></span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>default<span class="kw">[</span><span class="wa">:redis</span><span class="kw">][</span><span class="wa">:log_dir</span><span class="kw">]</span>   <span class="op">=</span> <span class="st">&quot;/var/log/redis&quot;</span></span>
<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a><span class="co"># one of: debug, verbose, notice, warning</span></span>
<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a>default<span class="kw">[</span><span class="wa">:redis</span><span class="kw">][</span><span class="wa">:loglevel</span><span class="kw">]</span>  <span class="op">=</span> <span class="st">&quot;notice&quot;</span></span>
<span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a>default<span class="kw">[</span><span class="wa">:redis</span><span class="kw">][</span><span class="wa">:user</span><span class="kw">]</span>      <span class="op">=</span> <span class="st">&quot;redis&quot;</span></span>
<span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a>default<span class="kw">[</span><span class="wa">:redis</span><span class="kw">][</span><span class="wa">:port</span><span class="kw">]</span>      <span class="op">=</span> <span class="dv">6379</span></span>
<span id="cb8-8"><a href="#cb8-8" aria-hidden="true" tabindex="-1"></a>default<span class="kw">[</span><span class="wa">:redis</span><span class="kw">][</span><span class="wa">:bind</span><span class="kw">]</span>      <span class="op">=</span> <span class="st">&quot;127.0.0.1&quot;</span></span></code></pre></div>
<p>This file gives default values for configuration options.
The defaults can be overridden by a specific machine.
For example, on your development box you might want
the <code>data_dir</code> to be someplace different.
Since it’s just Ruby code,
we can also use control statements to change these defaults
based on things like the host OS. One of the most powerful
parts of Chef is that the attributes we’re defining here
will be available to all of our configuration file templates.
This means we only have to declare the <code>user</code> variable once,
and it will be used to create a new user, and start Redis running
as that same user. We’re programming our config files.</p>
<p>A quick note for the non-Ruby programmers out there, when you see
<code>:redis</code>, this is called a symbol. The short story is that it’s
a string just like <code>"redis"</code>, but is more memory efficient if
used more than once. In Python, one of the above lines might
look like:</p>
<pre><code>default[&quot;redis&quot;][&quot;dir&quot;] = &quot;/etc/redis&quot;</code></pre>
<h3 id="templates">Templates</h3>
<p>In Chef we use <a href="http://ruby-doc.org/stdlib/libdoc/erb/rdoc/classes/ERB.html">ERB</a>
templates to write our config files.
In this recipe we’re using two templates, one for the configuration to
<code>redis-server</code> and the other for <code>upstart</code>.
<a href="http://upstart.ubuntu.com/">Upstart</a> is a replacement for
<code>etc/init.d/</code> scripts.
Edit <code>cookbooks/redis/templates/default/redis.conf.erb</code> to look like:</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="ex">port</span> <span class="op">&lt;</span>%= node<span class="pp">[</span><span class="ss">:redis</span><span class="pp">][</span><span class="ss">:port</span><span class="pp">]</span> %<span class="op">&gt;</span></span>
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="bu">bind</span> <span class="op">&lt;</span>%= node<span class="pp">[</span><span class="ss">:redis</span><span class="pp">][</span><span class="ss">:bind</span><span class="pp">]</span> %<span class="op">&gt;</span></span>
<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="ex">loglevel</span> <span class="op">&lt;</span>%= node<span class="pp">[</span><span class="ss">:redis</span><span class="pp">][</span><span class="ss">:loglevel</span><span class="pp">]</span> %<span class="op">&gt;</span></span>
<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a><span class="fu">dir</span> <span class="op">&lt;</span>%= node<span class="pp">[</span><span class="ss">:redis</span><span class="pp">][</span><span class="ss">:data_dir</span><span class="pp">]</span> %<span class="op">&gt;</span></span>
<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a><span class="ex">daemonize</span> no</span>
<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a><span class="ex">logfile</span> stdout</span>
<span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a><span class="ex">databases</span> 16</span>
<span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a><span class="ex">save</span> 900 1</span>
<span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a><span class="ex">save</span> 300 10</span>
<span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a><span class="ex">save</span> 60 10000</span>
<span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a><span class="ex">rdbcompression</span> yes</span>
<span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a><span class="ex">dbfilename</span> dump.rdb</span></code></pre></div>
<p>and <code>cookbooks/redis/templates/default/redis.upstart.conf.erb</code> like:</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co">#!upstart</span></span>
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="ex">description</span> <span class="st">&quot;Redis Server&quot;</span></span>
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a><span class="fu">env</span> USER=<span class="op">&lt;</span>%= node<span class="pp">[</span><span class="ss">:redis</span><span class="pp">][</span><span class="ss">:user</span><span class="pp">]</span> %<span class="op">&gt;</span></span>
<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a><span class="ex">start</span> on startup</span>
<span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a><span class="ex">stop</span> on shutdown</span>
<span id="cb11-8"><a href="#cb11-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb11-9"><a href="#cb11-9" aria-hidden="true" tabindex="-1"></a><span class="ex">respawn</span></span>
<span id="cb11-10"><a href="#cb11-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb11-11"><a href="#cb11-11" aria-hidden="true" tabindex="-1"></a><span class="bu">exec</span> sudo <span class="at">-u</span> <span class="va">$USER</span> sh <span class="at">-c</span> <span class="st">&quot;/usr/local/bin/redis-server </span><span class="dt">\</span></span>
<span id="cb11-12"><a href="#cb11-12" aria-hidden="true" tabindex="-1"></a><span class="st">  /etc/redis/redis.conf 2&gt;&amp;1 &gt;&gt; </span><span class="dt">\</span></span>
<span id="cb11-13"><a href="#cb11-13" aria-hidden="true" tabindex="-1"></a><span class="st">  &lt;%= node[:redis][:log_dir] %&gt;/redis.log&quot;</span></span></code></pre></div>
<h3 id="the-recipe-file">The Recipe File</h3>
<p>Now it’s time to write the actual recipe.
Having little Ruby experience, I’ll have to do
some hand-waving in explaining that the following
code is both Chef’s DSL, and perfectly valid
Ruby code.</p>
<p>The following code is run from the top-down. It uses Chef
<a href="http://wiki.opscode.com/display/chef/Resources">resources</a>
to create a user, make directories, download and compile Redis,
and write out the templates.</p>
<p>Edit <code>cookbooks/redis/recipes/default.rb</code>
to look like:</p>
<div class="sourceCode" id="cb12"><pre class="sourceCode ruby"><code class="sourceCode ruby"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>package <span class="st">&quot;build-essential&quot;</span> <span class="cf">do</span></span>
<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>  action <span class="wa">:install</span></span>
<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="cf">end</span></span>
<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a>user node<span class="kw">[</span><span class="wa">:redis</span><span class="kw">][</span><span class="wa">:user</span><span class="kw">]</span> <span class="cf">do</span></span>
<span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a>  action <span class="wa">:create</span></span>
<span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a>  <span class="fu">system</span> <span class="dv">true</span></span>
<span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a>  shell <span class="st">&quot;/bin/false&quot;</span></span>
<span id="cb12-9"><a href="#cb12-9" aria-hidden="true" tabindex="-1"></a><span class="cf">end</span></span>
<span id="cb12-10"><a href="#cb12-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-11"><a href="#cb12-11" aria-hidden="true" tabindex="-1"></a>directory node<span class="kw">[</span><span class="wa">:redis</span><span class="kw">][</span><span class="wa">:dir</span><span class="kw">]</span> <span class="cf">do</span></span>
<span id="cb12-12"><a href="#cb12-12" aria-hidden="true" tabindex="-1"></a>  owner <span class="st">&quot;root&quot;</span></span>
<span id="cb12-13"><a href="#cb12-13" aria-hidden="true" tabindex="-1"></a>  mode <span class="st">&quot;0755&quot;</span></span>
<span id="cb12-14"><a href="#cb12-14" aria-hidden="true" tabindex="-1"></a>  action <span class="wa">:create</span></span>
<span id="cb12-15"><a href="#cb12-15" aria-hidden="true" tabindex="-1"></a><span class="cf">end</span></span>
<span id="cb12-16"><a href="#cb12-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-17"><a href="#cb12-17" aria-hidden="true" tabindex="-1"></a>directory node<span class="kw">[</span><span class="wa">:redis</span><span class="kw">][</span><span class="wa">:data_dir</span><span class="kw">]</span> <span class="cf">do</span></span>
<span id="cb12-18"><a href="#cb12-18" aria-hidden="true" tabindex="-1"></a>  owner <span class="st">&quot;redis&quot;</span></span>
<span id="cb12-19"><a href="#cb12-19" aria-hidden="true" tabindex="-1"></a>  mode <span class="st">&quot;0755&quot;</span></span>
<span id="cb12-20"><a href="#cb12-20" aria-hidden="true" tabindex="-1"></a>  action <span class="wa">:create</span></span>
<span id="cb12-21"><a href="#cb12-21" aria-hidden="true" tabindex="-1"></a><span class="cf">end</span></span>
<span id="cb12-22"><a href="#cb12-22" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-23"><a href="#cb12-23" aria-hidden="true" tabindex="-1"></a>directory node<span class="kw">[</span><span class="wa">:redis</span><span class="kw">][</span><span class="wa">:log_dir</span><span class="kw">]</span> <span class="cf">do</span></span>
<span id="cb12-24"><a href="#cb12-24" aria-hidden="true" tabindex="-1"></a>  mode <span class="bn">0755</span></span>
<span id="cb12-25"><a href="#cb12-25" aria-hidden="true" tabindex="-1"></a>  owner node<span class="kw">[</span><span class="wa">:redis</span><span class="kw">][</span><span class="wa">:user</span><span class="kw">]</span></span>
<span id="cb12-26"><a href="#cb12-26" aria-hidden="true" tabindex="-1"></a>  action <span class="wa">:create</span></span>
<span id="cb12-27"><a href="#cb12-27" aria-hidden="true" tabindex="-1"></a><span class="cf">end</span></span>
<span id="cb12-28"><a href="#cb12-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-29"><a href="#cb12-29" aria-hidden="true" tabindex="-1"></a>remote_file <span class="st">&quot;</span><span class="sc">#{</span><span class="dt">Chef</span><span class="op">::</span><span class="dt">Config</span><span class="kw">[</span><span class="wa">:file_cache_path</span><span class="kw">]</span><span class="sc">}</span><span class="st">/redis.tar.gz&quot;</span> <span class="cf">do</span></span>
<span id="cb12-30"><a href="#cb12-30" aria-hidden="true" tabindex="-1"></a>  source <span class="st">&quot;https://github.com/antirez/redis/tarball/v2.0.4-stable&quot;</span></span>
<span id="cb12-31"><a href="#cb12-31" aria-hidden="true" tabindex="-1"></a>  action <span class="wa">:create_if_missing</span></span>
<span id="cb12-32"><a href="#cb12-32" aria-hidden="true" tabindex="-1"></a><span class="cf">end</span></span>
<span id="cb12-33"><a href="#cb12-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-34"><a href="#cb12-34" aria-hidden="true" tabindex="-1"></a>bash <span class="st">&quot;compile_redis_source&quot;</span> <span class="cf">do</span></span>
<span id="cb12-35"><a href="#cb12-35" aria-hidden="true" tabindex="-1"></a>  cwd <span class="dt">Chef</span><span class="op">::</span><span class="dt">Config</span><span class="kw">[</span><span class="wa">:file_cache_path</span><span class="kw">]</span></span>
<span id="cb12-36"><a href="#cb12-36" aria-hidden="true" tabindex="-1"></a>  code <span class="op">&lt;&lt;-</span><span class="cn">EOH</span></span>
<span id="cb12-37"><a href="#cb12-37" aria-hidden="true" tabindex="-1"></a>    tar zxf redis<span class="at">.tar.gz</span></span>
<span id="cb12-38"><a href="#cb12-38" aria-hidden="true" tabindex="-1"></a>    cd antirez<span class="op">-</span>redis<span class="op">-</span><span class="dv">55479</span><span class="er">a7</span></span>
<span id="cb12-39"><a href="#cb12-39" aria-hidden="true" tabindex="-1"></a>    make <span class="op">&amp;&amp;</span> make install</span>
<span id="cb12-40"><a href="#cb12-40" aria-hidden="true" tabindex="-1"></a>  <span class="cn">EOH</span></span>
<span id="cb12-41"><a href="#cb12-41" aria-hidden="true" tabindex="-1"></a>  creates <span class="st">&quot;/usr/local/bin/redis-server&quot;</span></span>
<span id="cb12-42"><a href="#cb12-42" aria-hidden="true" tabindex="-1"></a><span class="cf">end</span></span>
<span id="cb12-43"><a href="#cb12-43" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-44"><a href="#cb12-44" aria-hidden="true" tabindex="-1"></a>service <span class="st">&quot;redis&quot;</span> <span class="cf">do</span></span>
<span id="cb12-45"><a href="#cb12-45" aria-hidden="true" tabindex="-1"></a>  provider <span class="dt">Chef</span><span class="op">::</span><span class="dt">Provider</span><span class="op">::</span><span class="dt">Service</span><span class="op">::</span><span class="dt">Upstart</span></span>
<span id="cb12-46"><a href="#cb12-46" aria-hidden="true" tabindex="-1"></a>  subscribes <span class="wa">:restart</span>, resources(<span class="wa">:bash</span> <span class="op">=&gt;</span> <span class="st">&quot;compile_redis_source&quot;</span>)</span>
<span id="cb12-47"><a href="#cb12-47" aria-hidden="true" tabindex="-1"></a>  supports <span class="wa">:restart</span> <span class="op">=&gt;</span> <span class="dv">true</span>, <span class="wa">:start</span> <span class="op">=&gt;</span> <span class="dv">true</span>, <span class="wa">:stop</span> <span class="op">=&gt;</span> <span class="dv">true</span></span>
<span id="cb12-48"><a href="#cb12-48" aria-hidden="true" tabindex="-1"></a><span class="cf">end</span></span>
<span id="cb12-49"><a href="#cb12-49" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-50"><a href="#cb12-50" aria-hidden="true" tabindex="-1"></a>template <span class="st">&quot;redis.conf&quot;</span> <span class="cf">do</span></span>
<span id="cb12-51"><a href="#cb12-51" aria-hidden="true" tabindex="-1"></a>  path <span class="st">&quot;</span><span class="sc">#{</span>node<span class="kw">[</span><span class="wa">:redis</span><span class="kw">][</span><span class="wa">:dir</span><span class="kw">]</span><span class="sc">}</span><span class="st">/redis.conf&quot;</span></span>
<span id="cb12-52"><a href="#cb12-52" aria-hidden="true" tabindex="-1"></a>  source <span class="st">&quot;redis.conf.erb&quot;</span></span>
<span id="cb12-53"><a href="#cb12-53" aria-hidden="true" tabindex="-1"></a>  owner <span class="st">&quot;root&quot;</span></span>
<span id="cb12-54"><a href="#cb12-54" aria-hidden="true" tabindex="-1"></a>  group <span class="st">&quot;root&quot;</span></span>
<span id="cb12-55"><a href="#cb12-55" aria-hidden="true" tabindex="-1"></a>  mode <span class="st">&quot;0644&quot;</span></span>
<span id="cb12-56"><a href="#cb12-56" aria-hidden="true" tabindex="-1"></a>  notifies <span class="wa">:restart</span>, resources(<span class="wa">:service</span> <span class="op">=&gt;</span> <span class="st">&quot;redis&quot;</span>)</span>
<span id="cb12-57"><a href="#cb12-57" aria-hidden="true" tabindex="-1"></a><span class="cf">end</span></span>
<span id="cb12-58"><a href="#cb12-58" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-59"><a href="#cb12-59" aria-hidden="true" tabindex="-1"></a>template <span class="st">&quot;redis.upstart.conf&quot;</span> <span class="cf">do</span></span>
<span id="cb12-60"><a href="#cb12-60" aria-hidden="true" tabindex="-1"></a>  path <span class="st">&quot;/etc/init/redis.conf&quot;</span></span>
<span id="cb12-61"><a href="#cb12-61" aria-hidden="true" tabindex="-1"></a>  source <span class="st">&quot;redis.upstart.conf.erb&quot;</span></span>
<span id="cb12-62"><a href="#cb12-62" aria-hidden="true" tabindex="-1"></a>  owner <span class="st">&quot;root&quot;</span></span>
<span id="cb12-63"><a href="#cb12-63" aria-hidden="true" tabindex="-1"></a>  group <span class="st">&quot;root&quot;</span></span>
<span id="cb12-64"><a href="#cb12-64" aria-hidden="true" tabindex="-1"></a>  mode <span class="st">&quot;0644&quot;</span></span>
<span id="cb12-65"><a href="#cb12-65" aria-hidden="true" tabindex="-1"></a>  notifies <span class="wa">:restart</span>, resources(<span class="wa">:service</span> <span class="op">=&gt;</span> <span class="st">&quot;redis&quot;</span>)</span>
<span id="cb12-66"><a href="#cb12-66" aria-hidden="true" tabindex="-1"></a><span class="cf">end</span></span>
<span id="cb12-67"><a href="#cb12-67" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-68"><a href="#cb12-68" aria-hidden="true" tabindex="-1"></a>service <span class="st">&quot;redis&quot;</span> <span class="cf">do</span></span>
<span id="cb12-69"><a href="#cb12-69" aria-hidden="true" tabindex="-1"></a>  action <span class="kw">[</span><span class="wa">:enable</span>, <span class="wa">:start</span><span class="kw">]</span></span>
<span id="cb12-70"><a href="#cb12-70" aria-hidden="true" tabindex="-1"></a><span class="cf">end</span></span></code></pre></div>
<h2 id="trying-our-recipe">Trying Our Recipe</h2>
<p>Now that we’ve written our recipe, it’s time to try it out. In the root of your
repository, run <code>vagrant up</code>. This will start the virtual machine and set up
Redis using Chef. Once the command finishes, run this:</p>
<pre><code>$ vagrant ssh
$ echo &quot;ping&quot; | nc localhost 6379
$ exit</code></pre>
<p>If all went well, you should have seen <code>+PONG</code>. If you change something and
want to re-run Chef, type <code>vagrant provision</code>.</p>
<p>When you’re done working, run <code>vagrant destroy</code> to reclaim your RAM.</p>
<h2 id="closing-thoughts">Closing Thoughts</h2>
<p>Chef is much more powerful than what I’ve presented, but I hope I’ve been
able to show how easy it is to get started writing and editing recipes.
If you’d like to learn more about Chef, check out the
<a href="http://wiki.opscode.com/display/chef/Home">Opscode wiki</a>.</p>
<p class="twitter_follow">If you like this post, you should follow me on
  <a href="http://twitter.com/reiddraper">twitter.</a>
</p>
]]></summary>
</entry>
<entry>
    <title>100-Node Riak Cluster for $2</title>
    <link href="reiddraper.com/100-node-riak-cluster/index.html" />
    <id>reiddraper.com/100-node-riak-cluster/index.html</id>
    <published>2011-04-03 00:00:00</published>
    <updated>2011-04-03T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1>100-Node Riak Cluster for $2</h1>
<h3>Apr  3, 2011</h3>
<p>Riak is a distributed key-value store; data is replicated and partitioned
across your cluster. Increasing the cluster size allows you to scale both performance and
fault-tolerance. One of the most powerful parts of <a href="http://wiki.basho.com/" title="Riak">Riak</a> is the ability to add a new node to your cluster with one command:</p>
<pre><code>riak-admin join riak@example.com</code></pre>
<p>With the recent trend toward <a href="http://en.wikipedia.org/wiki/DevOps" title="DevOps">operations-as-code</a>,
I thought I would challenge myself to write a script to set up a 100-node Riak cluster with
one command. Using <a href="http://aws.amazon.com/about-aws/whats-new/2010/09/09/announcing-micro-instances-for-amazon-ec2/" title="micro instances">Amazon EC2 micro-instances</a>, the cluster costs $2 to run for an hour.</p>
<p>Riak works by splitting a 160-bit hash-space into a certain number of<br />
<a href="http://wiki.basho.com/How-Things-Work.html#The-Ring" title="vnode">virtual nodes (vnodes)</a>, say 1024.
Each physical node is then responsible for <code>1024 / N</code> vnodes, where <code>N</code> is the number of physical
nodes in the cluster. As a new node joins, it takes some vnodes from the rest of
the cluster.</p>
<p>I’ve written a simple Python script to launch a 100-node cluster.
The script launches a master node, and notes its IP address.
The other 99 nodes are launched and told to join the master. Riak doesn’t currently have provisions
to deal with many nodes trying to join the cluster at once. To avoid the
<a href="http://en.wikipedia.org/wiki/Thundering_herd_problem" title="thundering herd problem">thundering-herd problem</a>
I simply have each node sleep for a random time, such that nodes are joining, on average,
one every 15 seconds. Some sort of queueing system, and
<a href="https://issues.basho.com/show_bug.cgi?id=869" title="bug 869">this bugfix</a>, would eliminate the
need for nodes to stagger their join requests. <a href="https://gist.github.com/891586">Here is</a> a snippet
from the Riak IRC about this.
I didn’t get a chance to try it, but using Chef-server, there’s also a
<a href="https://github.com/opscode/cookbooks/blob/master/riak/providers/cluster.rb" title="cluster recipe">Riak cluster recipe</a>.</p>
<p>After getting my script working with a 20-node cluster, I tried to launch 100, only to learn
that AWS accounts are, by default, limited to 20 instances. Fortunately, the
<a href="http://aws.amazon.com/ec2/spot-instances/" title="EC2 spot instance">spot instance</a> limit is
100, so I was able to use those.</p>
<p>The script is simple, and usage looks like:</p>
<pre><code>./launch.py keypair ~/.ssh/keypair.pem user_data.sh 100</code></pre>
<p>Approximately 35 minutes after running the script, I had a 95-node cluster. The command
<code>riak-admin ringready</code> told me that two nodes were down. After starting them,
I had a 97-node cluster. I wasn’t able to
diagnose the problem with the other three nodes.
I was impressed with how easy it was to automate Riak, and it’s clear that
<a href="http://www.basho.com/" title="Basho">Basho</a> has plans to make things even easier.</p>
<p>Now is a good time to note that the script doesn’t launch a truly production-ready cluster.
For starters, it probably isn’t a good idea to use spot instances for a database.
You would also be wise to have a smaller number of more powerful machines, rather than
100 micro instances. Next, I would recommend something like
<a href="http://www.opscode.com/chef/" title="Chef">Chef</a>, for more complicated infrastructure automation.</p>
<p>If you’d like to run your own 100-node cluster, check out
<a href="https://github.com/reiddraper/riak-ec2-cluster-launcher">this github repository</a>.
If you decide to keep your cluster up for more than an hour,
<a href="http://wiki.basho.com/Sample-Data.html" title="Riak sample data">here’s some data</a>
to play with.</p>
<p>It’s exciting to see how infrastructure automation is making it easy for small teams
to build massive systems in short periods of time. Databases like Riak fit perfectly
with this, as their administrative cost is low, and configuration remains simple
regardless of how many nodes are in the cluster.</p>
<p>For those of you considering writing something similar, I highly recommend trying
<a href="http://vagrantup.com/" title="Vagrant">Vagrant</a> for testing virtual machine setups before
spending a dime on EC2.</p>
<p class="twitter_follow">If you like this post, you should follow me on
  <a href="http://twitter.com/reiddraper">twitter.</a>
</p>
]]></summary>
</entry>

</feed>
