<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom">
 
 <title>Karl Seguin</title>
 
 <link href="http://openmymind.net/" />
 <updated>2012-02-18T18:27:50-08:00</updated>
 <id>http://openmymind.net/</id>
 <author>
   <name>Karl Seguin</name>
   <email>karl@openmymind.net</email>
 </author>

 
 <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/KarlSeguinsBlog" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="karlseguinsblog" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry>
   <title>Let's Build Something Using Amazon's DynamoDB</title>
   <link href="http://openmymind.net/2012/2/6/Lets-Build-Something-Using-Amazons-DynamoDB" />
   <updated>2012-02-06T00:00:00-08:00</updated>
   <id>http://openmymind.net/2012/2/6/Lets-Build-Something-Using-Amazons-DynamoDB</id>
   <content type="html">&lt;p&gt;A couple weeks ago, Amazon released DynamoDB as part of AWS. DynamoDB is a NoSQL database with a focus on scalability, reliability and performance. DynamoDB has generated a lot of excitement, if for no other reason than the fact that Amazon is an authoritative figure in the NoSQL space. Their Dynamo paper, published in 2007, has been exceedingly influential.&lt;/p&gt;

&lt;h3&gt;DynanoDB At A Glance&lt;/h3&gt;
&lt;p&gt;The most important thing to understand about DynamoDB is that it doesn't support secondary indexes. Data can only be retrieved by the key. However, there are two types of  keys. The first is called a hash. It's a single value and it's what you would normally think of when you are talking about a key. The second is a composite of a hash and a range. This type of key lets you query data by either the hash component or the hash and range component. Additionally, records are automatically sorted by the range component.&lt;/p&gt;

&lt;p&gt;I know, that's pretty vague and it sounds a little crazy. If you've never dealt with this sort of system, you might think it far too limiting. You will have to model your data differently, but hopefully when we get our hands dirty, not only will it make sense, but it won't seem so odd.&lt;/p&gt;

&lt;p&gt;Beyond this technical point, the draw of DynamoDB is all about the infrastructure. You get fast and reliable performance (which has historically been a major shortcoming of storage solution using AWS/EBS), transparent scalability and reliability via replication. It essentially makes it possible to scale up to extreme levels without having to do anything special.&lt;/p&gt;


&lt;h3&gt;&amp;lt;Application&amp;gt;&lt;/h3&gt;
&lt;p&gt;The application that we are going to build is a simple API that can be used to store and retrieve change logs. A change log record would look something like: &lt;/p&gt;

&lt;pre class="brush: js"&gt;
{
  user: 'leto',
  asset: 'video-of-sand',
  changes: [
    {field: 'rating', old: 3, new: 10},
    {field: 'author', old: 'paaul', new: 'paul'}
  ]
}
&lt;/pre&gt;

&lt;p&gt;Once saved, we'll be able to retrieve a change log item by id. We'll also be able to get all the change log items made by a user, or for an asset, or a combination of the two. If you think about it, it's the core of what we'd need if we were building an audit log for a system.&lt;/p&gt;

&lt;p&gt;We'll only look at the DynamoDB-related parts, but if you are interested, you can get the working example from &lt;a href="https://github.com/karlseguin/auditor"&gt;github&lt;/a&gt;. It's written using node.js + CoffeeScript along with a 3rd party DynamoDB driver which I've contributed to.&lt;/p&gt;


&lt;h3&gt;Saving Change Logs&lt;/h3&gt;
&lt;p&gt;The first thing we'll do is save a change log. To do that, we must first create a table in DynanoDB: &lt;/p&gt;

&lt;pre class="brush: ruby"&gt;
ddb.createTable 'logs', { hash: ['id', ddb.schemaTypes().string] }, {read: 10, write: 10} , -&gt;
&lt;/pre&gt;

&lt;p&gt;The above code creates a table named &lt;code&gt;logs&lt;/code&gt; which will use a hash key (as opposed to a hash+range key). We've named the key field &lt;code&gt;id&lt;/code&gt; and said it'll be a &lt;code&gt;string&lt;/code&gt;.  The &lt;code&gt;read&lt;/code&gt; and &lt;code&gt;write&lt;/code&gt; values have to do with how DynamoDB distributes workload and scales. It's the expected read and write capacity; measured by what Amazon calls a &lt;em&gt;capacity unit&lt;/em&gt;, which is 1KB read or write per second. For the purpose of this post, it really doesn't matter.&lt;/p&gt;

&lt;p&gt;Now that our table is created, we can write new log records: &lt;/p&gt;

&lt;pre class="brush: ruby"&gt;
class Log
  constructor: (@user, @asset, @changes) -&gt;
    @id = uuid.v4()
    @created = new Date()

  save: (callback) =&gt;
    ddb.putItem 'logs', this.serialize(), {}, callback

  serialize: =&gt;
    {id: @id, user: @user, asset: @asset, created: @created.getTime(), changes: JSON.stringify(@changes)}
&lt;/pre&gt;

&lt;p&gt;The above code could be used like so: &lt;/p&gt;

&lt;pre class="brush: ruby"&gt;
app.post '/log', (req, res, next) -&gt;
  log = new Log(req.body.user, req.body.asset, req.body.changes)
  log.save -&gt;
    location = config.api.locationRoot + 'log/' + log.id
    res.writeHead(201, {'Content-Type': 'application/json', 'Location': location})
    res.end(JSON.stringify(log))
&lt;/pre&gt;

&lt;p&gt;The most important line in all of that is &lt;code&gt;ddb.putItem('logs', this.serialize(), {}, callback)&lt;/code&gt; which is where the data is actually sent to DynamoDB. &lt;code&gt;putItem&lt;/code&gt; can be used to do inserts or upserts, which is what the 3rd parameter controls (we left it blank which makes it default to upsert).&lt;/p&gt;

&lt;p&gt;There are a couple things worth taking a good look at. First of all, DynamoDB only supports strings and numbers, which is where the &lt;code&gt;serialize&lt;/code&gt; method comes into play. Our &lt;code&gt;created&lt;/code&gt; date is converted to a number, and &lt;code&gt;changes&lt;/code&gt; turned into a string (a real app might be interested in storing this as a compressed value). DynamoDB doesn't supported embedded objects like some other NoSQL solution, so &lt;code&gt;changes&lt;/code&gt; can't be stored as-is. Besides this, all DynamoDB really cares about is that we provide a field with the name and type that we defined our table key as; which we do as the &lt;code&gt;id&lt;/code&gt; field.&lt;/p&gt;

&lt;h3&gt;Getting a Change Log&lt;/h3&gt;
&lt;p&gt;Next we want to make it so that change logs can be retrieved by id. So, given the following code:&lt;/p&gt;

&lt;pre class="brush: ruby"&gt;
app.get '/log/:id', (req, res, next) -&gt;
  Log.load req.params.id, (err, log) -&gt;
    res.end(JSON.stringify(log))
&lt;/pre&gt;

&lt;p&gt;We'll write our &lt;code&gt;Log.load&lt;/code&gt; method as:&lt;/p&gt;

&lt;pre class="brush: ruby"&gt;
class Log
  ...

  @load: (id, callback) =&gt;
    ddb.getItem 'logs', id, null, {}, (err, res) =&gt;
      callback(null, this.deserialize(res))

  @deserialize: (data) =&gt;
    data.created = new Date(data.created)
    data.changes = JSON.parse(data.changes)
    return data
&lt;/pre&gt;


&lt;p&gt;The first parameter is the name of the table, next is the hash value we want to retrieve. The next two parameters are the range key value (which we'll never have with this table since it uses a hash key only) and an options parameters, to specify things such as which fields to get. Our &lt;code&gt;deserialize&lt;/code&gt; method undoes the work &lt;code&gt;serialize&lt;/code&gt; did when we first stored our record.&lt;/p&gt;

&lt;h3&gt;Searching&lt;/h3&gt;
&lt;p&gt;So far we've kept things simple. Creating a table involved identifying our key, inserting involved sending along a bunch of attributes, and to get a specific item we submitted its id. Few apps can be built with just that functionality. In fact, even for our simple demo app, it's unlikely that we'll ever want individual change logs. Rather, we'll want change logs belonging to an asset, or possibly a user.&lt;/p&gt;

&lt;p&gt;To achieve this, we need to maintain our own indexes. An index is nothing more than the value we are indexing and a reference to the record the value belongs to. We can achieve this by using another table. And, while we are at it, it makes sense to get change logs back ordered by creation date. Let's visualize what we are talking about: &lt;/p&gt;

&lt;pre class="brush: text"&gt;
logs table:
id   |   user   |  asset  |  created  |  changes
1        leto      sand      4909449     ....
2        leto      spice     4939494     ....
3        paul      spice     5001001     ....

logs_by_asset table:
asset  |  created   |  id
sand      4909449      1
spice     4939494      2
spice     5001001      3
&lt;/pre&gt;

&lt;p&gt;Now, if we set the key of &lt;code&gt;logs_by_asset&lt;/code&gt; to be a hash on &lt;code&gt;asset&lt;/code&gt; and a range  on &lt;code&gt;created&lt;/code&gt; we'll be able to find change logs by asset. How? Well, first we'll get all the ids which belong to a certain asset via &lt;code&gt;logs_by_asset&lt;/code&gt;, then we can retrieve those records by id in &lt;code&gt;logs&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;First thing we need to do is create our new table:&lt;/p&gt;
&lt;pre class="brush: ruby"&gt;
ddb.createTable 'logs_by_asset', { hash: ['asset', ddb.schemaTypes().string], range: ['created', ddb.schemaTypes().number] }, {read: 10, write: 10} , -&gt;
&lt;/pre&gt;

&lt;p&gt;Next we change our &lt;code&gt;save&lt;/code&gt; method to also create the index:&lt;/p&gt;

&lt;pre class="brush: ruby"&gt;
save: (callback) =&gt;
  ddb.putItem 'logs', this.serialize(), {}, (err, res) =&gt;
    ddb.putItem 'logs_by_asset', {asset: @asset, created: @created.getTime(), id: @id}, {}, -&gt;
    callback(err, res)
&lt;/pre&gt;

&lt;p&gt;Finally, we can do our two-phase lookup:&lt;/p&gt;

&lt;pre class="brush: ruby"&gt;
@find: (asset, callback) =&gt;
  ddb.query 'logs_by_asset', asset, null, {attributesToGet: ['id']}, (err, res) =&gt;
    ids = (key.id for key in res.Items)
    ddb.batchGetItem {table: 'logs', keys: ids}, (err, res) =&gt;
      items = []
      for item in res.Items
        items.push(this.deserialize(item))
      callback(null, items)
&lt;/pre&gt;

&lt;p&gt;There's a bit going on here. First, we get all the ids for a specific asset. If we wanted to, we could also specify a created value (a table with a composite key can be queried by hash or hash and range). We transform those ids into an array because they come back looking like  &lt;code&gt;[{id: '1', id:'2', id:'3'}])&lt;/code&gt; and we want to just query via &lt;code&gt;[1,2,3]&lt;/code&gt;. Finally we use &lt;code&gt;batchGetItem&lt;/code&gt; to get all the change logs that match the ids. As you can probably guess, &lt;code&gt;batchGetItem&lt;/code&gt; works a lot like &lt;code&gt;getItem&lt;/code&gt; except that it takes an array of keys (in fact, it can even batch get from multiple tables at once).&lt;/p&gt;

&lt;h3&gt;&amp;lt;/Application&amp;gt;&lt;/h3&gt;
&lt;p&gt;The real focus here is to introduce DynamoDB and show how to deal with the restrictions it imposes. Namely, how to create your own indexes to support more advanced queries. If you also want to query by &lt;code&gt;user&lt;/code&gt; you'll need another table and if you want to query by &lt;code&gt;asset+user&lt;/code&gt; you'll need yet another (in this case the hash key can be a &lt;code&gt;@asset + ':' + @user&lt;/code&gt;). If you want a different sort, you guessed it, you'll need another table.&lt;/p&gt;

&lt;p&gt;There's more you can do with DynamoDB (like deleting), or even doing a linear scan for arbitrary fields (which is expensive and won't scale, so I'm not sure when you'd do it). But understanding that records are retrieved by hash key or hash+range key, and what that means with respect to modeling, is the best place to start.&lt;/p&gt;

&lt;h3&gt;My Thoughts on DynamoDB&lt;/h3&gt;
&lt;p&gt;From a infrastructure point of view, DynamoDB is a dream come true. Take everything you know about scaling a database and throw it out. Stop worrying about RAID, or worse, RAIDED EBS, replication, availability zones and so on. I generally like to manage all my own stuff and run my own servers, but there's something simply awesome about DynamoDB's potential.&lt;/p&gt;

&lt;p&gt;However, beyond the infrastructure, the actual storage engine leaves a lot to be desired. It's where a other NoSQL solutions were 1-2 years ago. Which is significant when you consider how fast the field has evolved. The lack of secondary indexes isn't a deal breaker for me, but it's an increasingly rare limitations.&lt;/p&gt;

&lt;p&gt;For me, paging records is always a good measure of how helpful a database wants to be. Paging records in SQL Server or Oracle, for example, feels a lot like being given the finger. DynamoDB doesn't fair any better. Commands that can return multiple items take a &lt;code&gt;Limit&lt;/code&gt; option, which is good. But for an offset you need to provide a &lt;code&gt;ExclusiveStartKey&lt;/code&gt;, which is the last key that you received. Worse, even when you don't provide a &lt;code&gt;limit&lt;/code&gt; you might still get a partial result if the full result is too big (&gt;1MB). DynamoDB will let you know this happened by also providing a &lt;code&gt;LastEvaluatedKey&lt;/code&gt; in the reply. In other words, if you are hoping for a &lt;code&gt;limit&lt;/code&gt; and &lt;code&gt;offset&lt;/code&gt;, which I believe every database solution should strive to provide, you'll be as disappointed as I am.&lt;/p&gt;

&lt;p&gt;There's also the fact that it only supports strings and integers and doesn't support embedded objects. This isn't too uncommon, but I think we can agree more type support is better than less.&lt;/p&gt;

&lt;p&gt;Then there's the pricing. Billing per write and read compute units doesn't bother me. Sure, it's ambiguous at first, but I can see how it better reflects Amazon's actual cost than say, charging $X for Y RAM and Z HDD. They are essentially charging by I/O, which is probably a better all around measure of CPU, HDD and RAM usage. What does bother me though is that they round up to the nearest 1KB. Now, maybe in a real world app that would just be a blip. However, given that a high number of queries will likely go to a secondary index (and thus only return short ids), I have a feeling it really could add up. It kinda feels like they are providing an inferior experience (lack of secondary indexes), and forcing you to pay more because of it.&lt;/p&gt;

&lt;p&gt;My last point is about the communication protocol. Admittedly, this is something most devs won't have to worry/know. I'm quite familiar with the MongoDB and Redis protocols, and I can safely say that, in comparison, I &lt;strong&gt;hate&lt;/strong&gt; the DynamoDB protocol. First of all, even though it's JSON, they've somehow made it feel like XML. Not only is it incredibly verbose, but whenever you send attributes over, you have to encode them as such: &lt;code&gt;{"S": "MyStringValue"}&lt;/code&gt; or &lt;code&gt;{"N":"MyNumericalValue"}&lt;/code&gt;. If only JSON had a built-in way to distinguish strings from numbers...There are also a couple inconsistencies, which is a shame to see in such a young protocol. These inconsistencies are quite evident in the way errors are handled. I tried build a &lt;a href="https://github.com/karlseguin/alternator"&gt;local emulator backed by MongoDB&lt;/a&gt; for development purposes, but abandoned the project after being frustrated with DynamoDB's error handling.&lt;/p&gt;

&lt;p&gt;Ultimately, I think the idea is great, but the execution is a couple years behind what's currently available. The real question is where do they plan on taking it and when do they plan on getting there.&lt;/p&gt;

&lt;p&gt;</content>
 </entry>
 
 <entry>
   <title>Node.js, Require and Exports</title>
   <link href="http://openmymind.net/2012/2/3/Node-Require-and-Exports" />
   <updated>2012-02-03T00:00:00-08:00</updated>
   <id>http://openmymind.net/2012/2/3/Node-Require-and-Exports</id>
   <content type="html">&lt;p&gt;Back when I first started playing with node.js, there was one thing that always made me uncomfortable. Embarrassingly, I'm talking about &lt;code&gt;module.exports&lt;/code&gt;. I say &lt;em&gt;embarrassingly&lt;/em&gt; because it's such a fundamental part of node.js and it's quite simple. In fact, looking back, I have no idea what my hang up was...I just remember being fuzzy on it. Assuming I'm not the only one who's had to take a second, and third, look at it before it finally started sinking in, I thought I could do a little write up.&lt;/p&gt;

&lt;p&gt;In Node, things are only visible to other things in the same file. By &lt;em&gt;things&lt;/em&gt;, I mean variables, functions, classes and class members. So, given a file &lt;code&gt;misc.js&lt;/code&gt; with the following contents:&lt;/p&gt;

&lt;pre class="brush: js"&gt;
var x = 5;
var addX = function(value) {
  return value + x;
};
&lt;/pre&gt;

&lt;p&gt;Another file cannot access the &lt;code&gt;x&lt;/code&gt; variable or &lt;code&gt;addX&lt;/code&gt; function. This has nothing to do with the use of the &lt;code&gt;var&lt;/code&gt; keyword. Rather, the fundamental Node building block is called a &lt;em&gt;module&lt;/em&gt; which maps directly to a file. So we could say that the above file corresponds to a module named &lt;code&gt;file1&lt;/code&gt; and everything within that module (or any module) is private.&lt;/p&gt;


&lt;p&gt;Now, before we look at how to expose things out of a module, let's look at loading a module. This is where &lt;code&gt;require&lt;/code&gt; comes in. &lt;code&gt;require&lt;/code&gt; is used to load a module, which is why its return value is typically assigned to a variable:&lt;/p&gt;

&lt;pre class="brush: js"&gt;
var misc = require('./misc');
&lt;/pre&gt;

&lt;p&gt;Of course, as long as our module doesn't expose anything, the above isn't very useful. To expose things we use &lt;code&gt;module.exports&lt;/code&gt; and export everything we want: &lt;/p&gt;

&lt;pre class="brush: js"&gt;
var x = 5;
var addX = function(value) {
  return value + x;
};
module.exports.x = x;
module.exports.addX = addX;
&lt;/pre&gt;

&lt;p&gt;Now we can use our loaded module:&lt;/p&gt;

&lt;pre class="brush: js"&gt;
var misc = require('./misc');
console.log("Adding %d to 10 gives us %d", misc.x, misc.addX(10));
&lt;/pre&gt;

&lt;p&gt;There's another way to expose things in a module:&lt;/p&gt;

&lt;pre class="brush: js"&gt;
var User = new function(name, email) {
  this.name = name;
  this.email = email;
};
module.exports = User;
&lt;/pre&gt;

&lt;p&gt;The difference is subtle by important. See it? We are exporting &lt;code&gt;user&lt;/code&gt; directly, without any indirection. The difference between: &lt;/p&gt;

&lt;pre class="brush: js"&gt;
module.exports.User = User;
//vs
module.exports = User;
&lt;/pre&gt;

&lt;p&gt;is all about how it's used:&lt;/p&gt;

&lt;pre class="brush: js"&gt;
var user = require('./user');

var u = new user.User();
//vs
var u = new user();
&lt;/pre&gt;

&lt;p&gt;It's pretty much a matter of whether your module is a container of exported values or not. You can actually mix the two within the same module, but I think that leads to a pretty ugly API.&lt;/p&gt;

&lt;p&gt;Finally, the last thing to consider is what happens when you directly export a function:&lt;/p&gt;

&lt;pre class="brush: js"&gt;
var powerLevel = function(level) {
  return level &gt; 9000 ? "it's over 9000!!!" : level;
};
module.exports = powerLevel;
&lt;/pre&gt;

&lt;p&gt;When you require the above file, the returned value is the actual function. This means that you can do:&lt;/p&gt;

&lt;pre class="brush: js"&gt;
require('./powerlevel')(9050);
&lt;/pre&gt;

&lt;p&gt;Which is really just a condensed version of:&lt;/p&gt;

&lt;pre class="brush: js"&gt;
var powerLevel = require('./powerlevel')
powerLevel(9050);
&lt;/pre&gt;

&lt;p&gt;Hope that helps!&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>MongoDB: Embedded Documents vs Multiple Collections</title>
   <link href="http://openmymind.net/2012/1/30/MongoDB-Embedded-Documents-vs-Multiple-Collections" />
   <updated>2012-01-30T00:00:00-08:00</updated>
   <id>http://openmymind.net/2012/1/30/MongoDB-Embedded-Documents-vs-Multiple-Collections</id>
   <content type="html">&lt;p&gt;The most common MongoDB modeling question that gets asked has to do with when to use embedded documents versus multiple collections. First, for those unfamiliar with MongoDB, a quick recap.&lt;/p&gt;

&lt;p&gt;As a document database, MongoDB can store embedded documents. As an example, we might have:&lt;/p&gt;

&lt;pre class="brush: js"&gt;
{
  email: 'leto@dune.gov',
  password: 'spice',
  address: {
    street: 'The Citadel',
    city: 'Arrakeen',
    planet: 'Arrakis'
  }
}
&lt;/pre&gt;

&lt;p&gt;Not only can you have embedded documents, but you can also have arrays of embedded documents, and you can have multiple levels of nesting.&lt;/p&gt; MongoDB lets you index and query based on embedded keys, and it has special operator that let you push and pull values from an embedded array. All this to say that, in MongoDB, embedded documents and arrays are first-class citizens.&lt;/p&gt;

&lt;p&gt;It's also worth pointing out that MongoDB doesn't support joins, this is important to keep in mind when you consider the two approaches. (You do a "join" on the client, by issuing 2+ queries). In the end, MongoDB gives you two choices to model relationships: use embedded documents, or go the traditional route and store things into two separate collections (essentially a table). So the question is, which should you pick?&lt;/p&gt;

&lt;p&gt;There's no single answer to the question. It depends on a number of factors, not least of which is about the data you dealing with. Let's look at another example: a post with comments&lt;/p&gt;

&lt;pre class="brush: js"&gt;
{
  title: 'MongoDB: Embedded Documents vs Multiple Collections',
  slug: 'embedded-documents-vs-multiple-collections',
  body: '...',
  comments: [
    {by: 'leto', body: '...', date: '...'},
    {by: 'ghanima', body: '...', date: '...'},
    {by: 'jessica', body: '...', date: '...'},
    {by: 'paul', body: '...', date: '...'}
  ]
}
&lt;/pre&gt;

&lt;p&gt;The two examples represent different usage of embedded documents. The most obvious different is with respect to the number of elements. The first example is finite (in this case 1), the second example could grow forever. More importantly though, while an &lt;code&gt;address&lt;/code&gt; can certainly be thought of as a standalone entity (especially if multiple users can have the same address), there's something just natural about saying that a user and an address are a single coherent entity. Conversely, I don't think it's a stretch to think of a both a post and a comment as having meaning/purpose by themselves.&lt;/p&gt;

&lt;p&gt;Beyond this intuition, there are some technical matters to consider. In terms of performance, embedded documents won't require a client-side join, and can fetch the entire document in a single seek. This can have particularly meaningful impact when mixed with sharding. On the flip side, growing an array of embedded documents can force MongoDB to physically move where the document is stored (and have to update all indexes)&lt;/p&gt;

&lt;p&gt;It is important to understand the performance implications, and more importantly the reasons why the performance characteristics are the way they are. However, if you don't already know which approach to take, performance probably shouldn't be what's driving your decision.&lt;/p&gt;

&lt;p&gt;The other, much more relevant, consideration is, in my opinion, query flexibility. As much as embedded documents are first-class citizens, there's a limit to what you can do for queries. In the above example, we can't just get comments made by &lt;code&gt;leto&lt;/code&gt;. We have to get the entire document. The next release of MongoDB should fix this (by allowing the positional operator &lt;code&gt;$&lt;/code&gt; to be used in field selection).  Further down the road, virtual collections on embedded documents will also be added. However, for the time being, the only possible selection is indexed based (get comments 1-5, 5-10, ...), which is a serious drag.&lt;/p&gt;

&lt;p&gt;Which should you use? They both have their use. Given that an embedded document is something you can't do in relational databases, they do get talked (and hyped) about a lot. However, I believe that &lt;strong&gt;you should favor separate collections&lt;/strong&gt;. This is especially true if you are dealing with a potentially ever-growing array, or if you want to pull out specific embedded values. So the approach I would suggest is to first think of it stored in a separate collection and only then consider if it makes sense to embed. Distinct pieces of data, like our address, obviously makes sense to embed. Small and simple arrays, such as tags, also makes sense.&lt;/p&gt;

&lt;p&gt;Worst case, pick one, try it out, and if it doesn't work out, change it.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>The Little Redis Book</title>
   <link href="http://openmymind.net/2012/1/23/The-Little-Redis-Book" />
   <updated>2012-01-23T00:00:00-08:00</updated>
   <id>http://openmymind.net/2012/1/23/The-Little-Redis-Book</id>
   <content type="html">&lt;p&gt;&lt;a href="http://openmymind.net/redis.pdf"&gt;&lt;img style="border:1px solid #aaa;padding:10px;width:245px;height:103px" src="/redis_cover.png"/&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The &lt;a href="http://openmymind.net/redis.epub"&gt;epub version&lt;/a&gt; is now available&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are also &lt;a href="https://github.com/kondratovich/the-little-redis-book"&gt;Russian&lt;/a&gt; and &lt;a href="https://github.com/sandroconforto/the-little-redis-book"&gt;Italian&lt;/a&gt; translations available&lt;/p&gt;

&lt;p&gt;It's hard to believe that 10 months ago I wrote and released The Little MongoDB Book. Equally unbelievable has been the reception and feedback I've gotten. For a long time now I've wanted to write a similar book for Redis, but I just never felt like I could tell a good story. Redis is wonderfully simple, which makes it awesome to use, but I thought it would turn any book into little more than reference material. Well, I decided to give it a try and hopefully you'll agree with me that The Little Redis Book is a solid addition to the &lt;em&gt;Little&lt;/em&gt; family.&lt;/p&gt;

&lt;p&gt;You can download the &lt;a href="http://openmymind.net/redis.pdf"&gt;PDF version here&lt;/a&gt;. It comes in at 29-pages. I hope this helps people who are new to Redis. I also hope there's maybe one or two useful things in here for developers already familiar with it.&lt;/p&gt;

&lt;p&gt;I wrote it in markdown, the source is &lt;a href="http://github.com/karlseguin/the-little-redis-book"&gt;available on github&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;The book was written in 2 short days. Again, &lt;a href="http://twitter.com/perryneal"&gt;Perry Neal&lt;/a&gt; provided some great feedback and edits in a short period of time. Impossible to have pulled this off without him. Of course, given the speed at which this was done, corrections and feedback are welcomed in any format (email, comment, pull requests).&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Refactoring Common API Functionality Into A Node.js Proxy</title>
   <link href="http://openmymind.net/2012/1/17/Refactoring-Common-API-Functionaly-Into-A-Node-Proxy" />
   <updated>2012-01-17T00:00:00-08:00</updated>
   <id>http://openmymind.net/2012/1/17/Refactoring-Common-API-Functionaly-Into-A-Node-Proxy</id>
   <content type="html">&lt;p&gt;The handful of web services that I maintain all share common functionality. For example, at the start of each request, they load an account based on some key (in the query string or the message body). They also ensure that a request is valid by validating the provided sha1 signature. They handle versioning and do logging. It's almost the exact same thing from project to project, but not all of these projects are written in the same language so traditional re-use (dll, gem, package) isn't the best solution.&lt;/p&gt;

&lt;p&gt;This weekend I wrote &lt;a href="https://github.com/karlseguin/aproxi"&gt;aproxi&lt;/a&gt; which is a simple node.js proxy built on &lt;a href="https://github.com/senchalabs/connect/"&gt;connect&lt;/a&gt;. Connect is a middleware layer like ruby's rack or .NET's OWIN. The idea behind this project is to have it sit between a webserver, say nginx, and the application (which could be written in anything) and provide all of this common functionality.&lt;/p&gt;

&lt;p&gt;There are a couple hosted services that do this, like apigee and mashery. I think those are wonderful services, but I also think having something more custom for your applications can be beneficial (for example, I don't think either of them support method signing).&lt;p&gt;

&lt;p&gt;Let's look at the most basic example, ensure that they API key we received truly belongs to a valid account:&lt;/p&gt;

&lt;pre class="brush: ruby"&gt;
store = require('./../store')
appLoader =  -&gt;
  appLoader = (request, response, next) -&gt;
    return next() if request._appLoader
    request._appLoader = true

    key = if request.method == 'GET' || request.method == 'DELETE' then request.query['key'] else request.body['key']
    return invalid(response) unless key?

    store.findOne 'apps', {_id: key}, {fields: {secret: true}}, (err, app) =&gt;
      return invalid(response) if err? || !app?
      request._app = app
      return next()

invalid = (response) -&gt;
  response.writeHead(400, {'Content-Type': 'application/json'});
  response.end(JSON.stringify({error: 'the key is not valid'}))    

module.exports = appLoader
&lt;/pre&gt;

&lt;p&gt;There's some connect-specific code in here (like the nested functions and checking to see if this middleware already ran (which I'm not sure why I need, but all the built-in ones do that)), but it's overall quite simple. We load the key from the query or body and if it's either invalid or doesn't correspond to a an actually application, we respond with an error. Otherwise we move to the next middleware.&lt;/p&gt;

&lt;p&gt;Notice that we are only retrieving the app's secret value. This will be used in a following method to verify the signature. For more complex APIs, we might retrieve an account level (small, medium, large) which other middlewares might use to limit what can and can't happen.&lt;/p&gt;

&lt;p&gt;Once all middlewares have passed we use node's http package to proxy our request to the application server. This is our final middleware:&lt;/p&gt;

&lt;pre class="brush: csharp"&gt;
http = require('http')

proxy = (config) -&gt;
  proxy = (request, response, next) -&gt;
    return next() if request._proxy
    request._proxy = true

    options = 
      port: config.port
      host: config.host
      method: request.method
      path: request.url
      headers: request.headers

    prequest = http.request options, (presponse) -&gt;
      presponse.on 'data', (chunk) -&gt; response.write(chunk, 'binary')
      presponse.on 'end', -&gt; response.end()
      response.writeHead(presponse.statusCode, presponse.headers);

    prequest.on 'error', (err) -&gt;
      response.statusCode = 503
      response.end('connection to application server refused')
      
    prequest.write(request.bodyRaw, 'binary') if request.bodyRaw?
    prequest.end()

module.exports = proxy
&lt;/pre&gt;

&lt;p&gt;The node.js documentation describes the &lt;a href="http://nodejs.org/docs/latest/api/http.html"&gt;http object&lt;/a&gt;, but the code is fairly simple. We open a request to the application server, write out the body, and any data we receive we stream back out to the requesting client (which in our case would be nginx).&lt;/p&gt;

&lt;p&gt;One question you might have is why do we still need nginx? Well nginx provides a ton of reliable and fast modules. In theory we could get rid of it, but then we'd have to handle ssl, caching, blacklisting, throttling and so on. Conversely we could write all of this as nxing modules (or using varnish VCL) but the ease and power of node.js is unsurpassed. It would be ideal if nginx would embed V8 so that modules could be written in JavaScript, but that &lt;a href="http://translate.google.com/translate?js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;sl=ru&amp;tl=en&amp;u=http://sysoev.ru/prog/v8.html"&gt;doesn't seem like it'll happen&lt;/a&gt; any day soon.&lt;/p&gt;

&lt;p&gt;I'm actually quite excited by this project. It is somewhat configurable, but the goal isn't really for  other people to use it. It's so that I can use it. But, if you build APIs and you find yourself writing similar code over and over again, hopefully this project will give you some ideas and possibly act as a launching pad for your own custom proxy.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Understanding CoffeeScript Comprehensions</title>
   <link href="http://openmymind.net/2012/1/15/Understanding-CoffeeScript-Comprehensions" />
   <updated>2012-01-15T00:00:00-08:00</updated>
   <id>http://openmymind.net/2012/1/15/Understanding-CoffeeScript-Comprehensions</id>
   <content type="html">&lt;p&gt;If there's anything better than CoffeeScript, it's the amazing quality of documentation available. The main homepage, &lt;a href="http://coffeescript.org/"&gt;coffeescript.org&lt;/a&gt; is a stellar example many projects should be copying. And then there's the &lt;a href="http://arcturo.github.com/library/coffeescript/"&gt;The Little Book on CoffeeScript&lt;/a&gt; which I have nothing but praise for (as a Little Book author myself!).&lt;/p&gt;

&lt;p&gt;Nevertheless, I find that most resources don't properly explain CoffeeScript comprehensions. So, what's a comprehension? .NET readers would find it similar to LINQ...it's essentially a way to loop over and act on values in arrays or hashes.&lt;/p&gt;

&lt;p&gt;Even though that sounds simple, I had a hard time grasping the syntax. I don't know what it was. Most of the resources show JavaScript code along with the corresponding CoffeeScript version. What helped me really understand comprehensions though was was breaking them down and first looking at the plain way to write them&lt;/p&gt;

&lt;p&gt;In CoffeeScript, if you want to loop through an array, you use &lt;code&gt;for in&lt;/code&gt;. To loop through hashes/objects you use &lt;code&gt;for of&lt;/code&gt;. For example:&lt;/p&gt;

&lt;pre class="brush: ruby"&gt;
  heroes = ['leto', 'duncan', 'goku']

  for hero in heroes
    console.log(hero)

  # Or, including the index
  for hero, index in heroes
    console.log('The hero at index %d is %s', index, hero)


  likes = 
    leto: 'spice'
    paul: 'chani'
    duncan: 'murbella'

  for key of likes
    console.log(key)

  # Or, including the value
  for key, value of likes
    console.log('%s likes %s', key, value)
&lt;/pre&gt;

&lt;p&gt;Now, that's pretty basic and understandable. I understood the syntax around comprehensions when I recognized the similarities to statement modifiers. Statement modifiers are a neat and useful way in CoffeeScript and in Ruby to execute an &lt;code&gt;if&lt;/code&gt; (or &lt;code&gt;unless&lt;/code&gt;) statement.&lt;/p&gt;

&lt;pre class="brush: ruby"&gt;
  # instead of
  if x == 0
    x = 1

  # we ca do
  x = 1 if x == 0
&lt;/pre&gt;

&lt;p&gt;If you aren't used to these, it might seem like pretty useless syntactical sugar, but I believe that they contribute enough to code readability to make them worth using. Anyways, even though most people are used to thinking of code executing from left to right, I think the above statement modifiers (which first check the condition on the right) are easy enough to understand.&lt;/p&gt;

&lt;p&gt;Being familiar with this type of syntactical trick, we can go back to our loops fancy them up:&lt;/p&gt;

&lt;pre class="brush: ruby"&gt;
  heroes = ['leto', 'duncan', 'goku']

  console.log(hero) for hero in heroes


  likes = 
    leto: 'spice'
    paul: 'chani'
    duncan: 'murbella'

  console.log('%s likes %s', key, value) for key, value of likes
&lt;/pre&gt;

&lt;p&gt;Comprehensions allows for conditions via &lt;code&gt;when&lt;/code&gt;, so we do:&lt;/p&gt;

&lt;pre class="brush: ruby"&gt;
  heroes = ['leto', 'duncan', 'goku']

  # boring
  for hero, index in heroes
    if index % 2 == 0
      console.log(hero)

  # cool
  console.log hero for hero, index in heroes when index % 2 == 0
&lt;/pre&gt;


&lt;p&gt;Notice also that I dropped the parenthesis around the &lt;code&gt;console.log&lt;/code&gt;. In CoffeeScript parenthesis are often optional and people tend to not use them while writing comprehensions.&lt;/p&gt;

&lt;p&gt;There are other things you can do with compressions. For example, you won't always want to execute some code (like &lt;code&gt;console.log&lt;/code&gt;) on each item. Instead you might want to select or map the results into another variable:&lt;/p&gt;

&lt;pre class="brush: ruby"&gt;
  evenHeroes = (hero for hero, index in heroes when index % 2 == 0)
&lt;/pre&gt;

&lt;p&gt;The parenthesis in the above code are critical. Without them, only the last value of our comprehension ('goku') will be assigned. With them, the filtered array is assigned. The very first &lt;code&gt;hero&lt;/code&gt; in the above code is what's being selected from our compression.&lt;/p&gt;

</content>
 </entry>
 
 <entry>
   <title>entitlement</title>
   <link href="http://openmymind.net/2012/1/13/Entitlement" />
   <updated>2012-01-13T00:00:00-08:00</updated>
   <id>http://openmymind.net/2012/1/13/Entitlement</id>
   <content type="html">We live in luxury as a reward from those we keep in poverty. We grow fat on fabricated nutrition; children starve. The atrocity of our entitlement is unequaled. We laugh, we are happy, our victims are numbered in billions.</content>
 </entry>
 
 <entry>
   <title>Reading from TCP streams</title>
   <link href="http://openmymind.net/2012/1/12/Reading-From-TCP-Streams" />
   <updated>2012-01-12T00:00:00-08:00</updated>
   <id>http://openmymind.net/2012/1/12/Reading-From-TCP-Streams</id>
   <content type="html">&lt;p&gt;Over the last couple days I wrote a Redis &lt;a href="https://github.com/karlseguin/redispy-web"&gt;monitoring tools&lt;/a&gt; which leveraged a &lt;a href="https://github.com/karlseguin/redispy"&gt;redis monitoring stream parser&lt;/a&gt; I also wrote. They aren't particularly complicated or even useful projects, but it did involve writing code I've written a number of times before: TCP stream parsing.&lt;/p&gt;

&lt;p&gt;You see, the first time most people read data from a TCP stream they probably try to read (or write) it like a file. So they think that if the other end outputs 3 writes, than they'll need to do 3 reads. Unfortunately, when you are doing local development, that's probably going to be true. It's unfortunate because it doesn't mirror what happens in the real world.&lt;/p&gt;

&lt;p&gt;TCP is a stream protocol which means that those 3 writes might come in via a single read, or  via 10 reads. It has no concept of your message boundaries. Issuing three writes, ending with new line characters is meaningless...it's all just a stream of continuous bytes. What's a developer to do?&lt;/p&gt;

&lt;p&gt;There are two common approaches to this problem. The first is to prefix each message with a length (say a 32bit integer). The other is to include delimiters (like those new line characters). In the first case, once you've read the length you know just how many bytes you need to read to get the complete message. This also has the benefit of letting you pre-allocate a byte array of the correct size (assuming you want to load the whole thing into memory). In the second case, you need to scan the message byte by byte looking for your special delimiters. They are both commonly used. For example, MongoDB uses the first approach while Redis uses the second one.&lt;/p&gt;

&lt;p&gt;Every time you read you have to deal with three possible situations. First, that you didn't read enough (either because you knew how much to expect, or because you didn't hit the delimiters). In this case you need to store your partial message and issue another read. Second, you might read the perfect amount. Finally, you might read multiple messages, which means the message needs to be properly split and the rest of the message re-processed as though it was a new message. You can get any combination of these, you might not read a complete message followed by reading the rest of it plus 2 messages and a half.&lt;/p&gt;

&lt;p&gt;What I find particularly interesting is that either your length prefix or your delimiters themselves might be cut up. For example, say you prefix each of your message with a 32bit integer. It's entirely possible that your first read will only return 1, 2, or 3 of the 4 bytes you need. Same thing can happen with multi-byte delimiters (which is used to avoid conflicts). &lt;a href="https://github.com/karlseguin/redispy/blob/master/lib/reader.coffee"&gt;Here's the basic solution&lt;/a&gt; I came up with for newline terminated messages in redispy and &lt;a href="https://gist.github.com/1598236"&gt;here's a solution&lt;/a&gt; for length-prefixed messages.&lt;/p&gt;

&lt;p&gt;None of it is particularly difficult to deal with. You pretty much just need to keep some context/state around as you read from the socket. This could lead to a DOS-type attack (sending multiple partial messages to eat up server memory), though for most apps a reasonable timeout will be sufficient.</content>
 </entry>
 
 <entry>
   <title>Does My MongoDB Replica Set Need An Arbiter?</title>
   <link href="http://openmymind.net/2012/1/7/Does-My-Replica-Set-Need-An-Arbiter" />
   <updated>2012-01-07T00:00:00-08:00</updated>
   <id>http://openmymind.net/2012/1/7/Does-My-Replica-Set-Need-An-Arbiter</id>
   <content type="html">&lt;p id="intro"&gt;MongoDB replica sets provide a number of features that most MongoDB users are going to want to leverage. Best of all they are relatively easy to setup. However, first timers often hesitate when it comes to the role of arbiters. Knowing whether or not you need an arbiter is dead simple. Understanding why is critical.&lt;/p&gt;

&lt;blockquote&gt;TL;DR: You need an arbiter if you have an even number of votes. As an extension to this, at most you should only ever have 1 arbiter. If you aren't sure how many votes you have, it's probably the same as the number of servers in the set you have (including slaves, hidden, arbiters).&lt;/blockquote&gt;

&lt;h2&gt;Elections and Majority&lt;/h2&gt;
&lt;p&gt;When the primary becomes unavailable, an election is held to pick a new primary server from those servers still available. One of the key points to understand is that a majority is required to elect a new primary. Not just a majority from available servers, but a majority from all of the servers in the set.&lt;/p&gt;

&lt;p&gt;For example if you have 3 servers, A(primary), B and C then it'll &lt;strong&gt;always&lt;/strong&gt; take 2 servers to elect a new primary. If A(p) goes down, then B and C can and will elect a primary. However, if B also goes down, C will not elect itself as primary because it will only get 1/3 of the votes (as opposed to the necessary 2/3).&lt;/p&gt;

&lt;p&gt;In other words, a 3 server replica set can tolerate a single failure. A 5 server replica set can tolerate 2 failed servers (3/5 is a majority). A 7 server replica set can survive 3 failed servers (4/7).&lt;/p&gt;

&lt;h2&gt;Network Splits and Ties&lt;/h2&gt;
&lt;p&gt;To understand the other key point it helps if we change our perspective from thinking of servers going down, to thinking in terms of network splits. What?&lt;/p&gt;

&lt;p&gt;You see, when you think of servers crashing, it's reasonable to expect any remaining server to be elected - even without a majority. Instead of servers going down, consider what would happen if the servers all remain operational but can't see each other. Let's look at an example&lt;/p&gt;

&lt;p&gt;Pretend we have 4 servers - representing a [evil] even number of votes: A, B, C and D. AB are in one data center (or availability zone) while CD are in another. Now, the link between the two datacenter dies. What would happen? Each group thinks the other is down and could both elect their own primary. This would cause data conflicts (two servers would be accepting writes).&lt;/p&gt;

&lt;p&gt;So let's introduce an arbiter (E) and with it, an uneven number of votes. Now each servers knows that the set is made up of 5 votes, and thus 3 votes are required to elect a new primary. Whatever group E ends up with (either ABE or CDE) a primary will be elected, and, more importantly, whatever group E doesn't end up with won't be able to elect a primary.&lt;/p&gt;

&lt;p&gt;What happens if AB, CD and E do a three way split? Then there is no majority and thus no primary and the set will go down (but that's much better than having 2 primaries).&lt;/p&gt;

&lt;h2&gt;Separate Servers?&lt;/h2&gt;
&lt;p&gt;People often wonder whether an arbiter can run on the same box as one of the main &lt;code&gt;mongod&lt;/code&gt; processes? Ideally, no. Imagine we have AB and C where B is arbiter running on the same server as A. If that server goes down you've lost your majority. In other words, in this set up, if the wrong server goes down you have no redundancy. However, if C goes down, AB can and will remain a primary (and if you think you can just stick a 2nd arbiter on C, then this article has failed you miserably, and I'm very sorry for having wasted your time).&lt;/p&gt;

&lt;p&gt;Arbiters don't store any of the data and is a lightweight process. An EC2 micro instance is more than powerful enough.&lt;/p&gt;


&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The most important thing is to have an uneven number of votes. Knowing this, it should be obvious that you either want 0 arbiters or 1...but never, ever more.&lt;/p&gt;

&lt;p&gt;It's also important to understand that election doesn't rely on the majority of available servers, but of all servers in the set. A 3-server replica set will not tolerate 2 servers failing. Thinking in terms of network splits in the context of separate data centers helps me visualize this.&lt;/p&gt;

&lt;p&gt;ps: I'm cross-posting this to my MongoDB collection available at &lt;a href="http://mongly.com/"&gt;mongly.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;pps: that's the longest I've not written in a very long time&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>I Just Don't Like Object Mappers</title>
   <link href="http://openmymind.net/2011/11/18/I-Just-Dont-Like-Object-Mappers" />
   <updated>2011-11-18T00:00:00-08:00</updated>
   <id>http://openmymind.net/2011/11/18/I-Just-Dont-Like-Object-Mappers</id>
   <content type="html">&lt;p&gt;It seems like people are talking about ORMs again lately, so I wanted to share my thoughts. In fact, it's something I'm always interested in talking about since my take on this appears to be in the minority, and I've always been uncomfortable about that (on this topic).&lt;/p&gt;

&lt;p&gt;I have a hard time getting behind mappers. It doesn't matter if we are talking about full-blown ORMs, or DataMapper pattern or lighter-weight ODMs. My problem with them is always about relationship management. In a lot of projects, I think the complete object graph you get with mappers makes a lot of sense. But, for a lot (most?) web applications, where you are often dealing with short-lived units of work (so short lived that calling them units of work is silly), has always felt heavy handed to me. And, the more we push out to the client with things like backbone.js, the more I prefer to hand-write explicit methods.&lt;/p&gt;

&lt;p&gt;Even for simple stuff, I find it heavy. Say you are letting people vote on something, I just rather write code that explicitly does (via some hard-coded SQL):&lt;/p&gt;

&lt;pre class="brush:text"&gt;
  update answers set votes = votes + 1 where id = @id
&lt;/pre&gt;

&lt;p&gt;than to load an answer by id, update its votes, and then save the object.&lt;/p&gt; 

&lt;p&gt;Yes, part of my problem is performance. At my last job we started troubleshooting why a list of 20 items was rendering so slowly, only to find the database being hit thousands of times to render that page (select n + 1 causing select n + 1 type thing). And this was written by senior and experienced Java developers (teehee)...and it isn't the first time that I've seen it.&lt;/p&gt;

&lt;p&gt;But the real problem is the complexity and the obscurity of what's going on. And, when mappers start getting clever with things like proxying values, lazy loading and identity maps, the complexity and the chance of something blowing up on you skyrockets. Call me stupid, but I spent years working with NHibernate and when we ran into problems with detached objects, I still had no clue (nor did anyone else on the team) how to fix it.&lt;/p&gt;

&lt;p&gt;Again, when you are doing so much of your work in very short-lived and highly-focused requests, the idea of loading the current answer to get its comments, just seems silly. Just write a method that &lt;code&gt;gets_comments_by_answer_id&lt;/code&gt; and move on.&lt;/p&gt;

&lt;p&gt;For me, the only benefit of a mapper which could be enticing is writing database-agnostic code. But, I only consider that a benefit (a huge benefit) if you actually need it (like, you sell a boxed product that users can configure to run against X, Y or Z databases). If you are building a website on your own servers, for your own stack, it just isn't worth it.&lt;/p&gt;</content>
 </entry>
 
 
</feed>

