codeaholics.org

Giving Docker/LXC containers a routable IP address

Danny — Wed, 02 Oct 2013 22:20:32 +0000

At work, we are evaluating Docker as part of our “epic next generation deployment platform”. One of the requirements that our operations team has given us is that our containers have “identity” by virtue of their IP address. In this post, I will describe how we achieved this.

But first, let me explain that a little. Docker (as of version 0.6.3) has 2 networking modes. One which goes slightly further that what full virtualisation platforms would typically call “host only”, and no networking at all (if you can call that a mode!) In “host only” mode, the “host” (that is, the server running the container) can communicate with the software inside the container very easily. However, accessing the container from beyond the host (say, a client – shock! horror!) isn’t possible.

As mentioned, Docker goes a little bit further by providing “port forwarding” via iptables/NAT on the host. It selects a “random” port, say 49153 and adds iptables rules such that if you attempt to access this port on the host’s IP address, you will actually reach the container. This is fine, until you stop the container and restart it. In this case, your container will get a new port. Or, if you restart it on a different host, it will get a new port AND IP address.

One way to address this is via a service discovery mechanism, whereby when the service comes up it registers itself with some kind of well-known directory and clients can discover the service’s location by looking it up in the directory. This has its own problems – not least of which is that the service inside the container has no way of knowing what the IP address of the host it’s running on is, and no way of knowing which port Docker has selected to forward to it!

So, back to the problem in hand. Our ops guys want to treat each container as a well-known service in its own right and give it a proper, routable IP address. This is very much like what a virtualisation platform would call “bridged mode” networking. In my opinion, this is a very sensible choice.

Here’s how I achieved this goal:

Defining a new virtual interface

In order to have a separate IP address from the host, we will need a new virtual interface. I chose to use a macvlan type interface so that it could have its own MAC address and therefore receive an IP by DHCP if required.

ip link add virtual0 link eth0 type macvlan mode bridge

This creates a new virtual interface called virtual0. You will need a separate interface for each container. I am using bridged mode as it means container-to-container traffic doesn’t travel across the wire to the switch/router and back.

If you need a fixed MAC address – perhaps because you want a fixed IP address and you achieve this by putting a fixed MAC-to-IP mapping in your DHCP server – you need to add the following to the end of the line:

ip link add ... address 00:11:22:33:44:55

(substitute your own MAC address!)

If you do not do this, the kernel will randomly generate a MAC for you.

Giving the interface an IP address

If you want a fixed IP address and you’re happy to statically configure it:

ip address add 10.10.10.88/24 broadcast 10.10.10.255 dev virtual0

(substitute your own IP address, subnet mask and broadcast address)

If, however, you need to use DHCP to get an address (development environment, for example):

dhclient virtual0

This will start the DHCP client managing just this one interface and leave it running in the background dealing with lease renewals, etc. Once you’re ready to tear down the container and networking, you need to remember to kill the DHCP client, or, better yet:

dhclient -d -r virtual0

will locate the existing instance of the client for that interface, tell it to properly release the DHCP lease, and then exit.

Finally, bring the interface up:

ip link set virtual0 up

Find the container’s internal IP address

If you haven’t done so now, start your container as a daemon.

Use docker inspect to find the internal IP address under NetworkSettings/IPAddress.

Setting up inbound routing/NAT

Create a chain for the inbound routing rules for this container. I prefer to use a separate chain for each container as it makes cleaning up easier.

iptables -t nat -N BRIDGE-VIRTUAL0 # some unique name for the iptables chain for this container

Now, send all incoming traffic to this chain. We add a rule in PREROUTING which will match traffic coming in from outside, and a rule in OUTPUT which will match traffic generated on the host. Here, is the routable IP allocated statically or via DHCP to the virtual0 interface.

iptables -t nat -A PREROUTING -p all -d -j BRIDGE-VIRTUAL0 iptables -t nat -A OUTPUT -p all -d -j BRIDGE-VIRTUAL0

Finally, NAT all inbound traffic to the container. Here, is the internal address of the container as discovered using the docker inspect command.

iptables -t nat -A BRIDGE-VIRTUAL0 -p all -j DNAT --to-destination

This particular rule will forward inbound traffic for any port on the external IP to the same port on the internal IP – effectively exposing anything that the container exposes to the outside world. As an alternative, you can expose individual ports like so:

iptables -t nat -A BRIDGE-VIRTUAL0 -p all -m tcp --dport 80 -j DNAT --to-destination :8080

In this example, we forward any traffic hitting port 80 on our external IP to port 8080 on the internal IP – effectively exposing a web server on port 8080 inside the container as port 80 on the external IP. Nice tidy URLs that don’t have port numbers on. Neat, huh? You can even map multiple external ports to the same internal port by using multiple rules if you wish.

Setting up outbound NAT

What we have so far works, but if the container initiates any outbound requests, they will appear (by virtual of Docker’s default MASQUERADE rule) to come from the hosts own IP. This may be fine depending on your circumstances. But it probably won’t play well if you want to get firewalls involved anywhere, and it might be a nuisance if you’re using netstat or similar to diagnose where all the incoming connections to a heavily loaded server are coming from.

So… we’ll add “source NAT” into the equation so that outbound connections from the container come from its own IP:

iptables -t nat -I POSTROUTING -p all -s -j SNAT --to-source

This rule says any traffic we’re routing outbound which has a source address of the containers internal IP should have the source address changed to the containers external IP.

Note that we use -I (not -A) to ensure our rule goes ahead of Docker’s default MASQUERADE rule.

And finally

This all worked fine when I set it up at home, and then failed in the office. It turns out that it fails when you’re on a routed network (my network at home is a simple, flat network) and you try and reach your container from a different subnet. To cut a long story short, you need to disable “reverse path filtering”!

echo 2 > /proc/sys/net/ipv4/conf/eth0/rp_filter echo 2 > /proc/sys/net/ipv4/conf/virtual0/rp_filter

For more information about this, see this ServerFault question, this article and this article (linked from the previous one).

Summary

There’s a lot of information in this post, but in summary what we’ve done is:

Create a new virtual interface for the container
Given the virtual interface an IP address either statically or using DHCP
Used iptables to set up inbound routing to allow the container to be reached via its external/routable IP from both the host and the network beyond the host
Optionally forwarded (and even more optionally, aliased) only specific inbound ports
Set up source NAT to rewrite outbound traffic from the container to use the correct external IP
Given the kernel a bit of a helping hand to understand that we’re not terrorists and it’s safe to “do the right thing” with our slightly odd-looking traffic

The first 5 steps of this took me about 20 minutes to figure out. Step 6 took me 3 days, the help of several network engineers, one extremely patient ops guy and no small amount of blind luck!

I hope having it all written down in one place will be helpful to others.

Redis memory optimisation

Danny — Thu, 15 Aug 2013 15:51:23 +0000

According to Alexa, my employer’s web site is the 13th most popular web site in the UK, the 66th most popular in the US and the 127th most popular in the world. At least it is today. Not too shabby, eh?

So real-time analytics is fun. At our peak time (UK lunch time) we get in the region of 500 views per second on our article pages. This doesn’t include the various index pages or the home page, or any of the analytics data we capture around video views, etc. And our editorial team want to see in real time how their articles are performing so they can tweak them, move them around on the home page, etc.

To enable real-time slice and dice capabilities we hold LOTS of counters in Redis. The general schema for these is:

dimension1:dimension2:dimension3:dimension4:articleId:dimension5:date:hour:minute = counter

What the individual dimensions are is largely irrelevant for this post. Many (but not all) of the fields can also hold “*” instead of real data. And there can be many permutations of fields with real data in and fields with “*” in. So, for example, we might have the following:

d1:d2:d3:d4:articleId:d5:date:hour:* = counter

to represent all the views to a particular article across the 5 other dimensions in a given hour. Or

d1:d2:d3:d4:*:d5:date:*:* = counter

to represent all the views across dimensions d1-d5 for any article on a given date.

So each of the 500 hits per second that the analytics platform receives can generate a large number of Redis updates to satisfy the various combinations of wildcards. Our two primary Redis instances rumble along at a sustained rate of around 20,000 operations per second.

Now, all of this data takes space. Serious space. We shard this data across a number of Redis instance based on what type of data it is (page views, visitor data, video data, metadata, etc.). Following the release of some new business requirements recently, our two primary instances were using in the region of 20-24 GB of RAM each, and holding something like 110,000,000 keys each. Ouch!

A bit of lateral thinking was required.

Those keys work out on average around 40 bytes each. And the counters in them are, on average, around 2-3 bytes. That’s a massive overhead for the key names. The key names represent around 90% of the memory usage and the actual data only around 10%. And the keys contain an awful lot of duplication too – for example, dimension 1 can only take 2 possible values, but they are represented by a 4 character string and a 5 character string. So we store that 4 or 5 character string in EVERY SINGLE KEY.

There are a number of possible approaches to this including simply shortening the segments by making them a bit less semantically meaningful (for example, instead of storing dates as `20130815` we could store something like the number of days since January 1st 2010), or using some kind of dictionary-based approach whereby we assign every unique value that a dimension could have to an integer and then have keys that look like 12:3917:2385:1222:98737:3746:5543. Good luck debugging either of those.

In the end we took a pretty simple approach. We realised that the article ID dimension had by far the largest cardinality. So instead of making it part of the key and having Redis keep simple counters, we instead changed the Redis data structures to hashes keyed on the article ID. Thus:

d1:d2:d3:d4:articleId:d5:date:hour:minute = counter

becomes

d1:d2:d3:d4:d5:date:hour:minute = hash of articleID -> counter

By this simple expedient, we managed to reduce the memory usage on the primary servers from around 24GB to around 8GB (a saving of 66%) and the number of keys from 110,000,000 to around 2,500,000 (a saving of around 98%!). Ops team love us. CTO’s budget spreadsheet is looking more chilled out. Epic win.

There are a number of other advantages to this approach beyond the obvious reduction in hardware:

1) Because we’re using so much less memory, dumping RDB files is much quicker.
2) Because writing RDB files is much quicker, the number of changes that happen during the dump is greatly reduced meaning less memory pressuring during the dump.
3) The new data structure exhibits much better locality of reference, meaning that for any given workload that happens during the dump, fewer pages need to be copied. This means, again, even less memory pressure during RDB dump.

It does have some drawbacks though. In particular, you have changed the data access paths.

One of our screens has a “sparkline” of how a particular article has trended over the last 2 hours. Previously, we would gather the data for this by issuing a single Redis MGET command with 120 keys (one for each minute) where the hour and minute fields in the key ranged over the time period we were interested in.

Now, however, we cannot do that because there is no command in Redis to get one hash member (the article ID) out of 120 different hashes. The solution to this is a story for a different post…

So the moral of this story is this. When you are working with large volumes of data in a data store (but an in-memory one in particular):

1) Never let a prototype go live
2) Think about whether your primary workload will be storage or retrieval.
3) Optimise your schema (in this case, the Redis key structure) accordingly.
4) Understand your data and think about how you’re storing it.

(This is also a good opportunity to mention that we took the data in the old storage format and migrated it to the new format using my Node.JS-based streaming RDB toolkit).

Emulating Maven’s “provided” scope in Gradle

Danny — Tue, 02 Oct 2012 20:27:28 +0000

I’m not a fan of Maven. I don’t make any secret of it. But one thing they’ve got right is the “provided” scope – that is, a scope whose dependencies are used at compile time, but not required at runtime because they’re expected to already be present. The canonical example of the use of this scope is the Servlet or JSP API – it needs to be present at compile time to compile against, but the container provides it at runtime, so it doesn’t need to be explicitly pulled in as a runtime dependency.

I am however, a big fan of Gradle. Gradle’s Java plugin provides a number of default “configurations” (revealing Gradle’s close relationship to Ivy) which bear a close resemblance to Maven’s scopes. However, the Java plugin doesn’t provide anything like Maven’s “provided” scope. (Note: the War plugin does provide both “providedCompile” and “providedRuntime” configurations.)

Why would you want a “provided” configuration in a Java project? Well, my particular use case was that I was writing a custom Ant task (I’m aware of the irony of using Gradle to build this!) and I needed to compile against the Ant API – but, of course, at run time the Ant JAR is already present.

Having pieced together a number of posts on the internet, I came up with the following:

configurations {
    provided
}

sourceSets {
    main.compileClasspath += configurations.provided
    test.compileClasspath += configurations.provided
    test.runtimeClasspath += configurations.provided
}

dependencies {
    provided 'org.apache.ant:ant:1.8.4'
}

This works, but it feels a bit dirty.

If you’re an Eclipse user (of course you are!) it also has one major flaw. If you’re using the Eclipse STS Gradle plugin to create a managed classpath in your Eclipse project (rather than using Gradle’s Eclipse plugin to generate your .classpath and .project files), you’ll find that Eclipse doesn’t know anything about your “provided” dependencies and you have a bunch of compile errors.

Here’s how to sort that out:

apply plugin: 'eclipse'

eclipse.classpath.plusConfigurations += configurations.provided

So, you have to add Gradle’s Eclipse plugin – even if you wouldn’t otherwise be using it – and add your “provided” dependencies to the Eclipse classpath configuration. Even if you don’t use the Gradle Eclipse plugin, the Eclipse STS Gradle plugin uses the same model and will now pick up your “provided” dependencies.

There is an outstanding Gradle bug report/enhancement request for this feature, but for some reason it’s struggling to get traction. If you feel this is important enough to be a built-in part of the Gradle Java plugin, please vote for this issue.

World’s fastest break-in attempt

Danny — Mon, 16 Apr 2012 19:14:16 +0000

Actually, I’m pretty sure it isn’t, but still…

From spinning up a new EC2 instance today to getting the first e-mail from fail2ban took a little under 6 hours.

I don’t know if this makes me happy or sad. Happy because I have a Puppet-based bootstrap system which can bring a freshly minted box up to code in around 5 minutes (including iptables, fail2ban and a locked down SSH configuration), or sad because… well… have people really got nothing better to do?

In related news, when will fail2ban support IPv6? There seem to be lots of threads in lots of different issue tracking systems (most lately Github), many of which include patches, but no actual IPv6 action. Now that makes me sad.

WordPress, nginx, W3TC and robots.txt

Danny — Mon, 16 Apr 2012 15:29:58 +0000

A quick note to try and save somebody else the hours of pain I just experienced…

Here’s the scenario: you’re being dead clever and ditching Apache in favour of Nginx to run your WordPress blog/site and pretty much have everything right. You’re NOT using a plugin to generate robots.txt for you – after all, WordPress does a good enough job through the Settings > Privacy page. You browse to http://domain.com/robots.txt and everything looks pretty sweet. Heck, you might even go and change the privacy settings and grab robots.txt again to make sure it’s all working the way you expect.

Then… you drop the W3 Total Cache bomb. Now, W3TC is pretty well regarded, but it hasn’t had any love for a several months. In fact, it hasn’t even been updated to say it’s compatible with WordPress 3.3.0+ (which it appears to be, AFAICT, although some people have had issues with Minify). What it does have though, is Nginx support out of the box.

What does that mean? Well, if W3TC detects that it is running on Nginx, it will write out a snippet of Nginx configuration which deals with all the cleverness needed to get Nginx to serve W3TC page cache files statically off the disk without having to go through PHP. (This, my friends, is a large part of the secret sauce that makes an Nginx/PHP stack so much faster than Apache/PHP.) Theoretically, all you have to do is use the include directive to pull this snippet into your virtual host configuration file, and you’re good to go. (If you do this then don’t forget to nginx -s reload every time you tweak your W3TC settings.)

And then it hits you. robots.txt has stopped working.

Here’s my solution (in my virtual host file, if you care):

    location = /robots.txt {
        # Force robots.txt through the PHP. This supercedes a match in the
        # generated W3TC rules which forced a static file lookup
        rewrite ^ /index.php;
    }

This is a pretty specific location (using = and not having a regexp), so it trumps anything in the W3TC generated config. Any request for robots.txt is rewritten to index.php which your regular Nginx rules should then hand off to PHP-FPM, which means WordPress will dynamically generate the content for you.

Wow. That took me, literally, 2-3 hours to figure out. Mostly because I didn’t notice it had stopped working when I added W3TC into the mix. Once I’d figured out W3TC (or rather the W3TC generated config) was the culprit, the actual fix was pretty quick.

I’ll be writing more about my Nginx config and the relative performance against Apache2 on an Amazon EC2 Micro instance soon. In the mean time, I hope I saved you some time!

Adventures with DD-WRT and IPv6 (with a dash of TomatoUSB)

Danny — Sat, 28 Jan 2012 21:59:59 +0000

A little under a year ago, I decided two things: first, that it was about time my ageing home network got GigE and 5GHz wireless-N (dual band, of course, to support devices that would only do 2.4GHz); and second, that I would separate the jobs of BEING my network from CONNECTING my network to the internet (since I couldn’t find a good router which would meet these requirements AND had an ADSL modem in it).

So I bought a Linksys/Cisco E3000, made it the backbone of my network and connected it to the internet via my ISP-supplied ADSL modem.

Then an unfortunate incident happened which involved the Linksys/Cisco setup CD, an unwanted but non-removable guest WiFi network, and me swearing a lot.

The time had come (after about 16 hours!) to put DD-WRT on my router. As this post describes, choosing a version of DD-WRT that won’t “brick” your router (as the developers like to describe it) is treacherous to say the least. I eventually settled on dd-wrt.v24-16758_NEWD-2_K2.6_mega (specifically, the nv60k version). Despite the trepidation caused by the dire warnings on the web site, the flashing went well, and I’ve been pleased with DD-WRT ever since. Until…

Last week, I had a 40Mbit/sec fibre broadband connection installed. Amongst other things, my new ISP provides me with a block of IPv6 addresses. Actually 2^80 of them. I seriously need to think about what I’m going to do with them all.

My excitement at having 1,208,925,819,614,629,174,706,176 IP addresses was somewhat dampened when, after a day or so of fiddling and researching, I discovered that DD-WRT’s supposed IPv6 support was limited to the various types of v6-over-v4 tunnels (e.g. Hurricane Electric). Specifically, the PPP daemon doesn’t support IPv6 – so this might just be an issue for PPPoE users. There was no way for me to use all that space natively.

It should be noted here that even if you do want to use a tunnel to reach the IPv6 internet, you will still need to write startup scripts for DD-WRT to load the kernel module (the “Enable IPv6” checkbox doesn’t actually do anything), start radvd (the “Enable radvd” checkbox doesn’t actually do anything), configure the tunnel interfaces and WAN IP addresses, etc. And even after all of this, you’ll find that the IPv6 user tools (ip6tables, ping6, traceroute6, etc.) aren’t installed, so you’ll have to locate them and hope you have room on your device somewhere.

So the time has come to make the move to TomatoUSB. To some extent, this suffers from the same issues as DD-WRT when it comes to variants, etc., but the information is more logically presented, and there do seem to be fewer choices and fewer potential traps. After looking at the comparison of “mods” on Wikipedia, I chose Toastman’s mod. It seems to have all the features I wanted and he seems to do frequent builds with all the latest updates and patches – in fact, the latest build (1.28.7494.3) was made only 6 days ago. This compares well with DD-WRT which doesn’t appear to have had any real active work/releases for a year or so now.

My first impressions of TomatoUSB are positive. The GUI feels snappy, and has most of the same features as DD-WRT. The real-time bandwidth monitor is definitely prettier than DD-WRTs. And, most importantly, the IPv6 support works out of the box.

Out-of-the-box, ip6tables is configured to allow ICMP packets of every type (so I can ping all my machines from various online ping sites), but disallow all inbound traffic. So, Linux ip6tables bugs aside, I’m secure by default, which is nice. There doesn’t seem to be a GUI interface to setup firewall rules for IPv6, so I guess if I ever to want to let anything in, I’ll have to ssh to the router and do it by hand – but why would I ever want that?

And that’s that. I took under 2 hours to flash TomatoUSB, reproduce all my configuration on it, and get IPv6 working. Nice. I can now browse ipv6.google.com, www.v6.facebook.com/, and I get a dancing turtle when I visit www.kame.net. Also, this:

One last thing: don’t forget to enable IPv6 privacy extensions on all of your hosts!

iTunes 10.5 upgrade woes

Danny — Wed, 12 Oct 2011 21:26:02 +0000

iOS 5 is upon us, so I thought I would snag a copy and see what’s what. But, first thing’s first… I apparently needed to upgrade iTunes to 10.5. (Why? Why do I need a particular version of a media player to install a particular version of a mobile phone OS?)

I’m running Windows Vista (yes, really) Ultimate x64 SP2 with all current patches applied. After the obligatory unchecking of unwanted crap from the Apple software update tool (specifically, MobileMe and Safari), I settled in to watch the very slow download. I guess Apple’s servers are a bit overloaded right now. And then the very slow installation process begins. And then… the very slow installation process aborts.

Hmmm. Try again. At least it didn’t seem to need to do the download again. Same failure. No real error message. Just “failed to install” or something equally unhelpful.

A quick Google turned up lots of people having this problem. Some on Windows 7, some on Vista. All on x64. The typical advice was to try installing as an administrator, try downloading and running the MSI by hand, try both (manual install as an administrator). None of it helped. It did, however, reveal a more useful error message. “Service ‘iPod Service’ (iPod Service) could not be installed. Verify that you have sufficient privileges to install system services.”

Googling this turned up a year-old blog post by David Lesault which hit the spot.

Essentially, it seems the installer has issues uninstalling the iPod Service sometimes (I’ve never had this problem before, others seem to have had it since the genesis of iTunes 10.x). The service is marked for deletion, but not quite gone yet. Hence trying to install the new version of the service failed. This is similar to that funky Windows things where it can’t delete files that are in use by a process, but remembers them and deletes them when you reboot. Which, incidentally, is one of the primary reasons why Windows insists on reboots after various kinds of patches, although this is much, much better in Vista and later.

So, I slightly altered David’s process. I got myself to the error message and then simply switched my machine off (hold the power button for 4 seconds). When I restarted and tried the install again, MSI said that a pending installation was in progress and I would need to roll that back before continuing, which I duly did. I was a bit worried that the “rollback” would reinstall the old version of the service, but it didn’t seem to. And that’s that… the iTunes 10.5 upgrade successfully installed.

Now, only 7 minutes left of the iOS 5 download, and who knows how long it will take to actually upgrade the phone and what issues I will have…?

Thanks Apple. :-/

Disruptor.NET

Danny — Sun, 03 Jul 2011 19:53:13 +0000

It’s interesting to see people getting interested in porting the Disruptor to .NET (although what’s wrong with The One True Language, I don’t know!).

Tim Gebhardt has a port on Github
Matt Davey (Technical Director at Lab49) also has a couple of posts about porting the Disruptor and comparing the performance of his port to the Java version

I’ll try to keep this post updated as I learn of more .NET interest. Alternatively, please feel free to post a comment below.

The Disruptor – Lock-free publishing

Danny — Tue, 28 Jun 2011 16:57:11 +0000

In case you’ve been living on another planet, we recently open-sourced our high performance message passing framework.

I’m going to give a quick run-down on how we put messages into the ring buffer (the core data structure within the Disruptor) without using any locks.

Before going any further, it’s worth a quick read of Trish’s post, which gives a high-level overview of the ring buffer and how it works.

The salient points from this post are:

The ring buffer is nothing but a big array.
All “pointers” into the ring buffer (otherwise known as sequences or cursors) are Java longs (64 bit signed numbers) and count upward forever. (Don’t panic – even at 1,000,000 messages per second, it would take the best part of 300,000 years to wrap around the sequence numbers).
These pointers are then “mod’ed” by the ring buffer size to figure out which array index holds the given entry. For performance, we actually force the ring buffer size to be the next power of two bigger than the size you ask for, and then we can use a simple bit-mask to figure out the array index.

Basic ring buffer structure

WARNING: In terms of the organisation of the code, much of what I’m about to say is a simplification. Conceptually, I think it’s simpler to understand starting from how I describe it.

The ring buffer maintains two pointers, “next” and “cursor”:

In the picture above, a ring buffer of size 7 (hey, you know how these hand-drawn diagrams work out sometimes!) has slots 0 through 2 filled with data. The next pointer refers to the first free slot. The cursor refers to the last filled slot. In an idle ring buffer, they will be adjacent to each other as shown.

Claiming a slot

The Disruptor API has a transactional feel about it. You “claim” a slot in the ring buffer, then you write your data into the claimed slot, then you “commit” the data.

Let’s assume there’s a thread that wants to put the letter “D” into the ring buffer. It claims a slot. The claim operation is nothing more than a CAS “get-and-increment” operation on the next pointer. That is, this thread (let’s call it thread D) simply does an atomic get-and-increment which moves the next pointer to 4, and returns 3. Thread D has now claimed slot 3:

Next, another thread (thread E) claims slot 4 in the same manner:

Committing the writes

Now, threads D and E can both safely and simultaneously write their data into their respective slots. But let’s say that thread E finishes first for some reason…

Thread E attempts to commit its write. The commit operation consists of a CAS operation in a busy-loop. Since thread E claimed slot 4, it does a CAS waiting for the cursor to get to 3 and then setting it to 4. Again, this is an atomic operation. So, as the ring buffer stands right now, thread E is going to spin because the cursor is set to 2 and it (thread E) is waiting for the cursor to be at 3.

Now thread D commits. It does a CAS operation and sets the cursor to 3 (the slot it claimed) iff the cursor is currently at 2. The cursor is currently at 2, so the CAS succeeds and the commit succeeds. At this point, cursor has been updated to 3 and all data up to that sequence number is available for reading.

This is an important point. Knowing how “full” the ring buffer is – i.e. how much data has been written, which sequence number represents the highest write, etc. – is purely a function of the cursor. The next pointer is only used for the transactional write protocol.

The final step in the puzzle is making thread E’s write visible. Thread E is still spinning trying to do an atomic update of the cursor from 3 to 4. Now the cursor is at 3, its next attempt will succeed:

Summary

The order that writes are visible is defined by the order in which threads claim slots rather than the order they commit their writes, but if you imagine these threads are pulling messages of a network messaging layer then this is really no different from the messages arriving at slightly different times, or the two threads racing to the slot claim in a different order.

So there we have it. It’s a pretty simple and elegant algorithm. (OK, I admit I was heavily involved in its creation!) Writes are atomic, transactional and lock-free, even with multiple writing threads.

(Thanks to Trish for the inspiration for the hand-drawn diagrams!)

Open sourcing the Disruptor

Danny — Tue, 21 Jun 2011 09:14:37 +0000

LMAX recently open-sourced The Disruptor – one of the core frameworks upon which we build our ultra-high performance financial exchange. Today, we published a white paper detailing how The Disruptor works, and highlighting the sorts of performance benefits that can be achieved by using it.

The Disruptor is essentially a library which we (and now you!) can use to do message passing within your application. If you like, it’s a queue on steroids. But this stuff is far more fascinating than just that for a number of reasons.

Firstly, the raw performance figures. Our testing shows that the latency you can achieve with The Disruptor is 3 orders of magnitude less than you can achieve with ArrayBlockingQueue. And with that, comes throughput that’s an order of magnitude higher! Win-win. But there’s more. The Disruptor actually goes faster under higher load. We’ve had this monster passing messages with latencies as low as 50ns. That’s approaching the theoretical limit of what you can achieve with the hardware. Still think Java is slow?

Here’s a chart from the white paper, showing the relative latencies. It’s worth keeping in mind that this chart uses a log-log scale.

Secondly, the implementation approach. The Disruptor has been designed and built by stepping away from the problem, and re-evaluating it from a CS101 perspective. A lot of the principles used fly in the face of modern, main-stream concurrency ideas. For example, most deployments of The Disruptor will allow you pass messages from multiple producers to multiple consumers without a single lock. Not locking means not going to the kernel for lock arbitration, and that means no latency spikes.

Thirdly, the consistency. In our tests, ArrayBlockingQueue was giving us a mean latency of over 30,000ns, and a 99.99% tail of 4,000,000ns. The Disruptor was showing a mean latency of just 52ns, and a 99.99% tail of around 8,000ns. The consistency with which The Disruptor out-performs traditional queueing/message-passing techniques leads to less jitter and latency spikes. This is vital in a financial environment. Latency spikes lead to unhappy market makers who have no confidence in the prices they’re making. That leads to wider spreads. And that, ultimately, leads to unhappy customers.

Finally, you really have to hear Martin Thompson (our CTO) and Mike Barker (one of our tech leads) talk about this stuff to understand the passion that they put into its creation.

EDIT: Turns out Mike Barker has written a similar post today!

EDIT 2: Also, Trisha Gee has written a post which goes into some of the basics about what a ring buffer is. Can you tell we’re all quite excited?

For more information, checkout the Google code project and the white paper.