Tammer Saleh

A Modern Cloud Vocabulary

2016-05-24T10:44:00+00:00

The Cloud is a quantum flux that pushes the limits of the definition of "eventually consistent". I've seen things you people wouldn't believe. VMs returning from a deprovision command a year later. Entire regions destroyed by a router misconfig. API calls returning completely different results ten times in a row. Networks on fire off the shoulder of Orion.

Sorry… got carried away.

In any case, the Cloud is Chaos. Cloud-resilient systems must adhere to a new vocabulary in order to withstand it, but many developers have never been exposed to these concepts. They have their trade-offs, but systems built using these patterns are stronger, more resilient, and are overall easier to reason about.

Declarative (vs Imperative)

When I was a kid, just learning about computers and programming and such, my teacher asked us to describe the precise steps necessary to make a sandwich. She then paired us up, had us perform those steps verbatim, and sat back and cackled as we made disasters of our lunch. She was imparting an important lesson about how we interact with computers: Their power is that they do exactly what we tell them to do, their fundamental flaw is the same.

This is the imperative model. We tell the machine, step by step, what actions to perform. We cross our fingers and hope that these actions will result in a sandwich. If our instructions are wrong, then we end up with something else. If our assumptions were wrong, and the mayo was on the left while the peanut butter was on the right, then we end up with something else.

Imperative systems work well when the conditions are controlled and predictable. If you know i started at 1, then you know i++ will result in 2.

Imperative systems make actual and desired state implicit, and make moving between them explicit. In that way, they are the opposite of declarative systems.

Declarative systems make actual and desired states first class concepts, and take the responsibility of bridging the gap between them (which we call the delta). We interact with these systems by telling them how the world is already supposed to be, and watch as they scramble to catch up.

I should have a sandwich in front of me.

Crap! You're right - there should be a sandwich there! Let me make one!

Declarative systems are much harder to build. They have to be smart enough to find their own path to desired, routing around any obstacles in the way. They can't make assumptions, and instead have to be able to discover the actual state of the world. They're also less predictable. How are we getting there? Who knows? Who cares? Look - we're here! And they can be verbose. Instead of instructing the system to "add mayo" to fix your sandwich, you have to declare the entire sandwich all over again (mayo included).

If imperative systems thrive in controlled environments, declarative systems excel in chaos.

(The Cloud is chaos)

Declarative systems are one piece of the puzzle in resiliency. Another is…

Convergence

When I think of convergence, I think of the parable of the oak and the reed. The oak is strong but rigid, while the reed is weak but survives the storm.

(The Cloud is that storm)

Convergence is a property of a system that says "no matter how hard you blow, I'll eventually be standing again." Convergence goes hand in hand with declarative systems, as a convergent system needs to understand an explicit desired state. Convergence works by continually monitoring the delta between actual and desired state and triggering the declarative system when that delta exists. Often this works by taking a declarative system and simply running it in a loop.

Idempotence

Idempotence means an action has the same effect no matter if it's called once or many times. Mathematically speaking, that means f(f(x)) == f(x).

Idempotence is important in cloudy systems such as distributed databases. Operating in the Cloud means operating in chaos (see a theme forming?), where messages may or may not make it to their destination. We need to know that we can safely run the same command twice and get the desired result.

x = 5 is idempotent, but x++ isn't.
HTTP PUT is idempotent, but HTTP POST isn't.
echo "foo" > bar is idempotent, but echo "foo" >> bar isn't.

Declarative systems are inherently idempotent.

Immutability

When I was a Unix geek¹, I managed 400 machines the old fashioned way – SSH and hand-crafted bash loops. This was a nightmare for a number of reasons…

It was error prone – what if the server's down?
It created snowflakes with slightly different configurations than the rest.
It was impossible to reason about the contents of a server. The only way to know the actual state was to SSH in and find out.

Chef and Puppet automated this process, but perpetuated the fundamental flaw of gradually applying changes to the system.

The programming world has long had the concept of immutable data structures (C's const being the simplest), but this concept has been extended to everything from server configurations to distributed databases.

Immutable infrastructure is the practice of burning entire VM images for each software deployment, and reprovisioning those servers entirely. This increases security and reliability, removes snowflakes, and makes the servers easier to reason about.

Immutability can also have a big impact on data stores. It's easy to make performance optimisations if you can decide that the data will never be altered. This also makes many distributed data strategies tractable. Even if your data can be changed, you can fake it by storing immutable versions and reading the most recent one. This is the underlying principle of many modern eventually-consistent systems. I don't know this for sure, but I'm willing to bet cash money that Twitter uses this trick to store tweets.

Know Your Vocabulary

Understanding concepts like Declarative systems, Convergence, Idempotence, and Immutability, and knowing when and how to apply them will help you as you build architectures stable enough to withstand the chaotic mess of the Cloud.

Past tense… Yeah, right. ↩

Stop Looking for Trouble

2015-01-25T10:16:00+00:00

One of the hardest judgement calls in programming is determining the balance between refactoring and progress. Focusing on the quality of the codebase is an investment in its maintainability. Of course, when taken to the extreme, this can completely derail our progress on shipping functionality to our users.

So, how do we find the right balance? In an agile development workflow, we rely upon the TDD cycle: red, green, refactor. This produces a rhythm that maintains feature development without sacrificing quality.

However, some developers who are new to this workflow question whether it allows them to step back and find better ways to design the overall system. One of the misconceptions that leads to this doubt is in believing that the refactor step in "red, green, refactor" should only apply to the code you're currently writing. This creates a myopic behavior of only touching those bits of the codebase that are currently visible in the editor. Clearly that would be foolish – refactorings often apply throughout an entire project.

This isn't what the agile process encourages. Instead, it dictates that the refactoring must be driven by pain you're experiencing in the story you're currently working on.

But why do we not spend that time looking for refactorings outside the scope of the current story? Surely cleaning up code is a good thing, no matter where or why?

You can't predict the future.

Code changes constantly. Continual refactoring, customer requirements, adoption of 3rd party libraries and more translates into entire continents of code disappearing overnight. Investing time in cleaning up these undiscovered countries is often utterly wasted when those swaths of code are later removed.

What's more: these premature architectural changes solidify design choices. Since we can't predict future demands that will be placed upon our code, these new walls often have to be moved at great expense.

Most importantly, when we base our refactorings on tangible pain being felt at this moment, it's much more likely that we're actually avoiding future pain. We can say with confidence: "That hurt. Let's fix it to make sure it doesn't hurt again."

Stop looking for trouble.

To help keep my attention on useful refactorings, I like to remind myself to stop looking for trouble.

If you find yourself finished with a story and itching to find undiscovered places where code could be improved, you're just looking for trouble.

If you come up with clever ways of abstracting modules that aren't yet being used in other places, you're just looking for trouble.

If you start your greenfield application using Microservices™ and a message bus, you're just looking for trouble.

If you say to your pair "Did you hear that noise down the alley? Let's take a look!" then you're just looking for trouble.

Trust.

There are two realisations that allowed me to stop looking for trouble:

If code is never touched, the interest on the debt it contains is zero.
My teammates are more than capable of tackling said code with a machete if and when they do finally touch that code.

This really all comes down to trust. I realised that when I wanted to seek out strange and new places for technical debt, it was because I wasn't respecting my team. I didn't trust them to clean up after themselves.

By basing your refactorings on the pain you're feeling, and by trusting that your team will do the same, you can be confident that you're making the product better, without spinning your wheels by over-engineering.

UNIX Programming by Example: Runit

2014-07-09T11:29:00+00:00

I've often heard from fellow developers that they don't feel they have a strong foundation of UNIX principles. The Unix Philosophy is a good set of meta-rules (much like SOLID principles), but I do better by seeing concrete examples.

This description of Runit is such an example. It makes graceful use of the filesystem, symlinks, convention-over-configuration, process semantics, tool modularity and so on.

If you read that article and fully understand all of the features Runit gives you and the benefit of the features it doesn't, then you're on your way to really understanding good UNIX development.

Take a second to read through "Runit for Ruby (And Everything Else)" if you haven't already. We're going to go through it, point by point, to get a practical understanding of what we should be aiming for when building UNIX friendly tools.

Conventions over Requirements over Configuration

Good UNIX tools favor rewarding conformity over enforcing dogma. Runit demonstrates this in spades.

As you can see, all services live in one place, defined by $SVDIR or /service by default, which runsvdir manages.

…and, later:

We can […] give a user control over their own $SVDIR, /home/username/service here.

Everything Runit needs in order to operate is encapsulated in that one directory, conventionally /service, but also easily changed by a non-root user.

It would be tempting but misguided for the Runit developers to add a check to ensure it's being run as root. By not enforcing that, and by allowing the directory to be configured by the environment, we've enabled a major feature: per-user service directories.

This reminds me of the Ruby duck-typing philosophy. Don't sprinkle aggressive type checks throughout your code. Instead, be more flexible and rely on the fact that your code will break cleanly when used improperly. By doing so, you can open entirely new use-cases.

We see this again in the implementation of runlevels:

Runlevels become (unlimited amount of) directories of services (in /etc/runit/runsvdir) which can be switched to quickly and simply.
# Switch to the services described in `/etc/runit/runsvdir/primary`
$ runsvchdir primary

# Switch to the services described in `/etc/runit/runsvdir/failover`
$ runsvchdir failover

Runit doesn't place unnecessary and arbitrary constraints on the user. It's simpler to outline conventions and allow the user to build functionality through normal tools like symlinks. See this description of runsvchdir:

runsvchdir switches to the directory /etc/runit/runsvdir/, copies current to previous, and replaces current with a symlink pointing to dir.

Normally /service is a symlink to current, and runsvdir(8) is running /service/.

So, all runsvchdir actually does is move symlinks around. This is just an encouragement to follow good conventions. A user who "knows better" could easily build scripts to enable more complex tactics. This is similar to how capistrano manages the currently deployed application. Symlinks are also used for the /service/* entries:

The services in /service/* are symlinks to directories (usually in /etc/sv/) which must contain one executable file, named ‘run’.

Note that /etc/svs is just an encouraged location for your actual service directories. As King of Your Domain™, you could decide to keep those in /tmp/goingawaynextboot – Runit doesn't need to care, so it doesn't care. But it does strongly suggest the convention of using /etc/svs by repeatedly referring to it in the documentation.

On the other hand, Runit does need to know how to run your service. It could allow you to configure that in a runit.rc file or some such. But it takes a much easier and more oppinionated approach: it requires that you have a run script. It's simple for both Runit and the user, and it's much more predictable.

Runit also leans on convention to enable an entire set of features around failure notifications:

When a process stops, if a file named finish exists and is executable, finish will be run with two arguments, the exit code and exit status of run.

Note that the name of the script isn't configurable (since there's no real need) - it's just finish. It's also entirely optional, since Runit can operate without it. Runit is able to gracefully punt on the entire notification quagmire, because it respects the fact that…

Your Users are Developers

Let's look at the output of sv s (short for sv status):

$ sudo sv s /service/*
run: /service/callcenter: (pid 2870) 5266009s
run: /service/postgres: (pid 3732) 7700117s
run: /service/lighttpd: (pid 27321) 5208602s; run: log: (pid 3724) 7700117s
run: /service/ssh: (pid 3757) 7700116s; run: log: (pid 3731) 7700117s  
...

Note that all of the details for each service are listed on a single line, in a simple, but well-defined format. This is easily parsable by both machines and humans.

Of course it'd be easier for me to read if it converted the seconds to hours or days as appropriate, or if it aligned all of the columns. But that would make it much harder for me to drive sv through other scripts. As a UNIX user, I'm more than happy to make that trade-off.

You can see this respect for developers in the way Runit enables dependencies between services. Runit simply provides the ability to wait for another service to come up, and expects the user to write run scripts that use it:

Basic dependency example /service/lighttpd/run:
#!/bin/sh -e
sv -w7 check postgresql
exec 2>&1 myprocess

This single feature (wait 7 seconds for the postgresql service to be running, exiting with an error if that timout is reached), when used by the end user in their run command, solves a whole world of dependency-management features in one blow.

You can even see this focus on scriptability in the lack of output in the happy paths:

Generally there will be no output from such commands, use -v to get some output (examples from here on out will use -v)

This follows Eric Raymond’s "Rule of Silence": When a program has nothing surprising to say, it should say nothing.

Similarly, consider Runit's strategy for managing environment variables for each service:

A directory of files where an enviroment variable will be created for each file, with the value set to the contents of that file, may appear cumbersome at first glance.

In practice, however, we find that changing one or two options is the most likely workflow.

With the envdir setup, this becomes
echo "value" > /service/foo/env/VARIABLE_NAME

If you respect the fact that your users are programmers, then it becomes immediately clear that this is much better than a flat file of key/value pairs.

This also shows how we can gain simplicity by leaning heavily on the UNIX filesystem…

The Filesystem is Your Database

Consider, again, the example above. To set environment variables for your service, you simply create a file. Contrast that with the heroku CLI (heroku config:set GITHUB_USERNAME=joesmith), or with setting variables in the .travis.yml file. Both of these solutions require more code on the author's part, more congnitive load on the user, and (at least in the travis case) more effort to script.

Runit's approach, however, is both easy and incredibly simple. It achieved this simplicity because the authors understood that the UNIX filesystem wasn't just intended to hold MP3s. It's your local database.

You can see this same simplicity in adding and removing services:

Remove a service (stop it and make it not start back up, even on boot)
$ rm /service/sshd

That's it. Runit polls the filesystem, and constantly converges on the expected state. Much simpler for the user to interact with than learning a new command line option.

Simplicity through Composability

Runit also achieves simplicity by breaking itself up into many different tools with clear responsibilities: runsvdir, runsvchdir runsv, svlogd, chpst, sv, and more. Most programmers understand the need to break their code into clean classes and modules, but they often fail to extend that into the overall user interface.

By breaking the interface up into multiple executables like this, you simplify the implementation and make each individual tool easier for the end user to understand. The overall system may seem more complex, but that comes with the benefit of being able to easily drive various parts of the system through user-written scripts.

Consider the Runit process hierarchy:

runit
`- runsvdir
   |- runsv
   |  `- apache-ssl
   |- runsv
   |  `- crond
   `- runsv
      `- tinydns

The responsibilities are cleanly split between running many services (runsvdir) and monitoring/restarting a single service (runsv). By moving the restarting logic to many runsv processes, the semantics of each is much simipler to understand and debug. Having many processes with a single task each is better than a single master process with too many responsibilities.

Runit also gains simplicity through a strong understanding of processes and pipes.

Understand Processes

Processes run in the foreground logging to stdout/stderr.

The only thing cool about daemons is the name. They're a terrible hack involving a voodoo ritual of over 10 steps that makes old TSRs look like "good architecture." Daemons were only ever devised as a workaround for the poor design of the original init. Runit understands this, and expects that the services it monitors simply run in the forground as a normal god-fearing executable.

Furthermore, it makes graceful use of stdout and UNIX pipes to handle service logging:

If the directory log/ exists, it will be treated as a log service.

runsv will create a pipe and redirect standard out of the service (and its optional finish script) to log/run.

Log Service (for sshd) /etc/sv/sshd/log/run:
#!/bin/sh
exec svlogd -t /var/log/sshd/

Using UNIX pipes simplifies the entire logging problem. Processes no longer need to understand how to talk to syslog - they just print to STDOUT. This is something Twelve-Factor systems like Heroku and Cloud Foundry get right.

Become a Better Developer by Understanding UNIX

The logging example above actually sums up everything we've talked about quite nicely:

By respecting the fact that the user is a programmer, Runit enables them to implement whatever filtering or strange logging logic they need.
Runit doesn't enforce that users manage logs in any specific way, but encourages conventions by providing the svlogd logging tool.
Runit achieves simplicity by treating the logging service like any other process.
It leans on the filesystem: the log subdirectory is just another service directory, and could be shared amongst other services via symlinks.
It only enforces the conventions it has to: the log directory is entirely optional, but the run file is necessary (and isn't configurable). Similarly, logging isn't required, but if it's used, then all logs must come via STDOUT.

These principles are old, but there's a lot to learn from revisiting them. You can become a significantly better developer by studying how a single, well-written UNIX tool makes graceful use of the filesystem, symlinks, convention-over-configuration, process semantics, tool modularity and so on.

We're looking for engineers who are excited by this sort of topic on the London Services team. If you're interested in joining us, then I’d love to hear from you!

Finally, thanks to Mike Perham, who recently wrote a nice little piece about how much he appreciates runit. It prompted this post by reminding me of how wonderful an example of proper UNIX programming runit (and it's lineage, Daemontools) truly is.

Bringing Data to the Cloud

2014-07-02T13:15:00+00:00

I lead a team of engineers here in the London Pivotal office, who are responsible for building the next generation of Data Services for Pivotal CF. Here's some of our thinking around building reliable, self-sustaining, dynamically provisioned data services.

Bringing data services to a platform like Cloud Foundry is absolutely key. Building a production-grade PaaS is by no means easy, but superficial stateless platforms are becoming commodities. This is a good thing - it's the progress driving agility and the DevOps movement. However, everything changes when you add state.

Managing stateful processes, juggling volumes, managing backups, ensuring uptime, gardening AOFs and snapshots, dealing with startup dependencies… It's hard. And doing all of this in a completely automated way, behind unknown firewalls, in unknown environments is crazy-hard. However, we have a secret weapon which moves this problem just out of "insane" and into merely "ill-advised."

Challenges

We have a lot of services to build for 2014 - Cassandra, Redis, Neo4j, MongoDB, Memcached, Elasticsearch, and more. Because of this, we're forced to look at deploying and managing services from a much more holistic point of view. We have to concentrate on building patterns that can be reproduced instead of bespoke snowflakes.

Also, because we're building Cloud Foundry Services, we have the added complexity of building these to be provisioned on-demand - either as multi-tenant or single-tenant instances.

Cloud Foundry Services API

Because services that work with the Cloud Foundry platform must provide databases to many different application developers, they need to be able to perform two important actions:

Create a new instance.
Return a binding to that instance.

The most important question you have to answer when building a new service is: Exactly what is "an instance?" This informs the rest of your service design, and is the most important aspect of the application developer's experience. There are roughly two paths to go down: Multi and Single-tenant. Of course, the Devil is in the details.

Multi-tenant Cluster

By far, the easiest path is building a multi-tenant service. In the multi-tenant scenario, we deploy a single service cluster which is shared amongst all of the application developers. This basically pushes all of the scaling and resource isolation down into the service itself.

Some services handle this well. MySQL and PostgreSQL are examples of services that work fairly well from a multi-tenant point of view. While they're difficult to scale horizontally, and they don't deal well with noisy neighbors, they do have good authorization and authentication.

Cassandra is another example of a service that can work in a multi-tenant environment. It scales very well horizontally, and has strong authentication. However, there are still issues with its authorization system, and it has no good way of dealing with noisy neighbors.

Single-tenant Processes

Some services, however, don't work at all in a multi-tenant environment. Redis is such a service - it has no multi-user authentication or authorization, and it doesn't yet scale horizontally. Furthermore, its single-threaded, single-process architecture makes it extremely sensitive to noisy neighbors.

For services such as this, it's better to provision dedicated processes for each requested instance. While the VM is still shared amongst users, because users get a dedicated process, we categorize these as single-tenant.

Security with Single-tenant services is much easier. However, now we have to deal with real-time provisioning, configuration and orchestration. We have to keep hundreds of dynamic processes running, and we have to upgrade each of them when a new release is deployed. Each process has its own directory for persistent storage, which needs to be baby-sat during this whole process. And we still haven't solved the noisy neighbor problem, since a single process could hog a single CPU or all of the available IOPS on the VM.

Single-tenant Clusters of Processes

The difficulty of provisioning and orchestrating these processes becomes much greater when we need to scale out horizontally. Now we need to manage clusters of processes across multiple VMs. We must migrate processes when VMs are removed, maintain state across VMs, control the boot-sequence, manage master/slave IPs, etc. For this, we need some level of process scheduling - likely using our own diego.

Containers

Wrapping each process in a Docker container seems like a no-brainer at first glance. This solves the noisy neighbor problem, and adds another layer of security around the service. It certainly is a good idea, but it's also not trivial. Docker hasn't yet figured out the software-defined-networking problem. It also doesn't solve the issues around stateful services and inter-VM data migrations.

Single-tenant Clusters of Virtual Machines

Once we start getting into real production workloads, splitting up VMs via processes or containers becomes a non-starter. Production workloads require all of the resources available on the VM.

While BOSH (our large-scale deployment tool of choice) is generally geared towards static deployments, we can use it to dynamically spin up entire dedicated clusters of VMs when an application developer asks for an instance of a service. The processes running on those VMs may also be containerized, if only for consistency and another layer of security, but the VMs are dedicated to a single instance. There are no more noisy neighbor problems, and the cluster can make full use of the CPU, IOPS, and network.

Still Exploring

This is exactly the topic that Chris Brown and I spoke about at the 2014 Cloud Foundry Summit. You can find the video here.

To be clear, this is still all very much a work in progress. We're constantly exploring new and improved ways of tackling this challenge, and we fully expect to get it wrong before we get it right. It's a process that's just as fulfilling as it is challenging.

If this sounds exciting to you and you’re interested in joining the London Services team, I’d love to hear from you!

Mind the Map

2014-05-20T10:00:00+00:00

I've dealt with a lot of technical debt in my life. Enough to fill a book. I've come to a conclusion: Technical Debt is almost always manageable. Product debt often isn't.

Surface Area

What do I mean by "product debt?"

Features that drastically and immeasurably increase your product's surface area.

This form of debt usually comes around because of laziness on the product owner's part. Instead of dedicating time and energy towards discovering exactly what knobs customers need, they give them everything in one fell swoop.

"Want to clean up your configuration file? How about we just give you all of Ruby?"

"Want to install a package in the cloud? How about we just give you full root access?"

"Want to gather stats from your processes? How about we just expose the entire runtime over the network?"

At first these feel like clever solutions to entire classes of customer feature requests. "If we just give them X, then we never have to deal with feature requests like this again." But by being lazy, you've created a number of huge problems for yourself.

Unknown unknowns

These decisions impart more features to the end user than they actually need. It's hard as hell to take features away from customers, even the rarely used ones. You're now committed to maintaining features you haven't yet thought of.

Lack of Visibility

To make matters worse, when you add sweeping features such as these, you're left blind as to how customers are actually using your product. You've lost the tooling to give you insight as to how to make your product really shine.

Which packages are they installing? Which commands are they running? Which metrics do they care about?

Painted in a Corner

Iterating yourself out of technical debt is easier because only your team deals with each change. The channels of communication are few, and you can coordinate migrations through each step.

With product debt, however, each change impacts a huge number of people who you have no direct contact with. Removing an all-encompassing feature requires that you gradually add focused replacements until you believe the majority of your customers will be satisfied. But, because you can't control when they migrate to these new tools, and you can't see exactly why they're still using the meta-feature you gave them years ago, this becomes a herculean task.

Maintain the Map

You can get yourself out of product debt, but it's incredibly painful. Apple did it when they abandoned MacOS 9 in favor of a daring new OS X. Heroku did it when they abandoned Bamboo for Cedar, and that took them years. It involves patience, communication, insight and bravery.

To avoid this situation in the first place, you need to start actively thinking about your product's surface area.

Imagine a topography that describes the way your customers interact with your product. Each new feature broadens that landscape, increasing the size of the terrain that you have to monitor and maintain. Some features add fjords, and some add entire continents. Elegant products achieve a high degree of focused utility by exposing very little territory for the customer to understand and interact with. Instead, they focus on keeping that area meticulously maintained. This focus on quality over quantity produces highly maintainable and flexible products.

Your job is to figure out what customers need, and give it to them using as little land as possible.

Building an Encrypted USB Drive for your SSH Keys

2014-05-10T11:00:00+00:00

UPDATE: This post's process has been encoded and published in this repo, pivotal/usb-login-scripts. Try out the "scripts-autoexpire" for a similar experience with a few extra features.

Working on a Platform like Cloud Foundry, which is relied upon by a growing community of "serious" companies, requires us to take security seriously as well.

Security is something you know, something you have, and something you are.

The commonly agreed upon tenets of strong security is that it requires a combination of "something you know, something you have, and something you are." Two factor authentication includes both of those - usually something you know and something you have.

Here's how we've implemented two factor authentication across the board for our SSH keys using USB keychain drives. This strengthens our access to Github repositories and the numerous deployments we manage.

Follow these instructions to increase your security at home and work as well.

Choosing a USB Key

We prefer the Kingston DataTraveler drive due to its size and cost. Once you've found a USB keychain drive to your liking, you'll want to reformat it using macOS's built-in encrypted filesystem.

Format the Drive

Plug your drive into your computer and open Disk Utility. Select the disk (not the volume) on the left and navigate to the "Erase" tab. You'll want to name the volume something simple (such as "keys") to make it easier to access on the command line.

If you see the encrypted options in the dropdown, then just go ahead and format your drive with Mac OS Extended (Case-sensitive, Journaled, Encrypted)

However, some USB keys' partition tables are MBR, which doesn't support encryption, and you won't see encrypted partitions as options in the "Format" dropdown. In that case, you'll have to do a two-step dance, formatting the drive twice:

Once as OS X Extended (Journaled) using the GUID Partition Map, then..
Again, using Mac OS Extended (Case-sensitive, Journaled, Encrypted).

Now, you'll be prompted for your decryption password whenever you insert the drive. Be sure not to save the password into the OS X Keychain.

Add your SSH Keys

If you don't already have SSH keys, then you'll want to generate a new set. In fact, it's probably a good idea to use this as a chance to create a fresh set either way, just in case yours have been compromised.

You create a new SSH key pair by running ssh-keygen:

$ ssh-keygen -f /Volumes/keys/id_rsa -C "Tammer Saleh"
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in id_rsa.
Your public key has been saved in id_rsa.pub.
The key fingerprint is:
xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx Tammer Saleh
The key's randomart image is:
+--[ RSA 2048]----+
|             x   |
|            x  x |
|      x x xxxxxxx|
|     x x x x xxxx|
|      x x x   x  |
|     x x x     x |
|    x x x x      |
|     x   x       |
|                 |
+-----------------+

Then add the newly-created public key to your Github account.

Script to Load Keys and Eject

At this point, you could use the drive by manually adding the keys to your running agent and ejecting the drive. But that's a lot of typing and feels fairly error prone. Instead, let's script it.

Create the following script on your drive, and name it load:

#!/usr/bin/env bash

set -e

HOURS=$1
DIR=$(dirname $0)
KEY=$DIR/id_rsa

if [ -z $HOURS ]; then
  HOURS=1
fi

/usr/bin/ssh-add -D
/usr/bin/ssh-add -t ${HOURS}H $KEY
/usr/sbin/diskutil umount force $DIR

You will need to mark the script as executable:

chmod +x /Volumes/keys/load

Now, you can simply run /Volumes/keys/load to load your keys and eject the drive automatically. This makes for a very quick workflow.

$ /Volumes/keys/load
All identities removed.
Enter passphrase for /Volumes/keys/id_rsa: **********
Identity added: /Volumes/keys/id_rsa (/Volumes/keys/id_rsa)
Lifetime set to 3600 seconds
Volume keys on disk3 force-unmounted

Make a Backup Image

Finally, you can make an encrypted backup image in case our USB drive shits the bed.

Again, open Disk Utility. This time, we select the New Image>Image from "keys" menu item. Save the image as keys_backup, and configure it to be compressed and encrypted.

Ideally, you'll store your backup somewhere super secure. Another option is to simply create two USB drives, and store one in a locked box. My wife and I have a locked fireproof box that we use for all of our personal documents (passports, etc), which is an ideal location.

Weaknesses and Future Improvements

While this is infinitely better than leaving your ssh key unprotected on your computer, there are some weaknesses and potential future improvements.

The major weakness is that we're trusting that the host machine hasn't been tampered with. If it has, then we're handing our private keys over to it. Again, this risk exists either way, and isn't made worse through this technique.

The forced eject is blatant cargo culting. When tried without forcing, the drive reports being in use, but lsof shows no processes using it. This is the case even when the script is run from outside the drive, so I'm at a loss.

Ideally, we could store our entire ~/.ssh directory on the keychain. This requires a symlink from ~/.ssh to /Volumes/keys/.ssh, and has a number of other complications around permissions. We haven't investigated furthur.

Alternatives

A friend of mine recently blogged about his alternative approach, which makes use of an HTTP service that serves your SSH keys, and authenticates with a YubiKey.