Moserware

Back from the Future Bugs

Wed, 21 Oct 2015 16:29:00 +0000

(In honor of this quasi-historic day, I wanted to share my most memorable bug.)

Ordinary software bugs are merely annoying but straightforward to find and fix. Legendary bugs are the ones that actively resist being found and make you question your sanity for prolonged periods of time.

In early 2011, I was working on Kaggle’s submission handling code. Kaggle hosts competitions where the goal is to make the best predictions on a dataset. For example, given an image of a person’s retina, determine if they’re suffering damage from diabetes. You upload your predictions and they get evaluated against the known (but private) solution in order to compute your score which shows up on a leaderboard.

Long after deploying the submission handling code, a user emailed us saying that his score didn’t appear on the leaderboard. When I went to investigate, I saw his score was indeed on the leaderboard. I tried to reproduce his issue in production and I couldn’t. I painstakingly verified everything in a local debugger session and it worked perfectly without any issue. Given that everything seemed to be working fine, I closed the issue because I couldn’t reproduce it.

Much time passed without any further problems.

And then, it happened again: another user reported that their score didn’t appear on the leaderboard. This time I was convinced that there was some genuine edge case causing the issue. I once again carefully went through the whole submission process in a debugger and couldn’t replicate it. At this point, I was sure it was some weird database issue. Perhaps some obscure exception was thrown that caused the database to never update. I added some checks and exception handling code and the problem seemed to go away.

But in June of 2014, it came back with a vengeance. A new developer on the team was handling support requests at the time and received several reports of submissions not showing up on the leaderboard. When he explained the bug report, I was immediately haunted by memories of it. I recounted the mystery of this bug and how I couldn’t reproduce the issue. In frustration, I suggested that he add code to manually force an update to the leaderboard to workaround this bug.

I was disgusted by my own suggestion. It wasn’t a fix, it was duct-taping around the issue. But soon after offering this advice I had the forehead-slapping moment: different clocks can have different time!

It’s embarrassingly obvious in hindsight.

Here’s what happened:

When the submission was created, I set its status to “pending” and set its timestamp to the current time of the web server.
Another machine dequeued the submission, calculated its score, and then sent the result back.
The web server then updated the status of the submission which caused a stored procedure in the database to update the leaderboard based on all the submissions up to the current time of the database server.

The reason why I could never reproduce this problem locally was because my local web server and database server were on the same physical machine with the same clock, so the results were always consistent. Further, most of the time the clocks on the web server and database server in production were carefully synchronized over the network to within a few milliseconds of each other.

This bug only surfaced when:

The database server’s clock had drifted a second or two behind the web server’s clock.
The submission processing time took less time than the amount of drift between the two machines.
The database recalculated the leaderboard based on all submissions up to its current time and thus ignored the submission that was in its perceived future.
No subsequent submissions happened that would have forced another leaderboard recalculation based on the then current time.

Once I understood it was a clock drift bug, there was very simple fix: always use the same clock. In this case, we chose to always use the same database clock and haven’t had a problem with it in well over a million submissions since then.

In hindsight, it was a distributed systems rookie mistake. I had painfully rediscovered Segal’s law: “A man with a watch knows what time it is. A man with two watches is never sure.”

In addition to the simple solution we chose of always using the same physical clock, you can get around this problem by using fancier techniques like vector clocks or TrueTime in Google’s Spanner database that uses GPS synchronized clocks to carefully keep track of the uncertainty of the current time in order to provide transactional consistency across the planet.

As we increasingly write software that executes across multiple machines, it’s important to have some strategy for handling clock drift (even if it’s less than a second). Because, if you don’t, you too might be bitten by bugs caused from data that’s coming back… from the future.

Life, Death, and Splitting Secrets

Mon, 21 Nov 2011 08:43:00 +0000

(Summary: I created a program to help back up important data like your master password in case something happens to you. By splitting your secret into pieces, it provides a circuit breaker against a single point of failure. I’m giving it away as a free open source program with the hope that others might find it useful in addressing this aspect of our lives. Feel free to use the program and follow along with just the screenshots below or read all sections of this post if you want more context.)

Background

I just couldn’t do it.

My grandma died at this time last year from a stroke. She was a great woman. I still miss her. In that emotional last week, I was reminded of great memories with her and the fragility of life. I was also reminded about important documents that I still didn’t have.

When something happens to you, be it death or incapacitation, there are some important steps that need to occur that can be greatly assisted by legal documents. For example:

An advance health care directive (aka “Living Will”) specifies what actions should (or shouldn’t) be taken with regards to your healthcare if you’re no longer able to make decisions for yourself.
A durable power of attorney allows you to designate someone to legally act as you if you become incapacitated.
A last will and testament allows you to legally assign caregivers for minor children as well as designate where you’d like your possessions to go.

My grandma had these and it helped reduce stress and anxiety in this difficult time. We knew what she would have wanted and these documents helped legally enforce that.

I had assumed that these documents were expensive and time-consuming to create. Furthermore, as a guy in my 20’s, death still seems like a distant rumor. As a Christian, I’m not overly concerned about death itself, but my grandma’s death reminded me that these documents are not really for me, but rather the people I’d leave behind. I knew that if something happened to me, I’d potentially be leaving behind a mess, and that concern of irresponsibility compelled me to investigate what I could do.

It turns out that creating these documents is essentially a matter of filling out a form template. I bought a program that made it about as easy as preparing taxes online. In most cases, you just need disinterested third parties, such as friends or coworkers, to witness you signing them to make them fully legal. At most, you might have to get them notarized or filed in your county for a small fee.

One of the steps involved in filling out the “Information for Caregivers and Survivors” document is to list “Secured Places and Passwords.” It’s a helpful section that your executor can turn to if something happened to you in order to do things like unlock your cell phone or access your online accounts. Sure, your survivors might be able use legal force to get access without it, but only after months of sending official documentation. That’s a lot of hassle to put someone through. Also, it’s very likely that a lot of important things will be missed and no one would ever know they existed.

It’s probably rational to just write your passwords down and put them in a safe which your executor knows the location of and can access in a timely matter. Alternatively, you could pay for an attorney or a third-party service and leave your password list with them. However, this seemed like it would cause a maintenance problem, especially as I might add or update my passwords frequently. These options would also force me to trust someone I haven’t known for a long time. Most importantly, the thought of writing down my passwords on a piece of paper, even if it was in a relatively safe place, went against every fiber of my security being.

I just couldn’t do it.

DISCLAIMER: The above simple approaches are probably fine and have worked for a lot of people over the years. If you’re comfortable with these basic approaches, by all means use them and ignore this post. These simpler approaches have less moving parts and are easy to understand. However, if you want a little more security, or need to liven up this process with a little spy novel-esque fun, read on.

The Modern Password & Encryption Problem

As an online citizen, you don’t want to be that person. You know, the one whose password was so easy to guess that his email account was broken into and who “wrote” to you saying that he decided to go to Europe on a whim this past weekend but now needs you to wire him money right now and he’ll explain everything later: that guy.

You’ve learned that passwords like “thunder”, “thunder56”, and even “L0u|>Thund3r” are terrible because they’re easily guessed. You now know that the most important aspect of a password is its length combined with basic padding and character variation such as “/* Thunder is coming! */”, “I hear thunder!”, or “1.big.BOOM@thunder.mil”.

In fact, you’re probably clever enough that you don’t create or remember most of your passwords anymore. You use a password manager like LastPass or KeePass to automatically generate and store unique and completely random passwords for all of your accounts. This has simplified your life so that you only have to remember your “master password” that will get you into where you keep all the rest of your usernames and passwords.

You also understand that your email account credentials are a “skeleton key” for almost everything else due to the widespread use of simple password reset emails. For this very reason, you probably realize that it’s critical to protect your email login with “two-factor” authentication. That is, your email account should at least be protected by:

Something you know (your password) and
Something you have (your cellphone), that creates or receives a one-time use code when you want to login.

On top of all of this, you try your best to follow the trusty advice that your passwords should be ones that nobody could guess and you never ever write them down.

But what if something happens to you? If you’ve done everything “right,” then your master password and all your second factor details go with you.

And then there are your encrypted files. Maybe you’re keeping a private journal for your children to read when they grow up. Perhaps you’re living in some spy novel life where you’re worried that people will take you out to prevent something you know from being discovered. Wherever you fall on the spectrum, what do you do with such encrypted data?

Modern encryption is a bit scary because it’s so good. If you use a decent encryption program with a good password/key, then it’s very likely that no one, not even a major government, could decrypt the file even after hundreds of years. Encryption is great for keeping prying eyes out, but it could sadden survivors that you want to have access to your data. The thought of something being lost forever might make you almost yearn for the days when you just put everything into a good safe that’s rated by how many minutes it might slow somebody down.

On a much lighter note, the “something” that happens to you doesn’t have to be so grim. Maybe you had a really relaxing three week vacation and now you can’t remember the exact keyboard combination of your password. Given that our brains have to recreate memories each time you recall something, it’s possible that you could stress yourself out so much trying to remember your password that you effectively “forget” it. What do you do then?

When you put all your eggs into a password manager basket, you really want to watch that basket. Fortunately, creating a basic plan isn’t that hard.

A Proposed Solution

Let’s borrow an ancient yet incredibly useful idea: if it’s really important to get your facts right about something, be sure to have at least two or three witnesses. This is especially true concerning matters of life and death but it also comes up when protecting really valuable things.

By the 20th century, this “two-man rule” was implemented in hardware to protect nuclear missiles from being launched by a lone rogue person without proper authorization. The main vault at Fort Knox is locked by multiple combinations such that no single person is entrusted with all of them. On the Internet, the master key for protecting the new secure domain name system (DNSSEC) is split between among 7 people from 6 different countries such that at least 5 people are needed to reconstruct it in the event of an Internet catastrophe.

If this idea is good enough for protecting nuclear weapons, the Fort Knox vault, and one of the most critical security aspects on the Internet, it’s probably good enough for your password list. Besides, it can make a somewhat uncomfortable process a little more fun.

Let’s start with a simple example. Let’s say that your master password is “1.big.BOOM@thunder.mil”. You could just write it out on a piece of paper and then use scissors to cut it up. This would work if you wanted to split it among 2 people, but it has some notable downsides:

It doesn’t work if you want redundancy (i.e. any 2 of 3 people being able to reconstruct it)
Each piece would tell you something about the password and thus has value on its own. Ideally, we’d like the pieces to be worthless unless a threshold of people came together.
It doesn’t really work for more complicated scenarios like requiring 5 of 7 people.

Fortunately, some clever math can fix these issues and give you this ability for free. I created a program called SecretSplitter to automate all of this to hopefully make the whole process painless.

Let’s say you want to require at least 2 witnesses to agree that something happened to you before your secret is available. You also want to build in redundancy such that any pair of people can find out your password. For this scenario, you keep the can use the default settings and press the “split” button:

You’ll get this list of split pieces:

Notice that each piece is twice as long as your original message (about twice the size of a package tracking number). This is by design.

Now comes the hard part: you have to select three people you trust. You should have high confidence in anyone you’d entrust with a secret piece. It’s easy to get caught up in gee-whiz cryptography and miss fundamentals: you ultimately have to trust something, especially with important matters. SecretSplitter provides a trust circuit breaker just in case (because even well-meaning people can lose important things). The splitting process adds a bit of complexity, but so do real circuit breakers. If you trust no one, then you can’t have anyone help you if something happens.

For demonstration purposes, let’s say you trust 3 people.

You now have to distribute these secret pieces. You could do all sorts of clever things like send letters to people that will be delivered far in the future or read them over the phone. However, distributing them in person is a pretty good option:

It can make the upcoming holiday table discussions even more fun:

Let’s pretend that something happened to you. Two of the three family members that you gave pieces to would come together and agree that “something” indeed has happened to you. What happens now?

Well, either you included a note with each secret piece or you emailed them previously with instructions that they’d just need to download and run this small program. The pair comes together at a laptop and they each type their piece in quickly and then press “Recover”:

Oops… they typed so quickly that they mixed up one of the digits. It told us where to look:

They fix the typo and press recover again:

And immediately they see:

Password recovered! They could now use this master password to log into your password manager where you’ve stored further details.

This “message” approach is useful if you have a small amount of data such as a password that you could write on a piece of paper. One downside is that each piece is twice the size of the text message. If your message becomes much larger then it will no longer be feasible to type it in manually.

One alternative approach is to bundle together all of your important files into a zip file:

To split this file, you’d click the “Create” tab and then find the file, set the number of shares and click “Save”:

You’ll then be told:

And then you pick where to save the encrypted file:

Finally, you’ll see this screen:

This creates a slightly more complicated scenario because you now have 2 things to share: the secret pieces and the encrypted file with all your data. The encrypted file doesn’t have to be secret at all. You can safely email it to people that have a secret piece:

Now, if something happens to you, they’d run the program, and type in two shares and press “Recover”:

It’ll then tell them:

They’d then go to their email and search for the email from you that includes your encrypted file:

Then they’d find the single message (or the latest one if you sent out updates) and download your encrypted attachment:

They’d then go back to the program to open it up:

and then they’d see a message to be careful where they saved it:

and then they’d save it:

They’d then be asked if they want to open the decrypted file, which they’d say “Yes”:

Now they can see everything:

It might sound complicated, but if you’re familiar with the process, it might only take a minute. If you’re not tech savvy and have never done it before and type slowly, it might take 30 minutes. In either case, it’s faster than having to drive to your home and search around for a folder and it contains everything you wanted people to know (especially when things are time sensitive).

That’s it! Your master password and important data are now backed up. The risk is distributed: if any one piece is compromised (i.e. gets lost or misplaced), you can have everyone else destroy their secret piece and nothing will be leaked. Also, the program has an advance feature that lets you save the file encryption key. This feature allows you to send out updated encrypted files that can be decrypted with the pieces you’ve already established in person.

SecretSplitter implements a “(t,n) threshold cryptosystem” which can be thought of as a mathematical generalization of the physical two-man rule. The idea is that you split up a secret into pieces (called “shares”) and require at least a threshold of “t” shares to be present in order to recover the secret. If you have less than “t” shares, you gain no information about the secret. Whatever threshold you use, it’s really important that each “shareholder” know the threshold number of shares.

You can be quite creative in setting the threshold and distributing shares. For example, you can trust your spouse more by giving her more shares than anyone else. The key idea is that a share is an atomic unit of trust. You can give more than one unit of trust to a person, but you can never give less.

Another important practical concern is that you should consider adding redundancy to any threshold system. This is easily achieved by creating more shares than the threshold number. The reason is that if you’re going out of your way to use a threshold system, then you probably want to make sure you have a backup plan in case one or more of the shares are unavailable.

IMPORTANT LEGAL NOTE: It’s tempting to keep everything, including the important directives and your will in only electronic form (even when they’re signed). Unfortunately, most states require the original signed documents to be considered legal and most courts will not accept a copy. For this reason, you should still have the paper originals somewhere such as a fireproof safe. However, be careful where you put the originals: although it might sound convenient to put them in a bank safety deposit box, there’s usually a rather long waiting period before a before a bank can legally provide access to your box to a survivor, so don’t put any time sensitive items there. My recommendation at the current time would be to include copies of the signed originals in your encrypted file and also include detailed instructions on where the originals are located and how to access them.

How It Works

Given the sensitive nature of the data being protected, I wanted to make sure I understood every part of the mathematics involved and literally every bit of the encrypted file. You’re more than welcome to just use the program without fully understanding the details, but I encourage people to verify my math and code if you’re able and curious.

To get started, recall that computers work with bits: 1’s and 0’s that can represent anything. For example, the most popular way of encoding text will encode “thunder” in binary as

01110100 01101000 01110101 01101110 01100100 01100101 01110010

We can write this more efficiently using hexadecimal notation as: 74 68 75 6E 64 65 72. We can also treat this entire sequence of bits as a single 55 bit number whose decimal representation just happens to be 32,765,950,870,971,762. In fact, any piece of data can be converted to a single number.

Now that we have a single number, let’s go back to your algebra class and remember the equation for a line: y=mx+b.

In this equation, “b” is the “y-intercept”, which is where the line crosses the y-axis. The “m” value is the slope and represents how steep the line is (i.e. its “grade” if it were a hill).

This is all the core math you need to understand splitting secrets. In our particular case, our secret message is always represented by the y-intercept (i.e. “b” in y=mx+b). We want to create a line that will go through this point. Recall that a line could go through this point at any angle. The slope (i.e. “m” in y=mx+b) will direct us where it goes. For things to work securely, the slope must be a random number.

Although we use large numbers in practice for security reasons, let’s keep it simple here. Let’s say our secret number is “7” and our random slope is “3.” These choices generate this line:

With this equation, we can generate an infinite number of points on the line. For example, we can pick the first three points: (1, 10), (2, 13), and (3, 16):

You can see that if you had any two of these points, you could find the y-intercept.

It’s critical to realize that having just one of these points gives us no useful information about the line. However, having any other point on the line would allow us to use a ruler and draw a straight line to the y-intercept and thus reveal the secret (we could also work it out algebraically). Each point represents a secret piece or “share” and has a unique “x” and “y” value.

The mathematically fascinating part about this idea is that a line is just a simple polynomial (curve) and this technique works for polynomials of arbitrarily large degrees. For example, a second degree polynomial is a parabola that requires 3 unique points to completely define it (one more than a line). Its equation is of the form y=ax^2 + bx + c. In our case “c” is the y-intercept and “a” and “b” are random as in y = 2x^2 + 3x + 7:

Given this equation, we can generate as many “shares” as we’d like: (1,12), (2,21), (3,34), (4,51), etc.

Keep in mind that a parabola requires three points to uniquely define it. If you just had two points, as in (1,12) and (2,21), you could create an infinite number of parabolas going through these points and thus have infinite choices for what the y-intercept (i.e. your secret) could be:

However, a third point will define the parabola and its y-intercept exactly:

You’ve just learned that splitting a secret that requires three people is just a matter of creating a parabola. Requiring more people is just a matter of creating a higher-degree polynomial such as a cubic or quartic polynomial. If you understand this basic idea, the rest is just details:

Instead of using numbers, we translate the data to a big polynomial with binary coefficients.
Instead of using middle school algebra, we use a “finite field.” This helps keep results about the same size as the input and adds some security.

Don’t be intimidated by these changes. The core ideas are the same as the basic case. The only noticeable difference is that you have to think of operations like multiplication and division in a more abstract way. For details, check out my source code’s use of Horner’s scheme for evaluating polynomials, peasant multiplication, irreducible polynomials with the fewest terms, Lagrange polynomial interpolation to find the y-intercept, and using Euclidean inverses for division.

Again, it probably sounds more complicated than it really is. At its core, it’s simple. This technique is formally known as a Shamir Secret Sharing Scheme and it was discovered in the 1970’s.

I didn’t want to invent anything new unless I felt I absolutely had to. There was already a good tool called “ssss-split” that generates shares similar to how I wanted. This program adds a special twist by scrambling the resulting y-intercept point and therefore adds an extra layer of protection. Since this program was already the de-facto standard, I wanted to be fully compatible with it. To make sure I was compatible, I had to copy its method of “diffusing” (i.e. scrambling) the bits using the public domain XTEA algorithm. However, to ensure complete fidelity, I had to look at the source code. The only problem was that it was originally released under the GNU Public License (GPL) and it used a GPL library for working with large numbers. My goal was to make my implementation as open as I could, so I asked the author if I could look at his code to derive my own implementation that I’d release under the more permissive MIT license and he graciously allowed me to do this.

To prove the compatibility, you can use the ssss-split demo page and paste the results into SecretSplitter and it’ll work just fine. In addition, I created command line programs from scratch that are fully compatible with ssss-split and ssss-combine.

After some basic usability testing, I decided to make one small adjustment. The “ssss-split” command allows you to attach a prefix that it ignores. I wanted to add a special prefix that would tell what type of share it was (i.e. a message or a file) as well as a simple checksum because with all those digits it’s easy to mistype one.

Now, you can understand all the pieces of the long share:

In theory, you could “encrypt” a large file directly using this technique. In practice, it doesn’t work well because each share would be huge and not something you’d be able to write down by hand or say over the phone, even using the phonetic alphabet.

For lots of data, we use a hybrid approach: encrypt the file using standard file encryption with a random key and then split the small “key” into pieces.

For file encryption, I again didn’t want to invent anything new. I decided to use the OpenPGP Message Format, the same format used by PGP and GNU Privacy Guard (GPG). I didn’t want to have to worry about licensing restrictions or including a third-party library, so I wrote my own implementation from scratch that did exactly what I wanted. I read RFC4880 and started sketching out what I needed to do. A few bug fixes later and I had a working implementation that was able to interoperate with GPG. To simplify my implementation, I only support a limited subset of features:

I always use AES with a 256-bit key for encryption, even if users select a smaller effective key size. This means that users can pick any size key they want and thus balance security and share length. I picked AES because it’s strong and understandable with stick figures.
The actual file encryption key is always a hashed, salted, and stretched version of the reconstructed shares text.
The encrypted file has an integrity protection packet to detect if the file has been modified and ensure it was decrypted correctly.

Since I used common formats, you can verify the correctness of the generated files using a Linux shell. You can also create files using the shell and have them interoperate with SecretSplitter. I included a sample of how to do this with the source code.

Help Wanted / Future Possibilities

SecretSplitter still looks and feels like a prototype. There are lots of possible improvements that could be made:

Secret splitting is a relatively complicated idea. In Cryptography Engineering, the authors write “secret sharing schemes are rarely used because they are too complex. They are complex to implement, but more importantly, they are complex to administrate and operate.”

Although I tried to simplify the user experience for broad use, it could still use some user experience enhancements to simplify it further.
2. I wrote it in C# for the .net platform because that is what I’m most familiar with (and it has some built-in powerful primitives like BigIntegers, AES, and hash functions). I suspect that an HTML5 version using JavaScript, a nice interface, and coming from a trusted domain would get much broader usage. In addition, since this is a problem that affects everyone, having great internationalization support would be a nice touch. It also would be nice to have a polished look with a good logo and other graphics.
3. You could use more elaborate secret sharing schemes than what I implemented in SecretSplitter. I considered these, but ultimately wanted to use a technique that was already compatible with widely deployed tools. I also considered enhancing shares with two-factor support or using existing public key infrastructure, but decided that added too much complexity. Perhaps it’s possible to incorporate these in a good design.
4. It’d be neat if this scheme or something similar to it was integrated into LastPass and KeyPass as a core feature.
5. Obviously the shares themselves are long. I tried making them shorter but the downsides outweighed the upsides. Perhaps it could be better. Also, a compelling graphically designed share card might make it more fun for broader use. The long length is somewhat of a safety mechanism that prevents people from memorizing with a quick glance. Also, it discourages overhasty use much like freezing a credit card.
6. I kept the codes in a format that would be easy to write as well as read over the phone. I used a simple character set that avoids ambiguities like “O” vs “0”. One additional strategy could be to embed the share as a QR code or something similar. I didn’t pursue this approach in favor of simplicity, but this could be an option.
7. Really paranoid people might want to back up their encrypted file to paper. This is possible, but I’m not sure if it should belong inside the program itself.
8. It’d be good to have suggestions on how to exchange shares or perhaps borrow ideas from PGP key signing parties. I suspect that if secret splitting were to become popular, then “web of trust” scenarios would naturally occur (i.e. “I’ll hold your secret share if you hold mine”).
9. It’d be fun to compile a list of non-obvious uses for SecretSplitter to share with others. For example, it could make for interesting scavenger hunt clues.

If you’d like to donate your time to any of the above ideas, I’d encourage you to just give it a go. You don’t have to ask for my permission but it would be nice if you posted your results somewhere or left a comment to this post. You can use my code for whatever purpose you’d like. My only hope is that you might get some benefit out of it.

Conclusion

SecretSplitter is just a tool that gives another option for backing up very sensitive information by splitting it up into pieces. It’s not a full solution, only a tool. By relying on people I trust instead of a third-party company, it helped me remove one excuse I had for not preparing somewhat unpleasant but important documents that we should all probably have. I still don’t have this all figured out, but writing SecretSplitter help me get started.

If you’re young, don’t have any minor children, and don’t care at all what happens to your stuff, then you could run some mental actuarial model and convince yourself that the probability of you or your survivors needing these documents or password recovery procedure anytime soon is low, but you’re not given any guarantees.

At the very least, it’s a good idea to make sure all of your financial assets and life insurance policies have a named beneficiary and at perhaps at least one alternate. You can also declare things like organ donor preferences on your driver’s license instead of making declarations in other documents. It’s also a good idea to have an “ICE” entry in your cell phone. However, going the extra step and making very basic final documents doesn’t require that much more work. Besides, once you have baseline documents, keeping them fresh is just a matter of occasional updates due to life events.

The increasing digitization of our lives means that more personal things will only be stored digitally. From our journals to email to videos to health records, all of this will eventually only exist digitally and likely hidden behind passwords. This future needs some safety net for backing up sensitive things in a safe and accessible way.

Everything doesn’t need to be backed up. There are also lots of files, usernames and passwords that don’t really matter. Don’t include those. SecretSplitter was built with the assumption that everything that really mattered could be stored in a file small enough to email to others. This helps focus and pare down to what really matters.

It’s also good to have a healthy dose of common sense. Instead of holding out a secret until after your death, maybe you should get that resolved today. You’ll probably live better. My general view is that these final “secrets” should be mostly boring by just containing account details and credentials.

Finally, on a more personal level, I think it’s healthy to be reminded about our own mortality at least once every year or so. It’s a helpful reminder of how much a gift every day is and helps focus what we do and not worry about things that don’t matter.

If a little bit of fancy math can help you sleep better at night, well then, I’d consider it a success.

Special thanks to B. Poettering for creating the original ssss program and allowing me to clone its format.

Notes from porting C# code to PHP

Tue, 26 Oct 2010 08:34:00 +0000

(Summary: I ported my TrueSkill implementation from C# to PHP and posted it on GitHub. It was my first real encounter with PHP and I learned a few things.)

I braced for the worst.

After years of hearing negative things about PHP, I had been led to believe that touching it would rot my brain. Ok, maybe that’s a bit much, but its reputation had me believe it was full of bad problems. Even the cool kids had issues with PHP. But I thought that it couldn’t be too bad because there was that one website that gets a few hits using a dialect of it. When Kaggle offered to sponsor a port of my TrueSkill C# code to PHP, I thought I’d finally have my first real encounter with PHP.

To make the port quick, I kept most of the design and class structure from my C# implementation. This led to a less-than-optimal result since PHP really isn’t object-oriented. I didn’t do a deep dive on redesigning it in the native PHP way. I stuck with the philosophy that you can write quasi-C# in any language. Also, I didn’t use any of the web and database features that motivate most people to choose PHP in the first place. In other words, I didn’t cater to PHP’s specialty, so my reflections are probably an unfair and biased comparison as I was not using PHP the way it was intended. I expect that I missed tons of great things about PHP.

Personal disclaimers aside, even PHP book authors don’t claim that it’s the nicest language. Instead, they highlight the language’s popularity. I sort of got the feeling that people mainly choose PHP in lieu of languages like C# because of its current popularity and its perception of having a lower upfront cost, especially among cash-strapped startups. Matt Doyle, author of Beginning PHP 5.3, wrote the following while comparing PHP to other languages:

“Many would argue that C# is a nicer, better-organized language to program in than PHP, although C# is arguably harder to learn. Another advantage of ASP.NET is that C# is a compiled language, which generally means it runs faster than PHP’s interpreted scripts (although PHP compilers are available).” - p.5

He continued:

“ASP and ASP.NET have a couple of other disadvantages compared to PHP. First of all, they have a commercial license, which can mean spending additional money on server software, and hosting is often more expensive as a result. Secondly, ASP and ASP.NET are fairly heavily tied to the Windows platform, whereas the other technologies in this list are much more cross-platform.” - p.5

Next, he hinted that Ruby might eventually replace PHP’s reign:

“Like Python, Ruby is another general-purpose language that has gained a lot of traction with Web developers in recent years. This is largely due to the excellent Ruby on Rails application framework, which uses the Model-View-Controller (MVC) pattern, along with Ruby’s extensive object-oriented programming features, to make it easy to build a complete Web application very quickly. As with Python, Ruby is fast becoming a popular choice among Web developers, but for now, PHP is much more popular.” - p.6

and then elaborating on why PHP might be popular today:

“[T]his middle ground partly explains the popularity of PHP. The fact that you don’t need to learn a framework or import tons of libraries to do basic Web tasks makes the language easy to learn and use. On the other hand, if you need the extra functionality of libraries and frameworks, they’re there for you.” - p.7

Fair enough. However, to really understand the language, I needed to dive in personally and experience it firsthand. I took notes during the dive about some of the things that stuck out.

The Good Parts

It’s relatively easy to learn and get started with PHP. As a C# developer, I was able to pick up PHP in a few hours after a brief overview of the syntax from a book. Also, PHP has some decent online help.
PHP is available on almost all web hosts these days at no extra charge (in contrast with ASP.NET hosting). I can’t emphasize this enough because it’s a reason why I would still consider writing a small website in it.
I was pleasantly surprised to have unit test support with PHPUnit. This made me feel at home and made it easier to develop and debug code.
It’s very easy and reasonable to create a website in PHP using techniques like Model-View-Controller (MVC) designs that separate the view from the actual database model. The language doesn’t seem to pose any hindrance to this.
PHP has a “static” keyword that is sort of like a static version of a “this” reference. This was useful in creating a quasi-static “subclass” of my “Range” class for validating player and team sizes. This feature is formally known as late static binding.

The “When in Rome…” Parts

Class names use PascalCase while functions tend to use lowerCamelCase like Java whereas C# tends to use PascalCase for both. In addition, .NET in general seems to have more universally accepted naming conventions than PHP has.
PHP variables have a ‘$’ prefix which makes variables stick out:

function increment($someNumber) 
{ 
    $result = $someNumber + 1; 
    return $result; 
}

This convention was probably copied from Perl’s scalar variable sigil. This makes sense because PHP was originally a set of Perl scripts intended to be a simpler Perl.
- You access class members and functions using an arrow operator (“->”) like C++ instead of the C#/Java dot notation (“.”). That is, in PHP you say

$someClass->someMethod()

instead of

someClass.someMethod()

The arguments in a “foreach” statement are reversed from what C# uses. In PHP, you write:

foreach($allItems as $currentItem) { ... }

instead of the C# way:

foreach(currentItem in allItems) { ... }

One advantage to the PHP way is its special syntax that makes iterating through key/value pairs in an map easier:

foreach($someArray as $key => $value) { ... }

vs. the C# way of something like this:

foreach(var pair in someDictionary) 
{
    // use pair.Key and pair.Value 
}

The “=>” operator in PHP denotes a map entry as in

$numbers = array(1 => ‘one’, 2 => ‘two’, ...)

In C#, the arrow “=>” is instead used for a lightweight lambda expression syntax:

x => x * x

To define the rough equivalent of the PHP array, you’d have to write this in C#

var numbers = new Dictionary<int, string>{ {1, "one" }, {2, "two"} };

On the one hand, the PHP notations for maps is cleaner, but it comes at a cost of having no lightweight lambda syntax (more on that later).

PHP has some “magical methods” such as “__construct” and “__toString” for the equivalent of C#’s constructor and ToString functionality. I like C#’s approach here, but I’m biased.

The “Ok, I guess” Parts

The free NetBeans IDE for PHP is pretty decent for writing PHP code. Using it in conjunction with PHP’s XDebug debugger functionality is a must. After my initial attempts at writing code with a basic notepad, I found NetBeans to be a very capable editor. My only real complaint with it is that I had some occasional cases where the editor would lock up and the debugger wouldn’t support things like watching variables. That said, it’s still good for being a free editor.
By default, PHP passes function arguments by value instead of by reference like C# does it. This probably caused the most difficulty with the port. Complicating things further is that PHP references are not like references in other languages. For example, using references usually incurs a performance penalty since extra work is required.
You can’t import types via namespaces alone like you can in C# (and Java for that matter). In PHP, you have to import each type manually:

use Moserware\Skills\FactorGraphs\ScheduleLoop; 
use Moserware\Skills\FactorGraphs\ScheduleSequence; 
use Moserware\Skills\FactorGraphs\ScheduleStep; 
use Moserware\Skills\FactorGraphs\Variable;

whereas in C# you can just say:

using Moserware.Skills.FactorGraphs;

PHP’s way makes things explicit and I can see that viewpoint, but it was a bit of a surprising requirement given how PHP usually required less syntax.
- PHP lacks support for C#-like generics. On the one hand, I missed the generic type safety and performance benefits, but on the other hand it forced me to redesign some classes to not have an army of angle brackets (e.g. compare this class in C# to its PHP equivalent). - You have to manually call your parent class’s constructor in PHP if you want that feature:

class BaseClass 
{ 
    function __construct() { ... } 
}

class DerivedClass extends BaseClass 
{ 
    function __construct() 
    { 
        // this line is optional, but if you omit it, the BaseClass constructor will *not* be called 
        parent::__construct(); 
    } 
}

This gives you more flexibility, but it doesn’t enforce C#-like assumptions that your parent class’s constructor was called.
- PHP doesn’t seem to have the concept of an implicit “$this” inside of a class. This forces you to always qualify class member variables with $this:

class SomeClass 
{ 
    private $_someLocalVariable; 
    function someMethod() 
    { 
        $someMethodVariable = $this->_someLocalVariable + 1; 
        ... 
    } 
}

I put this in the “OK” category because some C# developers prefer to always be explicit on specifying “this” as well.
- PHP allows you to specify the type of some (but not all kinds) of the arguments of a function:

function myFunction(SomeClass $someClass, array $someArray, $someString) 
{ 
    ... 
}

This is called “type hinting.” It seems that it is designed for enforcing API contracts instead of general IDE help as it actually causes a decrease in performance.
- PHP doesn’t have the concept of LINQ, but it does support some similar functional-like concepts like array_map and array_reduce.
- PHP has support for anonymous functions by using the “function($arg1, ...){}” syntax. This is sort of reminiscent of how C# did the same thing in version 2.0 where you had to type out “delegate.” C# 3.0 simplified this with a lighter weight version (e.g. “x => x*x”). I’ve found that this seemingly tiny change “isn’t about doing the same thing faster, it allows me to work in a completely different manner” by employing functional concepts without thinking. It’s sort of a shame PHP didn’t elevate this concept with concise syntax. When C#’s lambda syntax was introduced in 3.0, it made me want to use them much more often. PHP’s lack of something similar is a strong discourager to the functional style and is a lesson that C++ guys have recently learned.
- Item 4 of the PHP license states:

Products derived from this software may not be called “PHP”, nor may “PHP” appear in their name, without prior written permission from group@php.net. You may indicate that your software works in conjunction with PHP by saying “Foo for PHP” instead of calling it “PHP Foo” or “phpfoo”

This explains why you see carefully worded names like “HipHop for PHP” rather than something like “php2cpp.” This technically doesn’t stop you doesn’t stop you from having a project with the PHP name in it (e.g. PHPUnit) so long as the official PHP code is not included in it. However, it’s clear that the PHP group is trying to clean up its name from tarnished projects like PHP-Nuke. I understand their frustration, but this leads to an official preference for names like Zope and Smarty that seem to be less clear on what the project actually does. This position would be like Microsoft declaring that you couldn’t use the “#” suffix or the “Implementation Running On .Net (Iron)” prefix in your project name (but maybe that would lead to more creativity?).

The Frustrating Parts:

As someone who’s primarily worked with a statically typed language for the past 15 years, I prefer upfront compiler errors and warnings that C# offers and agree with Anders Hejlsberg’s philosophy:

“I think one of the reasons that languages like Ruby for example (or Python) are becoming popular is really in many ways in spite of the fact that they are not typed… but because of the fact that they [have] very good metaprogramming support. I don’t see a lot of downsides to static typing other than the fact that it may not be practical to put in place, and it is harder to put in place and therefore takes longer for us to get there with static typing, but once you do have static typing. I mean, gosh, you know, like hey – the compiler is going to report the errors before the space shuttle flies instead of whilst it’s flying, that’s a good thing!”

But more dynamic languages like PHP have their supporters. For example, Douglas Crockford raves about JavaScript’s dynamic aspects:

“I found over the years of working with JavaScript… I used to be of the religion that said ‘Yeah, absolutely brutally strong type systems. Figure it all out at compile time.’ I’ve now been converted to the other camp. I’ve found that the expressive power of JavaScript is so great. I’ve not found that I’ve lost anything in giving up the early protection [of statically compiled code]”

I still haven’t seen where Crockford is coming from given my recent work with PHP. Personally, I think that given C# 4.0’s optional support of dynamic objects, the lines between the two worlds are grayer and that with C# you get the best of both worlds, but I’m probably biased here.

You don’t have to define variables in PHP. This reduces some coding “ceremony” to get to the essence of your code, but I think it removes a shock absorber/circuit-breaker that can be built into the language. This “feature” turned my typo into a bug and led to a runtime error. Fortunately, options like E_NOTICE can catch these, but it caught me off guard. Thankfully, NetBean’s auto-completion saved me from most of these types of errors.
PHP has built-in support for associative arrays, but you can’t use objects as keys or else you’ll get an “Illegal Offset Type” error. Because my C# API heavily relied on this ability and I didn’t want to redesign the structure, I created my own hashmap that supports object keys. This omission tended to reinforce the belief that PHP is not really object oriented. That said, I’m probably missing something and did it wrong.
PHP doesn’t support operator overloading. This made my GaussianDistribution and Matrix classes a little harder to work with by having to invent explicit names for the operators.
PHP lacks support for a C#-like property syntax. Having to write getters and setters made me feel like I was back programming in Java again.
My code ran slower in PHP. To be fair, most of the performance problem was in my horribly naive matrix implementation which could be improved with a better implementation. Regardless, it seems that larger sites deal with PHP’s performance problem by writing critical parts in compiled languages like C/C++ or by using caching layers such as memcached. One interesting observation is that the performance issue isn’t really with the Zend Engine per-se but rather the semantics of the PHP language itself. Haiping Zhao on the HipHop for PHP team gave a good overview of the issue:

“Around the time that we started the [HipHop for PHP] project, we absolutely looked into the Zend Engine. The first question you ask is ‘The Zend Engine must be terribly implemented. That’s why it’s slow, right?’

So we looked into the Zend Engine and tried different places, we looked at the hash functions to see if it’s sufficient and look some of the profiles the Zend Engine has and different parts of the Zend Engine.

You finally realize that the Zend Engine is pretty compact. It just does what it promises. If you have that kind of semantics you just cannot avoid the dynamic function table, you cannot avoid the variable table, you just cannot avoid a lot of the things that they built…

that’s the point that [you realize] PHP can also be called C++Script because the syntax is so similar then you ask yourself, ‘What is the difference between the speed of these two different languages and those are the items that are… different like the dynamic symbol lookup (it’s not present in C++), the weak typing is not present in C++, everything else is pretty much the same. The Zend Engine is very close to C implementation. The layer is very very thin. I don’t think we can blame the Zend Engine for the slowness PHP has.”

That said, I don’t think that performance alone would stop me from using PHP. It’s good enough for most things. Furthermore, I’m sure optimizers could use tricks like what the DLR and V8 use to squeak out more performance. However, I think that in practice, there is a case of diminishing returns where I/O (and not CPU time) typically become the limiting factor.

Parting Thoughts

Despite my brief encounter, I feel that I learned quite a bit and feel comfortable around PHP code now. I think my quick ramp-up highlights a core value of PHP: its simplicity. I did miss C#-like compiler warnings and type safety, but maybe that’s my own personal acquired taste. Although PHP does have some dubious features, it’s not nearly as bad as some people make it out to be. I think that its simplicity makes it a very respectable choice for the type of things it was originally designed to do like web templates. Although I still wouldn’t pick PHP as my first choice as a general purpose web programming language, I can now look at its features in a much more balanced way.

P.S. I’d love to hear suggestions on how to improve my implementation and learn where I did something wrong. Please feel free to use my PHP TrueSkill code and submit pull requests. As always, feel free to fork the code and port it to another language like Nate Parsons did with his JSkills Java port.

Computing Your Skill

Thu, 18 Mar 2010 08:33:00 +0000

Summary: I describe how the TrueSkill algorithm works using concepts you’re already familiar with. TrueSkill is used on Xbox Live to rank and match players and it serves as a great way to understand how statistical machine learning is actually applied today. I’ve also created an open source project where I implemented TrueSkill three different times in increasing complexity and capability. In addition, I’ve created a detailed supplemental math paper that works out equations that I gloss over here. Feel free to jump to sections that look interesting and ignore ones that seem boring. Don’t worry if this post seems a bit long, there are lots of pictures.

Introduction

It seemed easy enough: I wanted to create a database to track the skill levels of my coworkers in chess and foosball. I already knew that I wasn’t very good at foosball and would bring down better players. I was curious if an algorithm could do a better job at creating well-balanced matches. I also wanted to see if I was improving at chess. I knew I needed to have an easy way to collect results from everyone and then use an algorithm that would keep getting better with more data. I was looking for a way to compress all that data and distill it down to some simple knowledge of how skilled people are. Based on some previous things that I had heard about, this seemed like a good fit for “machine learning.”

But, there’s a problem.

Machine learning is a hot area in Computer Science— but it’s intimidating. Like most subjects, there’s a lot to learn to be an expert in the field. I didn’t need to go very deep; I just needed to understand enough to solve my problem. I found a link to the paper describing the TrueSkill algorithm and I read it several times, but it didn’t make sense. It was only 8 pages long, but it seemed beyond my capability to understand. I felt dumb. Even so, I was too stubborn to give up. Jamie Zawinski said it well:

“Not knowing something doesn’t mean you’re dumb— it just means you don’t know it.”

I learned that the problem isn’t the difficulty of the ideas themselves, but rather that the ideas make too big of a jump from the math that we typically learn in school. This is sad because underneath the apparent complexity lies some beautiful concepts. In hindsight, the algorithm seems relatively simple, but it took me several months to arrive at that conclusion. My hope is that I can short-circuit the haphazard and slow process I went through and take you directly to the beauty of understanding what’s inside the gem that is the TrueSkill algorithm.

Skill ≈ Probability of Winning

Skill is tricky to measure. Being good at something takes deliberate practice and sometimes a bit of luck. How do you measure that in a person? You could just ask someone if they’re skilled, but this would only give a rough approximation since people tend to be overconfident in their ability. Perhaps a better question is “what would the units of skill be?” For something like the 100 meter dash, you could just average the number of seconds of several recent sprints. However, for a game like chess, it’s harder because all that’s really important is if you win, lose, or draw.

It might make sense to just tally the total number of wins and losses, but this wouldn’t be fair to people that played a lot (or a little). Slightly better is to record the percent of games that you win. However, this wouldn’t be fair to people that beat up on far worse players or players who got decimated but maybe learned a thing or two. The goal of most games is to win, but if you win too much, then you’re probably not challenging yourself. Ideally, if all players won about half of their games, we’d say things are balanced. In this ideal scenario, everyone would have a near 50% win ratio, making it impossible to compare using that metric.

Finding universal units of skill is too hard, so we’ll just give up and not use any units. The only thing we really care about is roughly who’s better than whom and by how much. One way of doing this is coming up with a scale where each person has a unit-less number expressing their rating that you could use for comparison. If a player has a skill rating much higher than someone else, we’d expect them to win if they played each other.

The key idea is that a single skill number is meaningless. What’s important is how that number compares with others. This is an important point worth repeating: skill only makes sense if it’s relative to something else. We’d like to come up with a system that gives us numbers that are useful for comparing a person’s skill. In particular, we’d like to have a skill rating system that we could use to predict the probability of winning, losing, or drawing in matches based on a numerical rating.

We’ll spend the rest of our time coming up with a system to calculate and update these skill numbers with the assumption that they can be used to determine the probability of an outcome.

What Exactly is Probability Anyway?

You can learn about probability if you’re willing to flip a coin— a lot. You flip a few times:

Heads, heads, tails!

Each flip has a seemingly random outcome. However, “random” usually means that you haven’t looked long enough to see a pattern emerge. If we take the total number of heads and divide it by the total number of flips, we see a very definite pattern emerge:

But you knew that it was going to be a 50-50 chance in the long run. When saying something is random, we often mean it’s bounded within some range.

It turns out that a better metaphor is to think of a bullseye that archers shoot at. Each arrow will land somewhere near that center. It would be extraordinary to see an arrow hit the bullseye exactly. Most of the arrows will seem to be randomly scattered around it. Although “random,” it’s far more likely that arrows will be near the target than, for example, way out in the woods (well, except if I was the archer).

This isn’t a new metaphor; the Greek word στόχος (stochos) refers to a stick set up to aim at. It’s where statisticians get the word stochastic: a fancy, but slightly more correct word than random. The distribution of arrows brings up another key point:

All things are possible, but not all things are probable.

Probability has changed how ordinary people think, a feat that rarely happens in mathematics. The very idea that you could understand anything about future outcomes is such a big leap in thought that it baffled Blaise Pascal, one of the best mathematicians in history.

In the summer of 1654, Pascal exchanged a series of letters with Pierre de Fermat, another brilliant mathematician, concerning an “unfinished game.” Pascal wanted to know how to divide money among gamblers if they have to leave before the game is finished. Splitting the money fairly required some notion of the probability of outcomes if the game would have been played until the end. This problem gave birth to the field of probability and laid the foundation for lots of fun things like life insurance, casino games, and scary financial derivatives.

But probability is more general than predicting the future— it’s a measure of your ignorance of something. It doesn’t matter if the event is set to happen in the future or if it happened months ago. All that matters is that you lack knowledge in something. Just because we lack knowledge doesn’t mean we can’t do anything useful, but we’ll have to do a lot more coin flips to see it.

Aggregating Observations

The real magic happens when we aggregate a lot of observations. What would happen if you flipped a coin 1000 times and counted the number of heads? Lots of things are possible, but in my case I got 505 heads. That’s about half, so it’s not surprising. I can graph this by creating a bar chart and put all the possible outcomes (getting 0 to 1000 heads) on the bottom and the total number of times that I got that particular count of heads on the vertical axis. For 1 outcome of 505 total heads it would look like this:

Not too exciting. But what if we did it again? This time I got 518 heads. I can add that to the chart:

Doing it 8 more times gave me 489, 515, 468, 508, 492, 475, 511, and once again, I got 505. The chart now looks like this:

And after a billion times, a total of one trillion flips, I got this:

In all the flips, I never got less than 407 total heads and I never got more than 600. Just for fun, we can zoom in on this region:

As we do more sets of flips, the jagged edges smooth out to give us the famous “bell curve” that you’ve probably seen before. Math guys love to refer to it as a “Gaussian” curve because it was used by the German mathematician Carl Gauss in 1809 to investigate errors in astronomical data. He came up with an exact formula of what to expect if we flipped a coin an infinite number of times (so that we don’t have to). This is such a famous result that you can see the curve and its equation if you look closely at the middle of an old 10 Deutsche Mark banknote bearing Gauss’s face:

Don’t miss the forest from all the flippin’ trees. The curve is showing you the density of all possible outcomes. By density, I mean how tall the curve gets at a certain point. For example, in counting the total number of heads out of 1000 flips, I expected that 500 total heads would be the most popular outcome and indeed it was. I saw 25,224,637 out of a billion sets that had exactly 500 heads. This works out to about 2.52% of all outcomes. In contrast, if we look at the bucket for 450 total heads, I only saw this happen 168,941 times, or roughly 0.016% of the time. This confirms your observation that the curve is denser, that is, taller at the mean of 500 than further away at 450.

This confirms the key point: all things are possible, but outcomes are not all equally probable. There are longshots. Professional athletes panic or ‘choke’. The world’s best chess players have bad days. Additionally, tales about underdogs make us smile— the longer the odds the better. Unexpected outcomes happen, but there’s still a lot of predictability out there.

It’s not just coin flips. The bell curve shows up in lots of places like casino games, to the thickness of tree bark, to the measurements of a person’s IQ. Lots of people have looked at the world and have come up with Gaussian models. It’s easy to think of the world as one big, bell shaped playground.

But the real world isn’t always Gaussian. History books are full of “Black Swan” events. Stock market crashes and the invention of the computer are statistical outliers that Gaussian models tend not to predict well, but these events shock the world and forever change it. This type of reality isn’t covered by the bell curve, what Black Swan author Nassim Teleb calls the “Great Intellectual Fraud.” These events would have such low probability that no one would predict them actually happening. There’s a different view of randomness that is a fascinating playground of Benoît Mandelbrot and his fractals that better explain some of these events, but we will ignore all of this to keep things simple. We’ll acknowledge that the Gaussian view of the world isn’t always right, no more than a map of the world is the actual terrain.

The Gaussian worldview assumes everything will typically be some average value and then treats everything else as increasingly less likely “errors” as you exponentially drift away from the center (Gauss used the curve to measure errors in astronomical data after all). However, it’s not fair to treat real observations from the world as “errors” any more than it is to say that a person is an “error” from the “average human” that is half male and half female. Some of these same problems can come up treating a person as having skill that is Gaussian. Disclaimers aside, we’ll go along with George Box’s view that “all models are wrong, but some models are useful.”

Gaussian Basics

Gaussian curves are completely described by two values:

The mean (average) value which is often represented by the Greek letter μ (mu)
The standard deviation, represented by the Greek letter σ (sigma). This indicates how far apart the data is spread out.

In counting the total number heads in 1000 flips, the mean was 500 and the standard deviation was about 16. In general, 68% of the outcomes will be within ± 1 standard deviation (e.g. 484-516 in the experiment), 95% within 2 standard deviations (e.g. 468-532) and 99.7% within 3 standard deviations (452-548):

An important takeaway is that the bell curve allows for all possibilities, but each possibility is most definitely not equally likely. The bell curve gives us a model to calculate how likely something should be given an average value and a spread. Notice how outcomes sharply become less probable as we drift further away from the mean value.

While we’re looking at the Gaussian curve, it’s important to look at -3σ away from the mean on the left side. As you can see, most of the area under the curve is to the right of this point. I mention this because the TrueSkill algorithm uses the -3σ mark as a (very) conservative estimate for your skill. You’re probably better than this conservative estimate, but you’re most likely not worse than this value. Therefore, it’s a stable number for comparing yourself to others and is useful for use in sorting a leaderboard.

3D Bell Curves: Multivariate Gaussians

A non-intuitive observation is that Gaussian distributions can occur in more than the two dimensions that we’ve seen so far. You can sort of think of a Gaussian in three dimensions as a mountain. Here’s an example:

In this plot, taller regions represent higher probabilities. As you can see, not all things are equally probable. The most probable value is the mean value that is right in the middle and then things sharply decline away from it.

In maps of real mountains, you often see a 2D contour plot where each line represents a different elevation (e.g. every 100 feet):

The closer the lines on the map, the sharper the inclines. You can do something similar for 2D representations of 3D Gaussians. In textbooks, you often just see 2D representation that looks like this:

This is called an “isoprobability contour” plot. It’s just a fancy way of saying “things that have the same probability will be the same color.” Note that it’s still in three dimensions. In this case, the third dimension is color intensity instead of the height you saw on a surface plot earlier. I like to think of contour plots as treasure maps for playing the “you’re getting warmer…” game. In this case, black means “you’re cold,” red means “you’re getting warmer…,” and yellow means “you’re on fire!” which corresponds to the highest probability.

See? Now you understand Gaussians and know that “multivariate Gaussians” aren’t as scary as they sound.

Let’s Talk About Chess

There’s still more to learn, but we’ll pick up what we need along the way. We already have enough tools to do something useful. To warm up, let’s talk about chess because ratings are well-defined there.

In chess, a bright beginner is expected to have a rating around 1000. Keep in mind that ratings have no units; it’s just a number that is only meaningful when compared to someone else’s number. By tradition, a difference of 200 indicates the better ranked player is expected to win 75% of the time. Again, nothing is special about the number 200, it was just chosen to be the difference needed to get a 75% win ratio and effectively defines a “class” of player.

I’ve slowly been practicing and have a rating around 1200. This means that if I play a bright beginner with a rating of 1000, I’m expected to win three out of four games.

We can start to visualize a match between me and bright beginner by drawing two bell curves that have a mean of 1000 and 1200 respectively with both having a standard deviation of 200:

The above graph shows what the ratings represent: they’re an indicator of how we’re expected to perform if we play a game. The most likely performance is exactly what the rating is (the mean value). One non-obvious point is that you can subtract two bell curves and get another bell curve. The new center is the difference of the means and the resulting curve is a bit wider than the previous curves. By taking my skill curve (red) and subtracting the beginner’s curve (blue), you’ll get this resulting curve (purple):

Note that it’s centered at 1200 - 1000 = 200. Although interesting to look on its own, it gives some useful information. This curve is representing all possible game outcomes between me and the beginner. The middle shows that I’m expected to be 200 points better. The far left side shows that there is a tiny chance that the beginner has a game where he plays as if he’s 700 points better than I am. The far right shows that there is a tiny chance that I’ll play as if I’m 1100 points better. The curve actually goes on forever in both ways, but the expected probability for those outcomes is so small that it’s effectively zero.

As a player, you really only care about one very specific point on this curve: zero. Since I have a higher rating, I’m interested in all possible outcomes where the difference is positive. These are the outcomes where I’m expected to outperform the beginner. On the other hand, the beginner is keeping his eye on everything to the left of zero. These are the outcomes where the performance difference is negative, implying that he outperforms me.

We can plug a few numbers into a calculator and see that there is about a 24% probability that the performance difference will be negative, implying the beginner wins, and a 76% chance that the difference will be positive, meaning that I win. This is roughly the 75% that we were expecting for a 200 point difference.

This has been a bit too concrete for my particular match with a beginner. We can generalize it by creating another curve where the horizontal axis represents the difference in player ratings and the vertical axis represents the total probability of winning given that rating difference:

As expected, having two players with equal ratings, and thus a rating difference of 0, implies the odds of winning are 50%. Likewise, if you look at the -200 mark, you see the curve is at the 24% that we calculated earlier. Similarly, +200 is at the 76% mark. This also shows that outcomes on the far left side are quite unlikely. For example, the odds of me winning a game against Magnus Carlsen, who is at the top of the chess leaderboard with a rating of 2813, would be at the -1613 mark (1200 - 2813) on this chart and have a probability near one in a billion. I won’t hold my breath. (Actually, most chess groups use a slightly different curve, but the ideas are the same. See the accompanying math paper for details.)

All of these curves were probabilities of what might happen, not what actually happened. In actuality, let’s say I lost the game by some silly blunder (oops!). The question that the beginner wants to know is how much his rating will go up. It also makes sense that my rating will go down as a punishment for the loss. The harder question is just how much should the ratings change?

By winning, the beginner demonstrated that he was probably better than the 25% winning probability we thought he would have. One way of updating ratings is to imagine that each player bets a certain amount of his rating on each game. The amount of the bet is determined by the probability of the outcome. In addition, we decide how dramatic the ratings change should be for an individual game. If you believe the most recent game should count 100%, then you’d expect my rating to go down a lot and his to go up a lot. The decision of how much the most recent game should count leads to what chess guys call the multiplicative “K-factor.”

The K-Factor is what we multiply a probability by to get the total amount of a rating change. It reflects the maximum possible change in a person’s rating. A reasonable choice of a weight is that the most recent game counts about 7% which leads to a K-factor of 24. New players tend to have more fluctuations than well-established players, so new players might get a K-Factor of 32 while grand masters have a K-factor around
10. Here’s how the K-Factor changes with respect to how much the latest game should count:

Using a K-Factor of 24 means that my rating will now be lowered to 1182 and the beginner’s will rise to 1018. Our curves are now closer together:

Note that our standard deviations never change. Here are the probabilities if we were to play again:

This method is known as the Elo rating system, named after Arpad Elo, the chess enthusiast who created it. It’s relatively simple to implement and most games that calculate skill end here.

I Thought You Said You’d Talk About TrueSkill?

Everything so far has just been prerequisites to the main event; the TrueSkill paper assumes you’re already familiar with it. It was all sort of new to me, so it took awhile to get comfortable with the Elo ideas. Although the Elo model will get you far, there are a few notable things it doesn’t handle well:

Newbies - In the Elo system, you’re typically assigned a “provisional” rating for the first 20 games. These games tend to have a higher K-factor associated with them in order to let the algorithm determine your skill faster before it’s slowed down by a non-provisional (and smaller) K-factor. We would like an algorithm that converges quickly onto a player’s true skill (get it?) to not waste their time having unbalanced matches. This means the algorithm should start giving reasonable approximations of skill within 5-10 games.
Teams - Elo was explicitly designed for two players. Efforts to adapt it to work for multiple people on multiple teams have primarily been unsophisticated hacks. One such approach is to treat teams as individual players that duel against the other players on the opposing teams and then apply the average of the duels. This is the “duelling heuristic” mentioned in the TrueSkill paper. I implemented it in the accompanying project. It’s ok, but seems a bit too hackish and doesn’t converge well.
Draws - Elo treats draws as a half win and half loss. This doesn’t seem fair because draws can tell you a lot. Draws imply you were evenly paired whereas a win indicates you’re better, but unsure how much better. Likewise, a loss indicates you did worse, but you don’t really know how much worse. So it seems that a draw is important to explicitly model.

The TrueSkill algorithm generalizes Elo by keeping track of two variables: your average (mean) skill and the system’s uncertainty about that estimate (your standard deviation). It does this instead of relying on a something like a fixed K-factor. Essentially, this gives the algorithm a dynamic k-factor. This addresses the newbie problem because it removes the need to have “provisional” games. In addition, it addresses the other problems in a nice statistical manner. Tracking these two values are so fundamental to the algorithm that Microsoft researchers informally referred to it as the μσ (mu-sigma) system until the marketing guys gave it the name TrueSkill.

We’ll go into the details shortly, but it’s helpful to get a quick visual overview of what TrueSkill does. Let’s say we have Eric, an experienced player that has played a lot and established his rating over time. In addition, we have newbie: Natalia.

Here’s what their skill curves might look like before a game:

And after Natalia wins:

Notice how Natalia’s skill curve becomes narrower and taller (i.e. makes a big update) while Eric’s curve barely moves. This shows that the TrueSkill algorithm thinks that she’s probably better than Eric, but doesn’t how much better. Although TrueSkill is a little more confident about Natalia’s mean after the game (i.e. it’s now taller in the middle), it’s still very uncertain. Looking at her updated bell curve shows that her skill could be between 15 and 50.

The rest of this post will explain how calculations like this occurred and how much more complicated scenarios can occur. But to understand it well enough to implement it, we’ll need to learn a couple of new things.

Bayesian Probability

Most basic statistics classes focus on frequencies of events occurring. For example, the probability of getting a red marble when randomly drawing from a jar that has 3 red marbles and 7 blue marbles is 30%. Another example is that the probability of rolling two dice and getting a total of 7 is about 17%. The key idea in both of these examples is that you can count each type of outcome and then compute the frequency directly. Although helpful in calculating your odds at casino games, “frequentist” thinking is not that helpful with many practical applications, like finding your skill in a team.

A different approach is to think of probability as degree of belief in something. The basic idea is that you have some prior belief and then you observe some evidence that updates your belief leaving you with an updated posterior belief. As you might expect, learning about new evidence will typically make you more certain about your belief.

Let’s assume that you’re trying to find a treasure on a map. The treasure could be anywhere on the map, but you have a hunch that it’s probably around the center of the map and increasingly less likely as you move away from the center. We could track the probability of finding the treasure using the 3D multivariate Gaussian we saw earlier:

Now, let’s say that after studying a book about the treasure, you’ve learned that there’s a strong likelihood that treasure is somewhere along the diagonal line on the map. Perhaps this was based on some secret clue. Your clue information doesn’t necessarily mean the treasure will be exactly on that line, but rather that the treasure will most-likely be near it. The likelihood function might look like this in 3D:

We’d like to use our prior information and this new likelihood information to come up with a better posterior guess of the treasure. It turns out that we can just multiply the prior and likelihood to obtain a posterior distribution that looks like this:

This is giving us a smaller and more concentrated area to look at.

If you look at most textbooks, you typically just see this information using 2D isoprobability contour plots that we learned about earlier. Here’s the same information in 2D:

Prior:

Likelihood:

Posterior:

For fun, let’s say we found additional information saying the treasure is along the other diagonal with the following likelihood:

To incorporate this information, we’re able to take our last posterior and make that the prior for the next iteration using the new likelihood information to get this updated posterior:

This is a much more focused estimate than our original belief! We could iterate the procedure and potentially get an even smaller search area.

And that’s basically all there is to it. In TrueSkill, the buried treasure that we look for is a person’s skill. This approach to probability is called “Bayesian” because it was discovered by a Presbyterian minister in the 1700’s named Thomas Bayes who liked to dabble in math.

The central ideas to Bayesian statistics are the prior, the likelihood, and the posterior. There’s detailed math that goes along with this and is in the accompanying paper, but understanding these basic ideas is more important:

“When you understand something, then you can find the math to express that understanding. The math doesn’t provide the understanding.”— Lamport

Bayesian methods have only recently become popular in the computer age because computers can quickly iterate through several tedious rounds of priors and posteriors. Bayesian methods have historically been popular inside of Microsoft Research (where TrueSkill was invented). Way back in 1996, Bill Gates considered Bayesian statistics to be Microsoft Research’s secret sauce.

As we’ll see later on, we can use the Bayesian approach to calculate a person’s skill. In general, it’s highly useful to update your belief based off previous evidence (e.g. your performance in previous games). This usually works out well. However, sometimes “Black Swans” are present. For example, a turkey using Bayesian inference would have a very specific posterior distribution of the kindness of a farmer who feeds it every day for 1000 days only to be surprised by a Thanksgiving event that was so many standard deviations away from the turkey’s mean belief that he never would have saw it coming. Skill has similar potential for a “Thanksgiving” event where an average player beats the best player in the world. We’ll acknowledge that small possibility, but ignore it to simplify things (and give the unlikely winner a great story for the rest of his life).

TrueSkill claims that it is Bayesian, so you can be sure that there is going to be a concept of a prior and a likelihood in it— and there is. We’re getting closer, but we still need to learn a few more details.

The Marginalized, but Not Forgotten Distribution

Next we need to learn about “marginal distributions”, often just called “marginals.” Marginals are a way of distilling information to focus on what you care about. Imagine you have a table of sales for each month for the past year. Let’s say that you only care about total sales for the year. You could take out your calculator and add up all the sales in each month to get the total aggregate sales for the year. Since you care about this number and it wasn’t in the original report, you could add it in the margin of the table. That’s roughly where “margin-al” got its name.

Wikipedia has a great illustration on the topic: consider a guy that ignores his mom’s advice and never looks both ways when crossing the street. Even worse, he’s too engrossed in listening to his iPod that he doesn’t look any way, he just always crosses.

What’s the probability of him getting hit by a car at a specific intersection? Let’s simplify things by saying that it just depends on whether the light is red, yellow, or green.

Light State	Red	Yellow	Green
Probability of getting hit given light state	1%	9%	90%

This is helpful, but it doesn’t tell us what we want. We also need to know how long the light stays a given color

Light color	Red	Yellow	Green
% Time in Color	60%	10%	30%

There’s a bunch of probability data here that’s a bit overwhelming. If we join the probabilities together, we’ll have a “joint distribution” that’s just a big complicated system that tells us too much information.

We can start to distill this information down by calculating the probability of getting hit given each light state:

Red	Yellow	Green	Total Probability of Getting Hit
1%*60% = 0.6%	9%*10% = 0.9%	90%*30% = 27%	28.5%

In the right margin of the table we get the value that really matters to this guy. There’s a 28.5% marginal probability of getting hit if the guy never looks for cars and just always crosses the street. We obtained it by “summing out” the individual components. That is, we simplified the problem by eliminating variables and we eliminated variables by just focusing on the total rather than the parts.

This idea of marginalization is very general. The central question in this article is “computing your skill,” but your skill is complicated. When using Bayesian statistics, we often can’t observe something directly, so we have to come up with a probability distribution that’s more complicated and then “marginalize” it to get the distribution that we really want. We’ll need to marginalize your skill by doing a similar “summing-out” procedure as we did for the reckless guy above.

But before we do that, we need to learn another technique to make calculations simpler.

What’s a Factor Graph, and Why Do I Care?

Remember your algebra class when you worked with expressions like this?

Your teacher showed you that you could simplify this by “factor-ing” out w, like this:

We often factor expressions to make them easier to understand and to simplify calculations. Let’s replace the variables above with w=4, x=1, y=2, and z=3.

Let’s say the numbers on our calculator are circles and the operators are squares. We could come up with an “expression tree” to describe the calculation like this:

You can tell how tedious this computation is by counting 11 “buttons” we’d have to push. We could also factor it like this

This “factorization” has a total of 7 buttons, a savings of 4 buttons. It might not seem like much here, but factorizing is a big idea.

We face a similar problem of how to factor things when we’re looking to simplify a complicated probability distribution. We’ll soon see how your skill is composed of several “factors” in a joint distribution. We can simplify computations based on how variables are related to these factors. We’ll break up the joint distribution into a bunch of factors on a graph. This graph that links factors and variables is called a “factor graph.”

The key idea about a factor graph is that we represent the marginal conditional probabilities as variables and then represent each major function of those variables as a “factor.” We’ll take advantage of how the graph “factorizes” and imagine that each factor is a node on a network that’s optimized for efficiency. A key efficiency trick is that factor nodes send “messages” to other nodes. These messages help simplify further marginal computations. The “message passing” is very important and thus will be highlighted with arrows in the upcoming graphs; gray arrows represent messages going “down” the graph and black show messages coming “up” the graph.

The accompanying code and math paper go into details about exactly how this happens, but it’s important to realize the high level idea first. That is, we want to look at all the factors that go into creating the likelihood function for updating a person’s skill based on a game outcome. Representing this information in a factor graph helps us see how things are related.

Now we have all the foundational concepts that we’re ready for the main event: the TrueSkill factor graph!

Enough Chess, Let’s Rank Something Harder!

The TrueSkill algorithm is Bayesian because it’s composed of a prior multiplied by a likelihood. I’ve highlighted these two components in the sample factor graph from the TrueSkill paper that looks scary at first glance:

This factor graph shows the outcome of a match that had 3 teams all playing against each other. The first team (on the left) only has one player, but this player was able to defeat both of the other teams. The second team (in the middle) had two players and this team tied the third team (on the right) that had just one player.

In TrueSkill, we just care about a player’s marginal skill. However, as is often the case with Bayesian models, we have to explicitly model other things that impact the variable we care about. We’ll briefly cover each factor (more details are in the code and math paper).

Factor #1: What Do We Already Know About Your Skill?

The first factor starts the whole process. It’s where we get a player’s previous skill level from somewhere (e.g. a player database). At this point, we add some uncertainty to your skill’s standard deviation to keep game dynamics interesting and prevent the standard deviation from hitting zero since the rest of algorithm will make it smaller (since the whole point is to learn about you and become more certain).

There is a factor and a variable for each player. Each factor is a function that remembers a player’s previous skill. Each variable node holds the current value of a player’s skill. I say “current” because this is the value that we’ll want to know about after the whole algorithm is completed. Note that the message arrow on the factor only goes one way; we never go back to the prior factor. It just gets things going. However, we will come back to the variable.

But we’re getting ahead of ourselves.

Factor #2: How Are You Going To Perform?

Next, we add in beta (β). You can think of beta as the number of skill points to guarantee about an 80% chance of winning. The TrueSkill inventors refer to beta as defining the length of a “skill chain.”

The skill chain is composed of the worst player on the far left and the best player on the far right. Each subsequent person on the skill chain is “beta” points better and has an 80% win probability against the weaker player. This means that a small beta value indicates a high-skill game (e.g. Go) since smaller differences in points lead to the 80%:20% ratio. Likewise, a game based on chance (e.g. Uno) is a low-skill game that would have a higher beta and smaller skill chain.

Factor #3: How is Your Team Going to Perform?

Now we’re ready for one of the most controversial aspects of TrueSkill: computing the performance of a team as a whole. In TrueSkill, we assume the team’s performance is the sum of each team member’s performance. I say that it’s “controversial” because some members of the team probably work harder than others. Additionally, sometimes special dynamics occur that make the sum greater than the parts. However, we’ll fight the urge to make it much more complicated and heed Makridakis’s advice:

“Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones”

One cool thing about this factor is that you can weight each team member’s contribution by the amount of time that they played. For example, if two players are on a team but each player only played half of the time (e.g. a tag team), then we would treat them differently than if these two players played the entire time. This is officially known as “partial play.” Xbox game titles report the percentage of time a player was active in a game under the “X_PROPERTY_PLAYER_PARTIAL_PLAY_PERCENTAGE” property that is recorded for each player (it defaults to 100%). This information is used by TrueSkill to perform a fairer update. I implemented this feature in the accompanying source code.

Factor #4: How’d Your Team Compare?

Next, we compare team performances in pairs. We do this by subtracting team performances to come up with pairwise differences:

This is similar to what we did earlier with Elo and subtracting curves to get a new curve.

Factor #5: How Should We Interpret the Team Differences?

The bottom of the factor graph contains a comparison factor based on the team performance differences we just calculated:

The comparison depends on whether the pairwise difference was considered a “win” or a “draw.” Obviously, this depends on the rules of the game. It’s important to realize that TrueSkill only cares about these two types of results. TrueSkill doesn’t care if you won by a little or a lot, the only thing that matters is if you won. Additionally, in TrueSkill we imagine that there is a buffer of space called a “draw margin” where performances are equivalent. For example, in Olympic swimming, two swimmers can “draw” because their times are equivalent to 0.01 seconds even though the times differ by several thousandths of a second. In this case, the “draw margin” is relatively small around 0.005 seconds. Draws are very common in chess at the grandmaster level, so the draw margin would be much greater there.

The output of the comparison factor directly relates to how much your skill’s mean and standard deviation will change.

The exact math involved in this factor is complicated, but the core idea is simple:

Expected outcomes cause small updates because the algorithm already had a good guess of your skill. - Unexpected outcomes (upsets) cause larger updates to make the algorithm more likely to predict the outcome in the future.

The accompanying math paper goes into detail, but conceptually you can think of the performance difference as a number on the bottom (x-axis) of a graph. It represents the difference between the expected winner and the expected loser. A large negative number indicates a big upset (e.g. an underdog won) and a large positive number means the expected person won. The exact update of your skill’s mean will depend on the probability of a draw, but you can get a feel for it by looking at this graph:

Similarly, the update to a skill’s standard deviation (i.e. uncertainty) depends on how expected the outcome was. An expected outcome shrinks the uncertainty by a small amount (e.g. we already knew it was going to happen). Likewise, an unexpected outcome shrinks the standard deviation more because it was new information that we didn’t already have:

One problem with this comparison factor is that we use some fancy math that just makes an approximation (a good approximation, but still an approximation). We’ll refine the approximation in the next step.

The Inner Schedule: Iterate, Iterate, Iterate!

We can make a better approximation of the team difference factors by passing around the messages that keep getting updated in the following loop:

After a few iterations of this loop, the changes will be less dramatic and we’ll arrive at stable values for each marginal.

Enough Already! Give Me My New Rating!

Once the inner schedule has stabilized the values at the bottom of the factor graph, we can reverse the direction of each factor and propagate messages back up the graph. These reverse messages are represented by black arrows in the graph of each factor. Each player’s new skill rating will be the value of player’s skill marginal variable once messages have reached the top of the factor graph.

By default, we give everyone a “full” skill update which is the result of the above procedure. However, there are times when a game title might want to not make the match outcome count much because of less optimal playing conditions (e.g. there was a lot of network lag during the game). Games can do this with a “partial update” that is just a way to apply only a fraction of the full update. Game titles specify this via the X_PROPERTY_PLAYER_SKILL_UPDATE_WEIGHTING_FACTOR variable. I implemented this feature in the accompanying source code and describe it in the math paper.

Results

There are some more details left, but we’ll stop for now. The accompanying math paper and source code fill in most of the missing pieces. One of the best ways to learn the details is to implement TrueSkill yourself. Feel free to create a port of the accompanying project in your favorite language and share it with the world. Writing your own implementation will help solidify all the concepts presented here.

The most rewarding part of implementing the TrueSkill algorithm is to see it work well in practice. My coworkers have commented on how it’s almost “eerily” accurate at computing the right skill for everyone relatively quickly. After several months of playing foosball, the top of the leaderboard (sorted by TrueSkill: the mean minus 3 standard deviations) was very stable. Recently, a very good player started playing and is now the #2 player. Here’s a graph of the most recent changes in TrueSkill for the top 5 (of around 40) foosball players:

(Note: Look how quickly the system detected how good this new #2 player is even though his win ratio is right at 50%)

Another interesting aspect of implementing TrueSkill is that it has raised an awareness of ratings among players. People that otherwise wouldn’t have played together now occasionally play each other because they know they’re similarly matched and will have a good game. One advantage of TrueSkill is that it’s not that big of a deal to lose to a much better player, so it’s still ok to have unbalanced games. In addition, having ratings has been a good way to judge if you’re improving in ability with a new shot technique in foosball or learning more chess theory.

Fun Things from Here

The obvious direction to go from here is to add more games to the system and see if TrueSkill handles them equally well. Given that TrueSkill is the default ranking system on Xbox live, this will probably work out well. Another direction is to see if there’s a big difference in TrueSkill based on position in a team (e.g. midfield vs. goalie in foosball). Given TrueSkill’s sound statistics based on ranking and matchmaking, you might even have some success in using it to decide between to several options. You could have each option be a “player” and decide each “match” based on your personal whims of the day. If nothing else, this would be an interesting way to pick your next vacation spot or even your child’s name.

If you broaden the scope of your search to using the ideas that we’ve learned along the way, there’s a lot more applications. Microsoft’s AdPredictor (i.e. the part that delivers relevant ads on Bing) was created by the TrueSkill team and uses similar math, but is a different application.

As for me, it was rewarding to work with an algorithm that has fun social applications as well as picking up machine learning tidbits along the way. It’s too bad all of that didn’t help me hit the top of any of the leaderboards.

Oh well, it’s been a fun journey. I’d love to hear if you dived into the algorithm after reading this and would especially appreciate any updates to my code or other language forks.

Links:

The Math Behind TrueSkill - A math-filled paper that fills in some of the details left out of this post. - Moserware.Skills Project on GitHub - My full implementation of Elo and TrueSkill in C#. Please feel free to create your own language forks. - Microsoft’s online TrueSkill Calculators - Allows you to play with the algorithm without having to download anything. My implementation matches the results of these calculators.

Special thanks to Ralf Herbrich, Tom Minka, and Thore Graepel on the TrueSkill team at Microsoft Research Cambridge for their help in answering many of my detailed questions about their fascinating algorithm.

A Stick Figure Guide to the Advanced Encryption Standard (AES)

Tue, 22 Sep 2009 08:12:00 +0000

(A play in 4 acts. Please feel free to exit along with the stage character that best represents you. Take intermissions as you see fit. Click on the stage if you have a hard time seeing it. If you get bored, you can jump to the code. Most importantly, enjoy the show!)

Act 1: Once Upon a Time…

Act 2: Crypto Basics

Act 3: Details

Act 4: Math!

Epilogue

I created a heavily-commented AES/Rijndael implementation to go along with this post and put it on GitHub. In keeping with the Foot-Shooting Prevention Agreement, it shouldn’t be used for production code, but it should be helpful in seeing exactly where all the numbers came from in this play. Several resources were useful in creating this:

The Design of Rijndael is the book on the subject, written by the Rijndael creators. It was helpful in understanding specifics, especially the math (although some parts were beyond me). It’s also where I got the math notation and graphical representation in the left and right corners of the scenes describing the layers (SubBytes, ShiftRows, MixColumns, and AddRoundKey).
The FIPS-197 specification formally defines AES and provides a good overview.
The Puzzle Palace, especially chapter 9, was helpful while creating Act 1. For more on how the NSA modified DES, see this.
More on Intel’s (and now AMD) inclusion of native AES instructions can be found here and in detail here. - Other helpful resources include Wikipedia, Sam Trenholme’s AES math series, and this animation.

Please leave a comment if you notice something that can be better explained.

Update #1: Several scenes were updated to fix some errors mentioned in the comments.
Update #2: By request, I’ve created a slide show presentation of this play in both PowerPoint and PDF formats. I’ve licensed them under the Creative Commons Attribution License so that you can use them as you see fit. If you’re teaching a class, consider giving extra credit to any student giving a worthy interpretive dance rendition in accordance with the Foot-Shooting Prevention Agreement.

Just Enough MBA to Be a Programmer

Mon, 20 Jul 2009 08:00:00 +0000

There’s that awkward moment in your software development life when you realize that most of the people in your company aren’t programmers. Scanning your address book reveals Marketing, Sales, Accounting, Human Resources, and yes, the “business people” with their Masters of Business Administration (MBAs).

I’ve always been curious about what MBAs really do. In my weaker moments, I’ve even thought that the only reason people got an MBA was to demand a higher salary or to “move up the corporate ladder” into some management job. What did these MBA ninjas actually learn in school? Would having an MBA help me better understand how I affected my company’s bottom line? Although I had the curiosity, I never acted on it. This changed when another programmer recommended that I read The Ten-Day MBA by Steven Silbiger.

Sure, I knew that no one would anoint me with a real MBA at the end of the book any more than watching MIT lectures online would make me an MIT grad. Besides, going to a nice MBA school is more about being around other motivated people and professors. The real value in having an MBA is in applying the concepts, not the concepts themselves.

Disclaimers aside, I was determined to read the book and take notes on what a programmer should know about an MBA.

Day 1 - Marketing

Every developer painfully learns that technology doesn’t win on its own. At best, it just gives you a shot at marketing. Marketing is proof that software doesn’t sell itself, no matter how good it is.

A software company might have a Marketing Requirements Document (MRD) that outlines what the next version will contain. This usually is the result of a standard marketing analysis that the book outlined:

Consumer Analysis - Who are they? What do they want? How many different segments of people do you have? Is the buyer of your product different than the user? (The book gives the example that women buy the majority of men’s socks and underwear, thus it’s good to market appropriately).
Market Analysis - How big is your target market? Is it new? Is it growing? Where is the product in the life cycle?
Competitive Analysis - How do your Strengths, Weaknesses, Opportunities, and Threats (SWOTs) compare to your competition?
Distribution Analysis - What “channels” does your company use to reach your customer? Who are the intermediate players (e.g. the Apple Store, Amazon.com, etc)? What cuts do they take? What are their motivations?
Plan the Marketing Mix - How will you differentiate your products? How will you place it, promote it, and price it?
Determine the Economics - How long will it take before you break even? What are your fixed costs vs. margin costs? (Thankfully software has a low marginal cost)
Revise - Tweak and repeat as needed.

One big marketing theme is to “own a word in the consumer’s mind”:

If you establish one benefit in the consumer’s mind, the consumer may attribute other positives as well to your product. FedEx means “overnight delivery.” Only one company can own a word and it is tough to change it once it’s established… The easiest way to own a word is to be first. Consumers tend to stick with products that work for them. Kleenex cleans runny noses. p.26

This explains why your family still uses MapQuest despite your repeated attempts to show them how much better Google Maps is. It’s also helpful if your product name matches what it does. “Drano” is easier to remember than a “Web 2.0” name like Qoop.

I was surprised to learn that the popular online advertising term Cost per Thousand (CPM), is a relatively old term that has long existed in print media. In general, the more targeted a group is, the higher the CPM is. This explains why a programming ad on Stack Overflow can probably fetch a better CPM than the same ad on a site like Pandora, even though programmers use both.

Marketing people typically have their reasons for doing things that frustrate us. For example, if your software will take a long time to get through a distribution channel or marketing foresees a long customer buying process, they might begin to “market” your code long before a beta is available with the belief that it’ll hopefully be read by the time the customer is ready to buy.

Sometimes marketing has to make an extreme choice. When GTE faced rebuilding its tarnished brand in the 1990’s, it was probably a clever marketing person who suggested that they give up on fixing their brand name and re-brand themselves as Verizon.

Despite all the good advice, I was disappointed by the book’s lack of coverage of the Apple/BMW style of “marketing” the engineering department can do by creating a remarkable product. Creating a product that allows users to quickly jump over the “suck threshold” is just one example where a programmer can make a tremendous “marketing” contribution.

Day 2 - Ethics

Ethics seems easy to understand: “Do to others as you would have them do to you.” The hard part is realizing how the “others” are affected by your actions. Others include customers, executives, shareholders, suppliers, employees (and their families), the government, the planet, and the “future generations.”

Unfortunately, when simplicity is lost, Sarbanes-Oxley Acts are found.

Day 3 - Accounting

In theory, accounting is simple. Just answer these questions about your entity/business:

What does a company own?
How much does a company owe others?
How well did a company’s operations perform?
How does the company get the cash to fund itself? - p.72

If you get nothing else out of accounting, know how to read a balance sheet. Although Microsoft CEO Steve Ballmer dropped out of Stanford’s MBA program to become employee #24, he knew balance sheets were important:

In 1980, I came in to “be a business person” whatever that meant. Didn’t know much. Frankly all I’d ever really done is interview for jobs and market brownie mix. I wasn’t exactly well credentialed. I’d taken the first year at Stanford Business School so I can read a balance sheet – that was pretty important. We didn’t have that much money back then so there wasn’t much to read. But anyway those lessons were important.

Balance sheets are simple to follow:

As the name implies, the balance sheet is a “balance” sheet. The fundamental equation that rules over accounting balance is:
Assets (A) = Liabilities (L) + Owners’ Equity (OE)
What you own (assets) equals the total of what you borrowed (liabilities) and what you have invested (equity) to pay for it. This equation or “identity” explains everything that happens in the accounting records of a company over time. Remember it! - p.83

For example, your work computer is a company asset (which explains the “asset” tag on it). Your company created an equal liability to pay for it. When your company started, the founders gave up some of their money to increase the new company’s cash assets (left side) in exchange for stock in the company (right side).

For example, we can read Google’s balance sheet for the first quarter of 2009 and see:

Assets = $33.51 Billion Liabilities = $3.66 Billion Owner Equity = $29.85 Billion (which includes $14.98 Billion in “retained earnings” that Google is keeping for growth rather than giving it back to the owners of its 315.75 million shares)

Sure enough, everything “balances”:

From basic data, we can derive a bunch of helpful ratios to see how healthy Google is:

Liquidity/Current Ratio = (Current Assets / Current Liabilities) = 33.51 / 3.66 = 9.14 (Greater than 1 means there’s room to pay for liabilities)
Financial Leverage = (Total Liabilities + Owners’ Equity) / OE = (3.66 + 29.85) / 29.85 = 1.12 (Greater than 2 indicates a company is using a lot of debt to operate)
Return on Equity = (Net Income / Owners’ Equity) = 1.42 / 29.85 = 4.77% (Which indicates how efficiently the company is using shareholder equity)
… and many more …

Day 4 - Organizational Behavior

The whole purpose of Organizational Behavior (OB) is to get you to think before you act around people. You want to motivate people? OB has an equation for that:

Motivation = Expectation of Work will lead to Performance * Expectation Performance will lead to Reward * Value of Reward.

Feel free to tweak the variables as you see fit. You can Manage by Objective (MBO) where you set goals and then get out of the way or you can Manage by Walking Around (MBWA) where you play a more active role in day-to-day execution. The best choice depends on your environment and culture. You might need to mix the two. Remember that we humans are delicate creatures with our own wants and desires. Be careful.

Day 5 - Quantitative Analysis

Quantitative Analysis (QA) explains why Excel has so many functions that I’d never heard of. A core idea is that “a dollar today is worth more than a dollar received in the future.” (p.173).

Imagine that someone promises to pay you a dollar in a year if you give them money now. What is that worth to you today? Obviously, it matters on how much you trust them to pay you back. The more you trust them, the more you’re willing to give them now. Similarly, the less you trust them, the more you might “discount” that dollar in the future because they’re tying up money that could be used for better investments. This is called the “discount” or “hurdle” rate. Having a 10% discount rate means that the dollar in the future has a net present value of $0.91 today:

$1 * (1 + 10%)^-1 = $0.91

This simple idea has lots of consequences. For example, let’s oversimplify things and say that you can spend $2,000 today to buy and maintain a server that will last for 3 years or you can lock in a price with Amazon for that same server for $800 a year for the same 3 years. A naïve person would just see that $2000 is less than $2400, but a QA person that assigns a 10% discount rate would see:

… and come to the conclusion that it’s about $10 cheaper, in today’s dollars, to have Amazon maintain the server.

You can also do the inverse calculation. Assume you’re Amazon and that server costs you $1800 today and you can get someone to pay you $800 a year for it for 3 years. What is your internal rate of return for this investment?

Here we see an internal rate of return of about 16% on the server.

We could also use the time value of money to include valuing users. Early adopters of eBay and Twitter were worth more per user than late adopters because the early ones were more likely to tell their friends who hadn’t used the service and thus attract more new people.

Day 6 - Finance

Finance blends time, money, and risk.

To start, a business needs a structure that gives it some capital. Popular options include:

Sole Proprietorships - An individual or a married couple. You are effectively your business. All earnings are treated as personal income and taxed appropriately. You take in all the profits but also have unlimited liability. You can’t divide the company up. It’s simple, but the downside is that it makes it hard to raise money.
Partnerships - Involves more people than a proprietorship. Several people come together and can be general partners (each having unlimited liability) or limited partners (liable up to the investment). As a partner, you pay taxes on your percentage of the business’s income on your personal taxes.
Corporations - Effectively you give birth to a new legal entity that is distinct from the shareholders. Most large companies are “C Corporations” and have a double taxation issue where the corporation’s income is taxed and the dividends it issues to shareholders are taxed as well. If you have a smaller company with fewer than 100 shareholders, you may qualify for “S Corporation” status. S Corporations usually don’t pay income tax and instead rely on shareholders to pay the associated tax on their percentage of the income. This tends to give S Corporations the legal liability benefit of corporation status and the single taxation benefit of partnerships.

Corporations issue stock to raise money. Stock entitles the holder to a residual claim on earnings and assets after other debt obligations have been met. One obvious question is “what’s a good stock price?” This has a lot of factors, such as a company’s growth potential and the company’s earnings. Popular metrics include a company’s ratio of its stock price divided by its earnings (P/E ratio). Higher P/E ratios tend to indicate that shareholders have higher expectations the company will grow and eventually make more money in the future. Some examples:

Company	P/E Ratio
Google	31.47
Microsoft	13.98
Amazon.com	54.98

After you raised some capital, you should carefully think how you’ll spend it. There are many ways to do this. The Payback Period Method has you calculate how long it’ll take to recover your investment. The shorter the payback period, the less risky the investment is. For example, adding RAM is so cheap that the productivity boost has a short payback period. In contrast, completely rewriting a huge codebase might have put your company out of business before you get your money back.

Another approach is to use the Net Present Value Method to see how much the investment will return over its lifetime in terms of today’s dollars. Once you determine the discount factor to reflect the risk, you only consider investments that have a positive Net Present Value.

Day 7 - Operations

Operations is about making stuff. Popular operations guys include Frederick Taylor from the late 1800’s who is famous for breaking up tasks into small pieces and walking around factories with a stopwatch to find the “one right way” of doing them. Elton Mayo’s bold claim was that caring about your employees mattered. You could even make terrible working conditions if the employees were otherwise treated well and felt important.

Although some MBAs might use some programming techniques like optimizing flow-charts to improve operations, it’s more likely to see factory techniques used when managing programmers. Oversimplifying things, software development is a factory that turns capital into code. To this end, you’ll often see popular manufacturing processes like Toyota’s Kanban method of using visual cards to control workflow making their way into our world as “new” or “agile” software methodologies.

Day 8 - Economics

Economics is the magic that allows me to write software in exchange for steak burritos. As Adam Smith realized, society as a whole becomes “wealthier” when we seek division of labor to specialize and do something well rather than trying to do everything ourselves poorly.

At a micro level, economics is a simple matter of supply equals demand. When you look at the larger/macro economies, more complicated equations pop up like this one:

Money × Velocity = Price Level × Real Gross National Product

This equation shows that it’s important that money is moving around (e.g. isn’t hidden under your mattress) and that prices are stable or have reasonable growth.

One of the best things about the economics of software is that it has really low marginal costs (e.g. the cost to copy it). With processors, bandwidth, and storage all roughly following Moore’s Law exponential curves, the capacity is doubling every 18 - 24 months which implies that the cost for a fixed amount is falling by half over the same period.

As Chris Anderson points out in his book Free, it can sometimes makes sense to round these increasingly lower marginal costs down to zero and make money in different ways such as advertising or selling complements. It’s hard to find other industries that have as many economic freedoms as software.

Day 9 - Strategy

Strategy should be simple: have a remarkable product that people want. Bad things happen if you don’t do this. It’s especially helpful if you have a cash cow you can milk for lots of money to fund new initiatives. For example, Google makes so much money from ads that it can have this strategy:

Revenue = Amount of Web Pages Viewed

Google’s strategy of getting you to view lots of pages (which conveniently have Google ads on them) explains a lot of what it does. From wanting to speed up the web, to making a free phone OS, to creating a ton of free services to keep you hooked on the web. Google really doesn’t care what you do so long as you enjoy it and take in the targeted ads.

The book tended to focus on more traditional forms of strategy such as “cost leadership”, “differentiation”, and “focus on the customer” as well as applying lessons from the famous prisoner’s dilemma. I acknowledge that these are important as well, but I think that at its core, strategy can be simple.

Day 10 - Minicourses

The book ended with “minicourses” in areas relevant to business such as:

Property (real estate, patents, copyright, etc)
Leadership (e.g. schools want to create ‘leaders’ because they’ll be better future donors).

Although those were interesting, the section I enjoyed the most was on business law.

In our jobs, we often bump into legal matters. We face End User License Agreements (EULAs) and Non-Disclosure Agreements (NDAs) that we rarely read and often don’t fully understand. It was interesting to see any proper contract requires the following four conditions to be valid:

Capacity of Parties - Parties must have legal authorization and be mentally capable to enter into the agreement.
Mutual Agreement (Assent) or Meeting of the Minds - There must be a valid offer and an acceptance.
Consideration Given - Value must be given for the promise to be enforceable.
Legality - You can’t enforce a contract dealing with illegal goods or actions.

When bad things happen, it can sometimes escalate to a “legal action” which has a standard procedure involving steps you sometimes hear in the news:

Jurisdiction - For a court to hear a case, it must have “jurisdiction” to hear the case and power to bind the parties the decision.
Pleadings - The paperwork to start the trial process. The plaintiff (p) files a complaint asserting that the defendant (?) has done something wrong and requests a punishment or remedy.
Discovery - Lawyers gather witnesses and evidence before a trial. Each side is allowed to see the evidence held by the other side.
Pretrial Conference - The lawyers and judge try to focus the case on the most important issues. This is also good time for out-of-court settlements if possible.
Trial - Occurs before the court. The jury decides the factual disputes. The case can be thrown out by the judge with a “summary judgment” if it has no merit.
Jury Instruction by the Judge and the Verdict - The judge instructs the jury about the relevant law involved and the jury makes its decision about the facts and penalty within its authority.
Posttrial Motions - Options include asking for a retrial if an error of law or procedure occurred (e.g. jury misconduct).
Appeal - Each party in a lawsuit is entitled to one appeal at an appellate court where they can file a written brief with arguments for a new trial.
Secure or Enforce the Judgment - Send the person to jail and/or collect money.

While the short overview was intriguing, it enforced my belief that it’s important to have an attorney or a lawyer when it comes to the legal matters. At the very least, they usually have malpractice insurance if things go really bad.

Conclusion

The Ten Day MBA helped me move from being unconsciously incompetent about business administration to becoming consciously incompetent in just a few days. I think that alone made it worth the time. I don’t have aspirations to get a real MBA, but I now have more respect for those that do.

And now, back to programming…

The First Few Milliseconds of an HTTPS Connection

Wed, 10 Jun 2009 08:57:00 +0000

Convinced from spending hours reading rave reviews, Bob eagerly clicked “Proceed to Checkout” for his gallon of Tuscan Whole Milk and…

Whoa! What just happened?

In the 220 milliseconds that flew by, a lot of interesting stuff happened to make Firefox change the address bar color and put a lock in the lower right corner. With the help of Wireshark, my favorite network tool, and a slightly modified debug build of Firefox, we can see exactly what’s going on.

By agreement of RFC 2818, Firefox knew that “https” meant it should connect to port 443 at Amazon.com:

Most people associate HTTPS with SSL (Secure Sockets Layer) which was created by Netscape in the mid 90’s. This is becoming less true over time. As Netscape lost market share, SSL’s maintenance moved to the Internet Engineering Task Force (IETF). The first post-Netscape version was re-branded as Transport Layer Security (TLS) 1.0 which was released in January 1999. It’s rare to see true “SSL” traffic given that TLS has been around for 10 years.

Client Hello

TLS wraps all traffic in “records” of different types. We see that the first byte out of our browser is the hex byte 0x16 = 22 which means that this is a “handshake” record:

The next two bytes are 0x0301 which indicate that this is a version 3.1 record which shows that TLS 1.0 is essentially SSL 3.1.

The handshake record is broken out into several messages. The first is our “Client Hello” message (0x01). There are a few important things here:

Random:

There are four bytes representing the current Coordinated Universal Time (UTC) in the Unix epoch format, which is the number of seconds since January 1, 1970. In this case, 0x4a2f07ca. It’s followed by 28 random bytes. This will be used later on.
Session ID:

Here it’s empty/null. If we had previously connected to Amazon.com a few seconds ago, we could potentially resume a session and avoid a full handshake.
Cipher Suites:

This is a list of all of the encryption algorithms that the browser is willing to support. Its top pick is a very strong choice of “TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA” followed by 33 others that it’s willing to accept. Don’t worry if none of that makes sense. We’ll find out later that Amazon doesn’t pick our first choice anyway.
server_name extension:

This is a way to tell Amazon.com that our browser is trying to reach https://www.amazon.com/. This is really convenient because our TLS handshake occurs long before any HTTP traffic. HTTP has a “Host” header which allows a cost-cutting Internet hosting companies to pile hundreds of websites onto a single IP address. SSL has traditionally required a different IP for each site, but this extension allows the server to respond with the appropriate certificate that the browser is looking for. If nothing else, this extension should allow an extra week or so of IPv4 addresses.

Server Hello

Amazon.com replies with a handshake record that’s a massive two packets in size (2,551 bytes). The record has version bytes of 0x0301 meaning that Amazon agreed to our request to use TLS 1.0. This record has three sub-messages with some interesting data:

“Server Hello” Message (2):
- We get the server’s four byte time Unix epoch time representation and its 28 random bytes that will be used later.
- A 32 byte session ID in case we want to reconnect without a big handshake.
- Of the 34 cipher suites we offered, Amazon picked “TLS_RSA_WITH_RC4_128_MD5” (0x0004). This means that it will use the “RSA” public key algorithm to verify certificate signatures and exchange keys, the RC4 encryption algorithm to encrypt data, and the MD5 hash function to verify the contents of messages. We’ll cover these in depth later on. I personally think Amazon had selfish reasons for choosing this cipher suite. Of the ones on the list, it was the one that was least CPU intensive to use so that Amazon could crowd more connections onto each of their servers. A much less likely possibility is that they wanted to pay special tribute to Ron Rivest, who created all three of these algorithms.
Certificate Message (11):
- This message takes a whopping 2,464 bytes and is the certificate that the client can use to validate Amazon’s. It isn’t anything fancy. You can view most of its contents in your browser:
“Server Hello Done” Message (14):
- This is a zero byte message that tells the client that it’s done with the “Hello” process and indicate that the server won’t be asking the client for a certificate.

Checking out the Certificate

The browser has to figure out if it should trust Amazon.com. In this case, it’s using certificates. It looks at Amazon’s certificate and sees that the current time is between the “not before” time of August 26th, 2008 and before the “not after” time of August 27, 2009. It also checks to make sure that the certificate’s public key is authorized for exchanging secret keys.

Why should we trust this certificate?

Attached to the certificate is a “signature” that is just a really long number in big-endian format:

Anyone could have sent us these bytes. Why should we trust this signature? To answer that question, need to make a speedy detour into mathemagic land:

Interlude: A Short, Not Too Scary, Guide to RSA

People sometimes wonder if math has any relevance to programming. Certificates give a very practical example of applied math. Amazon’s certificate tells us that we should use the RSA algorithm to check the signature. RSA was created in the 1970’s by MIT professors Ron Rivest, Adi Shamir, and Len Adleman who found a clever way to combine ideas spanning 2000 years of math development to come up with a beautifully simple algorithm:

You pick two huge prime numbers “p” and “q.” Multiply them to get “n = p*q.” Next, you pick a small public exponent “e” which is the “encryption exponent” and a specially crafted inverse of “e” called “d” as the “decryption exponent.” You then make “n” and “e” public and keep “d” as secret as you possibly can and then throw away “p” and “q” (or keep them as secret as “d”). It’s really important to remember that “e” and “d” are inverses of each other.

Now, if you have some message, you just need to interpret its bytes as a number “M.” If you want to “encrypt” a message to create a “ciphertext”, you’d calculate:

C ≡ M^e (mod n)

This means that you multiply “M” by itself “e” times. The “mod n” means that we only take the remainder (e.g. “modulus”) when dividing by “n.” For example, 11 AM + 3 hours ≡ 2 (PM) (mod 12 hours). The recipient knows “d” which allows them to invert the message to recover the original message:

C^d ≡ (M^e)^d ≡ M^e*d ≡ M¹ ≡ M (mod n)

Just as interesting is that the person with “d” can “sign” a document by raising a message “M” to the “d” exponent:

M^d ≡ S (mod n)

This works because “signer” makes public “S”, “M”, “e”, and “n.” Anyone can verify the signature “S” with a simple calculation:

S^e ≡ (M^d)^e ≡ M^d*e ≡ M^e*d ≡ M¹ ≡ M (mod n)

Public key cryptography algorithms like RSA are often called “asymmetric” algorithms because the encryption key (in our case, “e”) is not equal to (e.g. “symmetric” with) the decryption key “d”. Reducing everything “mod n” makes it impossible to use the easy techniques that we’re used to such as normal logarithms. The magic of RSA works because you can calculate/encrypt C ≡ M^e (mod n) very quickly, but it is really hard to calculate/decrypt C^d ≡ M (mod n) without knowing “d.” As we saw earlier, “d” is derived from factoring “n” back to its “p” and “q”, which is a tough problem.

Verifying Signatures

The big thing to keep in mind with RSA in the real world is that all of the numbers involved have to be big to make things really hard to break using the best algorithms that we have. How big? Amazon.com’s certificate was “signed” by “VeriSign Class 3 Secure Server CA.” From the certificate, we see that this VeriSign modulus “n” is 2048 bits long which has this 617 digit base-10 representation:

1890572922 9464742433 9498401781 6528521078 8629616064 3051642608 4317020197 7241822595 6075980039 8371048211 4887504542 4200635317 0422636532 2091550579 0341204005 1169453804 7325464426 0479594122 4167270607 6731441028 3698615569 9947933786 3789783838 5829991518 1037601365 0218058341 7944190228 0926880299 3425241541 4300090021 1055372661 2125414429 9349272172 5333752665 6605550620 5558450610 3253786958 8361121949 2417723618 5199653627 5260212221 0847786057 9342235500 9443918198 9038906234 1550747726 8041766919 1500918876 1961879460 3091993360 6376719337 6644159792 1249204891 7079005527 7689341573 9395596650 5484628101 0469658502 1566385762 0175231997 6268718746 7514321

(Good luck trying to find “p” and “q” from this “n” - if you could, you could generate real-looking VeriSign certificates.)

VeriSign’s “e” is 2¹⁶ + 1 = 65537. Of course, they keep their “d” value secret, probably on a safe hardware device protected by retinal scanners and armed guards. Before signing, VeriSign checked the validity of the contents that Amazon.com claimed on its certificate using a real-world “handshake” that involved looking at several of their business documents. Once VeriSign was satisfied with the documents, they used the SHA-1 hash algorithm to get a hash value of the certificate that had all the claims. In Wireshark, the full certificate shows up as the “signedCertificate” part:

It’s sort of a misnomer since it actually means that those are the bytes that the signer is going to sign and not the bytes that already include a signature.

The actual signature, “S”, is simply called “encrypted” in Wireshark. If we raise “S” to VeriSign’s public “e” exponent of 65537 and then take the remainder when divided by the modulus “n”, we get this “decrypted” signature hex value:

0001FFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFF00302130 0906052B0E03021A 05000414C19F8786 871775C60EFE0542 E4C2167C830539DB

Per the PKCS #1 v1.5 standard, the first byte is “00” and it “ensures that the encryption block, [when] converted to an integer, is less than the modulus.” The second byte of “01” indicates that this is a private key operation (e.g. it’s a signature). This is followed by a lot of “FF” bytes that are used to pad the result to make sure that it’s big enough. The padding is terminated by a “00” byte. It’s followed by “30 21 30 09 06 05 2B 0E 03 02 1A 05 00 04 14” which is the PKCS #1 v2.1 way of specifying the SHA-1 hash algorithm. The last 20 bytes are SHA-1 hash digest of the bytes in “signedCertificate.”

Since the decrypted value is properly formatted and the last bytes are the same hash value that we can calculate independently, we can assume that whoever knew “VeriSign Class 3 Secure Server CA”’s private key “signed” it. We implicitly trust that only VeriSign knows the private key “d.”

We can repeat the process to verify that “VeriSign Class 3 Secure Server CA”’s certificate was signed by VeriSign’s “Class 3 Public Primary Certification Authority.”

But why should we trust that? There are no more levels on the trust chain.

The top “VeriSign Class 3 Public Primary Certification Authority” was signed by itself. This certificate has been built into Mozilla products as an implicitly trusted good certificate since version 1.4 of certdata.txt in the Network Security Services (NSS) library. It was checked-in on September 6, 2000 by Netscape’s Robert Relyea with the following comment:

“Make the framework compile with the rest of NSS. Include a ‘live’ certdata.txt with those certs we have permission to push to open source (additional certs will be added as we get permission from the owners).”

This decision has had a relatively long impact since the certificate has a validity range of January 28, 1996 - August 1, 2028.

As Ken Thompson explained so well in his “Reflections on Trusting Trust”, you ultimately have to implicitly trust somebody. There is no way around this problem. In this case, we’re implicitly trusting that Robert Relyea made a good choice. We also hope that Mozilla’s built-in certificate policy is reasonable for the other built-in certificates.

One thing to keep in mind here is that all these certificates and signatures were simply used to form a trust chain. On the public Internet, VeriSign’s root certificate is implicitly trusted by Firefox long before you go to any website. In a company, you can create your own root certificate authority (CA) that you can install on everyone’s machine.

Alternatively, you can get around having to pay companies like VeriSign and avoid certificate trust chains altogether. Certificates are used to establish trust by using a trusted third-party (in this case, VeriSign). If you have a secure means of sharing a secret “key”, such as whispering a long password into someone’s ear, then you can use that pre-shared key (PSK) to establish trust. There are extensions to TLS to allow this, such as TLS-PSK, and my personal favorite, TLS with Secure Remote Password (SRP) extensions. Unfortunately, these extensions aren’t nearly as widely deployed and supported, so they’re usually not practical. Additionally, these alternatives impose a burden that we have to have some other secure means of communicating the secret that’s more cumbersome than what we’re trying to establish with TLS (otherwise, why wouldn’t we use that for everything?).

One final check that we need to do is to verify that the host name on the certificate is what we expected. Nelson Bolyard’s comment in the SSL_AuthCertificate function explains why:

/* cert is OK. This is the client side of an SSL connection.
 * Now check the name field in the cert against the desired hostname.
 * NB: This is our only defense against Man-In-The-Middle (MITM) attacks! 
 */

This check helps prevent against a man-in-the-middle attack because we are implicitly trusting that the people on the certificate trust chain wouldn’t do something bad, like sign a certificate claiming to be from Amazon.com unless it actually was Amazon.com. If an attacker is able to modify your DNS server by using a technique like DNS cache poisoning, you might be fooled into thinking you’re at a trusted site (like Amazon.com) because the address bar will look normal. This last check implicitly trusts certificate authorities to stop these bad things from happening.

Pre-Master Secret

We’ve verified some claims about Amazon.com and know its public encryption exponent “e” and modulus “n.” Anyone listening in on the traffic can know this as well (as evidenced because we are using Wireshark captures). Now we need to create a random secret key that an eavesdropper/attacker can’t figure out. This isn’t as easy as it sounds. In 1996, researchers figured out that Netscape Navigator 1.1 was using only three sources to seed their pseudo-random number generator (PRNG). The sources were: the time of day, the process id, and the parent process id. As the researchers showed, these “random” sources aren’t that random and were relatively easy to figure out.

Since everything else was derived from these three “random” sources, it was possible to “break” the SSL “security” in 25 seconds on a 1996 era machine. If you still don’t believe that finding randomness is hard, just ask the Debian OpenSSL maintainers. If you mess it up, all the security built on top of it is suspect.

On Windows, random numbers used for cryptographic purposes are generated by calling the CryptGenRandom function that hashes bits sampled from over 125 sources. Firefox uses this function along with some bits derived from its own function to seed its pseudo-random number generator.

The 48 byte “pre-master secret” random value that’s generated isn’t used directly, but it’s very important to keep it secret since a lot of things are derived from it. Not surprisingly, Firefox makes it hard to find out this value. I had to compile a debug version and set the SSLDEBUGFILE and SSLTRACE environment variables to see it.

In this particular session, the pre-master secret showed up in the SSLDEBUGFILE as:

4456: SSL[131491792]: Pre-Master Secret [Len: 48] 03 01 bb 7b 08 98 a7 49 de e8 e9 b8 91 52 ec 81 ...{...I.....R.. 4c c2 39 7b f6 ba 1c 0a b1 95 50 29 be 02 ad e6 L.9{......P).... ad 6e 11 3f 20 c4 66 f0 64 22 57 7e e1 06 7a 3b .n.? .f.d"W~..z;

Note that it’s not completely random. The first two bytes are, by convention, the TLS version (03 01).

Trading Secrets

We now need to get this secret value over to Amazon.com. By Amazon’s wishes of “TLS_RSA_WITH_RC4_128_MD5”, we will use RSA to do this. You could make your input message equal to just the 48 byte pre-master secret, but the Public Key Cryptography Standard (PKCS) #1, version 1.5 RFC tells us that we should pad these bytes with random data to make the input equal to exactly the size of the modulus (1024 bits/128 bytes). This makes it harder for an attacker to determine our pre-master secret. It also gives us one last chance to protect ourselves in case we did something really bone-headed, like reusing the same secret. If we reused the key, the eavesdropper would likely see a different value placed on the network due to the random padding.

Again, Firefox makes it hard to see these random values. I had to insert debugging statements into the padding function to see what was going on:

wrapperHandle = fopen("plaintextpadding.txt", "a");
fprintf(wrapperHandle, "PLAINTEXT = ");
for(i = 0; i < modulusLen; i++)
{
    fprintf(wrapperHandle, "%02X ", block[i]);
}
fprintf(wrapperHandle, "\r\n");
fclose(wrapperHandle);

In this session, the full padded value was:

00 02 12 A3 EA B1 65 D6 81 6C 13 14 13 62 10 53 23 B3 96 85 FF 24 FA CC 46 11 21 24 A4 81 EA 30 63 95 D4 DC BF 9C CC D0 2E DD 5A A6 41 6A 4E 82 65 7D 70 7D 50 09 17 CD 10 55 97 B9 C1 A1 84 F2 A9 AB EA 7D F4 CC 54 E4 64 6E 3A E5 91 A0 06 00 03 01 BB 7B 08 98 A7 49 DE E8 E9 B8 91 52 EC 81 4C C2 39 7B F6 BA 1C 0A B1 95 50 29 BE 02 AD E6 AD 6E 11 3F 20 C4 66 F0 64 22 57 7E E1 06 7A 3B

Firefox took this value and calculated “C ≡ M^e (mod n)” to get the value we see in the “Client Key Exchange” record:

Finally, Firefox sent out one last unencrypted message, a “Change Cipher Spec” record:

This is Firefox’s way of telling Amazon that it’s going to start using the agreed upon secret to encrypt its next message.

Deriving the Master Secret

If we’ve done everything correctly, both sides (and only those sides) now know the 48 byte (256 bit) pre-master secret. There’s a slight trust issue here from Amazon’s perspective: the pre-master secret just has bits that were generated by the client, they don’t take anything into account from the server or anything we said earlier. We’ll fix that be computing the “master secret.” Per the spec, this is done by calculating:

master_secret = PRF(pre_master_secret, 
                    "master secret", 
                    ClientHello.random + ServerHello.random)

The “pre_master_secret” is the secret value we sent earlier. The “master secret” is simply a string whose ASCII bytes (e.g. “6d 61 73 74 65 72 …”) are used. We then concatenate the random values that were sent in the ClientHello and ServerHello (from Amazon) messages that we saw at the beginning.

The PRF is the “Pseudo-Random Function” that’s also defined in the spec and is quite clever. It combines the secret, the ASCII label, and the seed data we give it by using the keyed-Hash Message Authentication Code (HMAC) versions of both MD5 and SHA-1 hash functions. Half of the input is sent to each hash function. It’s clever because it is quite resistant to attack, even in the face of weaknesses in MD5 and SHA-1. This process can feedback on itself and iterate forever to generate as many bytes as we need.

Following this procedure, we obtain a 48 byte “master secret” of

4C AF 20 30 8F 4C AA C5 66 4A 02 90 F2 AC 10 00 39 DB 1D E0 1F CB E0 E0 9D D7 E6 BE 62 A4 6C 18 06 AD 79 21 DB 82 1D 53 84 DB 35 A7 1F C1 01 19

Generating Lots of Keys

Now that both sides have a “master secrets”, the spec shows us how we can derive all the needed session keys we need using the PRF to create a “key block” where we will pull data from:

key_block = PRF(SecurityParameters.master_secret, "key expansion", SecurityParameters.server_random + SecurityParameters.client_random);

The bytes from “key_block” are used to populate the following:

client_write_MAC_secret[SecurityParameters.hash_size] server_write_MAC_secret[SecurityParameters.hash_size] client_write_key[SecurityParameters.key_material_length] server_write_key[SecurityParameters.key_material_length] client_write_IV[SecurityParameters.IV_size] server_write_IV[SecurityParameters.IV_size]

Since we’re using a stream cipher instead of a block cipher like the Advanced Encryption Standard (AES), we don’t need the Initialization Vectors (IVs). Therefore, we just need two Message Authentication Code (MAC) keys for each side that are 16 bytes (128 bits) each since the specified MD5 hash digest size is 16 bytes. In addition, the RC4 cipher uses a 16 byte (128 bit) key that both sides will need as well. All told, we need 216 + 216 = 64 bytes from the key block.

Running the PRF, we get these values:

client_write_MAC_secret = 80 B8 F6 09 51 74 EA DB 29 28 EF 6F 9A B8 81 B0 server_write_MAC_secret = 67 7C 96 7B 70 C5 BC 62 9D 1D 1F 4A A6 79 81 61 client_write_key = 32 13 2C DD 1B 39 36 40 84 4A DE E5 6C 52 46 72 server_write_key = 58 36 C4 0D 8C 7C 74 DA 6D B7 34 0A 91 B6 8F A7

Prepare to be Encrypted!

The last handshake message the client sends out is the “Finished message.” This is a clever message that proves that no one tampered with the handshake and it proves that we know the key. The client takes all bytes from all handshake messages and puts them into a “handshake_messages” buffer. We then calculate 12 bytes of “verify_data” using the pseudo-random function (PRF) with our master key, the label “client finished”, and an MD5 and SHA-1 hash of “handshake_messages”:

verify_data = PRF(master_secret, "client finished", MD5(handshake_messages) + SHA-1(handshake_messages) ) [12]

We take the result and add a record header byte “0x14” to indicate “finished” and length bytes “00 00 0c” to indicate that we’re sending 12 bytes of verify data. Then, like all future encrypted messages, we need to make sure the decrypted contents haven’t been tampered with. Since our cipher suite in use is TLS_RSA_WITH_RC4_128_MD5, this means we use the MD5 hash function.

Some people get paranoid when they hear MD5 because it has some weaknesses. I certainly don’t advocate using it as-is. However, TLS is smart in that it doesn’t use MD5 directly, but rather the HMAC version of it. This means that instead of using MD5(m) directly, we calculate:

HMAC_MD5(Key, m) = MD5((Key ⊕ opad) ++ MD5((Key ⊕ ipad) ++ m)

(The ⊕ means XOR, ++ means concatenate, “opad” is the bytes “5c 5c … 5c”, and “ipad” is the bytes “36 36 … 36”).

In particular, we calculate:

HMAC_MD5(client_write_MAC_secret, seq_num + TLSCompressed.type + TLSCompressed.version + TLSCompressed.length + TLSCompressed.fragment));

As you can see, we include a sequence number (“seq_num”) along with attributes of the plaintext message (here it’s called “TLSCompressed”). The sequence number foils attackers who might try to take a previously encrypted message and insert it midstream. If this occurred, the sequence numbers would definitely be different than what we expected. This also protects us from an attacker dropping a message.

All that’s left is to encrypt these bytes.

RC4 Encryption

Our negotiated cipher suite was TLS_RSA_WITH_RC4_128_MD5. This tells us that we need to use Ron’s Code #4 (RC4) to encrypt the traffic. Ron Rivest developed the RC4 algorithm to generate random bytes based on a 256 byte key. The algorithm is so simple you can actually memorize it in a few minutes.

RC4 begins by creating a 256-byte “S” byte array and populating it with 0 to 255. You then iterate over the array by mixing in bytes from the key. You do this to create a state machine that is used to generate “random” bytes. To generate a random byte, we shuffle around the “S” array.

Put graphically, it looks like this:

To encrypt a byte, we xor this pseudo-random byte with the byte we want to encrypt. Remember that xor’ing a bit with 1 causes it to flip. Since we’re generating random numbers, on average the xor will flip half of the bits. This random bit flipping is effectively how we encrypt data. As you can see, it’s not very complicated and thus it runs quickly. I think that’s why Amazon chose it.

Recall that we have a “client_write_key” and a “server_write_key.” The means we need to create two RC4 instances: one to encrypt what our browser sends and the other to decrypt what the server sent us.

The first few random bytes out of the “client_write” RC4 instance are “7E 20 7A 4D FE FB 78 A7 33 …” If we xor these bytes with the unencrypted header and verify message bytes of “14 00 00 0C 98 F0 AE CB C4 …”, we’ll get what appears in the encrypted portion that we can see in Wireshark:

The server does almost the same thing. It sends out a “Change Cipher Spec” and then a “Finished Message” that includes all handshake messages, including the decrypted version of the client’s “Finished Message.” Consequently, this proves to the client that the server was able to successfully decrypt our message.

Welcome to the Application Layer!

Now, 220 milliseconds after we started, we’re finally ready for the application layer. We can now send normal HTTP traffic that’ll be encrypted by the TLS layer with the RC4 write instance and decrypt traffic with the server RC4 write instance. In addition, the TLS layer will check each record for tampering by computing the HMAC_MD5 hash of the contents.

At this point, the handshake is over. Our TLS record’s content type is now 23 (0x17). Encrypted traffic begins with “17 03 01” which indicate the record type and TLS version. These bytes are followed by our encrypted size, which includes the HMAC hash.

Encrypting the plaintext of:

GET /gp/cart/view.html/ref=pd_luc_mri HTTP/1.1 
Host: www.amazon.com 
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009060911 Minefield/3.0.10 (.NET CLR 3.5.30729) 
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 
Accept-Language: en-us,en;q=0.5 
Accept-Encoding: gzip,deflate 
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 
Keep-Alive: 300 
Connection: keep-alive 
...

will give us the bytes we see on the wire:

The only other interesting fact is that the sequence number increases on each record, it’s now 1 (and the next record will be 2, etc).

The server does the same type of thing on its side using the server_write_key. We see its response, including the tell-tale application data header:

Decrypting this gives us:

HTTP/1.1 200 OK 
Date: Wed, 10 Jun 2009 01:09:30 GMT 
Server: Server 
... 
Cneonction: close 
Transfer-Encoding: chunked

which is a normal HTTP reply that includes a non-descriptive “Server: Server” header and a misspelled “Cneonction: close” header coming from Amazon’s load balancers.

TLS is just below the application layer. The HTTP server software can act as if it’s sending unencrypted traffic. The only change is that it writes to a library that does all the encryption. OpenSSL is a popular open-source library for TLS.

The connection will stay open while both sides send and receive encrypted data until either side sends out a “closure alert” message and then closes the connection. If we reconnect shortly after disconnecting, we can re-use the negotiated keys (if the server still has them cached) without using public key operations, otherwise we do a completely new full handshake.

It’s important to realize that application data records can be anything. The only reason “HTTPS” is special is because the web is so popular. There are lots of other TCP/IP based protocols that ride on top of TLS. For example, TLS is used by FTPS and secure extensions to SMTP. It’s certainly better to use TLS than inventing your own solution. Additionally, you’ll benefit from a protocol that has withstood careful security analysis.

… And We’re Done!

The very readable TLS RFC covers many more details that were missed here. We covered just one single path in our observation of the 220 millisecond dance between Firefox and Amazon’s server. Quite a bit of the process was affected by the TLS_RSA_WITH_RC4_128_MD5 Cipher Suite selection that Amazon made with its ServerHello message. It’s a reasonable choice that slightly favors speed over security.

As we saw, if someone could secretly factor Amazon’s “n” modulus into its respective “p” and “q”, they could effectively decrypt all “secure” traffic until Amazon changes their certificate. Amazon counter-balances this concern this with a short one year duration certificate:

One of the cipher suites that was offered was “TLS_DHE_RSA_WITH_AES_256_CBC_SHA” which uses the Diffie-Hellman key exchange that has a nice property of “forward secrecy.” This means that if someone cracked the mathematics of the key exchange, they’d be no better off to decrypt another session. One downside to this algorithm is that it requires more math with big numbers, and thus is a little more computationally taxing on a busy server. The “Advanced Encryption Standard” (AES) algorithm was present in many of the suites that we offered. It’s different than RC4 in that it works on 16 byte “blocks” at a time rather than a single byte. Since its key can be up to 256 bits, many consider this to be more secure than RC4.

In just 220 milliseconds, two endpoints on the Internet came together, provided enough credentials to trust each other, set up encryption algorithms, and started to send encrypted traffic.

And to think, all of this just so Bob can buy milk.

UPDATE: I wrote a program that walks through the handshake steps mentioned in this article. I posted it to GitHub.

Using Obscure Windows COM APIs in .NET

Fri, 24 Apr 2009 08:37:00 +0000

Most native Windows APIs are simple to call from .NET. For example, if you need to do something special when showing a window, you can use the ShowWindow API using Platform Invocation Services (P/Invoke) like this:

[DllImport("user32.dll")]
static extern bool ShowWindow(IntPtr hWnd, int nCmdShow);

When you call this function, here’s roughly what happens:

The CLR calls LoadLibrary on the file (e.g. “user32.dll”)
The CLR then calls GetProcAddress on the function name (e.g. “ShowWindow”) to get the address of where the function is located.

For the most part, it just magically works. If we had used a function like “MessageBox”, the CLR would notice that it doesn’t exist and would then pick between the ANSI version (e.g. “MessageBoxA”) or the Unicode version (e.g. “MessageBoxW”).

With the address in hand, it’s easy to jump to it and you’re all set. Simple and easy.

I was expecting a simple API like this when I was investigating how to register my program as the default handler for “.wav” files on Vista. In the pre-Vista days, most programs would write directly into a registry key for the file extension (e.g. “HKEY_CLASSES_ROOT\.wav”) and move on. Problems come when your program wants to register itself as a handler for a “popular” extension like .MP3 or .HTM. Some programs go into an all out arms race with other programs in a fight of wills to make sure they keep the extension.

In Windows Vista and later, Microsoft wants us to use the new “Default Programs” feature. The idea is that you register what file extensions your program supports in the registry and then a nice UI allows people to easily pick which of those extensions they want to associate with your program. Digging around the documentation led me to discover that the bulk of the functionality was exposed via the IApplicationAssociationRegistration COM interface.

Ah, COM.

Over the years, I’ve tried to keep my distance from it. This irrational fear came from wizards that “next, next, finish”‘d your way into thousands of lines of inscrutable code. It took me years of passing glances to finally understand its basics. Even then, when I needed to use it from .NET, I’d right click on my project references and click “Add Reference”:

I’d pick the library I needed and then somehow I could use the types as if they were .NET objects. I didn’t ask further questions and moved on.

Unfortunately, IApplicationAssociationRegistration was nowhere to be found on the “Add Reference” list since it doesn’t seem to have a registered type library associated with it. Using my basic COM knowledge, I knew that if I wanted to use it I would need to know the interface identifier (IID) as well as a class identifier (CLSID) that pointed to a concrete implementation.

Following the MSDN documentation, I knew I’d probably find success in shobjidl.idl:

Sure enough, shobjidl.idl was sitting in my “C:\Program Files\Microsoft SDKs\Windows\v6.1\Include” directory and had this interface definition:

[
 object,
 uuid(4e530b0a-e611-4c77-a3ac-9031d022281b),
 pointer_default(unique),
 helpstring("Protocol URL and Extension File Application")
]
interface IApplicationAssociationRegistration : IUnknown
{
 HRESULT QueryCurrentDefault(
     [in, string] LPCWSTR pszQuery,
     [in] ASSOCIATIONTYPE atQueryType,
     [in] ASSOCIATIONLEVEL alQueryLevel,
     [out, string] LPWSTR* ppszAssociation);

...
}

A little further down was the declaration for the concrete class (coclass) and its associated class id (CLSID):

// CLSID_ApplicationAssociationRegistration
[ uuid(591209c7-767b-42b2-9fba-44ee4615f2c7) ] coclass ApplicationAssociationRegistration
{
 interface IApplicationAssociationRegistration;
}

In the IDL, we also see the definitions for the enums that the functions use:

typedef [v1_enum] enum tagASSOCIATIONLEVEL
{
 AL_MACHINE,
 AL_EFFECTIVE,
 AL_USER,
} ASSOCIATIONLEVEL;

typedef [v1_enum] enum tagASSOCIATIONTYPE
{
 AT_FILEEXTENSION,
 AT_URLPROTOCOL,
 AT_STARTMENUCLIENT,
 AT_MIMETYPE,
} ASSOCIATIONTYPE;

Getting this to work in .NET was surprisingly easy. The basic idea is that the CLR has to have just enough information to find the types:

The “ComImportAttribute” is almost as simple to use as DllImportAttribute. In addition, you need to use the GuidAttribute to specify the gigantic GUIDs.
You use the “InterfaceTypeAttribute” to specify the basic interface(s) that the interface you’re importing uses. In COM, all interfaces derive from IUnknown. If the interface supports scripting then it implements IDispatch. If you provide a speedy C++ way of accessing your interface (e.g. vtable definition) and the scripting IDispatch interface, you’ve got a “dual” interface.
You need to translate the parameter types to their .NET equivalents. This is an incredibly mechanical process that’s straightforward. If there is a chance that the underlying bits are different between COM and .NET (e.g. they’re not blittable) then you need to use the MarshalAsAttribute to tell the CLR how to convert the types as necessary.
You need to remember that COM handles errors by returning HRESULTs instead of natively using exceptions like .NET uses. By default, the CLR will make the last parameter that is an OUT parameter in the IDL to be the return value (it helps if it’s marked by “retval”). Therefore, you can act as if the function really returns its last parameter and the CLR will automatically check the HRESULT and throw a corresponding .NET exception as needed.
Optionally, and perhaps most controversially, you’re free de-Hungarianize the parameter names and PascalCase the enum names to make them much more friendly looking to people in .NET. It’s optional since it might confuse people that use MSDN documentation and expecting the original names.

In a minute or so, I translated the definitions and gladly got rid of the Hungarian prefixes by converting parameter names of “pszQuery” to just “query.” I also converted all the enums and removed their unnecessary prefixes. The end result was this:

[ComImport]
[Guid("4e530b0a-e611-4c77-a3ac-9031d022281b")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
internal interface IApplicationAssociationRegistration
{   
 [return: MarshalAs(UnmanagedType.LPWStr)]
 string QueryCurrentDefault( [MarshalAs(UnmanagedType.LPWStr)] string query,
                           AssociationType queryType,
                           AssociationLevel queryLevel);
 [return: MarshalAs(UnmanagedType.Bool)]
 bool QueryAppIsDefault([MarshalAs(UnmanagedType.LPWStr)] string query,
                        AssociationType queryType,
                        AssociationLevel queryLevel,
                        [MarshalAs(UnmanagedType.LPWStr)] string appRegistryName);
 [return: MarshalAs(UnmanagedType.Bool)]
 bool QueryAppIsDefaultAll(AssociationLevel queryLevel,
                           [MarshalAs(UnmanagedType.LPWStr)] string appRegistryName);
 void SetAppAsDefault([MarshalAs(UnmanagedType.LPWStr)] string appRegistryName,
                      [MarshalAs(UnmanagedType.LPWStr)] string set,
                      AssociationType setType);
 void SetAppAsDefaultAll([MarshalAs(UnmanagedType.LPWStr)] string appRegistryName);
 void ClearUserAssociations();
}

Importing the concrete class that implements the interface was just a matter of specifying its CLSID:

[ComImport]
[Guid("591209c7-767b-42b2-9fba-44ee4615f2c7")]
internal class ApplicationAssociationRegistration
{
 // coclass is implemented by the runtime callable wrapper
}

With all of that goo out of the way, you can use the interface like a normal .NET type:

var aa = new ApplicationAssociationRegistration();
var iaar = (IApplicationAssociationRegistration)aa;
string myCurrentMp3Player = iaar.QueryCurrentDefault(".mp3", AssociationType.FileExtension, AssociationLevel.Effective);

Behind the scenes, the runtime callable wrapper has to do something like this:

Load in ole32.dll where COM functions reside.
Call CoInitialize to initialize COM.
Look up your CLSID and IID in the registry under HKEY_CLASSES_ROOT and find their associated DLL (in our case, “shell32.dll”)
Create a factory for your class.
Use the factory to create an instance.
Call QueryInterface to get the specific interface we want (e.g. IApplicationAssociationRegistration)
Get a pointer to the function we want using the vtable.

After all that, we finally have a place to jump to like we did with P/Invoke.

Why bother with all of this? One reason is that Microsoft has a huge legacy investment in C and C++ in Windows. There’s no compelling reason for them to rewrite things in .NET. A natural consequence is that the C++ code that implements their latest APIs will be exposed using COM for the foreseeable future. Recently, Microsoft has gone ahead and published .NET COM wrappers for some of the popular new APIs like the Libraries feature in Windows 7. With just a little work, you don’t have to wait on Microsoft to do this for you.

Given that .NET was designed as a successor to COM, it’s no surprise that Microsoft has made interoperability with it very seamless. The runtime callable wrapper does a good job of hiding most of the messier details. The garbage collector handles much of the bookkeeping involved with memory management that used to be the bane of COM programming. The runtime is very aware of typical COM semantics of when to allocate and free memory. It’s not always perfect. Sometimes you can be pre-emptive and force your COM object to be cleaned up via Marshal.ReleaseComObject so you don’t have to wait on the garbage collector, but you should be careful.

I just presented the basics of what I learned to get my job done. There’s a lot more out there for more advanced scenarios. I’ve found the free book “COM and .NET Interop” by Andrew Troelsen to be helpful.

There’s plenty of obscure Windows APIs out there for the taking. Enjoy!

How .NET Regular Expressions Really Work

Mon, 16 Mar 2009 07:47:00 +0000

Remember when you first tried to parse text?

My early BASIC programs were littered with IF statements that dissected strings using LEFT$, RIGHT$, MID$, TRIM$, and UCASE$. It took me hours to write a program that parsed a simple text file. Just trying to support whitespace and mixed casing was enough to drive me crazy.

Years later when I started programming in Java, I discovered the StringTokenizer class. I thought it was a huge leap forward. I no longer had to worry about whitespace. However, I still had to use functions like “substring” and “toUpperCase”, but I thought that was as good as it could get.

And then one day I found regular expressions.

I almost cried when I realized that I could replace parsing code that took me hours to write with a simple regular expression. It still took me several years to become comfortable with the syntax, but the learning curve was worth the power obtained.

And yet with all of this love, I still had this nagging suspicion that I was doing it wrong. After reading Pragmatic Thinking and Learning, I was determined to try to imagine what life was like inside the code I wrote. But I just couldn’t connect with a regular expression.

The last straw came recently when I was trying to help a coworker craft a regex to properly handle name/value string pairs with escaped strings. In the end, our regex worked, but I felt that it was duct-taped together. I knew there was a better way.

I picked up a copy of Jeffrey Friedl’s book “Mastering Regular Expressions” and couldn’t put it down. In less than a week, I had flown through 400+ pages and had finally started to feel like I understood how regular expressions worked. I finally had a sense for what backtracking really meant and I had a better idea for how a regex could go catastrophically out of control.

I had extremely high hopes for chapter 9 which covered the .NET regular expression “flavor.” Since I work with .NET every day, I thought this would be the best chapter. I did learn a few things like how to properly use RegexOptions.ExplicitCapture, how to use the special per-match replacement sequences that Regex.Replace offers, how to save compiled regular expressions to a DLL, and how to match balanced parentheses – a feat that’s theoretically not possible with a regex. Despite learning all of this in the chapter, I still didn’t feel that I could “connect” with the very .NET regular expression engine that I know and love.

To be fair, the vast benefit of the book comes from the first six chapters that deal with how regular expressions work in general since regex implementations share many ideas. The book laid a solid foundation, but I wanted more.

I wanted to stop all my hand-waving at regular expressions and actually understand how they really work.

I knew I wanted to drill into the code. Although tools like Reflector are amazing, I knew I wanted to see the actual code. It’s fairly easy now to step into the framework source code in the debugger. Unlike understanding the details of locking, which had me dive into C++ and x86 assembly, it was refreshing to see that the .NET regular expression engine was written entirely in C#.

I decided to use a really simple regular expression and search string and then follow it from cradle to grave. If you’d like to follow along at home, I’ve linked to relevant lines in the .NET regular expression source code.

My very simple regex consisted of looking for a basic URL:

string textToSearch = "Welcome to http://www.moserware.com/!";
string regexPattern = @"http://([^\s/]+)/?";
Match m = Regex.Match(textToSearch, regexPattern); 
Console.WriteLine("Full uri = '{0}'", m.Value);
Console.WriteLine("Host ='{0}'", m.Groups[1].Value);

Our journey begins at Regex.Match where we checking an internal cache of the past 15 regex values to see if there a match for:

"0:ENU:http://([^\\s/]+)/?"

This is a compact representation of:

RegexOptions : Culture : Regex pattern

The regex doesn’t find this in the cache, so it starts scanning the pattern. Note that out of respect for the authors, our regex pattern doesn’t have any comments or whitespace in it:

// It would be nice to get rid of the comment modes, since the 
// ScanBlank() calls are just kind of duct-taped in.

We start creating an internal tree representation of the regex by adding a multi-character (aka “Multi”) node to contain the “http://” part. Next, we see that the scanner made it to first real capture:

http://([^\s/]+)/?

This capture contains a character class that says that we don’t want to match spaces or a forward slash. It is converted into an obscure five character string:

"\x1\x2\x1\x2F\x30\x64"

Later we’ll see why it had to all fit in one string, but for now we can use a helpful comment to decode each character:

Offset	Hex Value	Meaning
0	0x01	The set should be negated
1	0x02	There are two characters in the character part of the set
2	0x01	There is one Unicode category
3	0x2F	Inclusive lower-bound of the character set. It’s a ‘/’ in Unicode
4	0x30	Exclusive upper-bound of the character set. It’s a ‘0’ in Unicode
5	0x64	This is a magic number that means the “Space” category.

Before I realized that this string had meaning, I was utterly confused.

As we continue scanning, we find a ‘+’ quantifier:

http://([^\\s/]+)/?

This is noted as a Oneloop node since it’s a “loop” of what came before (e.g. the character class set). It has arguments of 1 and Int32.MaxValue to denote 1 or more matches. We see that the next character isn’t a ‘?’, so we can assert this is not a lazy match which means it’s a greedy match.

The first group is recorded when we hit the ‘)’ character. At the end of the pattern, we note a One (character) node for the ‘/’ and we see it’s followed by a ‘?’ which is just another quantifier, this time with a minimum of 0 and a maximum of 1.

All those nodes come together to give us this “RegexTree:”

We still need to convert the tree to code that the regular expression “machine” can execute later. The bulk of the work is done by an aptly named RegexCodeFromRegexTree function that has a decent comment:

/*
 * The top level RegexCode generator. It does a depth-first walk 
 * through the tree and calls EmitFragment to emits code before 
 * and after each child of an interior node, and at each leaf. 
 * 
 * It runs two passes, first to count the size of the generated 
 * code, and second to generate the code. 
 * 
 * <CONSIDER>we need to time it against the alternative, which is 
 * to just generate the code and grow the array as we go.</CONSIDER>;
 */

I love the anonymous “CONSIDER” comment and would have had a similar reaction. Instead of using an ArrayList or List<int> to store the op codes, which can automatically resize as needed, the code diligently goes through the entire RegexTree twice. The class is peppered with “if(_counting)” expressions that just increase a counter by the size they will use in the next pass.

As predicted by the comment, the bulk of the work is done by the 250 line switch statement that makes up the EmitFragment function. This function breaks up RegexTree “fragments” and converts them to a simpler RegexCode. The first fragment is:

EmitFragment(nodetype=RegexNode.Capture | BeforeChild, 
             node=[RegexNode.Capture, Group=0, Length=-1], 
             childIndex=0)

This is shorthand for emitting the RegexCode that should come before the children of the top level “RegexNode.Capture” node that represents group 0 and that goes until the end of the string (e.g. has length -1). The last 0 means that it’s the 0th child of the parent node (this is sort of meaningless since it has no parent). The subsequent calls walk the rest of the tree:

EmitFragment(RegexNode.Concatenate | BeforeChild, [RegexNode.Concatenate], childIndex=0)
EmitFragment(RegexNode.Multi, [RegexNode.Multi, string="http://"], childIndex=0)
EmitFragment(RegexNode.Concatenate | AfterChild, [RegexNode.Concatenate], childIndex=0)
EmitFragment(RegexNode.Concatenate | BeforeChild, [RegexNode.Concatenate], childIndex=1)
EmitFragment(RegexNode.Capture | BeforeChild, [RegexNode.Capture, Group=1, -1], childIndex=0)
EmitFragment(RegexNode.SetLoop, [RegexNode.SetLoop, min=1, max=Int32.MaxValue], childIndex=0)
EmitFragment(RegexNode.Capture | AfterChild, [RegexNode.Capture, Group=1, Length=-1], childIndex=0)
EmitFragment(RegexNode.Concatenate | AfterChild, [RegexNode.Concatenate], childIndex=1)
EmitFragment(RegexNode.Concatenate | BeforeChild, [RegexNode.Concatenate], childIndex=2)
EmitFragment(RegexNode.Oneloop, [RegexNode.Oneloop, min=0, max=1, character='/'], childIndex=0)
EmitFragment(RegexNode.Concatenate | AfterChild, [RegexNode.Concatenate], childIndex=2)
EmitFragment(RegexNode.Capture | AfterChild, [RegexNode.Capture, Group=0, Length=-1], childIndex=0)

The reward for all this work is an integer array that describes the RegexCode “op codes” and their arguments. You can see that some instructions like “Setrep” take a string argument. These arguments point to offsets in a string table. This is why it was critical to pack everything about a set into the obscure string we saw earlier. It was the only way to pass that information to the instruction.

Decoding the code array, we see:

Index	Instruction	Op Code/Argument	String Table Reference	Description
0	Lazybranch	23		Lazily branch to the Stop instruction at offset 21.
1		21		Lazily branch to the Stop instruction at offset 21.
2	Setmark	31		Push our current state onto a stack in case we need to backtrack later.
3	Multi	12		Perform a multi-character match of string table item 0 which is 'http://'.
4		0	"http://"
5	Setmark	31		Push our current state onto a stack in case we need to backtrack later.
6	Setrep	2		Perform a set repetition match of length 1 on the set stored at string table position 1, which represents [^\s/].
7		1	"\x1\x2\x1\x2F\x30\x64"
8		1
9	Setloop	5		Match the set [^\s/] in a loop at most Int32.MaxValue times.
10		1	"\x1\x2\x1\x2F\x30\x64"
11		2147483647
12	Capturemark	32		Capture into group #1, the string between the mark set by the last Setmark and the current position.
13		1
14		-1
15	Oneloop	3		Match Unicode character 47 (a '/') in a loop for a maximum of 1 time.
16		47
17		1
18	Capturemark	32		Capture into group #0, the contents between the first Setmark instruction and the current position.
19		0
20		-1
21	Stop	40		Stop the regex.

We can now see that our regex has turned into a simple “program” that will be executed later.

Prefix Optimizations

We could stop here, but we’d miss the fun “optimizations.” With our pattern and search string, the optimizations will actually slow things down, but the code generator is oblivious to that. The basic idea behind prefix optimizations is to quickly jump to where the match might start. It does this by using a RegexFCD class that I’m guessing stands for “Regex First Character Descriptor.”

With our regex, the FirstChars functions notices our “http://” ‘Multi’ node and determines that any match must start with an ‘h’. If we had alternations, the first character of each alternation would be added to make a limited set of potential first characters. With this optimization alone, we can skip all characters in the text that aren’t in this approved “white list” of first characters without having to execute any of the above RegexCode.

But wait… there’s an even trickier optimization! The optimizer discovers that the first thing the regex must match is a simple string literal: a ‘Multi’ node. This means that we can use the RegexBoyerMoore class which applies the Boyer-Moore search algorithm.

The key insight is that we don’t have to check each character of the text. We only need to look at last character to see if it’s even worth checking the rest.

For example, if our sample text is “Welcome to http://www.moserware.com/!” and we’re searching for “http://” which is 7 characters, we first look at the 7th character of the text which is ‘e’. Since ‘e’ is not the 7th character of what we’re looking for (which is a ‘/’), we know that there couldn’t possibly be a match and so we don’t need to bother checking all previous 6 characters because there isn’t even an ‘e’ in what we’re looking for. The tricky part is what to do if the what we find is in the string that we’re trying to match, but it isn’t the last ‘/’ character.

The specifics are handled in straightforward way with some minor optimizations to reduce memory needs given 65,000+ possible Unicode characters. For each character, the maximum possible skip is calculated.

For “http://”, we come up with this skip table:

Character	Characters to skip ahead
/	0
:	2
h	6
p	3
t	4
all others	7

This table tells us that if we find an ‘e’ then we can skip ahead 7 characters without even checking the previous 6 characters. If we find a ‘p’, then we can skip ahead at least 3 characters before performing a full check, and if we find a ‘/’ then we could be on the last character and need to check other characters (e.g. skip ahead 0).

There is one more optimization that looks for anchors, but none apply to our regex, so it’s ignored.

We’re done! We made it to the end of the RegexWriter phase. The “RegexCode” internal representation consists of these critical parts:

The regex code we created.
The string table derived from the regex that the code uses (e.g. our “Multi” and “Setrep” instructions have string table references).
The maximum size of our backtracking stack. (Ours is 7, this will make more sense later.)
A mapping of named captures to their group numbers. (We don’t have any in our regex, so this is empty.)
The total number of captures. (We have 2.)
The RegexBoyerMoore prefix that we calculated. (This applies to us since we have a string literal at the start.)
The possible first characters in our prefix. (In our case, we calculated this to be an ‘h’.)
Our anchors. (We don’t have any.)
An indicator whether this should be a RightToLeft match. (In our case, we use the default which is false.)

Every regex passes through this step. It applies to our measly regex with a code size of 21 as much as it does to a gnarly RFC2822 compliant regex that has 175. These nine items completely describe everything that we’ll do with our regex and they never change.

In need of an interpreter

Now that we have the RegexCode, the match method will run and create a RegexRunner which is the “driver” for the regex matching process. Since we didn’t specify the “Compiled” flag, we’ll use the RegexInterpreter runner.

Before the interpreter starts scanning, it notices that we have a valid Boyer-Moore prefix optimization and it uses it to quickly locate the start of the regex:

Index	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36
Character	W	e	l	c	o	m	e		t	o		h	t	t	p	:	/	/	w	w	w	.	m	o	s	e	r	w	a	r	e	.	c	o	m	/	!
Scan Order							1					9	8	2 & 7	6	5	4	3

It first looks at the 7th character and finds an ‘e’ instead of the ‘/’ that it wanted. The skip table tells it that ‘e’ isn’t in any possible match, so it jumps ahead 7 more characters where it finds a ‘t’. The skip table tells it to jump ahead 4 more characters where it finally finds the ‘/’ it wanted. It then verifies that this is the last character of our “http://” prefix. With a valid prefix found, we prepare for a match in case we’re lucky and the rest of the regex matches.

The bulk of the interpreter is in its “Go” method which is a 700 line switch statement that interprets the RegexCode we created earlier. The only interesting part is that the interpreter keeps two stacks to keep its state in case it needs to backtrack and abandon a path it took. The “run stack” records where in the search string an operation begins while the “run track” records the RegexCode instruction that could potentially backtrack. Any time there is a chance that the interpreter could go down a wrong path, it pushes its state onto these stacks so that it can potentially try something else later.

On our string, the following instructions execute:

Lazybranch - This is a branch that is “lazy.” It will only occur if we fail and have to backtrack to this instruction. In case there are problems, we push 11 (the string offset to the start of “http://”) onto the “run stack” and 0 (the RegexCode offset for this instruction) onto the “run track.” The branch is to code offset 21 which is the “Stop” instruction.
Setmark - We save our position in case we have to backtrack.
Multi - A multi-character match. The string to match is at offset 0 in the string table (which is “http://”).
Setmark - Another position save in case of a backtrack. Since the Multi code succeeded, we push our “run stack” offset of 18 (the start of “www.”) and our “run track” code position of 5
Setrep - Loads the “\x1\x2\x1\x2F\x30\x64” set representation at offset 1 in the string table that we calculated earlier. It reads an operand from the execution stack that we should verify that the set repeats exactly once. It calls CharInClassRecursive that does the following:
It sees that the first character, ‘w’, is not in the character range [’/’, ‘0’). This check corresponds to the ‘/’ in the “[^\s/]” part of the regex.
It next tries CharInCategory which notes that ‘w’ is part of the “LowercaseLetter” UnicodeCategory. The magic number 0x64 in our set tells us to do a Char.IsWhiteSpace check on it. This too fails.
Although both checks fail, the interpreter sees that it needs to flip the result since it is a negated (^) set. This makes the character class match succeed.
Setloop - A “loop” instruction is like a “rep” one except that it isn’t forced to match anything. In our case, we see that we loop for a maximum of Int32.MaxValue times on the same set we saw in “Setrep.” Here you can see that the code generation phase turned the “+” in “[^\s/]+” of the regex into a Setrep of 1 followed by a Setloop. This is equivalent to “[^\s/][^\s/]*”. The loop keeps chomping characters until it finds the ‘/’ which causes it to call BackwardNext() which sets the current position to just before the final ‘/’.
CaptureMark - Here we start capturing group 1 by popping the “run stack” which gives us 18. Our current offset is 35. We capture the string between these two positions, “www.moserware.com”, and keep it for later use in case the entire regex succeeds.
Oneloop - Here we do a loop at most one time that will check for the ‘/’ character. It succeeds.
CaptureMark - We capture into group 0 the value between the offset on the “run stack”, which is 11 (the start of “http://”), and the last character of the string at offset 36. The string between these offsets is “http://www.moserware.com/”.
Stop - We’re done executing RegexCode and can stop the interpreter.

Since we stopped with successful captures, the Match is declared a success. Sure enough, if we look at our console window, we see:

Full uri = 'http://www.moserware.com/' 
Host ='www.moserware.com'

Backtracking Down Unhappy Paths

I can hear the cursing shouts of ^#!@.*#!$ from the regex mob coming towards me. They’re miffed that I used a toy regular expression with a pathetically easy search text that didn’t do anything “interesting.”

The mob really shouldn’t be that worried. We already have all the essential tools we need to understand how things work.

One common issue that you have to deal with in a “real” regular expression is backtracking.

Let’s say you have a search text and pattern like this:

string text = "This text has 1 digit in it"; 
string pattern = @".*\d"; Regex.Match(text, pattern);

You’d recognize the parse tree:

The only thing new about it is that the ‘.’ pattern was translated into a “Notone” node that matches anything except one particular character (in our case, a line feed). We see that the set follows the obscure, but compact representation. The only thing new to report is that ‘\x09’ is the magic number to represent all Unicode digits (which the Turkey Test showed is more than just [0-9]).

It’s painful to watch the regex interpreter work so hard for this match. The “.*” puts it in a Notoneloop that goes right to the end of the string since it doesn’t find a line feed (‘\n’). It then looks for the Set that represents “\d” and it fails. It has no choice but to backtrack by executing the “RegexCode.Notoneloop | RegexCode.Back” composite instruction which backtracks one character by resetting the “run track” to be the Set instruction again, but this time it will start one character earlier.

Even in our insanely simple search string, the interpreter has to backtrack by executing “RegexCode.Notoneloop | RegexCode.Back” and retesting the Set a total of thirteen times.

An almost identical process occurs if we had used a lazy match regular expression like “.*?\d”. The difference is that it does a “Notonelazy” instruction and then gets caught up in a “RegexCode.Notonelazy | RegexCode.Back” backtrack and Set match attempt that happens fourteen times. Each iteration of the loop causes the “Notonelazy” instruction to add one more character instead of removing one like the “Notoneloop” instruction had to. This is typical:

In situations where the decision is between “make an attempt” and “skip an attempt,” as with items governed by quantifiers, the engine always chooses to first make the attempt for greedy quantifiers, and to first skip the attempt for lazy (non-greedy) ones. Mastering Regular Expressions, p.159

If we had a little more empathy for the regex interpreter, we would have written “[^\d]*\d” and avoided all the backtracking, but it wouldn’t have shown this common error.

Alternations such as “hello|world” are handled with backtracking. Before each alternative is attempted, the current position is saved on the “run track” and “run stack.” If the alternate fails, the regex engine resets the position to what it was before the alternate was tried and the next alternate is attempted.

Now, we can even understand how more advanced concepts like atomic grouping work. If we use a regex like:

\w+:

to match the names of email headers as in:

Subject: Hello World!

Things will work well. The problem will come when we try to match against

Subject

We already know that there is going to be a backtracking since “\w+” will match the whole string and then backtracking will occur as the interpreter desperately tries to match a ‘:’. If we used atomic grouping, as in:

(?>\w+):

We would see that the generated RegexCode has two extra instructions of Setjump and Forejump in it. These instructions tell the interpreter to do unconditional jumps after matching the “\w+”. As the comment for “Forejump” indicates, these unconditional jumps will “zap backtracking state” and be much more efficient for a failed match since backtracking won’t occur.

Loose Ends

There are some minor details left. The first time you use any regex, a lot of work goes on initializing all the character classes that are stored as static variables. If you just timed a single Regex, your numbers would be highly skewed by this process.

Another common issue is whether you should use the RegexOptions.Compiled flag. Compiling is handled by the RegexCompiler class. The interesting aspects of the IL code generation is handled exactly like the interpreter, as indicated by this comment:

/* 
 * The main translation function. It translates the logic for a single opcode at 
 * the current position. The structure of this function exactly mirrors 
 * the structure of the inner loop of RegexInterpreter.Go(). 
 * 
 * The C# code from RegexInterpreter.Go() that corresponds to each case is 
 * included as a comment. 
 * 
 * Note that since we're generating code, we can collapse many cases that are 
 * dealt with one-at-a-time in RegexIntepreter. We can also unroll loops that 
 * iterate over constant strings or sets. 
 */

We can see that there is some optimization in the generated code. The down side is that we have to generate all the code regardless of if we use all of it or not. The interpreter only uses what it needs. Additionally, unless we use Regex.CompileToAssembly to save the compiled code to a DLL, we’ll end up doing the entire process of creating the parse tree, RegexCode, and code generation at runtime.

Thus, for most cases, it seems that RegexOptions.Compiled isn’t worth the effort. But it’s good to keep in mind that there are exceptions when performance is critical and your regex can benefit from it (otherwise, why have the option at all?).

Another option is RegexOptions.IgnoreCase that makes everything case insensitive. The vast majority of the process stays the same. The only difference is that all instructions that compare characters will convert each System.Char to lower case, mostly using the Char.ToLower method. This sounds reasonable, but it’s not quite perfect. For example, in Koine Greek, the word for “moth” goes from uppercase to lowercase like this:

That is, in Greek, when a “sigma” (Σ) appears in lowercase at the end of a word, it uses a different letter (ς) than if it appeared anywhere else (σ). RegexOptions.IgnoreCase can’t handle cases that need more context than a single System.Char even though the string comparison functions can handle this. Consider this example:

string mothLower = "σής";
string mothUpper = mothLower.ToUpper(); // "ΣΉΣ"
bool stringsAreEqualIgnoreCase = mothUpper.Equals(mothLower, StringComparison.CurrentCultureIgnoreCase);  // true 
bool stringsAreEqualRegex = Regex.IsMatch(mothLower, mothUpper, RegexOptions.IgnoreCase); // false

This also means that .NET’s Regex won’t do well with characters outside the Basic Multilingual Plane that need to be represented by more than one System.Char as a “surrogate pair.”

I bring all of these “cases” up because it obviously troubled one of the Regex programmers who wrote this comment twice:

// We do the ToLower character by character for consistency.  With surrogate chars, doing 
// a ToLower on the entire string could actually change the surrogate pair.  This is more correct 
// linguistically, but since Regex doesn't support surrogates, it's more important to be 
// consistent.

You can tell the author was fully anticipating the bug reports that eventually came as a result of this decision. Unfortunately, due to the way the code is structured, changing this behavior would take a hefty overhaul of the engine and would require a massive amount of regression testing. I’m guessing this is the reason why it won’t be coming in a service pack anytime soon.

The last interesting option that affects most of the code is RegexOptions.RightToLeft. For the most part, this affects where the searching starts and how a “bump” is applied. When the engine wants to move forward or get the characters to the “right”, it checks this option to see if it should move +1 or -1 character from the current position. It’s a simple idea, but its implementation is with many “if(!runrtl)” statements spread throughout the code.

Finally, you might be interested in how Mono’s regular expression compares with Microsoft’s. The good news is that the code is also available online as well. In general, Mono’s implementation is very similar. Here are some of the (minor) differences:

Mono’s parse tree has a similar shape, but it uses more strongly typed classes. For example, sets such as [^\s/] are given their own class rather than encoded as a single string.
The Boyer-Moore prefix optimization is done in the QuickSearch class. It is calculated at run-time and is only used if the search string is longer than 5 characters.
The regex machine doesn’t have a separate string table for referencing strings like “http://”. Each character is passed in as an argument to the instruction.

Conclusion

Weighing in around 14,000 lines of code, .NET’s regular expression engine takes awhile to digest. After getting over the shock of its size, it was relatively straightforward to understand. Seeing the real source code, with its occasional funny comments, provided insight that Reflector simply couldn’t offer. In the end, we see that a .NET regular expression pattern is simply a compact representation for its internal RegexCode machine language.

This whole process has allowed me to finally connect with regular expressions and give them a splash of empathy. Seeing the horror of backtracking first hand in the debugger was enough for me to want to do everything in my power to get rid of it. Following the translation process down to the RegexCode level clued me into how my regex pattern will actually execute. Feeling the wind fly by a regex using the Boyer-Moore prefix optimization has encouraged me to do whatever I can to put string literals at the front of a pattern.

It’s all these little things that add up to a blazingly fast regular expression.

Rebooting Computing: Why?

Tue, 03 Feb 2009 12:15:00 +0000

Have you seen “The Most Famous Chart in Computer Science Education?” The exact numbers and data sources vary, but the curve always looks similar:

I used data from college bound seniors who indicated on their SAT that they intended to major in “Computer and Information Sciences and Support Services.” The curve tends peak between 1999 and 2001 and then you see a huge decline that has just begun to bottom out to numbers less than half their peak value.

Some people like to explain away this drop on another curve:

Although there was a correlation of computer science enrollment and the stock market, you’ll see that the curves diverge around 2003. A popular belief is that the bursting dot-com bubble scared some potential students, but then by 2003 parents thought most software jobs were being offshored and encouraged their kids to pick a different field.

Some jobs did go to Bangalore, but the total number of jobs actually grew in the US, even beyond their 1999 levels. There are still excellent job prospects for the long term. Even in hard times like the 1970’s recession, companies like Apple and Microsoft were founded. In 10 years, we’ll know of a several great companies that got their start in the current financial crisis. It’s unfortunate how students and their parents have been misled about the reality.

We’re faced with a pipeline problem. The fresh-outs you’ll be looking to hire in 10 years are in middle school right now. Are you doing anything to woo them to a career in computing? What do you say to a bright young girl to at least consider looking at computer science?

I could start with my story. As a kid, I was captivated by how I could think of an idea for a program and then have a computer execute it exactly. It was as if I could put part of my mind inside the computer. My parents and friends thought this was magic. It was magic, but it was a magic I could understand. There’s always cool new things being computed. It was magical to see how Deep Blue beat Kasparov long before I began to understand the beauty of how it worked. Every day I listen to MP3s that were created by a compression algorithm that gets rid of sound that my ear can’t hear. Companies like Walmart sift terabytes of data to predict market demand so they can send extra strawberry Pop-Tarts when hurricanes are approaching. On a personal level, new algorithms predict with a high accuracy what types of movies we’ll like. But that’s just one tiny sliver of what’s out there. Each person can have their own unique experience. There’s a lot of great computing going on and the demand is only increasing.

Sometimes you’ll hear honest hesitations about a career in computing because of fears that they’ll “sit in front of a computer all day.” The sad irony is that this thinking causes people to go into fields like accounting, graphic design, marketing, or hundreds of other fields where they spend just as much time in front of a computer. The difference is that they’ll spend most of their time using applications like Outlook, Word, or Excel and often have less fun than the software developers that are creating these programs. Moreover, working in the computing fields will give someone a chance to create future interfaces that don’t have everyone in front of giant screens all day.

This isn’t to say that there aren’t boring software development jobs, but there are also plenty of great jobs. It’s a great career. I sometime feel a little guilty that I can work in a field that I enjoy. As a field, we create software that powers business communications and connects you with friends. Software will be a pivotal role in gene sequencing and analyzing that will usher in customized medical treatments and drugs. Software will drive many great innovations of the future.

It’s sad that kids aren’t even given a chance to see the breadth and excitement of our field. I don’t blame them. On the whole, we’re doing a terrible job broadcasting our image.

Professors often teach computer science as if it is some sterile thing with nothing new in it. This is just not true; we’re in our adolescence. Sure, we bumble around and do silly things at times, but it also means we’re growing. We’re in an incredible time. Mehran Sahami captured this well at the conclusion of his intro class at Stanford:

“Think about the time that you’re living in. Don Knuth, who is considered the father of Computer Science is still alive and he’s in this department. It’s sort of like you’re geometers and you’re living in the time of Euclid… It’s all happening now. Don’t think of this stuff as dead people who did this stuff and it just happened and now you’re forced to do it. You’re living in it.”

Mehran is one of the few teachers that do a great job of sharing the new and exciting possibilities of computer science. Too often you see teachers that claim that computer science is some small box that is exactly what they’re teaching.

When I’ve had the honor of having conversations with computing pioneers like Alan Kay, I often hear how their teachers in the 60’s would admit that they didn’t understand the full possibilities of computing, and they wanted their students do better. The early ARPA community with J.C.R. Licklider at the helm is a great example of this style.

Licklider encouraged and funded wild and imaginative ideas that caused a huge boom in computer science back in the 60’s and early 70’s. Unfortunately, this slowed down in the 1980s as funders became more conservative, took fewer risks, and as a result got more incremental improvements rather than something big. I think this is why Alan Kay has difficulty finding significant new inventions in computing since 1980. Alan’s statement sounds crazy until you see just how much of what we use was started before 1980. Some will point to the web and the browser as a huge new invention, but even Marc Andreessen, speaking on how he was able to create the first graphical web browser revealed:

“I was able to do it so quickly because it was the icing on the cake that had been baking for 30 years.”

Indeed he had. In the mid-60’s, some of the baking started when Licklider funded Doug Englebart’s amazing oN-Line System that amazingly had hypertext links and was operated by a mouse. Len Kleinrock’s Ph.D. proposal on packet theory in 1961 gave him a great start that ultimately led to his team sending the first message over ARPAnet in 1969. Vint Cerf and Bob Kahn had already published a paper on TCP/IP, the bedrock of the Internet protocols, by 1974. All of these technologies were well refined and in production by the time Tim Berners-Lee created HTTP in the early 90’s to which Andreessen would add a graphical front end.

Are we willing to fund long-term “wild” and “crazy” ideas today to create Internet-sized future results? We’ve been too focused on short term results. It’s not just academia; most companies focus on short-terms results that dismiss the fact that computing is a young field and miss what really matters:

“Many HR departments haven’t figured this out yet, but in reality, It’s less important to know Java, Ruby, .NET, or the iPhone SDK. There’s always going to be a new technology or a new version of an existing technology to be learned. The technology itself isn’t as important; it’s the constant learning that counts.” - Pragmatic Thinking and Learning, p.145

Bummer, Now What?

Last March, I was given the opportunity to be on a design team to start tackling some of these problems. We knew that we couldn’t change the whole field and that some people would want to keep it the way it is. But we also knew that we had to do something; we didn’t want to settle for the status quo.

Our primary task was to plan a “summit” of the best people from academia, government, and industry and get them all in one room so that we could get a good sample of the entire field. We didn’t want to let people have the chance to point fingers outside and say it was somebody else’s problem.

After working on the basic concept for the summit, we needed to give it a name. I had enjoyed many side discussions of the great days of Licklider, PARC, and the early culture that accomplished great things. Sort of as a joke, I thought that we needed to “reboot” computer science to get rid of the cruft that had accumulated over time and get back to the excitement when the field was brand new. After some discussion, we decided to change “computer science” to a broader field of “computing” and use the “magic and beauty” of computer science to be the driver of the “rebooting.”

Only later would I realize the rebooting metaphor could be stretched a bit further. We can “reboot” the field without throwing away the good parts just as an operating system can reboot and depend on its valuable non-volatile memory being preserved. Rebooting doesn’t mean that we’ll go down the same crufty path (e.g. perhaps we have better “drivers” now). Most importantly, the domain name was available so we ran with it.

After nine months of planning and inviting over 220 people, we had our summit in January at the Computer History Museum. It was the first time that such a broad representation of the computing field came to work together in the same room.

The summit was guided by the Appreciative Inquiry process. It’s a technique that has you work in small teams to discover a positive core of what’s giving the field life and then uses that to start dreaming of a better future. As the three days unraveled, we made it to the “design” phase where we kicked off several projects that fell into three rough categories:

Education

K-8 Fundamentals (Creating engaging introductions of computing fundamentals at the elementary level)
Project/Problem Based Learning for Grades 7-14
CS in K-12: Essential Subject (Determining a path to get computing essentials introduced in the K-12 curriculum)
Recruiting CS Teachers (Significantly increase the number of computer science teachers)
National Curriculum for Multi-Disciplinary Collaboration
International Educational Repository (that would include classroom activities and ideas for the K-16 level)

Outreach

LabRats (Build learning communities that include after school activities focused on areas like computer science)
Recruiting Women & Minorities into CS
Image of Computing (Sort of like a marketing campaign to show how computing is changing the world)
Tools for Fun and Beauty (Providing software tools for people sharing the fun and beauty of computing)
Relevant Computer Science Intersecting with Socially Relevant Projects (e.g. providing infrastructure in third world countries or disaster response scenarios)
Defining Future Computing Requirements for IT Service Verticals (e.g. Health Care, Financial, Government)

Internal Growth

Parallel Worlds Initiative (Leverage the boom in multi-core to drive new areas of thinking in computing)
Open Artifacts (e.g. create hardware and software systems that can easily be inspected, understood, assembled, disassembled and reused in new ways to allow for exploration)
Rediscovering Computing “Gems” (Revisit ideas from the past that might have been previously abandoned because they were infeasible but now might be possible)
Computing Field Guide (Create a resource to show the breadth of computing to both novices and experts)

I joined the Computing Field Guide team. I think it’ll be fun to see if we can come up with a way to leverage many of the great existing resources out there and learn some of the breadth of the field as a result.

In the end, the three day summit was just the beginning of a long journey. It was a bit chaotic and some were disappointed we didn’t fix everything right then or do more, but I think the summit gave a good context for the issues our field faces. My best memories include all of the great people that I met and being able to engage in some great conversations.

Steve Jobs said “you can’t connect the dots looking forward; you can only connect them looking backwards.” I think that holds true for this summit. There are still a lot of dots between here and there; wherever “there” might be. With a lot of hard work, I’m confident that the dots will connect somehow. There’s an interesting road ahead and I want to be a part of it. We need to start baking cakes for future generations to ice.

Your Turn

What do you think of the current state of computer science? What is your dream for the future? What are some things that we can start now to improve the current situation? Are you involved with any existing effort? Would you like to join ours?

I’d love to hear your thoughts.

P.S. There are several blog posts by fellow “Rebooters” ([1] [2] [3] [4] [5] [6] [7] [8] [9]). There are some pictures on flickr and a Rebooting Computing Community site that’s starting to have follow-on discussion and might eventually have videos of highlights from the summit. In addition, you might watch this great talk by Dr. Peter Denning who has been leading this effort for over five years.