Articles by Dustin Boswell

How X Over SSH really works

Dustin Boswell — Sat, 27 Sep 2008 00:00:00 GMT

Imagine you are sitting in front of a machine named "home" that has a keyboard, mouse, display, and is running an X server. Now you open a terminal and ssh to a machine called "remote" (which doesn't need to have an X server running), and run a program like "firefox" which pops a window in your screen. How does this all work?

First, let's clear up the client/server terminology confusion. When talking about X:

X client: a process (like firefox or xemacs) which uses the X client API to display things and receive mouse/keyboard events.
X server: a process (usually just "X") which X clients connect to. At times (as we'll see below) other processes can act as X servers.

Whenever an X client starts up, it reads the local $DISPLAY environment variable, whose value looks like: [hostname]:display_number[.screen_number]. The X client immediately opens a connection to that X server. If it can't, it fails:

user@home: echo $DISPLAY
:0 # hostname is "localhost" by default
user@home: xcalc # pops up a calculator on my screen
user@home: DISPLAY="nosuchhost:99"
user@home: xcalc
Error: Can't open display: nosuchhost:99

Now let's see what happens when you ssh to another machine:

user@home: ps aux | grep X # yep, X server running @home
root ... /usr/bin/X :0 ...
user@home: ssh -X user@remote # -X enables X-forwarding
user@remote: ps aux | grep X # nope, no X server here
user@remote: echo $DISPLAY
:11.0
user@remote: xcalc # pops up screen @home

Wait a minute, the $DISPLAY variable is pointing to the localhost ("remote"). Natural questions to ask at this point:

There is no X server @remote, so why didn't xcalc just fail on startup?
Why was the display number "11".
How did xcalc show something @home?

The answer has to do with the ssh-daemon running @remote:

user@remote: ps aux | grep user
root ... sshd:user@pts/11

What's happening is that there's an "X emulator" running @remote that was setup just for your ssh session, that is listening on display 11.

To review, here's a play-by-play:

You type "ssh -X user@remote" in your terminal
The ssh process connects to the sshd server @remote.
sshd spawns a new process that is an X-server-emulator listening on some display number, e.g. "11"
sshd sets the $DISPLAY to point to that local "X-server" (e.g. ":11")
xcalc reads this $DISPLAY and conncects to this X-server. xcalc thinks it's displaying to the local machine.
the X-server-emulator simply forwards the X commands from xcalc through the ssh connection, to the original ssh process.
The ssh process @home now acts as a normal X-client and sends those commands to the X-server @home.

Some of you might be wondering: wasn't the X protocol designed to go over the network? Can't you do all this without ssh? You might be tempted to try something like:

user@remote: DISPLAY="home:0"
user@remote: xcalc

This doesn't work because the X server @home won't let other hosts connect to it. To change this (not that you should -- see below) you can do:

user@home: xhost +remote

xhost is a command which says "that host can connect to our X-server". However, everybody uses ssh X forwarding instead. Here are some security reasons why:

Normally, X-traffic (like your keystrokes) is sent unencrypted from X-client to X-server.
Ssh nicely sends that data through an encrypted channel, so it doesn't go over the internet in the clear.
"xhost +remote" is putting a lot of trust in 'remote' being a nice guy. If remote ever gets hacked, it could connect to the X-server @home and listen to all its keystrokes.

There's also an issue with firewalls: by doing "DISPLAY=home:0", you're assuming that a connection can be established from remote -> home. But this isn't always possible -- home might be sitting behind a firewall (like your Netgear router). Since the ssh-connection was setup from home-> remote, it takes advantage of this already-established connection.

Other notes for the curious:
- if $DISPLAY is set to "localhost:0" (or any explicitly named host) it uses tcp-ip to send the X-traffic locally.
- if $DISPLAY is just ":0" it uses a special (more efficient, non-tcp-ip) connection.

How to hash passwords securely.

Dustin Boswell — Tue, 09 Feb 2010 00:00:00 GMT

Let's say you're making a website where users can login with a password. How do you handle passwords in a way that is secure and doesn't risk exposing the user's password to the world? Here's a simple recipe I use:

Always one-way-hash (with a salt) the user's password on the client (using Javascript most likely). So when your user types "my_password" into the password field and hits "Log In", the browser will send something like "0x22cd3f2e3f2e56f7ecf5..." instead. There's never any need to "decrypt" this. The rest of the system behaves exactly the same as if "0x22cd3f2e3f2e56f7ecf5..." was their actual password.
Store a random string (a salt) for each user in your database.

Instead of storing the user's password, store hash(salt+password). Here's an example of what this looks like:


username  salt  hash(salt+password)
BillyBob  0xd029d0f092c09a09b  0xa0947cf7abd520
BettySue  0x9017d09082ceaa0fc  0x0014bcfd8be781
...  ...  ...

To verify if a given password is correct, lookup the salt for that user, and compute hash(salt+password) to see if that matches.
If a user forgets their password, send them an email with a link to a one-time url where they can enter a new password. Compute hash(salt+new_password) and store that in the database.

username	salt	hash(salt+password)
BillyBob	0xd029d0f092c09a09b	0xa0947cf7abd520
BettySue	0x9017d09082ceaa0fc	0x0014bcfd8be781
...	...	...

Why storing plain text passwords is bad

Let's say that you just store all the passwords in plain text. The problem is that some nefarious hacker might gain access to your database:


username  email  password (plain text!)
BillyBob  billybob23@gmail.com  Password123
BettySue  bettysue@yahoo.com  iLoveChocolate
...  ...  ...

username	email	password (plain text!)
BillyBob	billybob23@gmail.com	Password123
BettySue	bettysue@yahoo.com	iLoveChocolate
...	...	...

The bigger problem is that most people re-use passwords across multiple sites. So a smart hacker could try to login to billybob23@gmail.com's email account using the password "Password123". Think about all the information someone can get out of an email account (bank records for instance).

If that password doesn't work for BillyBob's email, the hacker could then try using that username/email/password for a number of popular banks, like bofa.com and chase.com

And if that doesn't work, the hacker will move on to BettySue's information. The hacker will probably be successful for a good fraction of your users. So do your users's a favor, and don't store passwords in plain text.

Why you need salts in your database

Why isn't it good enough to store hash(password) in the database? Hashes are one-way, so it's impossible to "undo" a hash, right?

Well, yes and no. The problem is that hackers have computed Rainbow Tables for many well-known hashes (like md5, sha1, sha256, etc... -- all the hashes that you might be using). In a nutshell, a Rainbow Table is a giant list (like hundreds of millions) of common passwords along with their hash:


password  hash(password)
a  0x2d08d232d9823d
b  0xfe8f8c8a8e8d82
...  ...
Password123  0xc0c27d8f9dee475c
...  ...
iLoveChocolate  0x32c243e37333489e
...  ...

password	hash(password)
a	0x2d08d232d9823d
b	0xfe8f8c8a8e8d82
...	...
Password123	0xc0c27d8f9dee475c
...	...
iLoveChocolate	0x32c243e37333489e
...	...

(A Rainbow Table is actually a little smarter/more space efficient than this, but I won't go into the details.) This table took a long time to generate, but once it's built, the hacker community can share it forever.

Now it's easy to "reverse" the simple hash(password) in your database -- they just take the hash like 0x32c243e37333489e, look it up in the Rainbow Table, and viola! the password is iLoveChocolate.

This is why you need to add a salt to the input of your hash. So instead of storing hash("iLoveChocolate"), you would be storing hash("this is your random long salt." + "iLoveChocolate"). This makes the Rainbow Table ineffective because "this is your random long salt.iLovChocolate" is unlikely to be in the table. The actual salt you use should be a good long (and random) string, perhaps 100 characters or greater.

Why you need per-user salts

You might be tempted to just use a single salt value for all users, and not have to deal with storing each user's salt in the database. But this is less secure.

For one, the hacker could start to build his own Rainbow Table, where "this is your single fixed salt value" is prepended to each password first. It would take a while, since the hacker would have to compute hundreds of millions of hashes. But eventually he might be able to match one of the hashes in your database to one of the entries in his newly-built Rainbow Table.

Also, by having a single salt for all users, the hacker would be able to know if 2 users happen to use the same password. If he sees 2 users with the same hash(fixed_salt+password), it must be because they have the same password. Thankfully, he doesn't know what that password is (yet). But a password that is used by two users is likely to be very weak, and so he could focus his attack on those 2 users.

So it's a good idea to have a per-user salt. This makes the above two attacks much less likely.

Why you should hash the password on the client also.

When the user types their password into your <input type=password> and submits that form, that password value is sent across the internet in plain text, and given to your web server, which might even log all the parameters for all the requests it gets.

That's a lot of opportunity for the password "iLoveChocolate" to get stolen. That's why it's better to hash that password in the browser first, so their password never leaves their computer. Note that a hacker could still sniff the hashed password going over the network, and use that hash later to send to the server and impersonate you. But at least the hacker can't use your real password for other purposes.

For the same reasons as above, it's better to salt this hash too. (In general, it's always more secure to add a salt before hashing. The saltier, the better.)

Shouldn't you use https/ssl instead?

Yes, you should use it, but I believe it's not enough. You should still use a client-side hash because:

Servers often log incoming requests (and their POST parameters) to a file. This would mean the user's plain-text passwords are just sitting there in a file somewhere.
One day someone might accidentally turn https off, or accidentally use http:// in one of the places where a password is sent.
SSL might be unavailable for a particular client or server (for example, some embedded device).

Example client-side code

Here's some example html:

{% filter force_escape %}
  
  
  
{% endfilter %}

Note: I use a div instead of a form to prevent it from accidentally getting submitted (thereby sending the raw_password). Here's the corresponding JavaScript: (I haven't tested this - I'm just giving you an idea of how it works.)

{% filter force_escape %}var login = function() {
  var username = $('#username').value();
  var raw_password = $('#raw_password').value();

  // Choose which version depending on your needs
  // hash() is a secure hash function like sha-1

  // Version 1: good
  var hashed_password = hash(raw_password);

  // Version 2: better (salt the hash with your domain)
  var hashed_password = hash("example.com" + raw_password);

  // Version 3: best
  var hashed_password = hash(username + "example.com" + raw_password);

  // Now login to server by sending (username, hashed_password) ...
}{% endfilter %}

Version 3 is technically the most secure, but this only works if username will never change (probably a dangerous assumption). And if your user can login with either a username or email, then it won't work either. So unfortunately, version 3 isn't doable for most sites.

How to use screen to pair program.

Dustin Boswell — Fri, 02 Apr 2010 00:00:00 GMT

Pair programming is a great way for 2 people to work together on the same code. Typically, it's done with the programmers sitting right next to each other. But what if they are in separate places, or they just don't want to accidentally touch elbows (ewww...)?

The UNIX screen command is typically used to run multiple terminal programs inside a single ssh session, and be able to disconnect/re-connect to the session without the programs noticing. It's an awesome utility, but you can also use it to let multiple people interact with the same terminal screen, and hence, allow multiple people to use the same editor at the same time.

First, I'm assuming both programmers have a user account on the same machine, and are already logged in (or ssh'd in) to the machine. (If the second programmer doesn't have an account, he can use the first programmer's account, and the steps below are the same.)

Enabling multi-user with `screen`

There are two ways to do this. One way is to do

chmod u+s /usr/bin/screen

first, and then make sure everyone's ~/.screenrc file contains:

multiuser on
acladd second_programmer_username

The other way is to just have

multiuser on
acladd root

But then the second programmer will need to do sudo screen instead of just screen in the steps below. (There are also more advanced security options.)

First Programmer: run `screen`

The first programmer starts his day by doing:

screen

Hit ENTER to dismiss the screen startup message. Then, go about your normal activities, such as running vim, or grep, or whatever.

Second Programmer: attach to his screen

The second user does the following:

[sudo] screen -rx first_programmer_username/

This attaches to the other user's active screen.

That's it! Now you should both be seing the same window and can both use their keyboard.

Problems with the Delete Button?

When ssh'ing from a terminal in Mac OSX I noticed that my delete button no longer worked. (ssh'ing from PuTTY in Windows didn't have this problem.) To fix this, do:

TERM=screen; screen

whereever you would normally do screen. Or, it's probably easier to just put

alias screen='TERM=screen; screen'

in your ~/.bashrc file.

Essential commands while inside `screen`


Command Purpose
Ctrl-a ? Show the screen help menu
Ctrl-a d Dettach from the screen (without killing it)
Ctrl-a c Create another window inside the screen.
Ctrl-a <space>   Cycle to the next window.
Ctrl-a a Alternate to the previous window.
Ctrl-a [ Enter "scroll mode" (use Up/Down arrows, then ESC to exit).

Command	Purpose
Ctrl-a ?	Show the `screen` help menu
Ctrl-a d	Dettach from the screen (without killing it)
Ctrl-a c	Create another window inside the screen.
Ctrl-a <space>	Cycle to the next window.
Ctrl-a a	Alternate to the previous window.
Ctrl-a [	Enter "scroll mode" (use Up/Down arrows, then ESC to exit).

Other `screen` flags


Command-line Purpose
screen -ls List all the active screens I have on this machine.
screen Start a new screen session.
screen -x Reattach to my pre-existing screen session.

Command-line	Purpose
screen -ls	List all the active screens I have on this machine.
screen	Start a new screen session.
screen -x	Reattach to my pre-existing screen session.

How iTunes works under the (file) covers

Dustin Boswell — Thu, 02 Apr 2009 00:00:00 GMT

If you're like me, you have a bunch of mp3s (that you legitmately obtained from your CD's - ahem) in a neatly organized directory tree on your hard drive. And it's a breeze to navigate, play, backup, and share.

Now you've got an iPod and/or a Mac, and you want to take the plunge and use iTunes - but your puny little file-based UNIX-loving brain can't handle the transition. Well, neither could I at first, but now I've finally unraveled the how the "file system" of iTunes works.

Here are the juicy nuggets that will help you understand:

About authorization:
1) Instead of an .mp3, purchased songs from iTunes are .m4p files. They are similar to mp3s but they are essentially encrypted with your AppleID as a key. (In fact, if you open up the .m4p file in a text editor, you can see your email address!)

2) Apple allows you to copy those .m4p files to as many computers/ipods as you like.

3) But you can't play an .m4p file unless that computer is "authorized" by your AppleID. You can have up to 5 computers authorized at one time. The "authorize" and "deauthorize" options in iTunes do the trick. (Note: it doesn't touch your library or files.)

Where is all the data?
1) When iTunes starts up, it looks at the iTunes/ directory (inside ~/Music in OSX, somewhere in your Documents & Settings in Windows). This contains a few importants files:
- iTunes Music Library.xml - this single file has all your playlists, and playcounts, and has the file paths of your actual music files.
- There is a similar file (without the .xml) which is a binary version of the .xml file. From my understanding, the xml file is for reading convenience, but this file is what iTunes really uses. The two files are kept in sync by iTunes.
- The Album Artwork/ directory is where the album image files are stored.

2) In iTunes->Preferences->Advanced there is an option for the "iTunes Music Folder". This directory contains all the .m4p and other raw music files. The song title and other text is metadata is inside the .m4p file.

Note that this directory has a Artist/Album/Song directory structure inside it. (This is what the iTunes->Preferences->Advanced->"Keep iTunes Music Folder Organized" option enforces.) You're not meant to touch this directory yourself - the iTunes program manages these subdirectories for you.

3) When you import music via iTunes->"Add to Library" (or when you purchase music, for that matter) then it:
- Adds those raw music files to your iTunes Music Folder. (If you have iTunes->Preferences->Advanced->"Copy files to iTunes Music folder when adding to library" unchecked, it will skip this step.)
- Adds the filepath of those files to the xml database mentioned above.

Implications
1) If you copy an raw music file into your iTunes Music Folder, that doesn't do anything - your iTunes library (the xml database) doesn't know about it.
2) If you move your iTunes Music Folder, your library will be broken, since that xml-database is pointing to file locations that don't exist. You'll have to move them back, or re-import those files.
3) The "Consolidate Library..." action copies all the music files pointed to by your library (where-ever they are) to your current iTunes Music Folder.

Well, I hope that helps! Let me if there are any important parts I've left out.

Oh, and if you're curious here's a writeup on how to share an iTunes library across multiple users on the same computer

Good Politician != Good Decider

Dustin Boswell — Fri, 03 Apr 2009 00:00:00 GMT

I've just realized what's wrong with our government: it's run by politicians. Seriously though, hear me out. The qualities that make a good politician aren't the qualities that make an ideal government official.

Here are the traits that make a successful politian:

→ Electability

Because of our media-based election system, you can't get into office unless you have a good on-camera personality, have the gift of gab, and are generally "likable." These are nice qualities for someone to have, and somewhat correlated with leadership, but less so with decision-making ability.

→ Leadership/Charisma

The ability to get others to join in your cause, to spend their efforts on your purpose. I believe people are hard-wired to want to follow leaders - it's part of our tribal ancestry.

→ Networking/Schmoozing/Politicizing

If you're able to rub elbows with the elite, "network", and get into "scratch my back, I'll scratch yours" exchanges with other politicians, you're going to be more effective.

But really, the most important trait we need in a government leader is:

→ Intelligence & decision-making

We need officials who can decide questions like:

"Should we go to war?"
"Should we impose an embargo against Cuba?"
"Should we spend more on military spending?"

by analyzing the facts and making good judgement calls.

Unfortunately, I think the first 3 traits are anti-correlated with good decision-making and analysis. The people who are good at dealing with other people typically aren't the type to study the facts and research a topic.

A new form of government.
If it were up to me, we would separate the "politician" and "decision-maker" roles.

We would have a large body of decision-makers - folks whose only job is to research issues and make informed decisions. These decision-makers would essentially act as a "brain trust" and be isolated from the day-to-day activities a normal politician does. That is, they would be spared from photo-ops, dealing with the media, filibusters, and other wastes of time. You know, so they'd have time to actually read the bills they're voting on.

The politicians would be required to bring decisions to to the decision-makers and accept their decision. But note that the politicians still have the power of picking which decisions to present. So the politicians can still choose the issues to focus on.

For example, if a politician thinks that we need to reduce carbon emissions, she might propose a cap-and-trade system for emissions. The decision-makers might review this proposal and reject it because it wouldn't be effective. The politician then might propose a simple carbon tax bill instead. The decision-makers might approve this proposal, and then the approval could be turned into a bill.

The key is to have a decision-maker group that is highly intelligent and unbiased. You might think "you'll have all the same problems of electing decision-makers as you do for electing politicians." I don't think it would, for a number of reasons:

A decision-maker has a much more "boring" job. They are handed proposals, and they get to evaluate them. They can't choose which proposals they are given, so it would be much more difficult for special-interests to penetrate this group. Also, there should be some sort of impartiality-judgement (like they do for jury selection) when decision-makers are selected.

The group would be large (over 1000 people), so it would be more robust to rogue decision-makers. Also, any one decision-maker would have far less power, so it wouldn't attract the power-hungry politician type.

What do you think?

Maximum Likelihood vs. Expected Value

Dustin Boswell — Tue, 05 Apr 2011 00:00:00 GMT

Suppose you pulled a slot machine 5 times, and won 1 time. What's your best estimate for the true underlying payout probability $p$?

This seemingly innocent question is actually quite involved. You might think the answer is $\frac{1}{5}$ but it's actually $\frac{2}{7}$.

(Note: we're making the usual assumption of a uniform prior for $p$, which just means "before we saw any data, we thought any value of $p$ was equally likely.")

Maximum Likelihood Estimate

The maximum likelihood estimate (MLE) of a parameter is the value whose likelihood is highest. If you're looking at the probability density function (PDF) of that parameter, the MLE is simply the highest point on the curve (i.e. the mode).

For a binomial, the MLE of $p$ is indeed $\frac{k}{n}$ (where you saw $k$ "successes" out of $n$ attempts), which is $\frac{1}{5}$ in this case.

Expected Value

However, the MLE is different than the expected value of $p$. If you're looking at the PDF of a parameter, the expected value is the mean of that curve. For a binomial, the expected value turns out to be $\frac{k+1}{n+2}$. (This is also known as Laplace's Rule of Succession.)

For the data above ($k=1$, $n=5$) you get $\frac{2}{7}$. Yes, this seems strange, but this will be a better estimate of $p$ than $\frac{1}{5}$.

Nitty Gritty Math Details

The posterior distribution for a binomial parameter $p$ takes the shape of a beta distribution, which is a function $Beta(x; \alpha, \beta)$. When the data is $k$ successes out of $n$ attempts, the parameters to the function are $\alpha=k+1$ and $\beta=(n-k)+1$.

The function $Beta(x; \alpha, \beta)$ has the following properties: $$mean = \frac{\alpha}{\alpha + \beta} = \frac{k+1}{n+2}$$ and $$mode = \frac{\alpha - 1}{\alpha + \beta + 2} = \frac{k}{n}$$

Lo and behold, when you plug in $k=1$ and $n=5$ you get $\frac{2}{7}$ for the mean, and $\frac{1}{5}$ for the mode.

Head-to-head Matchup

But don't take my word for it. You can simulate it yourself with the Python code below:

import random

def flip_coin(p_heads):
        if random.random() < p_heads: return 'H'
        else: return 'T'

def compute_errors():
        # pick a true underlying 'p' uniformly, and generate data
        p_heads = random.random()
        NUM_COINS = 5
        coins = [flip_coin(p_heads) for x in xrange(NUM_COINS)]

        # come up with estimates of 'p'
        mle_p = coins.count('H') / float(NUM_COINS)
        laplace_p = (coins.count('H') + 1) / float(NUM_COINS + 2)

        # return their errors
        return (abs(mle_p - p_heads), abs(laplace_p - p_heads))

win_count = {'mle': 0, 'laplace': 0}
mle_errors = []
laplace_errors = []
for x in xrange(1000000):
        (mle_error, laplace_error) = compute_errors()
        if mle_error <= laplace_error:
                win_count["mle"] += 1
        else:
                win_count["laplace"] += 1

        mle_errors.append(mle_error)
        laplace_errors.append(laplace_error)

print "win_counts:",  win_count
print "average mle_error:", sum(mle_errors) / len(mle_errors)
print "average laplace_error:", sum(laplace_errors) / len(laplace_errors)

When I run this, I get

win_counts: {'laplace': 569967, 'mle': 430033}
average mle_error: 0.142
average laplace_error: 0.123

Which means the formula $\frac{k+1}{n+2}$ was closer 57% of the time, and had a smaller average error than $\frac{k }{n}$ does.

WTF?!

If you're having a hard time coming to grips with this reality, let me try to explain some intuition behind why $\frac{k}{n}$ is a bad estimator.

The basic problem is that $\frac{k}{n}$ tends to "believe" extreme data too easily. If we see 0 wins out of 5 attempts, $\frac{k}{n}$ will conclude that the exact value of $p$ must be 0. Of course, this is extremely unlikely. It's more likely that the true value of $p$ is $> 0$, and that 0-out-of-5 was just an unlucky streak. (Similarly for when the data is 5-out-of-5 wins.)

Using Rsync to do local snapshotting/backups

Dustin Boswell — Sat, 09 Jan 2010 00:00:00 GMT

Jeff Atwood's recent data-loss story is a good reminder of why you should do off-site backups. But what if you just accidentally rm'ed a file, or saved over it's contents with bad data? It's a lot of work to do a full backup recovery of your entire system just to get back one file. Isn't there an easier way?

Local Snapshotting: not perfect, but super usefull

Here's what I do: run rsync with cron to copy all your important files to a local /backup directory.
First, create a new file called crontab.txt with contents like:

@hourly  rsync -a --inplace --max-size=1MB /www /etc /backup/hourly/
@daily   rsync -a --inplace --max-size=1MB /www /etc /backup/daily/
@weekly  rsync -a --inplace --max-size=1MB /www /etc /backup/weekly/
@monthly rsync -a --inplace /www /etc /backup/monthly/

Let me explain:

@hourly (and the others) is a special interval understood by cron
/www and /etc are specific directories I want to keep snapshotted. You might want to include /home as well
The -a option to rsync tells it to do "archive" mode (preserving permissions, etc...)
The --inplace option is just an optimization so that files in /backup are overwritten inplace, as opposed to rsync creating an intermediate temp file.
The --max-size=1MB tells rsync to ignore files greater than 1MB in size. I do this so that I don't bother making lots of copies of big log files and videos and other stuff that isn't that important and doesn't change that often.

Now you install this crontab by doing:

crontab crontab.txt

I do this as the root user, but you can do this as any user that can read all the directories you need to backup, and can write to /backup. (Warning: the above command will overwrite any other crontabs you already have installed. Do a crontab -l first, to see what's installed.) Now you can just sit back and relax -- copies of your local files are being copied every hour, day, week, and month to the /backup directory. If you want to make sure you have a full (monthly) backup right away, then you should execute:

rsync -a --inplace /www /etc /backup/monthly/

right now.

How does this help me?

Let's say you just accidentally removed a local file

rm /etc/lighttpd/lighttpd.conf
# Oh shit, I didn't mean to do that!

Not to fear, you can recover it by doing

cp /backup/hourly/etc/lighttpd/lighttpd.conf /etc/lighttpd/lighttpd.conf

Why do hourly, daily, etc...?

For mistakes that you catch right away, the /backup/hourly directory is what you'll goto most often to get a recent version. But sometimes you don't realize something is wrong until days (or weeks) later. In that case, the hourly backup is of no use, since it has already mirrored the mistake. If you were really paranoid, you could add a @yearly line to the crontab above.

Source code for putting "a" or "an" in front of a word.

Dustin Boswell — Thu, 11 Feb 2010 00:00:00 GMT

English is a funny language - you say "a usual person", but "an unusual person". There is no simple set of rules for how to decide this. Instead, you have to rely on whether the next word starts with a vowel sound.

Instead, I think the "right" solution is just to have a list of the words that should be prefixed with "an".

aardvark
able
... words that should be prefixed with "an" ...

I searched far and wide, but couldn't find one, so I created a list of words that should be prefixed with "an" myself, based on which words in the CMU pronunciation dictionary started with one of the vowel-like phonemes:

["AA", "AE", "AH", "AO", "AW", "AY", "EH", "ER", "EY", "IH", "IY", "OW", "OY", "UH", "UW"]

Files for you to Download

I am now sharing this list to the rest of the world (it's public domain, use it how you like.)

preceed_with_an.txt [146KB in size, 16,831 words]
preceed_with_an.py [147KB in size, it embeds the list above]

That Python file contains a function should_preceed_with_an("phrase...") that returns True or False.

A Django Template Filter

I now use this code in my Django templates by doing:

I am looking for {% templatetag openvariable %} phrase|a_or_an {% templatetag closevariable %} {% templatetag openvariable %} phrase {% templatetag closevariable %}

which will produce:

I am looking for a cat.
I am looking for an hour-glass.
I am looking for an unusual person.
I am looking for a usual person.

To make use of a_or_an you have to define it in one of your app/templatetags/ files:

from preceed_with_an import should_preceed_with_an

@register.filter
def a_or_an(phrase):
  if should_preceed_with_an(phrase): return "an"
  else: return "a"

Gas Mileage for my 2007 Audi A4

Dustin Boswell — Sun, 06 Jun 2010 00:00:00 GMT

I've collected meticulous notes on how much I fill up each time at the gas pump, and below is the data. (Don't ask why - I'm an information packrat.)

About the Data

This is over a 2 year period, driving a good mix of city traffic and longer freeway trips. I'm a moderate driver in terms of gassing it. (I never put it in "Sport" mode though.)

I figured I had this data, so I might as well share it. Let me know if your mileage is much different.

How to Diagnose your Flaky Internet Connection

Dustin Boswell — Sun, 12 Jul 2009 00:00:00 GMT

I have Verizon Avenue DSL, the worst ISP I've ever had, so I've gotten to learn a few tricks on how to troubleshoot my internet problems. Here are the steps:

Can you ping the outside world?

Try pinging a well-known IP address like 4.2.2.2

In Windows: Click Start -> Run... -> cmd -> type "ping -n 50 4.2.2.2"
In Linux/Mac: Open a Terminal and type "ping -c 50 4.2.2.2"

You should see successful results (0% packet loss) like that below:
If you get messages like "no route to host", or get 100% packet loss, you've got much bigger problems. (If so, try doing "ping 192.168.0.1" - if that doesn't even work, then you probably aren't even connected to your router.)

Does resetting just the router help things?

Try unplugging (waiting 20 seconds) and re-plugging the power to your router. Does that help things? If so, you might have a crappy/old/broken router. I've had 3 different Netgear/DLink routers where resetting helped things. (In fairness, 1 of those was my fault: I plugged a 12v power supply into a router that wanted 7.5v -- the plug fit, the router got really hot, and periodically reset it self.)

Does resetting the modem and router help?

Try unplugging (waiting 30 seconds) and re-plugging the power to your dsl/cable modem, and also to your wireless router. Occasionally, your modem can get stuck with a bad IP address, and this will force it to get a new one. This really shouldn't happen if you have a good ISP, but it can. But this is only something that might happen every few months or so, not every day. If doing this helps all the time, you probably have a different problem.

Is it your DNS?

If you are getting a lot of "host/server not found" errors in your browser, and/or the "looking up domain.com ..." message in the status-bar at the bottom takes a long time, the problem might be a bad DNS server.

Background on DNS:

When you plug your wireless router into your cable/dsl modem, the router is given an IP address, as well as the IP address of where to do DNS lookups. (These DNS servers are hosted by your ISP, and are often flaky/overloaded.) When you plug your computer into the router (or connect over wireless), the router tells your computer to use 192.168.0.1 (the IP address of the router) as the DNS server. Your computer thinks that your router is the DNS server, but really, your router just turns around and does the DNS lookup for you.

How to fix your DNS:

One thing you can easily try is to tell your computer to use a different DNS server. Go to opendns.org -- they have instructions on how to do this for your particular computer. Their DNS Server IP addresses are 208.67.222.222 and 208.67.220.220
(Or you can use Verizon's public DNS servers of 4.2.2.2 and 4.2.2.3, or (new) Google's DNS servers of 8.8.8.8)
Alternately, you can change the settings on your router to use these IP addresses. (It's hard to explain how to do this - you have to visit http://192.168.0.1 from a computer that is plugged directly into your router's special port.) This way, all the computers in your home will benefit from having these new DNS servers.

If using these DNS servers fixes your internet woes, then you've found your problem (and your solution).

Is your wireless connection flaky?

Try pinging 192.168.0.1 from your laptop that's connected wirelessly, and see what the packet loss is. (You really need to do 50 or 100 pings to get a fair estimate.) Ideally, the packet loss should be 0% -- if you do a ping from a computer that is plugged directly into the router that is what you'll get.
In my house, I would get packet losses of at least 3%, sometimes as high as 8% or even higher. The symptom is that the internet seemed very flaky. Sometimes web pages don't load, or take extremely long to load. Sometimes my ssh connections would lock up. If your wireless is the true cause, then all of these should be symptoms that you don't have when plugged directly into your router (all ethernet, no wireless).

How to fix your wireless connection:

I don't have a great solution here, since the problem might be that your house is just a "dead zone" as far as wireless goes. Or there might be too many other routers/microwaves/cordless phones/other interference right around you.

But here are some ideas to try:

try changing the "channel" of your wireless router (it's a number from 1-11) to something very different from what it was before.
try moving your router to a different place in the room (away from bookcases for example)
try upgrading the firmware of your router (a pain, I know)
buy a fancy new router

Update: I bought the Apple MB763LL/A AirPort Extreme Dual-band Base Station and so far everything is great - 0% packet loss, great range. It's a bit pricey, but I've decided I'm not going to skimp on productivity tools that I use every day.

Is Drinking Distilled Water Dangerous?

Dustin Boswell — Sun, 09 Nov 2008 00:00:00 GMT

Note: I'm not a doctor, zealot, or trying to sell anything - I'm just a regular guy summarizing all the research I did on this topic.
There seems to be a lot of controversy about what kind of water people should be drinking and cooking their food with:

tap water
natural "spring" or "rain" water
filtered, reverse-osmosis, or other cleansed/purified/low-mineral water
distilled water

The controversy is about the good & bad components in water. There are a number of good minerals found in "mineralized" water sources including:

calcium
magnesium
potassium
sodium

There are also a number of bad components that might be found in water ( whether its bottled or not )

lead , mercury, and other unhealthy heavy metals
bacteria & other live matter
man-made chemicals like plastics, medicines

There is a wide range of waters, in terms of purity. On the one extreme is distilled water, which should contain absolutely nothing but H2O. On the other extreme is "hard water" which contains a large amount of minerals (and possibly chemicals). There are many types of water in between, some that have only a small amounts, or just a subset of the minerals in question.

The "total dissolved solids" (TDS) is a measure of how much stuff (typically good minerals) is in your water. Highly-purified water (distilled, reverse-osmosis, other highly-filtered) has a TDS well below 50mg/liter. Distilled water ought to have TDS=0. "Hard water" or mineralized water often has TDS > 200. The debate is about which is healthier: low-TDS water, or high-TDS water. Below are the common arguments that come up, and my take on them.

"Distilled/highly-purified water is missing essential minerals that your body needs."
The amount of minerals in normal water is very small compared to the amount found in food (less than 10%). A humorous thought experiment mentioned here was to imagine a blender with a day's worth of food and consider the tiny difference in minerals between adding distilled water to this, or regular water. However, some argue that the calcium/magnesium/etc... in water is more easily absorbed by your body than from food or supplements.
I think this may be more important of an issue if you are already low on these minerals, and don't get enough of them from other sources (eg. eating nutritious food, cooking with tap water, drinking other liquids like orange juice, etc...)

"Purified water can "leach" metals and other bad chemicals from the pipes and containers."
Apparently, pure/low-mineral water is chemically "unstable" and wants to dissolve away the materials around it. If you store your (pure) water for a long amount of time, or get it through pipes that haven't been setup to do so, this is something you should think about.
Overall, this is really a contamination issue though, not about how distilled water affects your body. I just wanted to mention it for completeness.

"Purified water leaches vital minerals & ions from your body."
It stands to reason that if you drink distilled water, and urinate any of these vital minerals, that there is a net loss. I get the impression that the dissolving/extracting power of distilled water is very high, and that drinking it will draw out many good chemicals (in addition to the bad impurities) from your body. Perhaps distilled water is less dangerous when drank with a meal? I've read claims that distilled water is particularly bad during exercise (presumably because that is when your body needs those electrolytes the most).
Frustratingly, there doesn't seem to be much research on this. Would it be that hard to measure the amount of these chemicals in the urine of distilled water drinkers compared to normal?

Then why do people drink distilled water?

to avoid all the bad chemicals that might be found in tap water
they like the taste (as I do)
to extract and remove toxic substances from your body

I understand the inclination to avoid tap water (that's a whole controversy of its own) - but good bottled water can achieve this goal. As far as taste, this is a personal matter, but presumably everyone could find a non-distilled alternative that they like (as I am going to do now) for drinking on a daily basis. If you have a particular need to remove impurities from your body, I think distilled water is safe & effective in doing so, understanding that it may remove good materials from your body as well. I've also read that activated charcoal has been used for the same purpose.

So, what water should I be drinking?
I have no idea :) How ironic is it that modern man cannot answer such a simple question, when drinking water is something that every life form on Earth was born from?
If you believe all the evidence in the references below, you should be drinking water with a high TDS (lots of calcium, magnesium, and other good stuff), that doesn't have toxic metals or chemicals.
There doesn't seem to be any specific health benefit from distilled water except for avoiding bad chemicals. From what I've read, the only "danger" with drinking reasonable amounts of distilled water is the long-term mineral-deficiency it might cause in your body. But maybe our Western diets & lifestyles are already so deficient in these minerals that distilled water exacerbates it?
There is also a lot of controversy about whether alkaline water (pH > 7.0) is generally better for you because it helps your body be less acidic (which is the source of most disease according to alkaline-diet proponents). I hope to research this more and post later about it. But one point worth noting is that supposedly distilled waters are actually slightly acidic because they readily absorb CO2 from the air (carbonic acid).

As I find out more about various bottled water, I'll post them here. For starters, you might consider:

Fiji Water - TDS > 200, pH 7.5. Fiji took a lot of flack when people calculated how much waste goes into a single bottle flown from around the world, but the company now aims to be carbon negative .

If you have a brand of water that you swear by, please comment on this post.

Further reading:
http://www.who.int/water_sanitation_health/dwq/nutdemineralized.pdf - a great report by the World Health Organization citing a lot of research on why water-without-minerals is unhealthy
http://www.cyber-nook.com/water/distilledwater.htm - a page with lots of information on both sides of the issue
http://www.mgwater.com/calcium.shtml - a research paper showing that "hard water" was correlated with lower rates of cardiovascular death

SSH keys in 2 easy steps

Dustin Boswell — Sat, 13 Feb 2010 00:00:00 GMT

These are simple instructions that will let you ssh from one Linux machine to another without needing to type your password.

Step 1) Generate your public signature

On your local machine (where you are ssh-ing from) type:

ssh-keygen

(Then hit ENTER to accept the default output file of ~/.ssh/id_rsa.pub and ENTER again twice if you're lazy and want to use a blank passphrase.) Note that you only have to generate a key once per client machine - the same public key will be used to access all servers.

Step 2) Copy your public signature to the server

Again, from your local machine, type:

cat ~/.ssh/id_rsa.pub | ssh remote_user@remote.example.com "cat >> ~/.ssh/authorized_keys"

(but replace remote_user@remote.example.com with your actual user and server.)

This fancy shell command appends the contents of your public signature to the end of the ~/.ssh/authorized_keys file on the server. (If you did a simple scp it would overwrite any previous authorized keys you've stored.)

You're done!

Next time you ssh into the server

ssh remote_user@remote.example.com

It should do this without prompting for any passwords.

How to fix certain macbook wireless problems.

Dustin Boswell — Tue, 09 Aug 2011 00:00:00 GMT

On a number of occasions, I've opened my MacBook Pro (OS X 10.5.8) quickly and tried to use the internet before AirPort could connect to my usual network. Sometimes this causes the AirPort to get "stuck" in a weird state, where you can no longer use that network. Even rebooting the MacBook doesn't help.

Looking at System Preferences → Network → Advanced → TCP/IP you could see that IPv4 was something like 168.254.x.x and the Subnet was 255.255.0.0. This is the IP address that Mac assigns itself in the case when it failed to get an IP from the DHCP server. (You might also try turning off your firewall. Here's more info.)

Here's what I've done to fix it:

Turn the AirPort off (under System Preferences → Network)
Open the Terminal App.
type sudo rm /Library/Preferences/SystemConfiguration/com.apple.network.identification.plist and hit ENTER. (You will need to type your password.)
type sudo rm /Library/Preferences/SystemConfiguration/com.apple.airport.preferences.plist and hit ENTER
restart your MacBook
Turn the AirPort on.

Those plist files will get regenerated, and your wireless passwords are even remembered! (

My Vim Cheat Sheet

Dustin Boswell — Tue, 07 Aug 2012 00:00:00 GMT

A friend of mine decided to make the switch from Emacs to Vim, and to help him out, I gave him a copy of the cheat sheet I made to help me learn. Here it is - enjoy!

How to Play Yes/No Proposition Bets (and "Lodden Thinks")

Dustin Boswell — Fri, 18 Feb 2011 00:00:00 GMT

If you're a gambling man (like I am), and fancy yourself a good estimator, here are some fun games you can play with your friends...

Game 1: Over/Under betting, auction-style

This is an even-money bet (say for $10), where the first person starts by making a claim like:

I bet $10 that the Statue of Liberty is at least 50 feet tall.

The second person has 2 choices:

Accept the bet (second person wins if Statue of Liberty is actually under 50 feet tall.)
Make a bolder claim by increasing the value-in-question.

The bolder claim might be:

I bet $10 that the Statue of Liberty is at least 200 feet tall.

This keeps going back-and-forth until a bet is accepted. At that point, you have to go lookup the fact in question, and resolve the bet.

The value that the bet settles at is the Over-under that combines the two people's estimates. This is assuming both players play rationally, and neither accidentally increases the value too much in one step.

Usually, there is a gentlemen's agreement that each person has to increase the value-in-question by a certain fraction. But in practice, this doesn't come up much, because if one person were to be a sissy and increase the value from 200 feet to 200.01 feet, the second person usually just leapfrogs this to a new reasonable value.

Game 2: Lodden Thinks

The show Poker After Dark has made the "Lodden Thinks" version of this game popular. The "Lodden" is Johnny Lodden, a famous poker player.

This game is exactly the same as Game 1, except that the value-in-question is something that a third person knows and will keep secret until the bet needs to be resolved.

For example, in one episode of Poker After Dark, two players bet on the age Daniel Negreanu lost his virginity.

What's interesting about this game is that the value-in-question doesn't have to be something the third-person knows - it can be that third person's estimate of some unknown value. For instance, in one episode, two players were betting on what a third-player's estimate of Hugh Hefner's age was. Before the betting started, the third player was asked to come up with an estimate, and remember it, without telling anyone. Now the other two players bet on this estimate.

It doesn't matter if the estimate is good or not, what you're really betting on is what Lodden (or whoever the third person is) is thinking. It's an interesting game because you're basically betting over who can do a better job getting into the mind of another person (hence, why poker players like this game).

The other benefit of this game is it doesn't require access to a computer to lookup the facts - you can play this game in the car (or at a poker table), for instance.

Game 3: Odds-making on Yes-No Propositions

Both of the previous games require betting on a value-in-question that is a number. But you can play a version where you bet on a yes-no propostion, such as "Will the Lakers win the next championship?"

To play this game, you have to decide on a fixed prize-pool (say $10) and each claim is a fraction of that pool that is being bet against the remaining portion.

Here's an example: the first person starts by saying

I bet $0.10 (vs. your $9.90) that the Lakers will win.

The first claim should always be chosen to be an extremely good bet. In this case, the Lakers will probably win with better than a 1-in-99 chance, so the bet above is a good bet (for the first person).

The second person obviously never takes this first bet, but instead makes a bolder claim, like:

I bet $0.50 (vs. your $9.50) that the Lakers will win.

This goes back and forth, raising the value each time, until someone accepts.

My friends and I are software engineers, so we tend to bet on weird things like "Is the domain name GreenMonkeyButt.com available?

Note that you always have to phrase the question in the way that is MOST LIKELY, so that the small initial bet is a good bet. For example, if you wanted to bet on whether the Clippers will win the NBA championship, you'd want to start the betting as:

I bet $0.10 (vs. your $9.90) that the Clippers will lose.

so that the bet amount can increase from there. Otherwise, if you start the betting as:

I bet $0.10 (vs. your $9.90) that the Clippers will win.

Then the second player might just stop right there and accept your bet (oops).

Storing User Passwords Securely: hashing, salting, and Bcrypt

Dustin Boswell — Mon, 18 Jun 2012 00:00:00 GMT

In this article, I'll explain the theory for how to store user passwords securely, as well as some example code in Python using a Bcrypt library.

Bad Solution #1: plain text password

It would be very insecure to store each user's "plain text" password in your database:


user account  plain text password
john@hotmail.com  password
betty@gmail.com  password123
... ...

user account	plain text password
john@hotmail.com	password
betty@gmail.com	password123
...	...

This is insecure because if a hacker gains access to your database, they'll be able to use that password to login as that user on your system. Or even worse, if that user uses the same password for other sites on the internet, the hacker can now login there as well. Your users will be very unhappy.

(Oh, and if you think no one would ever store passwords this way, Sony did just this in 2011.)

Bad Solution #2: sha1(password)

A better solution is to store a "one-way hash" of the password, typically using a function like md5() or sha1():


user account  sha1(password)
john@hotmail.com  5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
betty@gmail.com  cbfdac6008f9cab4083784cbd1874f76618d2a97
... ...

user account	sha1(password)
john@hotmail.com	5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
betty@gmail.com	cbfdac6008f9cab4083784cbd1874f76618d2a97
...	...

Even though the server doesn't store the plain text password anywhere, it can still authenticate the user:

{% filter force_escape %}def is_password_correct(user, password_attempt):
    return sha1(password_attempt) == user["sha1_password"]
{% endfilter %}

This solution is more secure than storing the plain text password, because in theory it should be impossible to "undo" a one-way hash function and find an input string that outputs the same hash value. Unfortunately, hackers have found ways around this.

One problem is that many hash functions (including md5() and sha1()) aren't so "one-way" afterall, and security experts suggest that these functions not be used anymore for security applications. (Instead, you should use better hash functions like sha256() which don't have any known vulnerabilities so far.)

But there's a bigger problem: hackers don't need to "undo" the hash function at all; they can just keep guessing input passwords until they find a match. This is similar to trying all the combinations of a combination lock. Here's what the code would look like:

{% filter force_escape %}database_table = {
  "5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8": "john@hotmail.com",
  "cbfdac6008f9cab4083784cbd1874f76618d2a97": "betty@gmail.com",
  ...}

for password in LIST_OF_COMMON_PASSWORDS:
    if sha1(password) in database_table:
        print "Hacker wins! I guessed a password!"
{% endfilter %}

You might think that there are too many possible passwords for this technique to be feasible. But there are far fewer common passwords than you'd think. Most people use passwords that are based on dictionary words (possibly with a few extra numbers or letters thrown in). And most hash functions like sha1() can be executed very quickly -- one computer can literally try billions of combinations each second. That means most passwords can be figured out in under 1 cpu-hour. Programs like John The Ripper are able to do just this.

Aside: years ago, computers weren't this fast, so the hacker community created rainbow tables that have pre-computed a large set of these hashes ahead of time. Today, nobody uses rainbow tables anymore because computers are fast enough without them.

So the bad news is that any user with a simple password like "password" or "password123" or any of the billion most-likely passwords will have their password guessed. If you have an extremely complicated password (over 16 random numbers and letters) you were probably safe.

Also notice that the code above is effectively attacking all of the passwords at the same time. It doesn't matter if there are 10 users in your database, or 10 million, it doesn't take the hacker any longer to guess a matching password. All that matters is how fast the hacker can iterate through potential passwords. (And in fact, having lots of users actually helps the hacker, because it's more likely that someone in the system was using the password "password123".)

sha1(password) is what LinkedIn used to store its passwords. And in 2012, a large set of those password hashes were leaked. Over time, hackers were able to figure out the plain text password to most of these hashes.

Summary: storing a simple hash (with no salt) is not secure -- if a hacker gains access to your database, they'll be able to figure out the majority of the passwords of the users.

Bad Solution #3: sha1(FIXED_SALT + password)

One attempt to make things more secure is to "salt" the password before hashing it:


user account  sha1("salt123456789" + password)
john@hotmail.com  b467b644150eb350bbc1c8b44b21b08af99268aa
betty@gmail.com  31aa70fd38fee6f1f8b3142942ba9613920dfea0
... ...

user account	sha1("salt123456789" + password)
john@hotmail.com	b467b644150eb350bbc1c8b44b21b08af99268aa
betty@gmail.com	31aa70fd38fee6f1f8b3142942ba9613920dfea0
...	...

The salt is supposed to be a long random string of bytes. If the hacker gains access to these new password hashes (but not the salt), it will make it much more difficult for the hacker to guess the passwords because they would also need to know the salt. However, if the hacker has broken into your server, they probably also have access to your source code as well, so they'll learn the salt too. That's why security designers just assume the worst, and don't rely on the salt being secret.

But even if the salt is not a secret, it still makes it harder to use those old-school rainbow tables I mentioned before. (Those rainbow tables are built assuming there is no salt, so salted hashes stop them.) However, since no-one uses rainbow tables anymore, adding a fixed salt doesn't help much. The hacker can still execute the same basic for-loop from above:

{% filter force_escape %}for password in LIST_OF_COMMON_PASSWORDS:
    if sha1(SALT + password) in database_table:
        print "Hacker wins! I guessed a password!", password
{% endfilter %}

Summary: adding a fixed salt still isn't secure enough.

Bad Solution #4: sha1(PER_USER_SALT + password)

The next step up in security is to create a new column in the database and store a different salt for each user. The salt is randomly created when the user account is first created (or when the user changes their password).


user account  salt  sha1(salt + password)
john@hotmail.com  2dc7fcc...  1a74404cb136dd60041dbf694e5c2ec0e7d15b42
betty@gmail.com  afadb2f...  e33ab75f29a9cf3f70d3fd14a7f47cd752e9c550
... ... ...

user account	salt	sha1(salt + password)
john@hotmail.com	2dc7fcc...	1a74404cb136dd60041dbf694e5c2ec0e7d15b42
betty@gmail.com	afadb2f...	e33ab75f29a9cf3f70d3fd14a7f47cd752e9c550
...	...	...

Authenticating the user isn't much harder than before:

{% filter force_escape %}def is_password_correct(user, password_attempt):
    return sha1(user["salt"] + password_attempt) == user["password_hash"]
{% endfilter %}

By having a per-user-salt, we get one huge benefit: the hacker can't attack all of your user's passwords at the same time. Instead, his attack code has to try each user one by one:

{% filter force_escape %}for user in users:
    PER_USER_SALT = user["salt"]

    for password in LIST_OF_COMMON_PASSWORDS:
        if sha1(PER_USER_SALT + password) in database_table:
            print "Hacker wins! I guessed a password!", password
{% endfilter %}

So basically, if you have 1 million users, having a per-user-salt makes it 1 million times harder to figure out the passwords of all your users. But this still isn't impossible for a hacker to do. Instead of 1 cpu-hour, now they need 1 million cpu-hours, which can easily be rented from Amazon for about $40,000.

The real problem with all the systems we've discussed so far is that hash functions like sha1() (or even sha256()) can be executed on passwords at a rate of 100M+/second (or even faster, by using the GPU). Even though these hash functions were designed with security in mind, they were also designed so they would be fast when executed on longer inputs like entire files. Bottom line: these hash functions were not designed to be used for password storage.

Good Solution: bcrypt(password)

Instead, there are a set of hash functions that were specifically designed for passwords. In addition to being secure "one-way" hash functions, they were also designed to be slow.

One example is Bcrypt. bcrypt() takes about 100ms to compute, which is about 10,000x slower than sha1(). 100ms is fast enough that the user won't notice when they log in, but slow enough that it becomes less feasible to execute against a long list of likely passwords. For instance, if a hacker wants to compute bcrypt() against a list of a billion likely passwords, it will take about 30,000 cpu-hours (about $1200) -- and that's for a single password. Certainly not impossible, but way more work than most hackers are willing to do.

If you're wondering how Bcrypt works, here's the paper. Basically the "trick" is that it executes an internal encryption/hash function many times in a loop. (There are other alternatives to Bcrypt, such as PBKDF2 that use the same trick.)

Also, Bcrypt is configurable, with a log_rounds parameter that tells it how many times to execute that internal hash function. If all of a sudden, Intel comes out with a new computer that is 1000 times faster than the state of the art today, you can reconfigure your system to use a log_rounds that is 10 more than before (log_rounds is logarithmic), which will cancel out the 1000x faster computer.

Because bcrypt() is so slow, it makes the idea of rainbow tables attractive again, so a per-user-salt is built into the Bcrypt system. In fact, libraries like bcrypt on pypi store the salt in the same string as the password hash, so you won't even have to create a separate database column for the salt.

Let's see the code in action. First, let's install it:

{% filter force_escape %}sudo apt-get install libffi-dev libssl-dev
sudo pip install bcrypt
python -c "import bcrypt"   # did it work?
{% endfilter %}

Now that it's installed, here's the Python code you'd run when creating a new user account (or resetting their password):

{% filter force_escape %}from bcrypt import hashpw, gensalt
hashed = hashpw(plaintext_password, gensalt())
print hashed    # save this value to the database for this user
'$2a$12$8vxYfAWCXe0Hm4gNX8nzwuqWNukOkcMJ1a9G2tD71ipotEZ9f80Vu'
{% endfilter %}

Let's dissect that output string a little:

As you can see, it stores both the salt, and the hashed output in the string. It also stores the log_rounds parameter that was used to generate the password, which controls how much work (i.e. how slow) it is to compute. If you want the hash to be slower, you pass a larger value to gensalt():

{% filter force_escape %}hashed = hashpw(plaintext_password, gensalt(log_rounds=13))
print hashed
'$2a$13$ZyprE5MRw2Q3WpNOGZWGbeG7ADUre1Q8QO.uUUtcbqloU0yvzavOm'
{% endfilter %}

Notice that there is now a 13 where there was a 12 before. In any case, you store this string in the database, and when that same user attempts to log in, you retrieve that same hashed value and do this:

{% filter force_escape %}if hashpw(password_attempt, hashed) == hashed:
    print "It matches"
else:
    print "It does not match"
{% endfilter %}

You might be wondering why you pass in hashed as the salt argument to hashpw(). The reason this works is that the hashpw() function is smart, and can extract the salt from that $2a$12$... string. This is great, because it means you never have to store, parse, or handle any salt values yourself -- the only value you need to deal with is that single hashed string which contains everything you need.

Final Thoughts: choosing a good password

If your user has the password "password", then no amount of hashing/salting/bcrypt/etc. is going to protect that user. The hacker will always try simpler passwords first, so if your password is toward the top of the list of likely passwords, the hacker will probably guess it.

The best way to prevent your password from being guessed is to create a password that is as far down the list of likely passwords as possible. Any password based on a dictionary word (even if it has simple mutations like a letter/number at the end) is going to be on the list of the first few million password guesses.

Unfortunately, difficult-to-guess passwords are also difficult-to-remember. If that wasn't an issue, I would suggest picking a password that is a 16-character random sequence of numbers and letters. Other people have suggested using passphrases instead, like "billy was a turtle for halloween". If your system allows long passwords with spaces, then this is definitely better than a password like "billy123". (But I actually suspect the entropy of most user's pass phrases will end up being about the same as a password of 8 random alphanumeric characters.)

div and span: display = 'block', 'inline', or 'inline-block' ?

Dustin Boswell — Tue, 26 Oct 2010 00:00:00 GMT

Background: the difference between `div` and `span`

`<div>`

A "block-level element"
can contain all other elements!
can only be inside other block-level elements
defines a rectangular region on the page
tries to be as wide as possible
begins on a "new line", and has an "carriage return" at the end, like a <p>

`<span>`

An "inline element"
cannot contain block-level elements!!
can be inside any other element
defines a "snake" on the page
tries to be as small as possible
doesn't create any new lines.

Simple <div>s and <span>s span span . . . . . . . . . . . . . . . . . . . . . . . . . . . span

div

span

div

From a rendering point of view,

<span> == <div style="display: inline">

and

<div> == <span style="display: block">.

As for HTML syntax, however, a div cannot be nested inside an inline element, whereas a span cannot contain block-level elements.

But there is also the mysterious "display: inline-block". What is it...?

`block` vs `inline` vs `inline-block`

Below are a bunch of <div style="width: 50px"...> with different display: settings.

display: block

display: block

display: inline

display: inline

display: inline-block

display: inline-block

As you can see, inline-block is a hybrid that:

Creates a rectangular region (a block)
Doesn't create any new lines (hence "in line")

For more information

ETF's leak money

Dustin Boswell — Sat, 02 Apr 2011 00:00:00 GMT

An ETF (Exchange Traded Fund) is a stock (like a mutual fund) that lets ordinary investors invest in commodities and assets like oil and gold.

Without an ETF, it would be difficult to invest in the price of oil. You could invest in an oil company, like Exxon Mobile (XOM) or BP (BP), but the prices of these stocks only loosely correlate with the price of oil. You could theoretically rent out a tanker to physically store the oil, but only big firms like JP Morgan can afford to do this.

So instead you buy an ETF like OLO, whose daily price changes are designed to match the daily changes in oil as close as possible. If oil goes up by 5% one day, this ETF should go up by about 5%, too. They even have short ETFs (like SZO), where its price should go down by 5% if oil went up 5%.

The way these ETFs work is by buying and selling short-term oil futures, and "rolling" these contracts from month-to-month, buying new ones to replace expiring ones. They don't actually have to own any barrels of oil.

ETF's leak money

What most people don't realize is that this process leaks money. It takes money to run the ETF - buying and selling futures has its transaction costs, plus you have to pay guys in fancy suits to watch over the whole operation.

By "leaks money", I mean that the price of the ETF slowly goes down over time. It's like a boat with a small hole in it - the waves (oil prices) may bounce the boat up and down, but ultimately the boat will sink.

So how big is the hole? How fast does it leak money? It's hard to measure exactly, but here is some interesting data: the table below shows prices for OLO, SZO, and WTI crude futures, on 2 dates about a year apart:



  Date WTI crude futures OLO (long oil) SZO (short oil)


  Jan 8, 2010
  $83.25
  $14.06
  $46.15


  Dec 31, 2010
  $89.68*
  $14.00
  $44.31


  Change:
  +8%
  +0%
  -4%

Date	WTI crude futures	OLO (long oil)	SZO (short oil)
Jan 8, 2010	$83.25	$14.06	$46.15
Dec 31, 2010	$89.68*	$14.00	$44.31
Change:	+8%	+0%	-4%

[* Dec 30, 2010 price, no data available for Dec 31. Was $91.58 Jan 3, 2011] As you can see, the price of oil went up 8% over that time, yet OLO remained at the same price. (Now, there might be some intricacies I'm missing, in the way futures prices are calculated at the end of the day, or there might be other weird near-expiration effects happening.)

But there's another way to measure the leak: if there were no leak, the price of $100-of-OLO + $100-of-SZO should remain constant. That is, buying equally-weighted shares of each puts you in a combined neutral position where the price increase of one stock should cancel out the price decrease of the other. However, the combined $100-of-OLO + $100-of-SZO went down about 2% during that time. (I should really calculate this leakage on a month-to-month basis, and average over a number of years.)

Don't buy-and-hold ETFs

Some people believe that oil will run out in the next 20 or 30 years. And they might be right, but buying-and-holding an ETF like OLO isn't a good way to make money in the long term. During those 20 or 30 years, OLO is sinking by a few percent each year.

ETFs: the ultimate bookie

On a side note, I'd like to point out how awesome it must be for the ETF companies, like Power Shares (the guys who run OLO and SZO). You're basically just a middle-man between someone betting for the price of oil, and someone else betting against the price of oil. Effectively, the ETF company is just a bookie, and doesn't care which way the price of oil goes. There's very little risk, and they are happy to silently take 3% of your money each year.

Most useful tools for diagnosing UNIX system performance.

Dustin Boswell — Sun, 24 Mar 2013 00:00:00 GMT

Summary:

Command	Example	Install	What it does
top	top	(built-in)	Interactive overview of machine
vmstat	vmstat 2	(built-in)	Overview of memory, swap, cpu, disk
iostat	iostat -dmx 2	sudo apt-get install iostat	Disk utilization, throughput
iftop	iftop	sudo apt-get install iftop	Interactive overview of network traffic
tcpflow	sudo tcpflow -i any -C -e port 80	sudo apt-get install tcpflow	Sniff live network traffic
lsof	sudo lsof -i TCP	(built-in)	Show processes with open files/sockets

`top` explained

Ignore the "load". It just confuses people, and a 'bad' value depends on how many cores you have. Instead, just look at the CPU usage numbers ("user", "system", "nice", "idle", "iowait"), which are percentages averaged over all cpus. As long as the "idle" value is above 0%, your system probably isn't overloaded (at least, not in a way that having more CPU would help). The "iowait" value is confusing, so I would ignore it (if it's high, that means that a faster disk would improve throughput of your system, but if it's near-0, then either you don't have much io waiting, or that io wait time was used by some other cpu-busy process).

I usually hit 'M' to sort by memory usage -- the biggest processes are usually the most interesting ones. The "virtual" memory size is misleading -- this is how much memory the process would use if it touched all the memory it was given. The "resident (RES)" memory usage is the one that matters. The "shared" column is mostly useless, so ignore it (for instance, it doesn't take into account the amount "shared" between forked processes that haven't copy-on-write yet.)

`vmstat` explained

The runnable processes ("--procs-- > r") gives you a sense of how many processes are using (or want) CPU at the moemnt. The idle percent ("--cpu-- > id") lets you know how often the CPU has nothing to do. If this is 0, then your system is cpu limited at the moment. The swap amount ("--swap-- > si so") shows the "swapped in" and "swapped out". If these are constantly above 0, then your system is swapping to disk a lot, which is probably bad. The memory amount ("--memory-- > free buf cache") shows how much memory is free, being used for buffers, or for file cache. If these numbers are low (less than 1000(KB) for each), then your system probably doesn't have enough memory for what it wants.

`iostat` explained

Ignore the first line of output (those are summary stats since bootup, which is rarely useful). Focus on the last column (utilization). If it's near 100%, then your system has a disk bottleneck at the moment.

Installing djb-dns on a Linux machine.

Dustin Boswell — Wed, 23 Feb 2011 00:00:00 GMT

Down below is a script you can use to install djb-dns on a Linux system (like Ubuntu).

Specifically, it will install dnscache (a local caching nameserver) which resolves any domain name into an IP address. This is much like Google's public 8.8.8.8 DNS server.

Background on DNS lookups

To be clear: dnscache is not an "authoritative" dns server A dns cache is a simply a middle-man that executes global dns lookups on behalf of an incoming query, and caches the result for subsequent queries. See this clarification.

When a program does a dns lookup (turning a domain name into an IP, or vice versa) it uses a dns client library (e.g. calling the UNIX function gethostbyname()) to connect to a ("recursive") domain name server. That server (typically hosted by your ISP) does all the dirty work of first talking to the root-name-servers and going down the tree of DNS lookups until the full domain name is completely resolved.

The file /etc/resolv.conf contains the IP address(es) of the domain name server(s) your system is using. It is a small file that typically looks something like:

nameserver a.b.c.d
nameserver e.f.g.h

Why do I need to run my own dns cache?

The dns cache servers that your ISP is hosting typically aren't very good. Those servers are overloaded, not well maintained, etc... If you are doing a high volume of dns-lookups they won't keep up. For instsance, you are running a web crawler, or doing reverse-lookups on all the IP addresses that visit your site. Your ISP's servers will introduce latency and flakiness. I've personally dealt with 3 ISPs whose servers started returning errors because my volume was too high.

I've even run my own dns cache on my home Linux desktop because my home ISP's was so bad. (Nowadays I just use 8.8.8.8 for my home networks.)

What's so special about djb-dns?

It's rock-solid. It's written by this crazy-smart guy who knows his shit, and even has an unclaimed $1000 prize to find a security bug.

I've used it multiple times and haven't had any problems. The only downside is it's a pain-in-the-ass to install. Thankfully, I've gone through the headache for you.

The Install Script

# Must be run as root
# Also see http://hydra.geht.net/tino/howto/linux/djbdns/

#Create a /package directory:
mkdir -p /package
chmod 1755 /package

cd /package
wget http://cr.yp.to/daemontools/daemontools-0.76.tar.gz
gunzip daemontools-0.76.tar.gz
tar -xpf daemontools-0.76.tar
rm daemontools-0.76.tar
cd admin/daemontools-0.76
# Apply dumb patch to make things compile
cd src; echo gcc -O2 -include /usr/include/errno.h > conf-cc; cd ..
./package/install

cd /package
wget http://cr.yp.to/ucspi-tcp/ucspi-tcp-0.88.tar.gz
rm -rf ucspi-tcp-0.88
tar xfz ucspi-tcp-0.88.tar.gz
cd ucspi-tcp-0.88
# Apply dumb patch to make things compile
echo gcc -O2 -include /usr/include/errno.h > conf-cc
make
make setup check

cd /package
wget http://cr.yp.to/djbdns/djbdns-1.05.tar.gz
gunzip djbdns-1.05.tar.gz
tar -xf djbdns-1.05.tar
cd djbdns-1.05
# Apply dumb patch to make things compile
echo gcc -O2 -include /usr/include/errno.h > conf-cc
# Allow more simultaneous dns requests
sed -i -e "s/MAXUDP 200/MAXUDP 600/g" dnscache.c
make
make setup check

########## Install Users and Service directories ###########
groupadd dnscache
useradd -g dnscache dnscache
useradd -g dnscache dnslog
/usr/local/bin/dnscache-conf dnscache dnslog /var/dnscache
ln -s /var/dnscache /service

# Fix the nameservers to point to current ICANN structure 
# This assumes you have dig installed 
# Patch in the current list of root servers  
for a in a b c d e f g h i j k l m
do
  dig +short $a.root-servers.net.
done > /var/dnscache/root/servers/\@

# Increase the cache to 100MB
echo 100000000 > /service/dnscache/env/CACHESIZE
echo 104857600 > /service/dnscache/env/DATALIMIT

# Change multilog to keep more logs
echo "#!/bin/sh" > /service/dnscache/log/run
echo "exec setuidgid dnslog multilog t s10000000 ./main" >> /service/dnscache/log/run

Now all the tools and binaries are installed. To verify that the tools were installed you can do:

dnsip www.google.com

Now you just have to kick-off the dnscache server and update /etc/resolv.conf. You will want to run the following script at system startup (if you don't, the file /etc/resolv.conf might get over-written by your system):

# Must be run as root
rm -rf /etc/resolv.conf.prev
mv /etc/resolv.conf /etc/resolv.conf.prev
echo "nameserver 127.0.0.1" > /etc/resolv.conf

## init q  # (is this needed?)
/command/svscanboot &
sleep 5
svc -u /service/dnscache   # FYI: -t does a reboot
svstat /service/dnscache
svc -t /service/dnscache/log

Enjoy!

How to Fix the Sharp Edge on your MacBook Pro

Dustin Boswell — Tue, 01 Dec 2009 00:00:00 GMT

Problem:

The MacBook Pro has ridiculously sharp edges. After using it for a hour my wrists had annoying (and painful) indentation lines on them. It doesn't bother everyone, but I'm not the only one.

Solution:

It's pretty easy to shave that edge down with some tools. It only took 10 minutes, and it looks as good as if it was manufactured that way. I wish I had done it sooner.

I was a little hesitant to take a power tool to my new $1700 computer, but in the end it was really easy. The worst case is probably that you'd just scratch the case a little. But in my case it came out smooth and flawless.

And I assume you know how to be safe with power tools - wearing goggles and all that ... If you had a lot of time, and a few emery boards, I suppose you could do it without a Dremel, but I got impatient :)

Step 1: Position your laptop

Cover your open MacBook with a T-shirt to stop the rest of your laptop from getting any aluminum dust in it. Only the edge facing you needs to be exposed.

Then put your laptop on a table so that edge can hang off. You probably want your computer turned off during this time.

Step 2: Dremel

I used various Dremel tips (mostly the stone tips), and never found the perfect one. I'm not sure it really matters though. I turned on the drill on a medium speed and went across the whole length of the edge (with the bit angled so that it would "pull" you across the length of the edge). I only shaved across the front edge - the side edges have the external connection slots, so it seemed more dangerous, plus my wrists aren't bothered by those edges.

At first I was worried that I would shave away too much and cut into the motherboard, but it's not worth worrying about. There's plenty of aluminum between you and the inside (judging by that notch where you open it, at least 1/8th of an inch, maybe more). And you're only looking to shave away 1/32 of an inch or so (depending on your preferences). So as long as you keep moving the drill back and forth across the edge, you'll get a uniform shave that only takes a little off.

It's okay if there are grooves or it looks rough right now. But be careful to only apply the drill to the aluminum right on the very edge - you don't want to leave scratches elsewhere. Also, there will be a lot of aluminum dust, so you'll probably want to wipe that off with a wet napkin every once in a while.

Step 3: Polishing

Now it's time for the emery board. The finer-grit the better. I don't think an ordinary nail file (or sandpaper for that matter) will do as nice a job.

I moved the emery board across the whole length, back and forth, again and again, quickly scrubbing it down. Tilt the board all the way up, and all the way down – you want to get a nice rounded finished look.

(Again, this step will generate lots of aluminum dust, so you'll need to stop every once in a while to clean it off with a wet napkin.) You can't overdue this step, so be sure to spend at least 5 minutes with the emery board.

You're Done

If you did it right, you'll have a rounded edge that looks natural and is very smooth to the touch – like it was made that way.

Articles by Dustin Boswell

How X Over SSH really works

How to hash passwords securely.

Why storing plain text passwords is bad

Why you need salts in your database

Why you need per-user salts

Why you should hash the password on the client also.

Shouldn't you use https/ssl instead?

Example client-side code

How to use screen to pair program.

Enabling multi-user with screen

First Programmer: run screen

Second Programmer: attach to his screen

Problems with the Delete Button?

Essential commands while inside screen

Other screen flags

How iTunes works under the (file) covers

Good Politician != Good Decider

Here are the traits that make a successful politian:

→ Electability

→ Leadership/Charisma

→ Networking/Schmoozing/Politicizing

But really, the most important trait we need in a government leader is:

→ Intelligence & decision-making

Maximum Likelihood vs. Expected Value

Maximum Likelihood Estimate

Expected Value

Nitty Gritty Math Details

Head-to-head Matchup

WTF?!

Using Rsync to do local snapshotting/backups

Local Snapshotting: not perfect, but super usefull

Let me explain:

Now you install this crontab by doing:

How does this help me?

Why do hourly, daily, etc...?

Source code for putting "a" or "an" in front of a word.

Files for you to Download

A Django Template Filter

Gas Mileage for my 2007 Audi A4

How to Diagnose your Flaky Internet Connection

Can you ping the outside world?

Does resetting just the router help things?

Does resetting the modem and router help?

Is it your DNS?

Background on DNS:

How to fix your DNS:

Is your wireless connection flaky?

How to fix your wireless connection:

Is Drinking Distilled Water Dangerous?

SSH keys in 2 easy steps

Step 1) Generate your public signature

Step 2) Copy your public signature to the server

You're done!

How to fix certain macbook wireless problems.

My Vim Cheat Sheet

How to Play Yes/No Proposition Bets (and "Lodden Thinks")

Game 1: Over/Under betting, auction-style

Game 2: Lodden Thinks

Game 3: Odds-making on Yes-No Propositions

Storing User Passwords Securely: hashing, salting, and Bcrypt

Bad Solution #1: plain text password

Bad Solution #2: sha1(password)

Bad Solution #3: sha1(FIXED_SALT + password)

Bad Solution #4: sha1(PER_USER_SALT + password)

Good Solution: bcrypt(password)

Final Thoughts: choosing a good password

div and span: display = 'block', 'inline', or 'inline-block' ?

Background: the difference between div and span

<div>

<span>

block vs inline vs inline-block

For more information

ETF's leak money

ETF's leak money

Don't buy-and-hold ETFs

ETFs: the ultimate bookie

Most useful tools for diagnosing UNIX system performance.

Summary:

top explained

Enabling multi-user with `screen`

First Programmer: run `screen`

Essential commands while inside `screen`

Other `screen` flags

Background: the difference between `div` and `span`

`<div>`

`<span>`

`block` vs `inline` vs `inline-block`

`top` explained

`vmstat` explained

`iostat` explained