Keith Olbermann Thinks I'm an Idiot

Wed, 11 Apr 2012 04:59:37 +0000

This story ends with Keith Olbermann dismissing me as "another idiot" on national television, but it begins on a Monday morning with me sitting on my brown leather IKEA couch in Palo Alto, two blocks from Facebook's then-new College Terrace office. Five months earlier I started company with Matt Humphrey, Joe Damato, and Aman Gupta called Bumba Labs.

Our first application, Polls, let anyone on Facebook create or vote in a poll on Facebook. When someone voted they were prompted to post their response to their newsfeed, where all their friends could see it. That was enough to make it grow exponentially, and within a month there were millions of people voting. Over 10,000 polls were created each day. Even Mark Zuckerberg used our app to create a poll about Gidget, the Taco Bell dog, who had died earlier that year. Not a bad start.

By the end, Facebook, the Secret Service, and every major news station would be involved.

Monday, September 2009, 9AM

I sat down on my couch, opened my laptop, and logged on to Facebook to see how our application was doing.

The application "Polls" is temporarily unavailable due to an issue with its third-party developer. We are investigating the situation and apologize for any inconvenience.

That's the message I saw when I visited our application that morning. "Today is going to be awesome!"

Every successful Facebook developer has seen that message at least once. It means Facebook found something they didn't like in your application and decided to take it down. Normally they'd warn you a few days in advance and tell you to fix it or else. I double-checked my email and, nope, no warning — the app was just gone.

It would take Facebook at least a day to respond to any questions I had, so in the meantime I connected to the Polls database to see if I could spot anything unusual. At Facebook's request we had implemented a feature to report offensive polls a few months earlier, and now took time each morning to spot and delete any truly awful polls.

People would report a poll for anything, offensive or not: the poll's prompt didn't agree with their politics, there was a typo in one of the answers, the poll's photo offended them, etc. For most polls there was one report every hundred votes or so, but today there was a poll with only a few hundred votes and thousands of reports.

What was it?

The Best Laid Plans: 10AM

I knew in my gut this is why Facebook shut down our application, but it was still strange that they hadn't warned us. Playing dumb, I sent an email to someone I knew on the Facebook policy enforcement team.

Hey XXX,

Polls is down and it's displaying the TOS violation screen:

The application "Polls" is temporarily unavailable due to an     
issue with its third-party developer. We are investigating 
the situation and apologize for any inconvenience.

Any idea what's up?

Cheers,
Jesse

Our users had created horrible polls in the past, asking questions like "Should gay people be lynched?" or "Is Mrs. Jones the English teacher at such-and-such High School a racist?" We'd delete those polls as soon as we found them and ban whomever made them from ever using Polls again. In this case, the poll was created by a middle school girl from Orange County, California. She'll be graduating from high school this year.

This looked less like someone earnestly plotting to kill Obama and more like a bored kid phoning in a fake bomb threat to their high school. Thankfully, only a few hundred people voted in the poll — the truly viral polls had millions of votes. How did Facebook find this if it had so few votes, though? It took at least several thousand votes over the previous hour to break into the list of popular polls, and nobody outside Bumba saw the complaints.

I decided to wait until Facebook got back to me before I did anything else. The poll was removed, the provocateur banned, and hardly anyone had voted in the poll.

I left Palo Alto to spend the day working in San Francisco out of Matt, Joe, and Aman's apartment.

The Carnival Begins: 12PM

When I arrived at their apartment, Matt, Joe, and Aman had already seen the app was down, so I explained what I had found and we went to work on other projects. About twenty minutes later, Aman pointed at his monitor and shouted: "Hey, we're on the Huffington Post! Polls is on the Huffington Post!"

And there it was on the Huffington Post. And the Associated Press Wire. And all over the blogosphere.

After a bit of digging we found "patient zero," a small liberal blog called The Political Carnival. They started a campaign late Sunday evening to call Facebook, the FBI, and the Secret Service, which quickly spread to larger communities like DailyKos. One problem: nobody seemed to understand the difference between Facebook, developers on the Facebook Platform, and Facebook users creating polls with our app. They were blaming Facebook for hosting the poll and us for creating it!

The National Stage

The poll was created Sunday morning, discovered by a single blog Sunday evening, and had become a national news story by the time I woke up Monday. Nobody had reached out to us yet, either, so most articles about the poll were filled with wild speculation. Left-leaning outfits assumed the person who created it was a middle-aged white man out to kill Obama. Right-leaning outfits assumed this was a liberal plant, designed to make the right look bad. Everyone tried to weave this incident into a larger story about race, politics, and the direction of American society.

I wondered if they would be embarrassed to know that their outrage was unknowingly directed at a 14-year-old girl. Did they even know how to feel embarrassed?

Eric Eldon and Justin Smith from Inside Network were the first people to reach out. Eric had been a journalist in Silicon Valley for several years, so he understood how hard it was for any site to police user-generated content. His writeup at Inside Facebook was the most level-headed account by a mile or ten.

The second person to reach out called with no warning. I assumed it was a journalist trying to catch me off guard, but instead the caller replied: "This is Special Agent Mark Weller from the United States Secret Service, San Jose office." Because of their location, he told me, they work with Facebook and other Silicon Valley companies all the time, and were used to dealing with complaints about user-generated content. I gave him the identity of the girl who created the poll and we ended the call in under 15 minutes.

Self-deprecating, sarcastic humor being my default coping mechanism, I tweeted the following:

Life TODO: [X] Have a phone conversation with an agent from the US Secret Service
— Jesse Farmer (@jessefarmer) September 28, 2009

After I got off the phone with Agent Weller I immediately called Bumba's lawyer: "I'm not sure they'll send one, but don't be surprised if you get a subpeona from the Secret Service. I gave them your fax number." The call every lawyer wants to receive.

From Would-be Assassin to Idiot

Once my name was out there, the coverage escalted. I was getting unprompted calls from reporters wanted to talk to "the polls developer." People would email me asking why I wanted to assassinate Obama. When I explained that I actually worked on the Obama campaign in 2008, their brains melted. It just didn't fit into the story they had been telling themselves since the news broke.

Here are some selected pieces of coverage:

The best part came Tuesday night, though. My phone rang. It was my mother. Her voice was hurried and cracking a little. Before I could ask what was wrong she blurted out, "Jesse, Keith Olbermann just called you an idiot on national television." She was concerned this would hurt my reputation.

I went through the Countdown transcripts and read his refreshingly thoughtful commentary:

This as the fallout continues over that poll from Facebook. It continues. The Secret Service is now investigating the threat. And the developer of the online application that was used to create the survey has come forward, telling "Politico" "there is definitely a culture of paranoia and fear, and I think both sides are reacting in extreme ways. People carrying guns to town hall meetings. That's scary. People losing their cool over an internet poll like this that doesn't calm the situation."

Posting such a poll cools and calms the situation? Another idiot.

Thanks to Keith, I finally recognized my own idiocy and spent the next four months at an "idiot detox" center in eastern Oregon.

Later that week the Secret Service announced they had investigated the lead and found no credible threat. By then the media had moved on to the next banal controversy that could generate outrage and attention, so hardly anyone noticed.

Fallout

I'll stop myself from commenting on the "state of the media" and such. It's cliché at this point to say that people like Keith Olbermann and the bloggers who first broke this story aren't interested in the truth, but instead their own aggrandizement. (Oops.)

Community sites like DailyKos were better at understanding what happened, especially after I took the time to patiently answer all their questions.

As for the fallout, Polls was down throughout this episode. Because it spread virally by posting people's votes to their newsfeed, the three or four days of down time halted all growth. Traffic dropped to nothing and never recovered. It wouldn't have mattered much, anyhow, because three months later Facebook changed how the newsfeed worked and made it nearly impossible for apps to grow through the newsfeed alone.

The four of us shut down Bumba about a month later. Matt, Joe, and Aman, along with Jared Kopf, went on to start HomeRun. It was at this time, too, that Matt introduced me to Michael Preysman, a friend of his from CMU. Less than a year later Michael and I started Everlane.

In fact, here is the very first email I ever sent Michael:

From: Jesse Farmer 
To: Michael Preysman 
Date: Mon, Sep 28, 2009 at 12:12 PM
Subject: If you get a call from the Secret Service…

Here's why: 
http://www.sfgate.com/cgi-bin/article.cgi?f=/n/a/2009/09/28/national/w115451D54.DTL

You're still listed as a developer :|

- J

I sure know how to make a graceful first impression.

Getting Ahead: A Letter to Myself

Tue, 03 Apr 2012 17:58:59 +0000

All the pulses of the world,
Falling in they beat for us, with the Western movement beat,
Holding single or together, steady moving to the front, all for us,
Pioneers! O pioneers!

I moved to Silicon Valley the summer of 2006, as soon as I graduated from the University of Chicago. Two college friends, Ryo Chijiiwa and Isaac Wolkerstorfer (neé Wasileski), asked me to join their startup OpenHive, a "social" search engine for college campuses that allowed students to search each others' bookshelves. I had no expectations. In fact, before my plane landed in San Jose, I had never even set foot in California.

I was underprepared. I suppose everyone is, though. This is the letter I wish someone had written me that summer. Since nobody did, I'll have to write it to myself six years later.

Dear Jesse

Dear Jesse,

Hello from the future! You're about to move to California and help Yitz and Ryo with their startup. There's so much you're going to fuck up, but it's all worth it — honest. I want to help you get ahead.

The best thing about Silicon Valley is how friendly, open, and helpful everyone is. The chattering class can be cynical, but this is where the future is getting built if you look hard enough. Do look hard enough.

Here's the big secret: do valuable work and share it. People out here spend so much time talking about who's up, who's down, who's working with whom, who raised money and on what terms, who sold their company and for how much, etc. Twitter has nothing on the speed at which gossip travels in Silicon Valley. It's the work that earns you respect and credibility in the end, though.

Forget meetups, getting coffee, and "quick" phone calls. Doing valuable work and sharing it is the best way to build a network — it becomes your calling card. Idle meetings are the Silicon Valley equivalent of showing up empty-handed to a potluck. Everyone is happier if you bring something unique and delicious. Until you can do that, you're better off practicing your kitchen skills. You want to be able to point to something fantastic and say "I did that."

It's hard to know whether your work is valuable, particularly while you're in the middle of doing it. What if it's not good enough? What if people you respect think poorly of it? Being dissatisfied with your own work is what pushes you to improve, but don't give yourself too much credit. Most people won't think anything of it at all. You have to trust yourself that if you're thoughtful enough and prolific enough, something amazing will happen.

Small work can be valuable, too. When I was first learning Erlang there were no tutorials outside the official and very opaque documentation, so I took the time to write the tutorials I wish existed. If you think it's valuable someone else will, too. It might even be an opportunity to work with them. Don't be trapped by thinking that all your work has to be momentous. Seeds aren't momentous.

I know you can be independent and stubborn, but don't be afraid to ask for help in your work. You'll be surprised at how helpful people are, even people you've only met once or twice. The Midwesterner in you will say it's impolite to be a burden on other people's time. He's being overcautious (that's a polite way of saying he's full of shit). Ask for help especially when you're about to do something you've never done before.

Conversely, don't take everyone's advice to heart. You can get every possible piece of advice, if you want. Take the job, don't take the job. Work with him, don't work with him. Hire that guy, don't hire that guy. Take the money, don't take the money. You'll feel like a ping pong ball if you try to listen to it all.

Speaking of advice, someone will give you a copy of this poem when it really matters:

since feeling is first
who pays any attention
to the syntax of things
will never wholly kiss you;
wholly to be a fool
while Spring is in the world

It applies to everything. Life, love, work, business. You're great at being logical, mathematical, and methodical. If that's everything, though, you run the risk of being effective but uninspiring. Remember that poem and be more audacious (in everything). When it comes to inspiring people the Daniel Burnham quote — "Make no little plans. They have no magic to stir men's blood." — is absolutely true.

Finally, surround yourself with talented people you trust and respect. Keep them close, help them every chance you get, and make sure they know how much they matter. These are the people who will help you regardless of how much help you can offer in return, and will keep you grounded when you're about to do something really crazy.

I hope you don't find this letter too self-absorbed. I thought of ways to make it more clever, funnier, etc., but decided to opt for plain-spoken and sincere. If it annoys you, well: fuck off. ;)

Cheers,
Jesse

Acknowledgements

Thanks to David Cole, Joe Damato, Raph Lee, and David Kaye for reading earlier drafts of this essay.

Have your own letter you wish you'd received when you were just starting out in your career? Send me a link — I'd love to compile a list.

If you like this essay, follow me on Twitter.

The Value of a Social Commerce Referral

Mon, 22 Aug 2011 11:44:37 +0000

At Everlane we recently shut down a side-project of ours, Indie Cases, where we were selling some pretty sweet iPhone cases. Over the month and a half that the store was live we saw some good traffic from up-and-coming social commerce sites like Svpply and Pinterest.

I thought I'd take the time to share some of the numbers.

The Sites

Startup ideas come in bunches, and these social commerce sites are no exception. They all are built around aiding discovery, typically by allowing users on the site to share "finds" from around the web and build a following.

These finds may or may not be products and have a price tag associated with them. They may also be restricted to a specific category (e.g., women's high fashion) on the site.

Here's a list of all the sites I know. If you know any other, please, send me an email and I'll update the list.

Of these sites, only Pinterest, Svpply, and The Fancy sent any traffic to Indie Cases.

The Numbers

The hope for these startups is that referrals from these sites are worth more than the average, non-qualified visitor. Some, like The Fancy, are even experimenting directly with commerce.

Value of a Social Commerce Visitor
Site	Conversion %	$/visitor
Pinterest	0.93%	$0.31
Svpply	3.70%	$1.09
The Fancy	0.48%	$0.13

I left out the traffic number deliberately, but will say this: The Fancy drove approximately 10x the traffic of Svpply, which drove the least amount of traffic of the three.

A Wrinkle

Why did Svpply convert so much better than Pinterest or The Fancy? Well, half the purchases from Svpply were of this iPhone case, which Ben Pieratt, CEO of Svpply, tweeted about directly.

Social Commerce?

Consider the above: a plug from a respected member of a community interested in our products (Svpply) produced roughly the same number of sales as a site driving 10x the traffic (The Fancy).

That, in a nutshell, is the promise of social commerce: the right recommendation at the right time from the right person.

The success of sites like Svpply, Pinterest, and The Fancy will hinge on their ability to consistently produce that.

Additional Info

Just for reference, here are the Compete graphs of the sites that sent traffic to Indie Cases.

Pinterest dominates in raw site traffic, but the question is whether their referrals are coming with intent to purchase. I'm sure they were grilled nonstop about that while they were out fundraising — a traffic graph like that allows for a lot of leeway. ;)

Does this interest you?

At Everlane, we're not building a social commerce platform, we're building a full-on store selling our own products. Indie Cases was a small preview of what's coming.

If you are an engineer or designer interested in defining what online retail should look like in a world where Twitter and Facebook exist, and YouTube creates stars in weeks (not months), shoot me an email at jesse@everlane.com and let's talk.

Click Hacking for Fun and Profit

Fri, 22 Apr 2011 16:12:38 +0000

A friend IMed me the other day, asking, "You know how to make people click on things. I'm submitting something to Reddit — can you help me title the post?" A stark description of my skills, certainly, but it made me laugh and inspired me to write an article about the art of click hacking.

What is click hacking?

Aside from being a term I just made up, click hacking is a type of social engineering where the goal is to get someone to click a hyperlink.

The link could be the title of a Reddit post, a button on a landing page form that's trying to capture your email, or an image in a Facebook ad. Deception can be, but is not necessarily involved.

I'm going to use Facebook, Hacker News, and Reddit as the primary examples throughout this article because I know them best. If you have other examples please leave a comment!

For Good or Evil

All the tactics I'm about to describe can be used for good or evil, and I've seen each used for both. Don't take this article as an endorsement for spamming, scamming, or other internet trickery.

There's plenty of grey area, too. At its start Reddit faked on-site activity to avoid looking like a ghost town. Was that unethical? A mistake?

None of these tactics are a substitute for generating real value for your customer, though they will help you understand how your customers react to what you present them (and how to incentivize them to react the way you want). The hard (and most important) parts are still up to you.

With that disclaimer aside, let's dive into specifics.

Understand Your Audience

Above all else, I take the time to understand my audience. Each online community has its own customs, norms, and standards of behavior. You have to understand them before you can blend in or stand out as necessary.

For example, Hacker News values civilitySee, e.g., Some Tips to Improve the Civility on Hacker News., straight-dealing, and intellectual honesty, but can be a little short-sighted and dour. Reddit values wit (especially puns and in-jokes/memes) and has a strong sense of community justice. It can also be completely juvenile.

From the click hacker's perspective, a pun-filled title could do well on Reddit, but would never see the light of day on Hacker News.

Ask for Help or Feedback

One way to get people to pay attention is to ask for help. Hacker News has a "Ask HN" feature which evolved purely from community behavior. There are similar "Tell HN" and "Review my Startup" features.

For example, here is Drew Houston's original Hacker News post asking for reviews of Dropbox. Every startup I know wants to publish a "showoff" entry on Hacker News and have it get to the front page, and their motives range from honest (they really want feedback) to self-serving (they just want the attention).

Give a Gift

Everyone loves gifts, but when we receive them we also feel pressure to reciprocate. The click hacker can use that pressure to get people to do what they want.

This tactic is the difference between "I baked a cake" and "I baked a cake for you." A normal person is obliged to accept and even reciprocate.

For example, this Reddit user wrote a CSS hack to change the appearance of the site — a little present for anyone using Reddit. The Reddit community reciprocated by giving him upvotes.

Notice how he says he made it "for Reddit."

The cake example works, too, though. :)

Bribery

Bribery is the opposite of gift-giving. In this situation the click hacker says, "Do this for me and I'll give you something." If that something is really awesome people will do almost anything.

The stereotypical example here are those scammy ads for free iPods and the like. All you have to do is fill out this form and click this link and you'll be entered to win. Or, hey, instead maybe you can forward this offer to three of your friends, and if one of them wins, you win, too!

This can be done well in certain contexts. For example, Facebook game developers often trade installs by paying players virtual currency, e.g., "Install this game to earn ten farm dollars." Or a game developer might have a special landing page offering players virtual currency as a way to encourage the player to click the install button.

The player gets what they want (virtual currency) and the game developer acquired a customer for free.

Assuming the terms are clearly outlined there's nothing wrong with this. Of course, there's plenty of room for outright dishonesty by promising goods that never arrive. Don't be that guy.

Scarcity

People want what they can't have. Or, more accurately, people want what they think they can't have.

Gilt, for example, relies heavily on this tactic to drive sales. Better click that buy button now, because we're running out!

In Gilt's case the scarcity is legitimate, but it can be artificial, too. A game developer might make an in-game object ultra-rare in order to drive specific behavior. Limited beta invites are a tried-and-true method of generating traffic and interest; give out 20 invites to an audience of 20,000 and collect follow-up information for the people who don't get there in time.

Scarcity also works with time, e.g., "If you install this game within the next thirty seconds, you'll get a ten credit bonus." Tick, tick, tick, tock.

Social Proof

Similar to scarcity, people don't want to feel like they're missing out. If they see other people doing something, especially people they know or respect, they'll be more likely to do it.

This is the raison d'Ãªtre for Facebook's Facepile widget. Click that Like button. You know you want to. Come on, all your friends are doing it.

Even adding faces of random but friendly-looking people can be effective in getting people to click through. Here's a screenshot of Match.com's homepage.

Look at all those happy people. Don't you want to be happy, too?

Pick a Fight

Rather than trying to blend in with a community, some times it helps to generate controversy. This can either be you vs. the community, or setting two factions within a community against each other.

I'll share a story myself. Back in 2009 I was working on a polling application for Facebook. People would vote in polls and their vote would appear in their newsfeed.

The most controversial topics were the most successful, so to get things started I created a poll: "Do you support same-sex marriage?"

This was just after the Proposition 8 fight, so I decided to run two sets of ads on Facebook: one targeting 50 miles around San Francisco and another targeting 50 miles around Salt Lake City.

I can't say it was a win for civil political discourse, but it generated enough controversy to make that poll viral and in turn make the entire application viral.The way that application died makes a good story, too.

Environmental Flaws

Sometimes the environment has flaws which allow the click hacker to do some interesting things.

When Facebook first launched their user-generated polling product, I remember seeing several polls asking questions like, "Which model is hotter?" The possible answers were URLs of images.

Unfortunately for the user, clicking on the text of the answer (the URL in this case) would register a vote and post that vote to your newsfeed. Their friends, who of course also wanted to see pictures of hot models, clicked the URLs, accidentally voted, and spread the poll to their friends. Oops!

Click hackers were using this mis-feature deliberately to spread new polls across Facebook.There's a blog post describing data from this social hack — if anyone has it please share it in the comments!

I don't think many people will disagree when I say that most of the growth on the Facebook Platform from 2007-2009 was built on similar environmental flaws. Here's a two-year old example from Slide's Super Pocus:

What else?

What else am I missing? Post examples in the comments.

Have you ever been "tricked" into clicking something? (I know I have.) Have a funny story? Have experience optimizing landing pages, etc. and think about this all the time anyway? Leave a comment and share!

Everlane is Hiring

My startup, Everlane, is hiring. We're trying to take the best elements of shopping offline — visual merchandising, personality, and curation — and bring them online. Check out our jobs page at http://www.everlane.com/jobs and send an email to jesse@everlane.com if you're interested in talking.

Speed vs. Certainty in A/B Testing

Mon, 25 May 2009 10:00:18 +0000

A/B testing is a great tactical tool for studying customer behavior on the web. But like any randomized trial there's some chance that the improvement we measure is just statistical noise.

How worried should we be that the feature we thought improved our product actually does nothing, or worse, hurts our bottom line? How can we ever really know that we're making the correct decision? And is it better to run tests more quickly or more accurately?

The answers to these questions depend on the cost of a bad decision. If mistakes are cheap then it's better to make 1,000 decisions and get only 60% of them right than to make 100 decisions and get 100% of them right.

One way to achieve this balance in the context of A/B testing is to tune the confidence level.

Tuning the Confidence Level

Intuitively, the confidence level of an A/B test tells you how certain you can be of the result of the A/B test. For example, a confidence level of 95% means that there's a 5% chance that a statistically significant result is actually random variation, i.e., there is a 5% chance of a false positive.

Of course, we're free to choose some other confidence level besides 95%. We could choose 80%, 90%, or 99.999%. A higher confidence level requires more data before reaching statistical significance, but we will be more certain of the result.

If you're not comfortable with the nuts and bolts of statistical analysis, confidence levels, and A/B testing I recommend reading my article about statistical analysis and A/B testing, which explains exactly how one "chooses" a confidence level.

In short, the confidence level acts as a dial between speed and certainty, and we're free to choose where to set that dial depending on the priorities of our business or product.

Speed vs. Certainty

So where on the speed-certainty spectrum should you, as a product manager or startup entrepreneur, sit?

Mike Cassidy has a great presentation where he argues that speed is the primary business startegy for startups.

Why is speed great for startups? Because mistakes are cheap and calculated risks are rewarded. Most product decisions can be undone, and important early tests can be redone at a higher confidence level when the product has more traction.

But mistakes aren't always cheap. Here are some factors that increase the cost of a mistake.

Volume

Volume is leverage. If you have millions of customers, like Google or Amazon, a 1% improvement to the bottom line is a huge win. Conversely, a 1% mistake is a huge hit.

Fortunately this problem helps mitigate itself. Increased volume affords you the luxury of running A/B tests at a higher confidence level in the same amount of time.

Reversibility

Most product decisions in a consumer technology startup can be undone, for a price. For example, it's easy to undo a bad decision for a web-based product, slightly harder to undo a decision for desktop software, and very difficult (and costly) to undo a decision for a physical product.

The less reversible a decision is the more certain you should be before you make it. In the context of A/B testing a product feature this means a higher confidence level, even if it takes longer to run the test.

Real Money

Imagine you're an ad network. You're constantly A/B testing formatting, positioning, offers, etc. to see which performs best. Making a mistake in this regard costs your publishers money.

Like volume, money creates leverage. But it is more complicated than that: publishers don't just want increased revenues, they want reliable cash flow. That is, when money is involved, not only do you have to perform better but you have to perform more consistently because of phenomena like the peak-end rule.

In this case a "three steps forward one step back" strategy might actually be worse than going step-by-step in the right direction, even if the former averages out to better performance.

Conclusions

Maintaining momentum in a startup isn't about making only correct decisions — it's about making enough correct decisions. This presents a continuum from speed to certainty. At one extreme you run the business with a magic eight-ball; at the other you agonize over every detail until you're 100% certain that you've made the correct choice.

This thought process extends naturally to A/B testing where the idea of "certainty" and "cost" can be quantified. To recap:

A/B testing is a great tactical tool for testing specific hypotheses about your customers.
However, there is a tradeoff between speed and certainty, controlled by the confidence level of the A/B test.
The cost of doing A/B tests quickly is that you will make more wrong decisions, but that is ok if mistakes are cheap.
For example, it's better to make 1,000 decisions and get only 60% of them right than to make 100 decisions and get 100% of them right, all else being equal.

A Spreadsheet Model

Below is a little spreadsheet model that illustrates all my points above.

The two independent variables are the gain from a good decision and the cost of a bad decision. The spreadsheet assumes a fixed time period, so a higher confidence level means more certainty but fewer tests. The ideal confidence level is highlighted as you change the parameters of the model.

You can download the A/B testing confidence model here.

For the statistically inclined this model assumes that traffic increases linearly over time, that the sample statistic is normally distributed, and that a one-tailed t-test is the appropriate statistical test.

Credits

This article was inspired by a conversation with Sean Ellis and edited by my wonderful girlfriend, who is probably going to yell at me for linking to her Twitter account.

8 Tips for Crafting Metrics That Matter

Tue, 12 May 2009 10:00:31 +0000

Metrics are the marketer's microscope. They show him what his customers are actually doing, as opposed to what they say they are doing or intend to do. With proper metrics he can make decisions faster and more accurately.

You can decide to measure anything, but what metrics matter and what ones are just for show? Here are some rules I hope will guide you toward creating meaningful metrics that help, rather than hinder, the decision-making process.

Be Actionable

If I had to give a one-sentence answer to the question "What metrics should I implement for my product?" it would be "Whatever metrics are actionable." This means the line from a question to a metric and the line from a metric to an action should be as short as possible.

Most of the tips below are meant to focus attention on this issue. What can you do to make sure your metrics are actionable?

Be Understandable and Trustworthy

Do you understand what your metric measures? Does everyone in your organization also understand and do they trust it?

Trust is the important part. Everyone has to trust the metric if you're going to use it to make decisions, otherwise you'll be getting constant pushback. This will slow the decision-making process and cause a lot of ill-tempered arguments.

Measure Results

Does your metric measure customer behavior or a correlate of customer behavior? In the past approximations and correlations were necessary because measuring behavior directly was hard, but on the web there's no excuse — you have access to every single thing a person does on your site, down to where their mouse is hovering and for how long.

For example, if you want to know how good Twitter is for your business don't measure the number of positive tweets about your company. Instead, measure how many customers it drives to your product and how much money those customers put in your hands.

Understand the Downside

What would you do if your metric were 50% off the mark today? Would you be able to articulate why this is a problem for your business? What it costs you? Would you know where to start looking for possible causes?

As an example, I've worked with a startup that used "number of MySpace friends" as a go-to metric in every marketing meeting. Is that really material to the business?

What would happen if tomorrow we had half as many MySpace friends? Would we lose $100? $10,000?

This number says nothing about the value of MySpace as a marketing channel, which in the most charitable case is what it is trying to approximate, and the downside is completely ambiguous. Is the number of MySpace friends tied to anything meaningful?

Like the Twitter example above, if I thought MySpace were an important marketing channel for my product I'd be measuring things like the number of qualified leads from MySpace and the value they generate for the business.

Understand the Upside

Conversely, ask yourself, "What value does improving the metric bring to the company?" Some metrics are blindingly obvious in this regard, e.g., top-line revenue numbers and some "efficiency" metrics like effective CPM and revenue per user.

Some metrics are less obvious. What about page views? Can you imagine a scenario where your pageviews double but the net effect was bad for your business?

Metrics like these are dangerous because they lull you into a false sense of security. Everything on your analytics dashboard is going up and to the right, but the fundamentals of the business might still be suffering.

Don't Be Ambiguous

Does your metric measure one thing and one thing only? Or is it really an aggregation of several variables, each of which can rise or fall independently?

A good example of an ambiguous metric is the notion of a "daily active user," or the total number of people who interacted with your product today. This number is ambiguous because it is really the sum of two metrics: the number of new users and the number of returning users.

Ambiguous metrics are bad because they obscure the underlying variables that truly reflect customer behavior and delay the decision-making process as you are forced to determine which of these variables is actually affecting the aggregate metric. Furthermore, it's possible that one variable accounts for all the growth in the metric, e.g., you have millions of new users per day but no returning users.

This latter scenario has been the death of many Facebook apps.

Segment by Purpose

Whenever I'm building a web product I divide usage into key segments. Generally these are acquisition, retention, engagement, and monetization. That is, how do customers find my product? Are those customers coming back? Are they doing the things I want or need them to do? And how much money are those customers making me?

See Dave McClure's Startup Marketing for Pirates presentation, which focuses on this idea.

Doing this lets you tackle each segment independently and allows for a sharper product focus. Early on you might want to focus on acquisition, or maybe you want to focus on monetization from day one. Eventually you'll starting caring about longer-term metrics like retention and you won't be distracted or overwhelmed by unrelated metrics from different segments.

You might select one or two from each category to act as top-line variables in a dashboard that you look at every day, e.g., new users (by channel), returning users (by channel), activity, and revenue.

Appropriate Granularity

Sometimes you need a bird eye's view and sometimes you need a tunneling electron microscope. Know when you need which.

As a general rule of thumb I focus on the microscopic when I am designing specific optimizations but focus on the macroscopic when I'm determining whether the decisions we're making are working. Another way is to finish this sentence, "I know my product is healthy because..."

You'd never finish that sentence with "because the click-through-rate on my login page is 20%." You'd say something like "because 80% of my customers return every week" or "because our revenue is growing by 5% month-over-month."

Conclusions

These are just tips, not hard-and-fast rules. They are meant to focus the discussion of what metrics matter and why because the choice of metrics has a long-lasting impact if your team is committed to building a data-driven culture.

If you have any tips on deciding what metrics matter and why, leave them in the comments! This list wasn't meant to be exhaustive and I know (or hope) people will have some strong opinions!

Building a Social Network, Island by Island

Mon, 11 May 2009 08:30:36 +0000

A necessary condition for building a self-sustaining social network is density. We understand this intuitively. After all, a network of one person is hardly a "network" at all.

Metcalfe's Law, which states that the value of a network grows in proportion to the square of the number of users of that network, express this idea formally. The value of a social network rests in its ability to foster communication, in its connections.

If you're building a social network, whether it's a destination website or an application that exists on another social network, density must figure into key strategic decisions. Let's see how.

Islands

One way to think about social networks is as a network of networks. On Facebook, for example, if I graph the connections between all my friends I see distinct groups: my high school friends, my college friends, my current circle of friends, and my professional network.

Each group of people is more or less isolated from each other. Density exists within each of these islands, but not between them. I'm fairly certain this is a topological property of any social network.

Case Study: Facebook

Facebook started at Harvard and was initially college-only. Their growth strategy was explicit from day one: move from school to school as demand warranted. I've been told that Sean Parker wouldn't consider opening up access to a new school until at least several dozen students from that school requested an account.

Each school was an island. Once Facebook saturated a specific set of colleges it moved onto the next round. Eventually there was enough anticipatory buzz that they could launch at large, state schools without risk of fading away.

They still pursue this strategy today. After establishing a critical density among colleges they opened up access to high schools and then to everyone with an email address. From there they started moving country-to-country.

The countries where Facebook is having the most difficulty gaining traction are the ones with already-established social networks, like Germany with StudiVZ. In fact, if you look at this old map of the most popular social network in each country, you get an idea of how isolated this country-by-country growth really is.

Other Networks

Facebook isn't the only example of a social network who grew this way. hi5 has a similar story, starting with smaller markets overseas and spreading from country-to-country. Or Craigslist, by starting small in San Francisco and eventually becoming a presence in most major US cities.

The MySpace team had a background in direct marketing, which is all about targeting specific offers at the people who are most likely to respond. They started with the club scene in LA and grew from there.

The key to all these strategies was density.

If you're launching a new social service, even if your end goal is to have everyone and their mother using it, it's important to understand the impact density has on the growth

Multiple Dimensions of Density

So far the only kind of density we've talked about is network density, i.e., multiple people connected through their shared use of a service. You could call this "product density."

Sometimes product density isn't enough. Take IM, for example, or any network that requires synchronous communication. Not only do two people have to be using the same product but they have to be using it at the same time. What good is your friend being on IM if you're never awake at the same time?

Xfire, an IM client for gamers, is an example of a product that innovated in this space by tackling a segment of customers who were already interacting synchronously.

Mobile social networks take this to an even greater extreme. To connect with people on Loopt or Google Latitude not only do we have to be using the same product at the same time, but we have to be in the same place!

This isn't to say building these networks is impossible. Rather, they come with an extra handicap in the form of reduced density. Overcoming that problem has to be a key part of the product strategy.

Conclusion and Counterexamples?

Most product strategy discussions, in my experience, are focused on acquisition or other topline metrics that go "up and to the right." Instead, if you're building social software, I believe density is a necessary condition for long-term success and needs to be a part of the strategy discussion from day one.

First, understand the density requirements for your product. Do customers need to sign up for the same service? Do they need to be using it at the same time? Do they need to be in the same place? Is there anything you can do to lower the density requirement?

Second, build a "depth first" strategy. Are there any naturally dense customer segments that might fit your product? Do you have the ability to target specific demographics or segments for acquisition? Which ones respond positively and is it possible to build density there?

Once you've achieved sufficient density on one island hop to the next and repeat.

And if anyone out there can think of any counterexamples — social networks or services that got big "all at once" — leave a comment and let me know! I honestly can't think of any.

What Verna Taught Me

Wed, 06 May 2009 08:30:28 +0000

When I was student at the University of Chicago I worked for Residential Computing, or ResCom. ResCom was responsible for maintaining all the computer labs and IT systems in the residential dorms.

I had two jobs. The first was to help manage the dorm computer labs. If a new virus broke out and computers were banned from the network because they were spamming everybody on campus, I had to go in and clean it up. I still remember when the Blaster worm infected most of the Windows machines on campus.

The second job was to help build a web-based dorm management system called Chopin. This system was at the center of the daily operation of the dorms. Students could use it to submit work requests and report problems with the dorms. Staff could use it to send out mass mailings, sell students printing credits for the dorm printers, and any number of other routine tasks.

Enter Verna

Verna was a front-desk clerk at one of the dorms. She was responsible for helping students if a problem came up and used Chopin to get her job done.

She had also lived in Hyde Park, the neighborhood surrounding the University, for most of her life and just wanted to do her job without having to deal with whiny students or crappy software. Chopin was supposed to help her do that.

One day while fixing the the front-desk computer I asked Verna, "What do you think of Chopin?" She immediately supplied a laundry list of complaints. It didn't work well, it was confusing, she never knew where to go or what to do, and so forth.

It would have been easy to blame her. Maybe she just needed better training or maybe she wasn't trying hard enough. But that was all ego: I was just upset because she told me the product I helped build was pretty awful.

Her critique was absolutely fair. I had built solutions I liked for problems I wasn't even sure existed. In fact, before then, I had never honestly talked to any of the people who actually used Chopin about the product itself. In retrospect that seems completely insane.

And watching her use Chopin I saw so much waste. She would click ten times where two would have sufficed. Some features would never get used, others used for things they weren't intended for. Once I was there, talking with her and watching how she used Chopin, I saw that the whole process was messed up, from top to bottom. Most of her energy was spent working around problems I had caused!

Go and See

If my job hadn't required that I work on Chopin and get out of the office I never would have even realized there was a problem.

That experience taught me that whenever I didn't understand a customer's frustration or thought that maybe they were feeling this way or that way I should just go ask them before building solutions that might be worse than the problem. When in doubt, go and see for yourself. Actually, scratch that: always go see for yourself.

Too often I'd pass off mere belief as knowledge, or generalize from a specific set of circumstances to a fundamentally different set of circumstances. Maybe this solution worked over there, but why should it work over here? The only people who can validate your product are your customers — everyone else, including yourself, can wait their turn.

Verna taught me that.

Notification Strategies for Social Networks

Tue, 05 May 2009 09:00:02 +0000

You've built a social application and launched a new feature. The number of notifications you can send out is constrained. Which set of users should you notify to guarantee the most people start using this new feature?

This problem might seem artificial. Why not put up an ad on your product, or send a notification to every single person who might be interested? There are several reasons the number of people you can notify might be constrained.

There is a technical constraint, e.g., Facebook limits the number of application-to-user notifications at the API level.
There is a financial constraint, e.g., you're sending notifications over SMS and every message costs you money.
There is a strategic constraint, e.g., sending notifications too frequently causes fatigue and reduces the effectiveness of future notifications.

So, the situation is not too far fetched. Let's investigate the issue.

For the rest of the article the "application" is going to be a Facebook application and it can only send 100 application-to-user notifications per week. Which 100 users should we notify?

The Basic Considerations

In the linear cascade model when a user in a social network adopts a new behavior there is a probability that each neighbor in the network will adopt it.

Under this model we probably wouldn't want to notify two people who are friends, and especially not a cluster of friends or a clique. The new feature wouldn't spread very far beyond this group.

Likewise, we wouldn't want to notify people who are very far apart on the social network because a user is more likely to adopt a behavior if more than one of his friends has also adopted it. So there is a balancing act between notifying users who are close together, to achieve density, and notifying users who are far apart, to achieve breadth.

Heuristics and Centrality Measures

The easiest solution is to pick 100 random users to notify, but this is also the most naive since it takes into account neither the structure of the network nor likelihood that a person will influence their neighbors.

A better"Better" according to what? As we'll see, randomly selecting seed users performs worse than all the other heuristics. solution to this problem is to develop a heuristic that ranks every user in the network according to some metric. If we can only send 100 notifications then they are sent to the first 100 people on this ranked list.

The idea here is to use centrality measures to come up with heuristics. In graph theory "centrality" is a measure of how important an individual node is.

The simplest measure is called "degree centrality" and is equal to the number of neighbors of a node. On a social network this is the number of friends of a given user. So, if you wanted to send out 100 notifications using this heuristic we'd send notifications to the 100 users with the most friends. This heuristic involves convincing celebrities to use the new feature.

There are other, more complex heuristics. The Wikipedia article linked above has a list of other centrality measures, and I wrote an article about calculating eigenvalue centrality, which is similar to PageRank. Each of these admits a heuristic which can tell us which users to notify.

Of course, which strategy works best is hard to know beforehand, as it varies with respect to both time and the underlying notification. A/B testing this is difficult because the effects are intentionally dependent. If anyone has a good solution to this that doesn't involve collecting massive amounts of data about user behavior I'd be interested in hearing it!

It should be noted that each of these heuristics only takes into account the underlying structure of the graph and not the probability of "infection." By including the latter we can come up with a nearly exact model of the optimal subset of users to notify.

A Global Solution

Brendan Meeder at CMU pointed me to a great paper that discuss this very topic, Maximizing the Spread of Inï¬‚uence through a Social Network by Kempe, et al.

Rather than take a localized view of the problem by ranking each node individually, we create a statistical model of how the new feature propagates through the network.

First, we start with a finite seed set, A. In our case A is a set of 100 users. Say we convert each of these 100 users.

In our model if a user u is converted then for each neighbor v there is some probability

$latex p_{u,v}$

that v will also be converted.

After the process has run its course some set of users has adopted the new feature. Because adoption is probabilistic the size of this final configuration is a random variable. Using the notation from Kempe, et al., for a given seed set A the size of the final set of adopters is a random variable denoted by

$latex \displaystyle{\varphi\left(A\right)}$

Our goal is to pick the set A which maximizes the expected value of this random variable.

Formally, we want to find the subset A such that

$latex \displaystyle{\sigma\left(A\right) = E\left[\varphi\left(A\right)\right]}$

is maximized, where

$latex \displaystyle{\sigma\left(\cdot\right)}$

is called the influence function.

The Algorithm and The Results

It turns out that calculating the influence function exactly is NP-hard, but there is a greedy algorithm which approximates the value under certain (unrestrictive) conditions.

If you want more details read the paper linked above or the related Inï¬‚uential Nodes in a Diï¬€usion Model for Social Networks by Kempe, et al.

Using Monte Carlo methods Kempe, et al. simulated the diffusion process using this algorithm versus several of the heuristics I described above. The results are fairly striking: their algorithm performs at least 18% beter than the best-performing heuristic (degree centrality) and 48% better than if the seed set were randomly selected. I've embedded a graph of their results below.

The "target set" is the initial seed set of users to notify, and the "active set" is the final set of users who actually adopted the new feature or product. The more users who adopt the feature the better the strategy.

Feasibility

Kempe's algorithm is more feasible than many of the heuristics discussed above, although the best performing heuristic — degree centrality — is also the easiest to calculate. He also doesn't include eigenvalue centrality in his analysis, which I'd be interested in comparing.

The biggest downside to his algorithm is that it requires both full knowledge of the underlying graph and an accounting of all the user-to-user transmission probabilities. Modeling these probabilities would require a lot of data about users over an extended period of time.

Whether the additional 18% is worth the extra computation and data collection depends on a lot on specific circumstances, but personally I'm going to try to implement it in my projects and see how the performance compares first-hand.

Footnotes

Why hi5 Might Have an Edge on Facebook

Tue, 28 Apr 2009 10:50:26 +0000

Facebook has been trying hard to find a business model. Their Beacon advertising product is probably the most infamous example. So far they've been left empty handed and have been forced to look outside the company for money, first from MicrosoftMicrosoft acquires equity stake in Facebook, expands ad partnership (cnet) and then from foreign investors.Update On Facebook's Dubai Fundraising Trip (Business Insider)

If Facebook wants to be the internet's cable company what are they going to have to do to turn themselves into a $40Bn company?

Are Virtual Goods the Key?

Not all social networks are struggling to find a great business model. Tencent, a Chinese social networking company, pulled in over $1Bn in revenue last year, primarily through its use of virtual goods.The worldâ€™s most lucrative social network? Chinaâ€™s Tencent beats $1 billion revenue mark. (VentureBeat)

But Facebook doesn't need to look overseas to see that virtual goods could work for them. Most of the top Facebook games use virtual currency to make money, powered by leadgen-based ad networks like Offerpal and Gambit. There are reports that some of these apps are pulling in eight figures per year.Developer Analytics: Facebook game Mob Wars making $22,000 a day (VentureBear)

And of course there's Facebook's own gifting service, which has recently moved to a virtual currency system, pricing gifts in "points" that can be bought with real money.Gift Shop Credits Have Arrived (Facebook)

All of this is to say that it appears that virtual goods are a natural business model for social networks and Facebook has enough data to see that. Why isn't Facebook pursuing this strategy more aggressively? Why do they seem dead-set on building advertising technologies like Social Ads and Beacon?

The US Advertising Crutch

In the world of advertising not all countries are equal. US traffic is generally valued the highest, followed by other English-speaking countries, the G20 , and finally the rest of the world.

Until recently Facebook was concentrated in the English-speaking world. It's the second largest social network in the US, after MySpace, and the largest in both Canada and the UK.

Unlike other social networks which don't have a significant presence in the English-speaking world, Facebook can support itself through advertising. This is a crutch that prevents Facebook making bold decisions with their business model. I believe Facebook sees themselves as the next Google, one piece of technology away from changing the world of advertising.

The Demographic Crunch

Not all social networks have Facebook's demographics, of course. hi5, the world's third largest social network after MySpace and Facebook, has an extensive presence throughout Latin America and other countries which advertisers and publishers typically ignore. The same can be said of the advertising market in China, but recall that Tencent pulled in $1Bn last year through virtual goods.

It's little wonder, then, that hi5 is aggressively pursuing a virtual goods strategy.Hi5's virtual entertainment plans could hit a virtual jackpot (VentureBeat) Their demographics makes this strategy much more appealing. Facebook has the money and the audience to waste pursuing a pure-advertising strategy for social networks.

What once seemed like a demographic disadvantage might turn out to be a demographic advantage for hi5. Will they beat Facebook to the business model punch?

And a year from now will we be reading articles about Facebook's virtual goods strategy compares to hi5's, as opposed to articles about how Facebook's new homepage compares to Twitter?

You're crazy. You know that, right?

Obviously hi5 has an uphill battle. Facebook is growing on the order of 500,000 new users per day and shows no signs of slowing. But the same was said of MySpace and Friendster when Facebook launched. I think we still have a few more twists in the story of social networking on the web, and this is just one possible twist among many.

Behavior Adoption on Social Networks

Fri, 24 Apr 2009 09:20:12 +0000

Why and how do people adopt new behaviors? Why do they start using new products? Did you sign up for Facebook because all of your friends were on it, or because a specific friend recommended it to you? Or do you refuse to sign up at all?

In this article I'm going to outline two models that describe how new behaviors, ideas, and messages propagate through social networks.

The Threshold Model

The first model is called the Threshold Model.See Threshold Models of Collective Behavior (1978) by the famous sociologist Mark Granovetter. It says that people adopt a new behavior bceause a sufficiently large proportion of their friends have adopted that behavior. Early adopters have a very low threshold, say 5% or 10%, while late adopters would have a much higher threshold. Every person, however, has their own individual threshold.

For example, my girlfriend's stated reason for signing up for Twitter was that "all my friends were using it." And during the 2008 US Presidential election, some Obama supporters would adopt Hussein as their middle name.See Obama Supporters Adopting Middle Name "Hussein" As Their Own When I saw that lots of my friends were doing it I was certainly tempted to do the same.

The underlying psychological principle is one of "missing out" or "when in Rome." The key variable here is the initial distribution of thresholds across a social network, which describes in totality the final extent of the behavior.

It's worth noting that this model says nothing about how people initially adopt behavior. That is, it says nothing about innovators, only about the spread of innovation through a social network.

The Cascade Model

The second model is called the Cascade or Word-of-Mouth ModelSee Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth (2001) by Goldenburg, Libari, and Muller., and is the method of "viral growth" that most social application developers are familiar with. It says that every person has a chance of adopting a new behavior whenever one of their neighbors adopts it.

This model describes phenomena like product recommendations or user-to-user notifications on Facebook. The probability that a person adopts the new behavior is the conversion rate for the notification.More accurately, we'd model the "probability" as a random variable whose mean was the conversion rate.

This probability is both a function of the sender and the recipient, so more influential people are more likely to convince you to adopt a behavior (or purchase a product, or install an application).

Practical Implications

Both of these models describe facets of real-world interaction on social networks. My take is that the cascade model is more accurate at the beginning of a social network's life, where behavior is spreading through sparse areas, connected by influencers. Later on, after a critical density has settled in, people start adopting the behavior because everyone else is adopting it and there's a social cost to not doing the same.

We see this pattern in services like Facebook and MySpace, both of which got their start by harvesting emails and spreading through word-of-mouth (and spam) across a social network.See Stealing MySpace: The Battle to Control the Most Popular Website in America for details about the MySpace team's background in direct marketing. The ConnectU vs. Facebook court documents, which you can find via Google, paint a similar story for Facebook's early years. Eventually each network reached a point where a sufficient number of people were familiar with the product and new users adopted it not because their friends recommended it (the cascade model), but because there was a social expectation that they do (the threshold model).

Also, with respect to analytics and viral growth, the threshold model is more difficult to track. In the cascade model we record who sent what to whom and which messages they responded to. It's clear who gets credit for a user's conversion. In the threshold model you have to track passive exposures, and there's no clear causal relationship.

If ten of my friends are doing something and I decide to start doing the same thing, who gets credit? Most analytics packages will show this behavior as a direct visit, with no connection to other users' behavior, even though there is a viral process underlying it.

In short, the threshold model requires a certain level of behavioral density, while the cascade model doesn't. However, we see both models expressed in how people actually adopt new behaviors in social contexts.

Formalisms

In the threshold model every person u has a threshold

T_u \in [0,1]

and each of their neighbors v is weighted according to

w_{u,v}

then the person u adopts the behavior.

The set of thresholds, weights, and initial adopters completely determines the extent of the behavior in the social network.

In the cascade model, for every person u and neighbor v there is a random variable

X_{u,v}

which describes the likelihood of u adopting the behavior if v has adopted it.

Takeaways

I'll try to boil all this down into a few, practical takeaways.

The Threshold and Cascade Models describe two mechanisms of behavior adoption in social networks.
The Threshold Model says that people do something if enough of their friends are doing it.
The Cascade Model says that people have a chance of doing something if one of their friends is doing it.
Both models correspond to different real-life adoption patterns.
The typical "viral loop" involves the cascade model, but most successful social networks rely on the mechanics of the threshold model in the long run, i.e., density is important for long-term success.
The cascade model is a good tool for analyzing acquisition scenarios, but the threshold model is probably more helpful for understanding retention and engagement — it at least implies that density is a key factor in social network growth, a metric that's not often discussed publicly.

Agree? Disagree? Leave a comment, send me an email, or follow me on Twitter!

Almost Viral: A Hybrid Acquisition Strategy

Wed, 15 Apr 2009 09:00:54 +0000

Two common acquisition strategies for a new application are a paid acquisition strategy and a viral acquisition strategy. The former involves acquiring users at a cost less than the revenue they generate. The latter involves users inviting their friends to the application.

"Going viral" has become a sort of holy grail and most people would say they'd rather have a viral application than not, but it has a distinct downside: uncontrolled growth and ever-increasing operational costs. By being almost-but-not-quite viral you can dramatically reduce your cost of acquisition without setting your servers on fire.

The Paid Strategy

The key variables for a paid strategy are cost of acquisition and average revenue per user, or ARPU. If your ARPU is greater than your cost of acquisition then you can buy as many users as your budget allows, reinvesting the new revenue into acquiring new users. Generally this is done by advertising your product through something like AdWords or Facebook's Social Ads.

The best thing about a paid strategy is its relative simplicity. If your ARPU is $2.00 then you can run as many ads you want so long as the cost of acquisition is less than $2.00 and still be profitable.

Eventually, though, you will run out of ad inventory. There are only so many publishers who are willing to accept $0.10 per click. At this point your only options are to decrease your cost of acquisitionIf you increase the conversion rates for your ads then you can pay more per click to get more impressions without hurting your bottom line — you pay the same to acquire a user, but get more users through the door. or increase your ARPU.

The Viral Strategy

The viral acquisition strategy requires you get current users to invite other users to the application. The two key variables for the viral strategy are the average number of invites each new users sends and the rate at which those invites convert into new users.

The ratio of converting invites to new users is your viral coefficient, k. If this is greater than one you will see self-sustaining, viral growth.See my article Three Myths of Viral Growth for more information about viral growth. If the coefficient is less than one each user will bring in a fixed number of new users, but the application's growth is still linear.

People who have decided on a viral acquisition strategy focus on this number obsessively. It's the first big hurdle to clear and if you haven't had experience engineering a viral application it can take months to build something viral.

But being viral isn't an either-or proposition. Increasing your viral coefficient from 0.5 to 0.8 has other advantages, especially if you integrate it with a paid strategy. Let's see how.

A Hybrid Strategy

Say you're building a game on Facebook backed by a virtual currency. Users give you money or fill out offers to get coins, so you have a positive ARPU. This means you're free to pursue a paid acquisition strategy.

On the other hand, you're on Facebook, which has many viral hooks. Many of the technical hurdles are much smaller there, so it becomes more a question of design and optimization rather than implementation. At the very least you encourage players to invite their friends.

The first version of your application has a viral coefficient of k=0.5, that is, every new user who joins brings in 0.5 new users. Equivalently, for every two users you acquire you get one free.

That's interesting, especially if you're also paying for users. If you're paying $1.50 per user then you paid $3.00 to get two users, but acquired a third user for free! This means that you effectively paid $1.00 per user: $3.00 paid / 3 users.

This process is actually geometric, however. If you purchased 4 users with a viral coefficient of 0.5, you'd first get 2 new users for free. These 2 new users would then bring in 1 additional user, for a total of 3 new users, reducing your cost of acquisition even further. This is an infinite geometric series, which I'll outline below.

Having a non-zero viral coefficient reduces your effective cost of acquisition.

This means that you get more users for every dollar you spend on ads. Or, if you've run out of inventory, this means you can now spend more per user and retain the same net revenue level.

Let's figure out how to calculate your effective cost per acquisition.

Effective Cost of Acquisition

If you just want the formula, here it is:

C' = C(1-k)

Where C is your cost of acquisition, k is your viral coefficient, and C' is your effective cost of acquisition.

If k=0, i.e., you have no viral acquisition, then C' = C.

If k = 1 and your application is viral then C' = 0 and your application grows without spending any additional money. But rather than being an either-or proposition — you're either viral or you're not — there's a sliding scale. The more viral you are the cheaper it is to acquire users.

The Benefits of Being Almost Viral

From the above formula, if you have a viral coefficient of 0.90 then you have reduced your cost of acquisition by 90%. This is a great situation to be in. You might ask, "Why wouldn't I want go the last 0.10 and make my application viral?"

The one benefit of being viral is huge growth, which looks sexy on a graph and can tip an investor to your side if you're looking for outside money, but the growth is unpredictable. Not only do you have little control over what demographics come to dominate your application, but sometimes it grows so quickly that you run into operational problems (servers on fire, etc.).

By being almost viral you can grow very cheaply, control your rate of growth and demographics, and get enough traffic to conduct meaningful experiments. Need to grow more slowly? Just decrease your daily ad spend. Need statistically significant results more quickly? Increase your daily ad spend.

Put another way, with a viral coefficient of 0.9 you've dealt with your acquisition risk. Rather than going fully viral and dealing with the operational difficulties, it might be worth your time to deal with other market risks: retention, engagement, and monetization.

So, stop sweating about "being viral." Sometimes it's better to be almost viral.

Deriving the Formula for Effective Cost of Acquisition

You can skip this and go right to the comments if you're not interested in the math.

We have an application with a viral coefficient of k = 0.5. Every new user who joins brings in 0.5 new users. Another way of thinking of it is that for every new user who joins there is a 50% probability that he will bring in another user.

But this potential user also has a 50% chance of bringing in a new user, so the expected number of users is now 1 + 0.5 + 0.5*0.5. This continues ad infinitum.

Formally, if we acquire one user and have a viral coefficient of k then the number of users we expect to join is N(k), given by the formula

\displaystyle{N(k) = 1 + k + k^2 + k^3 + \cdots = \sum_{i = 0}^{\infty} k^i}

The initial 1 is our original user, whom we paid for, and each k represents the expected number of users from each step in the viral process.

This is a geometric series, so we know that

\displaystyle{N(k) = \frac{1}{1-k}}

Therefore, our effective cost of acquisition is

\displaystyle{C' = \frac{C}{N(k)} = \frac{C}{\frac{1}{1-k}} = C(1-k)}

Social Applications are Social Networks

Thu, 09 Apr 2009 08:00:51 +0000

Are all social applications also social networks? Dave McClure made a passing reference to this a little over a year ago, saying "RockYou & Slide [are] arguably social networks of their own."Google Open Social + Friends vs. Facebook Platform I want to make the stronger claim: social applications are always social networks.

It doesn't matter how large you are, it doesn't matter what your goals are, and it doesn't matter what your product is. I think if you're building a social application then you're trying to build a new social network. As we'll see, this has both strategic and technical implications.

What is a Social Network?

First, if I'm going to convince you that something is a social network we should understand what a social network is. If you ask a person to name a few social networks, they will probably list services like Facebook, MySpace, and Twitter. And if an investor tells you they're "not investing in social networks," they mean it in this concrete, social-network-as-a-product sense.

Others, like Brad Fitzpatrick and Mark Zuckerberg, use the term social graphSee, e.g., Thoughts on the Social Graph to distinguish between the underlying social relations between people and the services, called social networks, that are built on top of them.

But if there's one thing I learned from my mathematics education it's this: we're free to define things however we want so long as they're consistent. Therefore we ought to choose the definition that helps us get our job done.

So, here is my first, and most abstract definition:

A social network is a collection of people bound together through a specific set of social relations.

By "social relation" I mean a connection between people that permits the exchange of information. This prevents artificial relations like "Alex and James are connected if they have the same hair color."

When I say "social network" I always mean the actual collection of people. Facebook is a social network. There are actual people engaged with the site, creating relationships, sharing information, and doing all the things they'd do in "real life." Or, put another way: a family is a social network, a family tree is not.Ceci n'est pas un Social Network

If you don't like the above definition I can give you a functional one which I believe is equivalent.

A collection of people is a social network if and only if it is possible for something to spread virally through that collection.

In Web 2.0 speak, a "social network" is a collection of people over which you can "go viral". I believe that virality and social networks are fundamentally linked, and that both the above definitions are equivalent.

Social Applications are Social Networks

Accepting the above definitions, even if for the sake of argument, I don't think it's too hard to see why social applications are social networks. Let's take Slide's Top Friends as an example. Is Top Friends a social network in its own right?

I think it's easier to see that Top Friends meets the first definition. It is certainly a collection of people: the set of Facebook users who have installed the application. Are those people bound by specific social relations? Yes, and those relations are distinct from the ones represented in Facebook. For example, Alex adding James as a top friend is a social signal distinct from Facebook.

What about the second definition? Top Friends doesn't have an external API so it's impossible to build apps or plugins for Top Friends.For all I know Slide has an internal Top Friends API that lets them build new services that ride on Top Friends' success, but that's only speculation. So, what "goes viral" over Top Friends? New features and patterns of usage do.This is the essence of engagement loops. Eric Ries talks about going "beyond viral." There is no "beyond viral." Rather, on social networks viral processes govern the whole stack: acquisition, retention, engagement, and monetization.

I'd also argue that the converse is true: social networks are all social applications. YouTube spread through MySpace, Facebook spread through email, email spread through the real-life "social graph", and PayPal spread through eBay.Slide is to Facebook as Paypal was to eBay. Anyone buy it? All social networks are social applications built off of pre-existing social networks.

Strategic Implications

If Top Friends is a social network in its own right then there are strategic implications for Facebook. Prima facie, Top Friends is competing with facebook for users' attention on its own platform. Before Facebook launched the Platform it was the Eye of Providence, collecting, collating, and analyzing every bit of activity that occurred on its network.

After the Platform launched these third parties were able to infect portions of Facebook's network. In some cases, e.g., the Causes application, the relationship was symbiotic. In others, e.g., Top Friends, the relationship was antagonistic, with Facebook actually shutting down Top Friends at one point.See this TechCrunch article.

What does Facebook gain by having Top Friends on its Platform? Nothing substantial, as far as I can tell. What does it lose? Control and insight over the activities of its userbase.There's a broader argument that ceding control in this way is the right strategic move, but Facebook is not there yet — the limit of that argument is something like OpenSocial.

In effect, Top Friends is a social network bootstrapped off of Facebook, with its own set of communication channels over which Facebook has no authority or insight. This tension is present everywhere in the Platform because application developers' interests are not wholly aligned with Facebook's and will probably never be.

Technical Implications

I'm going to save the technical implications for another article, but it boils down to this: social networks in the sense that I defined above are fairly well understood. I believe the techniques used on the web today to grow "viral" applications are under the research from fields like social network analysis and epidemiology.

Since I believe social applications and social networks are synonymous, we can better understand how these applications grow by understanding how social networks grow.

In the meantime, I recommend reading The Statistical Evaluation of Social Network Dynamics by Tom A. B. Snijders from the University of Groningen if you're interested in the technical aspects of social networks and social applications.

And please, leave a comment if you have any thoughts about the above!

Where the iTunes Store Fails: Community

Mon, 06 Apr 2009 08:45:04 +0000

You don't need me to tell you that the iTunes Store has changed the face of music distribution, digital or otherwise. In April of 2008 it became the top music retailer in the USiTunes Store Top Music Retailer in the US and passed 6 billion songs downloaded earlier this yeariTunes Sells 6 Billion Songs, And Other Fun Stats From The Philnote. That's almost one song downloaded for every person on the planet.

For music startups iTunes figures into most strategic decisions. If you're streaming music for free to consumers you're going to be an iTunes affiliateBoth imeem and Last.FM are, for example. If you're selling music to consumers you're going to competing directly with iTunes — consumers have no reason to get their music from both you and iTunes if you both have it.

It's understandable if your heart skips a beat when you catch rumor that Apple will be building a similar product. How can you maneuver in this environment?

Finding Room to Breath

The iTunes Store is a lot like Wal-Mart: ubiquitousWho did Apple pass as the top music retailer? Wal-Mart, highly integrated, and bland. People shop there because it's easier, not because it's sexier, even though in other areas Apple is very good at selling precisely that hip, sexy lifestyle.

But Wal-Mart's strategy isn't the only strategy out there. Companies like Whole Foods can still thrive in their niche even though people can get cheaper food at Wal-Mart. Where is the Whole Foods of digital music? Does such a thing even make sense?

Building a Community

Corner record stores are about more than just the transaction. They attract a certain crowd and embrace a certain culture. High Fidelity is a good example of this on film.

iTunes misses out on the cultural and communal aspects of music altogether. It's very sterile. It's also a terrible means of discovering new music, a role which traditional record stores can fulfill.

As an example, say you're really into electronica. What good are the reviews on the iTunes music store to you? They're left by idiots who don't know Aphex Twin from Paul van Dyk. You go there when you know what you want and leave the second you have itThis is a problem with iTunes in general. It's the last step in your marketing campaign, not the first. See my article The $0.99 (App) Store..

And knowing iTunes, they might not even have music from your favorite bands if they're obscure enough.

Instead, imagine a hub of engaged electronica fans with a custom, electronica-centric storeOf course, you can break down genres into subgenres, and so forth. Maybe the sweet spot is having an ambient store, a trance store, a house store, a D&B store, etc.. The community itself spurs demand for its own store due to its reputation for quality electronica-related recommendations.

Improved discovery, better quality merchandise, exclusive deals with bands, and a community of like-minded people are just a few reasons why people might prefer to shop at a genre-specific music store rather than iTunes if they're forced to choose.

Plus an independent store is more freely able to experiment with payment models, distribution methods, and marketing campaigns. This might interest bands who see iTunes as a love-it-or-leave-it environment controlled from top to bottom by Apple.

Will This Work?

I don't know if this will work, but I think it's a reasonable enough hypothesis to test. This is just one possible strategy for building a music product and probably has several flaws I haven't thought through. Leave a comment and let me know your thoughts.

May a thousand music stores bloom!

Update

Adam from Heroku pointed me towards Beatport, which has been pursuing this exact strategy for the last few years.

After a little digging I found a few others, too. Insound.com for indie music and Mondomix for world music. I also know of one stealth startup pursuing this strategy for another genre. Do you know of any others? How well does this strategy perform?

In the limit you can imagine a "Ning for iTunes Stores," where the costs of implementing the store are shared but the community-building aspects are left to the company.

When in Rome: Newcomers on Facebook

Mon, 19 Jan 2009 03:23:04 +0000

A teammate of mine recently sent me a link to a paper called "Feed Me: Motivating Newcomer Contribution in Social Network Sites" and I thought it was worth discussing. The paper was jointly authored by Moira Burke, a PhD student at Carnegie Mellon, and Cameron Marlow and Thomas Lento, two research scientists at Facebook.

The Chicken and the Egg

The root question addressed in the paper is this: what motivates newcomers to contribute to social networks? For social networking sites getting users to contribute is one of the primary problems, right after how you acquire new users.

Let's dive right in and look at their hypotheses, methodology, and conclusions.

Hypotheses

The authors took all users who joined on a random weekday in March 2008 — amounting to about 140,000 users — and tried to predict their long-term sharing habits based on the experiences they have in the first two weeks. Specifically, they looked at how users interacted with photos.

The paper outlines four hypotheses:

Social learning: Newcomers whose friends share more content will go on to contribute more content themselves.
Singling out: Newcomers who are singled out in content will contribute more content.
Feedback: Newcomers receiving more feedback on their initial content will go on to contribute more content.
Distribution: Newcomers whose initial content is distributed widely will go on to contribute more content.

Conclusion: When in Rome...

The authors also broke down the newcomers into two categories, early uploaders and non-early uploaders, depending on whether or not they uploaded more than one photo in the first two weeks.

The two factors that correlated with long-term photo sharing for early uploaders were whether your friends were also sharing photos in your firs two weeks, and whether people commented on your photos. Surprisingly "singling out," i.e., getting tagged in photos, had no statistically significant effect.

Singling out, however, did work for non-early uploaders, suggesting that people can be cajoled into uploading photos by tagging them, but that people already uploading photos to Facebook won't upload any more than they were before.

In short, newcomers are susceptible to peer pressure.

How is this useful?

The upside to this paper is that it gives a clear picture of what is worth measuring. Getting a user to upload a photo doesn't just mean one more photo on the site — some percentage of their friends will upload a photo, too.

What's more, you can enter into a sort of feedback loop. The paper didn't address whether "social learning" also correlated with increasing auxiliary activities like feedback, but imagine this: more photos uploaded means more comments, which in turn means more photos. Is it possible to make this cycle self-sustaining?

The downside is that this doesn't help with the chicken-and-egg problem. What happens when a user comes to the site and they have no friends? There are some public spaces on Facebook, but most social networking in that vein are dominated by interactions among friends.

Overall one of the most detailed papers analyzing data from a huge social networks. Leave a comment and let me know your thoughts, especially if you know any other papers of this kind!

The $0.99 (App) Store

Wed, 10 Dec 2008 01:23:05 +0000

I was going to hold off writing this article, but after reading this open letter to Steve Jobs from an iPhone developer I just couldn't.

Mr. Hockenberry isn't the first to argue that iPhone apps are too cheap. So, what gives?

Marketing v. Distribution

The problem is that the App Store is a distribution channel (and a very good one, at that), but developers are using it as their primary means of marketing. Distribution and marketing aren't one and the same, and this tension is why developers are feeling pinched.

Distribution is the "how," as in, how do you get your product to your customer? Wal-Mart, Target, and your favorite mom-and-pop store are distribution channels. Malls are a way of aggregating distribution channels and amortizing the fixed costs.

Marketing is the "why," as in, why do your customers want to buy your product? Marketing channels like TV ads, direct marketing, etc. are about getting your message in front of your customers and convincing them they should buy your product.

For iTunes apps the only significant distribution channel is the app store itself. Unfortunately, the primary marketing channel is getting your app on one of the featured lists on the front page of the app store.

Why Apps Are Cheap

Here's a thought experiment. Pretend Borders is the only book store in the world and that they put their best-selling books closer to the front. 10,000 people wander in and out of Borders every day. People are five times as likely to buy the books out front versus the ones in the back.

Now imagine there is only one prime spot and two books that share the same demand curve. What happens? The one that has the lower price gets a 5x boost in sales, so each publisher tries to undercut the other until they're both priced at near-zero.

And if one of them is willing to put ads in their book, well, they're happy pricing their book at zero from the start.

This is the app store as it stands now, more or less. Marketing is about stimulating demand. App developers are confusing the app store with a marketing channel, and the only way to stimulate demand in that environment is to violently slash prices.

Beyond the App Store

There are two ways to increase demand: by lowering your price, or by marketing.

The app store is a store just like Borders or Wal-Mart. They make their money by distributing lots of goods that other people make and taking a cut, so of course they give premium spots to the apps that sell the most.

App developers, however, are acting like the people in the store are the only people in the world. The only way to stimulate demand is to lower the price and hope for that premium spot.

Instead developers should look for creative ways to stimulate demand outside the app store itself. Lower prices aren't what convince you to buy Beyonce's new album at the record store, it's the other way around. Beyonce's multi-million dollar marketing campaign is what convinces you to go into the record store in the first place.

The first iPhone developer to capitalize on this will make a big splash and reverse the $0.99 App Store trend. Just remember to link to this article when they do.

The Dangers of Genetic Optimization

Thu, 04 Dec 2008 06:00:46 +0000

The guys at Weebly just had a round of press for their latest product, SnapAds, an ad optimization platform that uses genetic algorithms. The technology is very cool, so check it out.

John Resig then posted a link to Genetify, the previous incarnation of this technology, which uses genetic algorithms to optimize your website at large by "evolving" your stylesheets.

The Black Box of Genetic Algorithms

SnapAds is a great application of this technology because the guiding metric function is obvious: total ad revenue. Since we've talked about A/B testing, you might wonder why not do this automatically for your website at large and optimize other user behavior?

The answer is what we call "black box testing." You know the results — maybe users are 50% more likely to click a certain link — but you don't understand why.

This is a pitfall of normal A/B and multivariate testing, too. You put up an experiment, measure the outcomes, and pick the one that performs the best according to the metrics that matter. Bingo bango.

And hey, if you automate the optimization step with something like genetic algorithms, you don't even need to do this. The machine makes the decision for you!

Analysis Matters

The problem with black box testing — when you understand the outcome but not the underlying cause — is that there's no learning. Analysis matters. Customer insight matters.

If you're only doing black box testing you don't really understand your customers. You're just blindly following the dictates of whatever algorithm you've set up.

Your customers might be buying more now, but can you apply that knowledge to your next product?

The Cult of the Product

Tue, 02 Dec 2008 06:00:03 +0000

In the movies you can build a baseball stadium in an Iowa cornfield and get millions of people to show up. Who wouldn't want to see the ghost of Mickey Mantle play another game? In real life there are millions of details that go into constructing a baseball stadium, not the least of which are having a team and fans ready to fill it from day one.

The first scenario might be more romantic, but if you're looking to be a baseball mogul you'd better be operating in the second.

The same is true of your consumer internet venture. Most hackers and entrepreneurs spend their time thinking about just the product. "If we just build the most awesomest widget possible," they think, "people will love it and give us money." Product first, everything else second.

The Cult of the Product

This is the sentiment that embodies what I call the "cult of the product." Like the Field of Dreams, people in this mindset believe that product is the most important thing and if they build it customers will come flocking.

It's hard to blame anyone — these signals are everywhere in the technology industry. If you take Apple's public image at face value, for example, you'd believe that every product idea that comes out of the company springs fully formed from the head of Steve Jobs himself.

This is a carefully crafted illusion. In reality Apple has one of the most refined (and most well-guarded) design processes in the industry. If Steve Jobs is the heart of the company their design process is the blood.

This process helps them identify market needs and build the product that best satisfies that need. The function of a design process is to increase the value of the end-product and reduce the risk of shipping it.

Here's a great video of Steve Jobs discussing product strategy while he was still at NeXT:

The Components of a Successful Product

In order to succeed every company needs three components: a market, a product, and a distribution channel. You'll notice Steve Jobs talks about all three in the above video.

Most aspiring internet entrepreneurs think a lot about product and a little about the market. Their distribution strategy, however, often amounts to little more than "get mentioned on TechCrunch a lot."

Think back to the most successful consumer internet startups: Google, Craigslist, MySpace, YouTube, Facebook, etc. How many of them needed TechCrunch-style press to drive growth? None. And that's not just "luck" — each had a distribution strategy built in from day one.

In fact, building a web product, like any product, requires you to think about all of these things. A product that fits a market is useless if you have no way to get it in front of customers, and even the best distribution model can't help if nobody wants to use the product you're distributing.

Vanquishing the Cult

Apple analyzes markets through a top-dow, design-centric process, but we can flip this around and apply a bottom-up, data-driven approach.

In fact, we can use the same principles of scientific product development to reason about business strategy.

For example, we might start with the hypothesis that 1,000,000 people are willing to pay $50/month for your product.

To test this you need to get real bullets in your gun as fast as possible. This means talking to potential customers and getting their feedback, implementing simple prototypes and measuring their performance, etc. Put your product and product ideas through the most rigorous process, using a combination of qualitative and quantitative feedback.

Then, using the data you gathered from your measurements and tests, iterate. It might turn out that nobody was willing to pay more than $20/month for your a simplified version of your product. This data lets you form new hypotheses.

Did you have the right product/market fit? Were you approaching the right customers? Should you lower your price? Should you improve the product? Should you have a different pricing scheme entirely? Each of these questions is itself a testable hypothesis and can be approached through an empirical, data-driven process.

The important point is to have a process, whether design-centric or data-driven, that helps you identify the key market, product, and distribution challenges of your business.

Sacrificing even a little bit of your business to the cult of the product is an unnecessary risk.

Three Myths of Viral Growth

Tue, 18 Nov 2008 15:47:16 +0000

Myth 1: Viral Growth is Exponential Growth

Viral growth isn't exponential growth. Your web product has a maximum audience, for example, but an exponential curve grows forever. Instead viral growth follows a logistic curve.

The logistic curve comes from population biology where the growth of a population has an exponential component, e.g., humans on average have 2.1 children each, but is dampened by competition for resources. If the population is growing too fast eventually it reaches the limit where their environment can no longer support pure exponential growth.

This upper limit is called the carrying capacity.

Seth Godin misses this in his otherwise excellent Elephant Math essay. He equates "perfect viral growth" with exponential growth, but no viral growth is exponential. There are two variables at play: the rate of reproduction and the carrying capacity.

Sometimes the carrying capacity is obvious. If you're building a Facebook app the carrying capacity can't be any larger than the total size of Facebook, for example. Other times you don't know until you start feeling its effects.

This is what happens to every viral product, whether you like it or not. Viral growth slows and you have to worry about retaining users rather than acquiring new ones. If you don't your product runs the risk of jumping the shark.

And here's the kicker: the faster your app is growing the sooner you have to care about retention because you reach the carrying capacity much more quickly.

Myth 2: Viral Growth is a Marketing Buzzword

Viral growth isn't a marketing buzzword, although I won't say marketers don't occasionally misuse the word. It has the potential to happen any time you're in a situation where people can communicate.

In simple terms viral growth happens when a user comes across your product and recommends it to his friends. If the average user recommends the product to more than one person you get viral growth.

What most people don't understand is that viral growth is also a function of the viral substrate, or the underlying communication medium. The easier it is to communicate with other people the more likely something is to go viral.

With modern technologies like email, Facebook, SMS, etc., communication is virtually frictionless. I could send an email to 1,000 people right now if I wanted, or text ten of my best friends simultaneously. What's more, these channels are actually easier to measure than word-of-mouth recommendations.

Instead marketers use vague terms like "word of mouth marketing." But every step in the viral process can and should be measured, and you should use mathematical models (like the logistic growth curve, above) to understand what is really happening.

Myth 3: Viral Growth Can't be Engineered

Viral growth is one of many distribution strategies and that means it can be engineered. Innovation in distribution might be boring, but it makes the difference between a K-Mart and a Wal-Mart.

Innovation in viral distribution means building and optimizing a viral loop.

Here is an example, taken from KISSMetrics' Product Planner:

Invite Loop

View more Hug Me user flows.

Simply put, your viral loop is the series of steps a user goes through before he invites his friends. Each step in the loop costs you users. Perhaps only 10% of users click the "accept invitation" link, for example. The efficiency of the funnel is the percentage of users who make it all the way through the funnel.

One of fundamental equations of viral growth is

k = e\cdot i

where "e" is the efficiency of your loop and "i" is the average number of invites per user. k is the viral coefficient, or the average number of additional users each new user brings in. If k > 1 you get viral growth.

Phrasing it like this makes "getting viral" into an optimization problem, one that you can A/B test.

Measure, Test, Repeat

Viral growth can be measured, tested, and modeled. It's not a fuzzy marketing term. And if you can do it right you'll have yourself a difficult-to-top distribution channel for your next product.

Statistical Analysis and A/B Testing

Wed, 12 Nov 2008 06:00:41 +0000

In this article we're going to talk about how hypothesis testing can tell you whether your A/B tests actually effect user behavior, or whether the variations you see are due to random chance.

First, if you haven't yet, read my previous introductory article on hypothesis testing. It explains the statistical principles behind hypothesis testing using the example of a biased coin. We're going to move quickly beyond that and dive right into A/B testing.

Landing Page Conversion

You're testing a landing page that has a signup form. You want to test various layouts to try and maximize the percentage of people who sign up. This percentage is called the "conversion rate," i.e., the rate at which you convert visitors from passerbys to customers.

You have a four-way experiment with a control treatment and three experimental treatments. How you pick your treatments is a subject worth discussing in its own right, but they should try to move the big levers: copy, layout, and size.

For this experiment we'll just call the treatments control, A, B, and C. You can use your imagination.

Fake Data

Your totally awesome Project X is attracting users. You've analyzed your sales pipeline and the point with the highest potential impact is the landing page. You want increase the landing page conversion rate by at least 20%.

You create an A/B test with four treatments: control, A, B, and C. Here is the data you collect:

Project X Landing Page
Treatment	Visitors Treated	Visitors Registered	Conversion Rate
Control	182	35	19.23%
Treatment A	180	45	25.00%
Treatment B	189	28	14.81%
Treatment C	188	61	32.45%

From the data both treatments A and C show at least a 20% improvement in the landing page performance, which was our goal. You might declare Treatment C "good enough," choose it, and move on. But how do you know the variation isn't due to random chance? What if instead of 188 visitors treated we only had 10 visitors treated? Would you still be so confident?

As usual we're aiming for a 95% confidence interval.

Hypothesis testing is all about quantifying our confidence, so let's get to it.

The Statistics

Remember, we need to start with a null hypothesis. In our case, the null hypothesis will be that the conversion rate of the control treatment is no less than the conversion rate of our experimental treatment. Mathematically

H_0: p - p_c \le 0

where p_c is the conversion rate of the control and p is the conversion rate of one of our experiments.

The alternative hypothesis is therefore that the experimental page has a higher conversion rate. This is what we want to see and quantify.

The sampled conversion rates are all normally distributed random variables. It's just like the coin flip, except instead of heads or tails we have "converts" or "doesn't convert." Instead of seeing whether it deviates too far from a fixed percentage we want to measure whether it deviates too far from the control treatment.

Here's an example representation of the distribution of the control conversion rate and the treatment conversion rate.

The peak of each curve is the conversion rate we measure, but there's some chance it is actually somewhere else on the curve. Moreover, what we're really interested in is the difference between the two conversion rates. If the difference is large enough we conclude that the treatment really did alter user behavior.

So, let's define a new random variable

X = p - p_c

then our null hypothesis becomes

H_0 : X \le 0

We can now use the same techniques from our coin flip example, using the random variable X. But to do this we need to know the probability distribution of X.

It turns out that the sum (or difference) of two normally distributed random variables is itself normally distributed. You can read the gory mathematical details yourself, if you're interested.

This gives us a way to calculate a 95% confidence interval.

Z-scores and One-tailed Tests

Mathematically the z-score for X is

z = \frac{p - p_c}{\sqrt{\frac{p(1-p)}{N} + \frac{p_c(1-p_c)}{N_c}}}

where N is the sample size of the experimental treatment and N_c is the samle size of the control treatment.

Why? Because the mean of X is p - p_c and the variance is the sum of the variances of p and p_c.

In the coin flip example the 95% confidence interval corresponded to a z-score of 1.96. But it's different this time.

In the coin flip example we rejected the null hypothesis if the percentage of heads was too high or too low. The null hypothesis there was

p = 0.50

but in this case our null hypothesis is

X \le 0

In other words, we only care about the positive tail of the normal distribution. Here's a graphical representation of what I'm talking about. In the coin example we have and we reject the null hypothesis if the percentage heads is too high or too low.

In this example we only reject the null hypothesis if the experimental conversion rate is significantly higher than the control conversation rate, so we have

That is, we can reject the null hypothesis with 95% confidence if the z-score is higher than 1.65. Here's a table with the z-scores calculated using the formula above:

Project X Landing Page
Treatment	Visitors Treated	Visitors Registered	Conversion Rate	Z-score
Control	182	35	19.23%	N/A
Treatment A	180	45	25.00%	1.33
Treatment B	189	28	14.81%	-1.13
Treatment C	188	61	32.45%	2.94

Conclusions

From the table above we are safe concluding that Treatment C did, in fact, outperform the control treatment. Whether the performance of Treatment A is statistically significant is irrelevant at this point because we know the performance of Treatment C is, so we should just pick that one and move on with our lives.

Here are the key take-aways:

The conversion rate for each treatment is a normally distributed random variable
We want to measure the difference in performance between a given treatment and the control.
The difference itself is a normally distributed random variable.
Since we only care if the difference is greater than zero we only need a z-score of 1.65, corresponding to the positive half of the normal curve.

Statistical significance is important for A/B testing because it lets us know whether we've run the test for long enough. In fact, we can ask the inverse question, "How long do I need to run an experiment before I can be certain if one of my treatments is more than 20% better than control?"

This becomes more important when money is on the line because it lets you quantify risk, minimizing the impact of potentially risky treatments.

We'll cover these things in future articles. Until then!

Data Management, Facebook-style

Mon, 10 Nov 2008 06:00:55 +0000

Jeff Hammerbacher, the former lead of the Data Team at Facebook and now VP of Product at Cloudera, put up some great slides on the evolution of Facebook's data management strategy.

They're very interesting from many perspectives, so take a look and then stay tuned for my two cents.

Growing With Data

Jeff was at Facebook for about two and a half years and saw Facebook grow from a company dealing with gigabytes of data per day to a company dealing with terrabytes of data per day. It was his job to guide the process of making sense of this pile of semi-structured data.

The technical aspects are interesting, but what's more interesting to me is the story. A good title for the presentation might be "Growing With Data."

The Three Stages

As I said, the most interesting part to me was how Facebook's data initiatives evolved over time to meet their growing needs.

At first they did what everyone does — periodic offline batch processing. But we all know this doesn't scale forever, especially if your data is growing at an exponential rate.

Eventually you wind up in a situation where you produce more data in an hour that you can process. You can try to scale vertically, getting more bandwidth, more processing power, faster disks, etc., but the exponential nature of the situation will win in the end.

Once the ad hoc ETL system no longer met their needs they built a system for distributed logging. Unfortunately it didn't provide the flexibility they needed. Analysts couldn't run SQL and maintaining the system was difficult.

Eventually they hit upon Hadoop, an open source implementation of Google's MapReduce. They built Hive, a system for querying datasets stored in Hadoop files. This means you get the scalability of Hadoop and the flexibility of a SQL-like language. It's very slick.

They also built Cassandra, which provides a BigTable-like system for storing massive amounts of structured data.

Evolution, not Revolution

As I said, I like the story. They didn't start by building these complex tools, but rather they evolved to fit a growing need within the company. Beyond that I like that their approach to Hive was so customer-centric. The analysts wanted SQL so they built a SQL-like language on top of their fancy distributed technology. Very cool.

There's a lot more where that came from over at the Cloudera blog, so check it out. The future is data.

Obama, McCain, and Data-Driven Campaigning

Wed, 29 Oct 2008 09:50:34 +0000

On Monday Slate published an article about Obama's text messaging strategy (h/t Andrew Chen) and how it compared to the traditional robo-calling strategy. Politics is getting more quantitative every year and it's great to see the campaigns using techniques like A/B testing to determine what works and doesn't work in political messaging.

A Channel to Voters

Every campaign has several channels to voters: person-to-person contact, phone calls, mailers, etc. You can attach metrics like "dollars per vote" or "votes per contact" to each of these channels, and the campaigns do.

From the Slate article,

Robo-calls are the pyrotechnics of politics: They create a big disturbance, but they don't have a prolonged effect. Numerous studies of robo-call campaigns show that they're ineffective both as tools of mobilization and persuasion — they don't convince voters to go to the polls (or to stay away), and they don't change people's minds about which way to vote. So why do campaigns run robo-calls? Because they're cheap and easy. Telemarketing firms charge politicians between 2 and 5 cents per completed robo-call; that's as low as $20,000 to reach 1 million voters right in their homes.

The campaigns also measure votes-per-dollar. Using the Green and Gerber numbers we get this table:

Voter Contact Methods
Channel	Votes-per-contact	Dollars-per-vote
Canvassing	1/14	$29
Phonebanking	1/38	$38
Telemarketing	1/180	???
Mailers (non-partisan)	1/200	???

Partisan mailers are even farther down the list, and leaflets, emails, and robo-calls showed "no discernible effect" on the electorate. Obama's breakthrough is using text messaging which costs an astonishing $1.50 per vote.

Hierarchy of Personalization and Maturity

The hierarchy is pretty clear: the more personal the contact the more effective it is. Person-to-person contact will always be personal, obviously, and until this campaign text messaging was something you did with your friends and family.

Will this change? Email marketing is less effective nowadays because everyone is used to spam. Text messaging has worked brilliantly for Obama this campaign but as the technique becomes more common there's no way to know if, four years from now, people will still respond the same way.

Bombarding a communication channel with impersonal messages makes the medium itself less personal and therefore less effective.

A/B Testing

How did Green and Gerber calculate the effectiveness of voter contact methods? By carefully measuring the results of their A/B test and using hypothesis testing to determine whether there were actually differences between the control and test groups:

Rather than merely theorizing about how campaigns might get people to vote, Green, Gerber, and their colleagues favor randomized field experiments to test how different techniques work during real elections. Their method has much in common with double-blind pharmaceutical studies: With the cooperation of political campaigns (often at the state and local level), researchers randomly divide voters into two categories, a treatment group and a control group. They subject the treatment group to a given tactic: robo-calls, e-mail, direct mail, door-to-door canvassing, etc. Then they use statistical analysis to determine whether voters in the treatment group behaved differently from voters in the control group.

Personal Experience

I'm an Obama supporter and have been volunteering on-and-of since early this year. Working on the California primary I had a chance to see the campaign's data-driven approach first-hand.

Every state is broken down to the precinct using a system called VoteBuilder. For a village the area might be the whole town, but for a city it could be as small as a few city blocks. Precinct captains or other field workers can slice up the city using queries like "get me a list of voters who are not strong supporters of either candidate and print off their names in walking order."

What's more, Obama's campaign is as much about his brand as it is about finding the votes it knows are there, even if they're in traditionally Republican areas. I canvassed my hometown in Northern Michigan, for example, an area classified as a "persuasion area."

How did the Obama campaign know that this small segment of typically Republican Northern Michigan was persuadable?

It's simple, really: data plus an empirical mindset.

Lessons from Republicans

In a lot of ways the Obama campaign has learned from the Republicans. In the past Republicans have won in large part because they applied their background in quantitative direct marketing to voter mobilization.

And if Obama wins it will be in large part because he absorbed and modernized the data-driven techniques Republicans have been using since the early 90's.

Hypothesis Testing: The Basics

Mon, 27 Oct 2008 14:09:54 +0000

Say I hand you a coin. How would you tell if it's fair? If you flipped it 100 times and it came up heads 51 times, what would you say? What if it came up heads 5 times, instead?

In the first case you'd be inclined to say the coin was fair and in the second case you'd be inclined to say it was biased towards tails. How certain are you? Or, even more specifically, how likely is it actually that the coin is fair in each case?

Hypothesis Testing

Questions like the ones above fall into a domain called hypothesis testing. Hypothesis testing is a way of systematically quantifying how certain you are of the result of a statistical experiment.

In the coin example the "experiment" was flipping the coin 100 times. There are two questions you can ask. One, assuming the coin was fair, how likely is it that you'd observe the results we did? Two, what is the likelihood that the coin is fair given the results you observed?

Of course, an experiment can be much more complex than coin flipping. Any situation where you're taking a random sample of a population and measuring something about it is an experiment, and for our purposes this includes A/B testing.

Let's focus on the coin flip example understand the basics.

The Null Hypothesis

The most common type of hypothesis testing involves a null hypothesis. The null hypothesis, denoted H₀, is a statement about the world which can plausibly account for the data you observe. Don't read anything into the fact that it's called the "null" hypothesis — it's just the hypothesis we're trying to test.

For example, "the coin is fair" is an example of a null hypothesis, as is "the coin is biased." The important part is that the null hypothesis be able to be expressed in simple, mathematical terms. We'll see how to express these statements mathematically in just a bit.

The main goal of hypothesis testing is to tell us whether we have enough evidence to reject the null hypothesis. In our case we want to know whether the coin is biased or not, so our null hypothesis should be "the coin is fair." If we get enough evidence that contradicts this hypothesis, say, by flipping it 100 times and having it come up heads only once, then we can safely reject it.

All of this is perfectly quantifiable, of course. What constitutes "enough" and "safely" are all a matter of statistics.

The Statistics, Intuitively

So, we have a coin. Our null hypothesis is that this coin is fair. We flip it 100 times and it comes up heads 51 times. Do we know whether the coin is biased or not?

Our gut might say the coin is fair, or at least probably fair, but we can't say for sure. The expected number of heads is 50 and 51 is quite close. But what if we flipped the coin 100,000 times and it came up heads 51,000 times? We see 51% heads both times, but in the second instance the coin is more likely to be biased.

Lack of evidence to the contrary is not evidence that the null hypothesis is true. Rather, it means that we don't have sufficient evidence to conclude that the null hypothesis is false. The coin might actually have a 51% bias towards heads, after all.

If instead we saw 1 head for 100 flips that would be another story. Intuitively we know that the chance of seeing this if the null hypothesis were true is so small that we would be comfortable rejecting the null hypothesis and declaring the coin to (probably) be biased.

Let's quantify our intuition.

The Coin Flip

Formally the flip of a coin can be represented by a Bernoulli trial. A Bernoulli trial is a random variable X such that

Pr\left(X = 1\right) = 1 - Pr\left(X = 0\right) = 1 - q = p

That is, X takes on the value 1 (representing heads) with probability p, and 0 (representing tails) with probability 1 - pOf course, 1 can represent either heads or tails so long as you're consistent and 0 represents the opposite outcome.

Now, let's say we have 100 coin flips. Let X_i represent the i^th coin flip. Then the random variable

Y = \sum_{i=1}^{100} X_i

represents the run of 100 coin flips.

The Statistics, Mathematically

Say you have a set of observations O and a null hypothesis H₀. In the above coin example we were trying to calculate

P\left(O \mid H_0\right)

i.e., the probability that we observed what we did given the null hypothesis. If that probability is sufficiently small we're confident concluding the null hypothesis is falseBut remember, if that probability is not sufficiently small, that doesn't mean the null hypothesis is true!

We can use whatever level of confidence we want before rejecting the null hypothesis, but most people choose 90%, 95%, or 99%. For example if we choose a 95% confidence level we reject the null hypothesis if

P\left(O \mid H_0\right) \le 1 - 0.95 = 0.05

The Central Limit Theorem is the main piece of math here. Briefly, the Central Limit Theorem says that the sum of any number of re-averaged identically distributed random variables approximates a normal distribution.

Remember our random variables from before? If we let

p = \frac{Y}{N}

then p is the proportion of heads in our sample of 100 coin flips. In our case, it is equal to 0.51, or 51%.

But by the central limit theorem we also know that p approximates a normal distribution. This means we can estimate the standard deviation of p as

\sigma = \sqrt{\frac{p(1-p)}{N}}

Wrapping It Up

Our null hypothesis is that the coin is fair. Mathematically we're saying

H_0 : p_0 = 0.50

Here's the normal curve:

A 95% level of confidence means we reject the null hypothesis if p falls outside 95% of the area of the normal curve. Looking at that chart we see that this corresponds to approximately 1.98 standard deviations.

The so-called "z-score" tells us how many standard deviations away from the mean our sample is, and it's calculated as

z = \frac{p-0.50}{\sqrt{\frac{0.50(1-0.50)}{N}}}

The numerator is "p - 0.50" because our null hypothesis is that p = 0.50. This measures how far the sample mean, p, diverges from the expect mean of a fair coin, 0.50.

The Data

Let's say we flipped three coins 100 times each and got the following data.

Data for 100 Flips of a Coin
Coin	Flips	Pct. Heads	Z-score
Coin 1	100	51%	0.20
Coin 2	100	60%	2.04
Coin 3	100	75%	5.77

Using a 95% confidence level we'd conclude that Coin 2 and Coin 3 are biased using the techniques we've developed so far. Coin 2 is 2.04 standard deviations from the mean and Coin 3 is 5.77 standard deviations.

When your test statistic meets the 95% confidence threshold we call it statistically significant.

This means there's only a 5% chance of observing what you did assuming the null hypothesis was true. Phrased another way, there's only a 5% chance that your observation is due to random variation.

Recap

Hypothesis testing is a way of systematically quantifying how certain you are of the result of a statistical experiment. You start by forming a null hypothesis, e.g., "this coin is fair," and then calculate the likelihood that your observations are due to pure chance rather than a real difference in the population.

The confidence interval is the level at which you reject the null hypothesis. If there is a 95% chance that there's a real difference in your observations, given the null hypothesis, then you are confident in rejecting it. This also means there is a 5% chance you're wrong and the difference is due to random fluctuations.

The null hypothesis can be any mathematical statement and the test you use depends on both the underlying data and your null hypothesis. In our coin flipping example the underlying data approximated a normal distribution and we wanted to test whether the observed proportion of heads was different enough to be significant. In this case we were measuring the sample mean.

We can measure anything, though: the sample variance, correlation, etc. Different tests needs to be used to determine whether these are statistically significant, as we'll see in coming articles.

What's Next?

Now that we understand the innards of hypothesis testing we can apply our knowledge to A/B tests to determine whether new features actually effect user behavior. Until then!

Scientific Product Development

Mon, 20 Oct 2008 06:00:53 +0000

Growing up every kid learned about the scientific method, about hypotheses, testing, measurement, and analysis. Data-driven development is about taking these scientific principles and applying them (at least in part) to all aspects of a business — especially product development.

It's about subjecting your decisions to empirical reality and letting the data guide your intuition. Mike Speiser wrote about this last September and it's a good phrase, so I'm going to steal it.

Let's take a look!

Scientific Product Development

In that spirit, let's break down the process of scientific product development. A picture is worth a thousand words, so here goes:

Let's break it down.

Hypothesize

The first step in scientific product development is to come up with a testable hypothesis. This can be a hypothesis about anything: your userbase, your market, whatever.

Measure

Many scientific breakthroughs happen because of improvements in instrumentation. Think microscope, telescope, x-ray crystallography, etc. The nuts and bolts of instrumentation, data collection, and measurement are just as important in building a data-driven business as they are in science.

What data do you need to collect? How do you collect it? How do you store it? And finally, how do you extract it in a way your analysts can make sense of it?

This is also the step for the metrics-obsessed.

For most startups measurement and instrumentation involves little more than SQL and your database of choice, but the big boys use technologies like Hadoop, Cassandra, and BigTable to solve various problems in this domain. Google Analytics fits here, too.

Test

I've been talking about this step the most. Once you are recording data and have the ability to extract intelligence from it you can subject your products to A/B tests, multivariate tests, and other experimental techniques.

You can either build your own testing infrastructure or use off-the-shelf stuff like Google Website Optimizer.

Analyze

The final step in the process is analysis. This means taking the data you've collected and generating insight.

Analysis involves looking at all the interlocking variables, isolating the ones that matter most, and presenting them in a way that is easily understood. Statistics is helpful here, as is a background in Information Design.

Your analysis is what gives you the data to contradict or support your hypothesis. It also tells you what you're missing as you iterate your data-driven process. Are you not collecting data you need? Are there additional hypotheses that you need to test? Good analysis yields a few good answers and several more questions.

Directed versus Undirected Analysis

When I was learning the scientific method as a kid it was always presented as a rigid process. You form a hypothesis then you go about methodically testing that hypothesis, refining it until it's a "theory."

This model is too rigid. Flashes of insight often come when you're freely exploring the data. There isn't always a statement to test — sometimes you need to look at the data to figure out what's worth testing.

I'm Not Alone

Other people, like Mike Speiser and Eric Ries, are talking explicitly about building products in this way.

On the strategic and measurement side there are people like my friend Andrew Chen and Dave McClure.

If you have any recommendations of other scientific-minded bloggers I'm all ears!

Implementing A/B Testing

Tue, 14 Oct 2008 06:00:24 +0000

Before you can start doing A/B tests you need a system that can support them. That means either you find one off the shelf or you build it yourself.

Off-the-shelf A/B Testing

Most off-the-shelf A/B testing software is geared towards marketers as they were the first group online to adopt the technique en masse. The two pieces of software I'm most familiar with are Omniture Test & Target, which used to be part of Offermatica before Omnitured acquired them, and Google Website Optimizer.

Omniture Test & Target costs money and is designed for big corporate clients with equally big wallets. It's very nice software, but probably not what you're looking for if you're just getting started out.

Google Website Optimizer, however, is free and much more simple. It lets you do both A/B testing and multivariate testing, but is limited in that it only has a notion of "conversions."

You place a bit of code on every page that is part of an experiment and another bit of code on the page that counts as a "conversion." You can then track conversion rates across your treatments (or "variations" as GWO calls them).

Conversion Rate Experts has a google introductory article on Google Website Optimizer if you're interested.

Rolling Your Own

Rolling your own A/B testing system isn't that hard. Let's say we're running an email campaign called buy_our_book where we're trying to advertise our new book. Here's how the code might look (in PHP):

function send_mail($recipient, $campaign) {
	$subject = get_mail_subject($recipient, $campaign);
	$copy = get_mail_copy($recipient, $campaign);
	mail($recipient, $subject, $copy);
}

function get_mail_subject($recipient, $campaign) {
	if ($campaign == 'buy_our_book') {
		// Get treatment if it exists, else get random treatment
		$treatment = get_treatment_for('book_subject', $recipient);
		switch($treatment) {
			case 'control':
				return get_default_subject($recipient, $campaign);
			case 'discount_price':
				return "Get 50% off our latest book!";
			case 'direct_appeal'
				return "Buy our book, we beg you!"	
		}
	} else {
		get_default_subject($recipient, $campaign);
	}
}

get_treatment_for does the meat of the work. It should do the following:

Use an existing treatment if the recipient already has one.
Otherwise, assign that user a random treatment

Here is an example MySQL schema for A/B testing:

CREATE TABLE treatments (
	id INT UNSIGNED NOT NULL auto_increment,
	name VARCHAR(255),
	experiment_id INT UNSIGNED NOT NULL,
	PRIMARY KEY(id)
);

CREATE TABLE experiments (
	id INT UNSIGNED NOT NULL auto_increment,
	name VARCHAR(255),
	PRIMARY KEY(id)
);

CREATE TABLE users_treatments (
	user_id INT UNSIGNED NOT NULL,
	experiment_id INT UNSIGNED NOT NULL,
	treatment_id INT UNSIGNED NOT NULL,
	PRIMARY KEY(user_id, experiment_id)
);

This schema assumes each user can be uniquely identified by an integer, but if the requirements of your website are different you can change it. For example, if users aren't required to sign up and you track their activities using cookies you'd store the cookie ID in the database.

Use Weights

It's also advisable that you add weights to your treatments so that, e.g., you can select one treatment 90% of the time and another 10% of the time. All this logic can be encapsulated in the get_treatment_for function. I even wrote an article about how to get weighted random elements in PHP.

Why weights? If you're dealing with revenue running a 50/50 split between the control and a total unknown puts your bottom line at risk.

Even if you're not monetizing your traffic you probably don't want to put your traffic itself at risk. Weighting your treatments takes care of this.

What's Next

Hopefully you understand enough to go out and implement a basic A/B testing system yourself. In my next two articles I'm going to cover instrumentation and analysis. That is, what should you be measuring (and how) and how do you know which treatment was successful?

An Introduction to A/B Testing

Mon, 06 Oct 2008 04:00:15 +0000

A/B testing is one of the primary tools in any data-driven environment. You can think of it as a big cage match. Send in your champion versus several other challengers and out comes a victor.

Of course, on the web there's less blood and more statistics, but the principle remains the same: how do you know who will win unless you force them to fight to the death?

A/B Testing lets you compare several alternate versions of the same web page simultaneously and see which produces the best outcome, e.g., increased click-through, engagement, or any other metric of your choice.

Ok, What is A/B Testing, Really?

A/B Testing is a way of conducting an experiment where you compare a control group to the performance of one or more test groups by randomly assigning each group a specific single-variable treatment. Let's break that down.

First, you decide on an experiment. Maybe you're building a web application that forces users to register and you want to experiment on your landing page. You want to see if you can improve the percentage of people who register.

The conversion rate for your landing page is

\text{conversion rate} = \frac{\text{\# of visitors who register}}{\text{\# of total visitors}}

For example, if 100 people visit your landing page today and 20 of those people register then you have a conversion rate of 20%. All else being equal, the landing page with the higher conversion rate is better"All else being equal" is important here — if one of your landing pages promises free candy to people who register you might get a higher conversion rate, but the resulting users will have less long-term value once they realize you're a big fat liar. I'm also not going to talk about statistical significance, yet..

Building Treatments

Once you know what you want to test you have to create treatments to test it. One of the treatments will be the control, i.e., your current landing page. The other treatments will be variations on that. Here are some things worth testing:

Layout. Move the registration forms around. Add fields, remove fields.
Headings. Add headings. Make them different colors. Change the copy.
Copy. Change the size, color, placement, and content of any text you have on the page.

You can have as many treatments as you want, but you get better data more quickly with fewer treatments. I rarely conduct A/B tests with more than four treatments.

Randomization Means Control

You can't just throw up one landing page on Friday and another landing page on Saturday and compare the conversion rates — there's no reason to believe that the conversion rate for users who visit on a Friday is the same for users who visit on a Saturday. In fact, they're probably not.

A/B testing solves this by running the experiment in parallel and randomly assigning a treatment each person who visits. This controls for any time-sensitive variables and distributes the population proportionally across the treatments.

Let's look an example data set.

An Example

Say we have a service called "Foobar" and we're conducting an experiment on our landing page. Our goal is to improve the conversion rate by at least 10%. When a new visitor arrives on the landing page we randomly assign them one of three treatments: the control, Treatment A, or Treatment B.

Let's also say these treatments involve the headline copy. For example, the control treatment's headline copy might be "Foobar is a great service! Sign up here." One of the experimental treatments might have "Foobar lets you stay in touch with family all across the country — easily."

You run the experiment for a few days and get the following data:

A/B Testing Example Data for the Foobar Service
Treatment	Visitors Treated	Visitors Registered	Conversion Rate
Control	1,406	356	25.32%
Treatment A	1,488	372	25.67%
Treatment B	1,392	425	30.53%

From the data above you'd conclude that Treatment B is the winner, but you have to be careful — if the conversion rates were closer or if your sample size were smaller you wouldn't be able to tell which treatment won. For example, can you say for certain that Treatment A is better than the control treatment, or could it just be due to chance?

Sample Size Matters

The sample size of a treatment is the number of people who received that treatment. The larger the sample size the more certain you are that the sample's performance reflects the real performance of the treatment.

For example, what if the above data looked like this, instead?

A/B Testing Example Data for the Foobar Service
Treatment	Visitors Treated	Visitors Registered	Conversion Rate
Control	10	3	30.00%
Treatment A	12	6	50.00%
Treatment B	9	4	44.44%

Which treatment is the best, now? You might be inclined to say that Treatment A is the winner because it has a higher conversion rate. But this is akin to saying that you know a coin is biased because you flipped it three times and got all heads.

That might be unlikely, but it's not impossible. The larger the sample size the more certain you are that the effects you're observing are from real differences in the treatments and not from pure chance. In fact, none of these results are statistically significant, i.e., they're just as likely to be caused by chance as by real differences in the treatments.

Since sample size is per-treatment there are primarily two ways to increase it: use fewer treatments or run the experiment for longer.

What's Next?

There's a lot more to cover when it comes to A/B testing. Here are a few topics I'll be writing about over the coming weeks:

Implementation: Once we understand what A/B testing is about, how do we implement it? Do different products require different implementations?
Statistical Significance: Once we have results from our A/B test, how can we quantify our level of certainty? How long do we have to run an experiment before we can be certain of the results?
Hypothesis Testing: What if we want to test more complex behavior? What if the data we get back can't be modeled as a simple percentage?
Best Practices: What is worth testing? How do you balance short-term and long-term goals in the context of testing?

That's it for today. Feel free to leave a comment and let me know what you want me to write about next. Cheers!

Data-Driven Development

Wed, 01 Oct 2008 07:00:08 +0000

There are lots of smart people out there talking about metrics and tech startups, but the one thing they all have in common is an empirical mind-set.

Another common thread is that a lot of these practices come from other, older industries that have had time to mature. It's high-time we apply this to the startup world.

What is Data-Driven Development?

Data-Driven Development is centered around the belief that business decisions — whether technical, artistic, or financial — are best made based on what is actually happening rather than your personal model of the world.

Of course, everyone agrees with that. The problem is that everyone always believes their version of the world is the correct one. It's not, at least not all the time. How do you know when you're correct and when you're not?

Applying the principles of Data-Driven Development helps you understand what actually works and what doesn't.

Principles of Data-Driven Development

Three key principles are as follows:

Everyone Is Biased: Decisions should be made through the lens of empiricism rather than the lens of intuition.
Universal Instrumentation: Without visibility you can't tell when you're succeeding — or failing.
No Sacred Cows: The most dangerous beliefs are the ones held universally. Test (and measure) everything.

Erlang: A Generalized TCP Server

Mon, 16 Jun 2008 06:00:28 +0000

In my last few articles about Erlang we've covered the basics of network programming with gen_tcp and Erlang/OTP's gen_server, or generic server, module. Let's combine the two.

In most people's minds "server" means network server, but Erlang uses the terminology in the most abstract sense. gen_server is really a server that operates using Erlang's message passing as its base protocol. We can graft a TCP server onto that framework, but it requires some work.

The Structure of a Network Server

Most network servers have a similar architecture. First they create a listening socket that listens for incoming connection. They then enter an accept state in which they loop until termination, accepting each new connection as it arrives and starting the real client/server work.

To see this in action recall the simple echo server from my network programming article:

-module(echo).
-author('Jesse E.I. Farmer <jesse@20bits.com>').
-export([listen/1]).

-define(TCP_OPTIONS, [binary, {packet, 0}, {active, false}, {reuseaddr, true}]).

% Call echo:listen(Port) to start the service.
listen(Port) ->
    {ok, LSocket} = gen_tcp:listen(Port, ?TCP_OPTIONS),
    accept(LSocket).

% Wait for incoming connections and spawn the echo loop when we get one.
accept(LSocket) ->
    {ok, Socket} = gen_tcp:accept(LSocket),
    spawn(fun() -> loop(Socket) end),
    accept(LSocket).

% Echo back whatever data we receive on Socket.
loop(Socket) ->
    case gen_tcp:recv(Socket, 0) of
        {ok, Data} ->
            gen_tcp:send(Socket, Data),
            loop(Socket);
        {error, closed} ->
            ok
    end.

As you can see, listen creates a listening socket and immediately calls accept. This waits for an incoming connection, spawns a new worker (loop) that does the real work, and then waits for the next incoming connection.

In this code the parent process owns both the listen socket and the accept loop. As we'll see this doesn't work so well when we try to integrate the accept/listen loop with gen_server.

Abstracting The Network Server

Network servers come in two parts: connection handling and business logic. As I described above the connection handling is basically the same for every network server. Ideally we'd be able to do something like

-module(my_server).
start(Port) ->
	connection_handler:start(my_server, Port, business_logic).

business_logic(Socket) ->
	% Read data from the network socket and do our thang.

Let's go ahead and do just this.

Implementing A Generic Network Server

The problem with implementing a network server using gen_server is that the call to gen_tcp:accept is blocking. If we were to call this in the server's initialization routine, for example, the whole gen_server mechanism would block until a client connected.

There are two ways to get around this. One involves using a lower-level connection mechanism that supports non-blocking (or asynchronous) accepting. There are then a whole family of functions, most notably gen_tcp:controlling_process, that helps you manage who receives what messages when clients connect.

A simpler and, in my opinion, more elegant solution is to have a single process that owns the listening socket. This process does two things: spawns new acceptors and listens for "connection received" messages. When it receives a message it knows to spawn a new acceptor.

An acceptor is free to call the blocking gen_tcp:accept since it's running in its own process. When it receives a connection it fires an asynchronous message back to the parent process and immediately calls the business logic function.

Here's the code. I've commented where appropriate, so hopefully it's readable.

-module(socket_server).
-author('Jesse E.I. Farmer <jesse@20bits.com>').
-behavior(gen_server).

-export([init/1, code_change/3, handle_call/3, handle_cast/2, handle_info/2, terminate/2]).
-export([accept_loop/1]).
-export([start/3]).

-define(TCP_OPTIONS, [binary, {packet, 0}, {active, false}, {reuseaddr, true}]).

-record(server_state, {
		port,
		loop,
		ip=any,
		lsocket=null}).

start(Name, Port, Loop) ->
	State = #server_state{port = Port, loop = Loop},
	gen_server:start_link({local, Name}, ?MODULE, State, []).

init(State = #server_state{port=Port}) ->
	case gen_tcp:listen(Port, ?TCP_OPTIONS) of
   		{ok, LSocket} ->
   			NewState = State#server_state{lsocket = LSocket},
   			{ok, accept(NewState)};
   		{error, Reason} ->
   			{stop, Reason}
	end.

handle_cast({accepted, _Pid}, State=#server_state{}) ->
	{noreply, accept(State)}.

accept_loop({Server, LSocket, {M, F}}) ->
	{ok, Socket} = gen_tcp:accept(LSocket),
	% Let the server spawn a new process and replace this loop
	% with the echo loop, to avoid blocking 
	gen_server:cast(Server, {accepted, self()}),
	M:F(Socket).
	
% To be more robust we should be using spawn_link and trapping exits
accept(State = #server_state{lsocket=LSocket, loop = Loop}) ->
	proc_lib:spawn(?MODULE, accept_loop, [{self(), LSocket, Loop}]),
	State.

% These are just here to suppress warnings.
handle_call(_Msg, _Caller, State) -> {noreply, State}.
handle_info(_Msg, Library) -> {noreply, Library}.
terminate(_Reason, _Library) -> ok.
code_change(_OldVersion, Library, _Extra) -> {ok, Library}.

We use gen_server:cast to pass asynchronous messages back to the listening process. When the listening process receives the message accepted it spawns a new acceptor.

Right now this server is not very robust because if the active acceptor fails, for whatever reason, the server will stop accepting connections. To make it more OTP-like we should be trapping exits and firing off a new acceptor in the event that a connection fails.

A "Generic" Echo Server

The echo server is the easiest server to write, so let's do it using our new abstract socket server.

-module(echo_server).
-author('Jesse E.I. Farmer <jesse@20bits.com>').

-export([start/0, loop/1]).

% echo_server specific code
start() ->
	socket_server:start(?MODULE, 7000, {?MODULE, loop}).
loop(Socket) ->
    case gen_tcp:recv(Socket, 0) of
        {ok, Data} ->
            gen_tcp:send(Socket, Data),
            loop(Socket);
        {error, closed} ->
            ok
    end.

As you can see the "server" becomes nothing more than its business logic. The connection handling has been generalized and pushed off into its own socket_server. The loop in our generic server is actually identical to the loop in our original echo server, too.

Hopefully you all can learn from this as much as I did. I finally feel like I'm starting to understand Erlang.

Also, feel free to leave a comment, especially if you have any thoughts on how I can improve my code. Cheers!

Erlang: An Introduction to Records

Sun, 15 Jun 2008 07:00:17 +0000

Internally Erlang has only two internal compound data types: lists and tuples. Neither of these data types support named access, so creating associative arrays a la PHP, Ruby, or Python is an impossibility without additional libraries.

That is, in Ruby, I could do:

server_opts = {:port => 8080, :ip => '127.0.0.1', :max_connections => 10}

while in Erlang there's no such support at the language (syntax) level.

To get around this limitation the Erlang VM provides a pseudo data type called records. Records to support named access with some cruft. We'll see why I call these "pseudo" data types later on.

Defining Records

Records are more similar to structs in C than associative arrays in that that require you to define their contents up front and they can only hold data. Here's an example record that stores connection options for a server of some kind.

-module(my_server).

-record(server_opts,
	{port,
	ip="127.0.0.1",
	max_connections=10}).

% The rest of your code goes here.

Records are defined using the -record directive. The first parameter is the name of the record and the second parameter is a tuple that contains the fields in the record and their default values.

In our case we've defined a server_opts record that has three fields: a port, a binding IP, and the number of maximum connections allowed. There is no default port, but the default value of ip is "127.0.0.1" and the default value of max_connections is 10.

Creating Records

Records are created by using the hash (#) symbol. Using the server_opts record from above the following are all valid ways to create a record.

Opts1 = #server_opts{port=80}.

This creates a server_opts record with port set to 80. The other fields have their default value.

Opts2 = #server_opts{port=80, ip="192.168.0.1"}.

This create a server_opts like the above, expect now ip is set to "192.168.0.1".

In short, when creating a record you can include whatever fields you like. Omitted fields will take on their default value.

Accessing Records

Accessing records is clumsy and it's where they start to reveal their cruft. If I want to access the port field I can do

Opts = #server_opts{port=80, ip="192.168.0.1"},
Opts#server_opts.port

Yep, that's right, any time you want to access a record you have to include the record's name. Why? Because records aren't really internal data types, they're a compiler trick.

Internally records are tuples that look something like this:

{server_opts, 80, "127.0.0.1", 10}

The compiler maps the named fields to their position in the tuple.

The VM keeps track of record definitions and the compiler translates all the record logic to tuple logic when you compile your Erlang program. That is, there is no record "type," so you have to tell Erlang what record we're talking about every time you access one.

Updating Records

Updating records works much like creating records. For example,

Opts = #server_opts{port=80, ip="192.168.0.1"},
NewOpts = Opts#server_opts{port=7000}.

would first create a server_opts record. NewOpts = Opts#{port=7000} create a copy of Opts with a port number of 7000 rather than 80 and bind it to NewOpts.

Matching Records and Guard Statements

This wouldn't be a tutorial about Erlang unless we talked about pattern matching. Let's say we want to do something particular with a server if it is running on port 8080 and something else otherwise.

handle(Opts=#server_opts{port=8080}) ->
	% do special port 8080 stuff
handle(Opts=#server_opts{}) ->
	% default stuff

Guard statements work similarly. For example, binding to ports below 1024 often require root access, so we might want to special cast that:

handle(Opts) when Opts#server_opts.port <= 1024 ->
	% requires root access
handle(Opts=#server_opts{}) ->
	% Doesn't require root access

Using Records

In my limited time using Erlang I've seen records used primarily for two things. First, records are used to keep state, especially when using the generic server behaviour. Since Erlang is side-effect free state cannot be kept globally. Instead it must be passed around from function to function.

Perhaps a subset of the first, records are also used to keep track of configurable options.

There are limitations to records, however. Most notably the ability to add and remove fields on the fly. Like C structs the structure of the record is defined beforehand.

If you want to to add and remove fields on the fly, or if you don't know what fields you'll have until runtime, you should use dicts rather than records.

Erlang: A Generic Server Tutorial

Mon, 09 Jun 2008 07:00:33 +0000

One of the benefits of working with Erlang is that it was designed with real-world applications in mind. This is reflected in OTP, or Open Telecommunications Platform, a set of standard libraries that come with the default Erlang VM.

Erlang/OTP implements in a generic way lots of networking paradigms, including finite state machines (gen_fsm), event handling (gen_event), and client/server interaction (gen_server). We're going to cover on the last library, gen_server, or Erlang/OTP's generic server library.

The Client/Server Model

The client/server model is based around many clients connecting to a single, central server. The clients can send and receive message from the server while the server maintains a global state.

Here's a picture.

A common instance where the client/server model makes sense is when you have some resource you want to distribute among several people. The server controls access and allocation of the resource and the clients consume it.

The Code

Code speaks louder than words, so without further ado here is a simple server server that simulates a library. People can check out and return books from the library, but there's only one copy of each book.

-module(library).
-author('Jesse E.I. Farmer <jesse@20bits.com>').
-behaviour(gen_server).

-export([init/1, handle_call/3, handle_cast/2, handle_info/2, terminate/2, code_change/3]).

-export([start/0, checkout/2, lookup/1, return/1]).

% These are all wrappers for calls to the server
start() -> gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).
checkout(Who, Book) -> gen_server:call(?MODULE, {checkout, Who, Book}).	
lookup(Book) -> gen_server:call(?MODULE, {lookup, Book}).
return(Book) -> gen_server:call(?MODULE, {return, Book}).

% This is called when a connection is made to the server
init([]) ->
	Library = dict:new(),
	{ok, Library}.

% handle_call is invoked in response to gen_server:call
handle_call({checkout, Who, Book}, _From, Library) ->
	Response = case dict:is_key(Book, Library) of
		true ->
			NewLibrary = Library,
			{already_checked_out, Book};
		false ->
			NewLibrary = dict:append(Book, Who, Library),
			ok
	end,
	{reply, Response, NewLibrary};

handle_call({lookup, Book}, _From, Library) ->
	Response = case dict:is_key(Book, Library) of
		true ->
			{who, lists:nth(1, dict:fetch(Book, Library))};
		false ->
			{not_checked_out, Book}
	end,
	{reply, Response, Library};

handle_call({return, Book}, _From, Library) ->
	NewLibrary = dict:erase(Book, Library),
	{reply, ok, NewLibrary};

handle_call(_Message, _From, Library) ->
	{reply, error, Library}.

% We get compile warnings from gen_server unless we define these
handle_cast(_Message, Library) -> {noreply, Library}.
handle_info(_Message, Library) -> {noreply, Library}.
terminate(_Reason, _Library) -> ok.
code_change(_OldVersion, Library, _Extra) -> {ok, Library}.

Breaking It Down

The first line of interest is -behaviour(gen_server). This tells Erlang that we'll be using gen_server module for our behavior.

Next we implement wrappers for server calls. We start the library server by calling library:start/0, which in turn calls gen_server:start_link/4.

Whatever we pass to start_link/4 will be passed to init/1 later, which is the callback that handles connection events. In our case we just want to create a new dictionary to store which books have been checked out.

Once we've started the server we want to be able to check out books, see if a book has been checked out, and return books. We implement wrappers to handle these functions, each of which invokes gen_server:call/2.

gen_server:call is used for synchronous communication between the client and the server. That is, it is used when the server expects a response. These calls are handled by handle_call (big surprise, huh?).

All of the meat is in the handle_call definitions. As you can see the server understands three messages: checkout, lookup, and return. We have one definition of handle_call for each possible message and a default action that returns an error when it receives a message it doesn't understand.

Here's an example of how you'd actually use the library server. All of the commands are executed in the Erlang shell, erl.

1> c(library).
{ok,library}
2> library:start().
{ok,<0.39.0>}
3> library:checkout(jesse, "American Creation").
ok
4> library:lookup("American Creation").
{who,jesse}
5> library:checkout(james, "American Creation").
{already_checked_out,"American Creation"}
6> library:return("American Creation").
ok
7> library:checkout(james, "American Creation").
ok

Other Goodies and Caveats

Writing code with gen_server isn't all academic. There are real benefits.

Abstraction

The greatest benefit of gen_server is the abstraction it provides. By encapsulating the essence of the client/server model we can focus on the business logic rather than low-level event management.

More importantly, however, it abstracts away the protocol. The code behind the scenes can change without affecting the client/server behavior.

Supervision

Although we don't make use of it here, gen_server supports supervision behaviors. If a call throws an exception the server can capture it and restart the appropriate section of code. This is handled using handle_info. This becomes more important if the server is spawning additional processes.

Code Swapping

We don't make use of this either, but gen_server supports hot code swapping using the code_changed callback. This is one place where Erlang really shines and gen_server carries it through to the client/server model.

Caveats

It's not all awesome, though. It's surprisingly tricky to write gen_server code that handles TCP/IP connections. I'll give an example of mixing networking and gen_server in a future article, but there are all sorts of control and blocking issues that have to be dealt with.

Leave a comment if you have any cool gen_server examples out there.

Politics and Tufte's Lie Factor

Sun, 25 May 2008 10:57:58 +0000

I admit it, I'm a political junkie. I'm also a math guy who loves design. Politics gets emotional fast and people are quick to stretch whatever data they have to fit their small, partisan aims.

Pundits and partisans misuse statistics all the time, but I happened upon a real gem that perfectly illustrates Edward Tufte's "Lie Factor."

The Lie Factor

In 1983 Edward Tufte wrote a book called The Visual Display of Quantitative Information. In it he states the following principle:

The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the quantities represented.

The Lie Factor measures the extent to which a graph violates this principle. Mathematically it can be stated as follows:

The lie factor should be between 0.95 and 1.05. If it is outside that range then either the graph creator didn't know what the they were doing or they were intentionally trying to distort the facts.

Update: I realized after experimenting with Excel that the reason Jay's graph looks the way it does is because that's the Excel default. Stupid on Excel's part, but it's still careless not to notice.

The Culprit

On Friday Jay Cost over a Real Clear Politics made a post entitled "A Review of Obama's Voting Coalition." It contained no commentary, only six graphs. Here's the fifth graph:

For those who don't know, the Democrats nominate their candidate based on the number of delegates. Most states allocate their delegates proportionally based on the popular vote in each congressional district. One side-effect of this is that a vote in a sparsely populated congressional district can be worth more delegates that one in a densely populated congressional district.

But looking at this graph I was taken aback. Is it really true that each vote received by Obama was worth three times as many delegates as a vote received by Clinton? Take a closer look, though: the "zero point" on the graph is not zero but 10,200. The absolute difference is the same but the relative difference is skewed. Lie factor!

To see why this matters look at the corrected graph.

As you can see the difference is much less stark.

The Effect

It looks like Clinton gets around 11,750 votes per delegate and Obama gets around 10,800. This is around a 13.2% difference in the data.

The size of the effect on the graph, however, shows a 61.3% difference between the two numbers. That's a Lie Factor of around 4.64! Someone needs to review their Tufte.

The Echo Chamber

One reason I don't like the political blogosphere is that it's totally predictable. The same characters say and act the same way, all the time. They may as well be giving out advance copies of their script.

So, of course, Jerome Armstrong, the creator of MyDD and a vociferous Clinton supporter, placed this graph on the front page of his site without a hint of irony or self-reflection. He didn't even bother to analyze the graph and see if it really said what he thought it did.

This is a great example of another phenomenon called confirmation bias, where people search out or skew information so that it conforms to their currently held beliefs. In this case, Jerome just blindly posted a highly misleading graph because it supported his thesis that Clinton should be the Democratic nominee.

It's a comedy of errors, to be sure, but at least we can learn what not to do if we don't want to make ourselves look clueless.

P.S., this website has a list of the graph examples that Tufte himself used to illustrate the Lie Factor principle. Check it out.

An EventMachine Tutorial

Wed, 21 May 2008 19:59:46 +0000

Ruby / EventMachine is an event-driven networking library for Ruby, similar to Twisted for Python. Certain aspects of it are also similar to Erlang/OTP's gen_server module.

Why EventMachine?

EventMachine satisfies two key requirements. First, because EventMachine is an implementation of the reactor pattern, it separates networking logic from application logic. This means you don't have to worry about handling the low-level connections and socket logic. Instead, you just implement callbacks for the appropriate networking events.

Second, EventMachine is lightweight and supports system-level networking primitives. This means that Ruby's speed issues aren't a problem: the performance-critical stuff is in C/C++ and it uses the best your OS has to offer (e.g., epoll in Linux).

In short, EventMachine makes it really easy to write scalable networking services in Ruby. And who doesn't want to do that?

Installing EventMachine

EventMachine comes in a Ruby gem called eventmachine. Just run gem install eventmachine to get the ball rolling.

Note that EventMachine requires a C++ compiler on your system to build the native extensions.

Using EventMachine

I learn by example, so let's dive in.

echo service

The echo service is a traditional UNIX service that accepts incoming network connections and echos back whatever the client sends to it, byte-for-byte. With EventMachine it's really simple.

#!/usr/bin/env ruby

require 'rubygems'
require 'eventmachine'

module EchoServer  
  def receive_data(data)
    send_data(data)
  end
end

EventMachine::run do
  host = '0.0.0.0'
  port = 8080
  EventMachine::start_server host, port, EchoServer
  puts "Started EchoServer on #{host}:#{port}..."
end

Running the above code will start an echo daemon listening on port 8080 for all incoming connections. Let's break it down.

First look at the bottom of the code where we call EventMachine::run. This starts an event loop. It expects a block to be passed in where we would typically start any clients or servers that will live in that loop and will never terminate unless we explicitly call EventMachine::stop_event_loop.

In our case we start our echo service using EventMachine::start_server. The first two parameters are the host and port. The combination 0.0.0.0:8080 means that EchoServer will listen for connections on port 8080 from any IP address.

The third parameter is the handler. Typically the handler is a Ruby module which defines the appropriate callbacks. This is so we don't pollute the global namespace with callback functions. EchoServer only defines receive_data, which is called whenever we receive data over a connection.

Finally, in the EchoServer module we call send_data whenever EventMachine invokes receive_data, which sends data over the connection that initiated the callback.

HTTP client

EventMachine can also be used to create clients. Rather than calling EventMachine::start_server we call EventMachine::connect. Here's a program that connects to an HTTP server and prints out the headers it receives, EventMachine-style.

#!/usr/bin/env ruby

require 'rubygems'
require 'eventmachine'

module HttpHeaders 
  def post_init
    send_data "GET /\r\n\r\n"
    @data = ""
  end
  
  def receive_data(data)
    @data << data
  end
  
  def unbind
    if @data =~ /[\n][\r]*[\n]/m
      $`.each {|line| puts ">>> #{line}" }
    end
    
    EventMachine::stop_event_loop
  end
end

EventMachine::run do
  EventMachine::connect 'microsoft.com', 80, HttpHeaders
end

If you change 'microsoft.com' to ARGV[0] you can pass in whatever website you'd like on the command see what headers it returns.

Here we see a new callback, post_init. This is called after a connection is set up. If you're a client this means you've just connected to a server and if you're a server this means a new client has just connected.

We also use the unbind callback, which is called when either end of the connection is closed. In our case this means the server closed the connection because it has sent us all the data it's going to send. If you were implementing a server it would mean that a client had disconnected.

unbind and post_init are complementary. The former is called whenever a connection is closed and the later whenever a connection is created. I'm not sure why they weren't named in a way that implies their relationship, but there you have it.

These are the three main callbacks, though. Do something when a connection is created, when we receive data, and when a connection is closed. Everything else is pretty much a variation on these plus send_data for sending data.

Facebook Users Just Want Entertainment

Mon, 19 May 2008 16:40:32 +0000

Starting in late 2007 Facebook began instituting its media strategy in earnest with Facebook Pages. According to Facebook pages offer "a unique experience where users can become more deeply connected with your business or brand."

It's good to see Facebook becoming conscious about how they can help shape the future of branding, since this is where the real money is for social networks. Let's see how Facebook Pages has evolved since last November.

Factoids

I went into this project without any pre-conceptions of what I would find. I never really used Facebook pages and wasn't sure if there were any definite conclusions to be found in the data. At best I thought the numbers might be useful for third parties. Here are some interesting facts:

As of May 18th, 2008 there are 190,365 pages and 50,800,399 fans across all pages.
One third of all pages are dedicated to musicians, but this category represents 37% of all fans.
After musicians the category with the second-largest number of fans is TV Shows even though it has 3.8 times fewer fans.
7.9% of pages are in the "other business" category, the largest business category, but only 3.6% of fans.

NOTE: Facebook changes the copy from "fans" to something type-specific. For example, politicians don't have "fans" they have "supporters." I'm going to use fan in the general sense.

It turns out that sports, entertainment, and politics are the three broad categories that perform the best on Facebook, as we'll see below.

Trends

There are two ways to measure the size of a category: one, by the number of pages in that category; two, by the number of fans in that category. Let's look at both.

The graph includes the ten largest categories by the number of pages in each, with the remaining 55 categories grouped into one. The interesting thing is that the graph is divded almost evenly into thirds, consisting of one single category (musicians), the 1-9 most common categories, and the remaining 55 categories.

This doesn't really tell us how Facebook Pages are performing by category only how people are investing in Facebook pages. Let's look at the users' side of things.

There are two things worth noting in this graph. One, the categories make a qualitative shift towards entertainment and politics and away from general businesses. Two, the graph becomes even more lop-sided, with musicians taking up almost 40% of the graph and "other" falling to around 20%.

Usage

So, here's a question: what categories fair best? Let's look at the 100 largest pages by number of fans and see how they break down by category.

The difference here is even more stark. 48% of the top 100 pages are musician pages and 17% are for TV shows. No other category has more than 10% of the fans.

Let's take a look at how pages are paying off by taking the difference between the percentage of fans in a category and the percentage of pages. All else being equal we'd expect pages different categories to have a similar "return on investment." Anything beyond that can only be explained by how Facebook users interact with pages.

That is, if 10% of all pages are in a category but 15% of all fans are in that same category, we say that category has a 5% "ROI." This metric allows us to see which categories are most likely to pay off.

Facebook Page Categories by ROI
Category	% of Pages	% of Fans	ROI
TV Show	1.20%	9.65%	8.45%
Film	1.44%	5.75%	4.31%
Musician	34.11%	37.03%	2.93%
Politician	1.85%	4.43%	2.57%
Actor	1.48%	3.20%	1.72%
Comedian	0.94%	2.02%	1.09%
Game	0.62%	1.41%	0.78%
Food and Beverage	0.91%	1.67%	0.77%
Athlete	0.92%	1.65%	0.72%
Sports Team	1.37%	1.96%	0.59%
Sports / Athletics	1.16%	1.72%	0.57%

The most striking thing, for me, is how conventional these categories are: entertainment, politics, and sports.

Conclusions

Facebook's future rests in the branding of traditional media verticals. They have captured a powerful demographic and have nearly perfected a distribution mechanism. They deny they're a media company even as their VP of Product Marketing says they're the "net's cable company."

Entertainment and sports pages perform above expectations, while generic "business" pages perform below expectations. The popularity of the political categories is to be expected since Facebook's largest userbase is in the US and 2008 is a Presidential election year.

MySpace seems to understand that social networking and entertainment go hand-in-hand. It's about time Facebook embraces the same — their users already have.

Download the the full dataset in a Microsoft Excel spreadsheet

Facebook Bans Google Friend Connect

Thu, 15 May 2008 13:27:16 +0000

Facebook announced today on their official developers' blog that they have banned Google Friend Connect, stating privacy concerns.

Google Friend Connect is a service that allows users to share their social data, such as personal information and friends, with websites that embed the Google-created widgets. This data can come from many social networks, including Facebook, Hi5, Orkut, and Google Talk.

The key section in the second-to-last paragraph:

Now that Google has launched Friend Connect, we've had a chance to evaluate the technology. We've found that it redistributes user information from Facebook to other developers without users' knowledge, which doesn't respect the privacy standards our users have come to expect and is a violation of our Terms of Service. Just as we've been forced to do for other applications that redistribute data in a way users might not expect or understand, we've had to suspend Friend Connect's access to Facebook user information until it comes into compliance.

They claim that they have "reached out to Google several times about this issue," but do not state what conversations, if any, took place. Nor do they spell out exactly how Google Friend Connect violates the Terms of Service.

Facebook announced on May 9th, 2008 that they will be launching their own competitor to Google Friend Connect, Facebook Connect. Both Google Friend Connect and Facebook Connect came on the heels of MySpace's May 8th announcement of their Data Availability project.

Analysis

First, it's exciting to see competition in the data portability space. What seemed like a fantasy just a year ago is now an inexorable trend: data will flow freely across all social networks. Or, as Charlene Li said, "Social networks will be like air."

Google, MySpace, Yahoo!, and Facebook all have huge stakes in this game, each controlling a slice of the social networking pie. Facebook and MySpace have "social networks" in their own right, but don't forget that friend data can come from services like email and IM, too.

Second, Facebook is skirting a fine, legalistic line. They don't claim they have a problem with Google Friend Connect taking data from Facebook. Rather, their problem is that Google Friend Connect supposedly then shares this data with third-parties. Of course, the blog post announcing all this is rather opaque and gives no specifics.

But does anyone sincerely believe this isn't just Facebook pressing its competitive advantages? They're about to launch their own version of Friend Connect and crippling your competitor in anticipation is a play right out of the Microsoft platform handbook.

I think the folks at Facebook are just upset because Google, for once, got the drop on them. The only way they know how to respond is with muscle rather than grace.

Facebook is a business, so I understand it has to operate out of self-interest, but I hope they're not so self-deluded as to believe this move was motivated by privacy concerns. The original launch of Facebook Beacon is enough to know that Facebook doesn't have privacy on the mind all the time.

On a more general level, Facebook likes to play the world domination game, as Umair Haque has pointed out countless times. Using privacy as a front Facebook acts the paternalist.

Does Facebook know best? Are they the best arbiters of my privacy? Thanks, Facebook, but no thanks. I should be able to do with my data as I please.

Update: TechCrunch has more, including a follow-up from both Google and Facebook.

Update 2: John Furrier has an interesting post where he compares Facebook's strategy to Netscape rather than Microsoft.

Interview Questions: Database Indexes

Tue, 13 May 2008 12:56:53 +0000

Continuing my series on interview questions, I'm going to spend some time covering ops and sysadmin questions. We'll start by writing up an introduction to database indexes and their structure.

The Question

Most consumer-facing web startups these days use one of the major open source databases, either MySQL or PostgreSQL, to some degree. If you want to prove your worth it's a good idea to get down to the nitty gritty and gain some understanding about these databases' internals.

So, the question: "Explain to me what databases indexes are and how they work."

The Answer

In a nutshell a database index is an auxiliary data structure which allows for faster retrieval of data stored in the database. They are keyed off of a specific column so that queries like "Give me all people with a last name of 'Smith'" are fast.

The Theory

Database tables, at least conceptually, look something like this:

id	age	last_name	hometown
--	--	--		--
1	10	Johnson		San Francisco, CA
2	27	Smith		San Joe, CA
3	15	Rose		Palo Alto, CA
4	64	Farmer		Mill Valley, CA
5	55	Pauling		San Francisco, CA
6	17	Smith		Oakland, CA
...	...	...		...
100	49	Meyer		Berkeley, CA
101	30	Wayne		Monterey, CA
102	18	Schwartz	San Francisco, CA
104	6	Johnson		San Francisco, CA
...	...	...		...
10000	41	Fetterman	Mountain View, CA
10001	25	Breyer		Redwood City, CA

That is, a table is a collection of tuplesFor bonus points, the "relational" in "relational database" comes from this fact, not from the idea that there are "relations" between tables.. If we have a file like this sitting on disk how do we get all records that have a last name of 'Smith?'

The code would wind up looking something like this:

results = []
for row in rows:
	if row[2] == 'Smith':
		results.append[row]

Finding the appropriate records requires checking the conditions (here, having a last name of 'Smith') for each row. This is linear in the number of rows which, for many databases, could be millions or billions of rows. Bad news.

How can we make it faster?

Database Indexes

Any type of data structure that allows for (potentially) faster access can be considered an index. Let's look at some.

Hash Indexes

Take the same example from above, finding all people with a last name of 'Smith.' One solution would be to create a hash table. The keys of the hash would be based off of the last_name field and the values would be pointers to the database row.

This type of index is called, unsurprisingly, a "hash index." Most databases support them but they're generally not the default type. Why?

Well, consider a query like this: "Find all people who are younger than 45." Hashes can deal with equality but not inequality. That is, given the hashes of two fields, there's just no way for me to tell which is greater than the other, only whether they're equal or not.

B-tree Indexes

The data structure most commonly used for database indexes are B-trees, a specific kind of self-balancing tree. A picture's worth a thousand words, so here's an example.

The main benefit of a B-tree is that it allows logarithmic selections, insertions, and deletions in the worst case scenario. And unlike hash indexes it stores the data in an ordered way, allowing for faster row retrieval when the selection conditions include things like inequalities or prefixes.

For example, using the tree above, to get the records for all people younger than 13 requires looking at only the left branch of the tree root.

Other Indexes

Hash indexes and B-tree indexes are the most common types of database indexes, but there are others, too. MySQL supports R-tree indexes, which are used to query spatial data, e.g., "Show me all cities within ten miles of San Francisco, CA."

There are also bitmap indexes, which allow for almost instantaneous read operations but are expensive to change and take up a lot of space. They are best for columns which have only a few possible values.

Subtleties

Performance

Indexes don't come for free. What you gain for in retrieval speed you lose in insertion and deletion speed because every time you alter a table the indexes must be updated accordingly. If your table is updating frequently it's possible that having indexes will cause overall performance of your database to suffer.

There is also a space penalty, as the indexes take up space in memory or on disk. A single index is smaller than the table because it doesn't contain all the data, only pointers to the data, but in general the larger the table the larger the indexTechnically the size of an index is going to be proportional to the cardinality of the column being indexed..

Design

Nodes in a B-tree contain a value and a number of pointers to children nodes. For database indexes the "value" is really a pair of values: the indexed field and a pointer to a database row. That is, rather than storing the row data right in the index, you store a pointer to the row on disk.

For example, if we have an index on an age column, the value in the B-tree might be something like (34, 0x875900). 34 is the age and 0x875900 is a reference to the location of the data, rather than the data itself.

This often allows indexes to be stored in memory even for tables that are so large they can only be stored on disk.

Furthermore, B-tree indexes are typically designed so that each node takes up one disk block. This allows each node to be read in with a single disk operation.

Also, for the pedants among us, many databases use B+ trees rather than classic B-trees for generic database indexes. InnoDB's BTREE index type is closer to a B+ tree than a B-tree, for example.

Summary

Database indexes are auxiliary data structures that allow for quicker retrieval of data. The most common type of index is a B-tree index because it has very good general performance characteristics and allows a wide range of comparisons, including both equality and inequalities.

The penalty for having a database index is the cost required to update the index, which must happen any time the table is altered. There is also a certain about of space overhead, although indexes will be smaller than the table they index.

For specific data types different indexes might be better suited than a B-tree. R-trees, for example, allow for quicker retrieval of spatial data. For fields with only a few possible values bitmap indexes might be appropriate.

Good Question, Bad Question

I like this question because it shows whether the interviewee is curious enough to dive into these details. For certain higher-level engineering positions knowing this should be second-nature, but even for a generic web development position knowing how your database works will only help you improve the performance of your web application.

Also, it's just arcane enough that you can go through the motions without knowing it, but not so arcane that it's inaccessible to someone without an advanced education. Any decent programmer should be able to understand it — the exceptional ones will go out of their way to learn it.

Powerset Launches. Verdict: Meh.

Mon, 12 May 2008 11:59:53 +0000

Powerset, the much-hyped natural-language search company, has finally launched a public product: a showcase for its search technology that "enhances the Wikipedia experience." It's live right now on its homepage, so go check it out.

Are you back? That sound you heard is the technology world shrugging in unison. For all the hype Powerset has gotten over the last year and a half this showcase leaves a Chicxulub-sized gap between expectation and execution.

Even ignoring all the press, it's not that impressive on the face of it. Using their example queries as templates it took me about five seconds to find queries which not only returned appropriate results on Google but simultaneously returned nonsense on Powerset.

On a personal note, I really wanted to like Powerset. The people working there are all super-smart and I know they've put a lot of blood and sweat into this launch. But I have to be honest. If someone over there reads this just know I do it because I want to see the company launch a great product.

A Failure of Execution

Let's start by diving into a little Google vs. Powerset one-on-one.

Query			Winner
Who is on Google's board?	Powerset	Google	Powerset
Who is on both Google and Apple's board?	Powerset	Google	Google
How did Hitler die?	Powerset	Google	Google
How did Adolf Hitler die?	Powerset	Google	Tie
What is the longest suspension bridge in the United States?	Powerset	Google	Google
When did the United States enter Iraq?	Powerset	Google	Powerset
Who was the tenth President of the US?	Powerset	Google	Google

You get the idea. I tried to pick questions that Powerset is designed to answer, i.e., fact-based trivia easily found on Wikipedia.

Some of the failures are pretty egregious, honestly. Searching for "Who was the tenth President of the US?" fails to return a single relevant result in the first page of Powerset whereas Google's entire first page is relevant, even limiting Google to searching just Wikipedia.

In other cases it appears Powerset has a poor understanding of synonymy, returning irrelevant results for "How did Hitler die?" but returning the correct answer for "How did Adolf Hitler die?" Of course, Google returns the correct answer in both cases.

Guys, this is supposed to be the exact area where you excel. What gives? You're not even living up to your new, lowered expectations.

A Failure of Marketing

Marketing is a two-edged sword. The net effect of good marketing is to solidify your brand in the minds of consumers. This is good if you execute, but can also make it difficult to change course later.

Powerset fell into this trap. They started with ambitions of being a Google-killer, but reading their about page now it sounds more like they're aiming to be the Google-enhancer. This is a respectable business, of course, but it is hard to swallow after a year and a half of being promised a revolutionary new search paradigm.

They might be repudiating the Google-killer label now, but here's an excerpt from a February, 2007 press release:

â€The time is right to tell the world about the game-changing technology weâ€™ve created,â€ said Ron Kaplan, Powerset Chief Technology and Science Officer, who previously created and managed the Natural Language Research Group at PARC. "I am glad to join Powersetâ€™s team of world-class linguists and search engineers to help this technology revolutionize the way people access information."

Too much marketing before a product launches can back you into a corner and in Powerset's case it will be difficult at this point to avoid being compared directly to Google in the press.

Conclusions

Powerset was started in 2005 and has been using Xerox PARC's natural language processing technology for over a year, now. In that time they've been pumping out press releases talking about how they will revolutionize not just search but the way humans and computers interact.

What do they have to show for it? Not much, judging by their latest product. As a search tool it is more interesting than useful, shining in only a few, pre-selected cases. The advantages over Google are so minimal and the defects so large that I would never consider using this as my main means of searching Wikipedia, let alone the Web at large.

To me this product smells like a tech demo, not a fully-featured product launch, intended to convince someone outside Powerset that they really are producing something amazing. There are rumors that Powerset is looking to sell or raise another round of financing, and have recently hired David Wehner, a managing director at Allen & Co.

This launch might be enough to convince investors to re-up or buyers to fork over the dough, but speaking as an end-user I'll take another look at what Powerset has to offer when it can tell me who John Tyler was.

How I Grow My Blog

Fri, 09 May 2008 11:18:02 +0000

I was talking with Matt Humphrey the other day and he asked me, "How did you grow your blog?" My answer at the time wasn't very enlightening, so I thought I'd sit down and hammer out my strategy for growing 20bits.

General Principles

Play to Your Strengths

Not everyone is witty and not everyone is penetrating. Personally I'm good at writing longer, article-sized posts, so that's what I write most of the time. I'm also not great at written humor — I usually just come off sounding like a smug asshole.

As an exercise I might try to write some shorter articles or include a parenthetical joke, but I understand my strengths and use them to my advantage.
Pick Your Audience

Before you even start writing you have to decide on an audience. When I first started blogging I was all over the map. Since then I've tightened up this blog to focus on technology, technology news, and the some aspects of Silicon Valley life.

This limits me in some respects, but helps me in others. Since, you know, I'm working in technology in the Bay Area, it actually does a lot to advance my career, even if I'm never going to be quoted in Time magazine.
Be Interesting, and Know When You Aren't

Try to write interesting things. This means things your audience would be interested in reading, by the way. I can't count the number of times I wrote an article that I was really proud of only to realize that I was the only one who gave a crap.

Of course, nobody can be interesting all the time. Take stock of your mistakes and learn to identify when a post will actually be interesting. And please, don't fall into the trap of thinking you're better than your audience.
Be Succinct

Be as short as you can be without compromising your central argument. This applies to any kind of expository writing, in my opinion, and blogging is no different.

Tactical Principles

Make Friends

Reach out to other people in your field and make connections. Ask people out for coffee to discuss their latest work. Promote other people when they say something worthwhile.

Basically, you want people to be able to associate your website with your face. Think of it as a branding exercise with the pleasant side-effect of getting to meet really awesome people.
Be Quick in Spotting Popularity

If you spot something you know is going to take off and you have a response, write it up as quickly as you can and get it out there.

Nowadays you can use TechMeme to measure this. Find an upcoming article there that few people have responded to and be the first to respond.
Be Controversial, but not a Jerk

Controversy generates interest. Don't just use a linkbait headline with a milquetoast body. That's just half-assing it.

That said, don't be a jerk. If you call people out expect to get called out in return. And be ready to change your tune when someone shows you the error of your ways. In short, have strong opinions, weakly held.
Know When To Promote, and Then Go All Out

When you have a post you know will get traction don't be afraid to promote it by calling in favors. But don't be the boy who cried wolf, either, asking your friends and connections to promote every single story you write. Save if for the good stuff.

I try to follow the above consciously with every article I write. So far it has paid dividends.

The State of the Platform: Update

Wed, 07 May 2008 14:12:44 +0000

My article about The State of the Facebook Platform has been spreading through the blogosphere like a game of telephone. Lots of people have chimed in with their own opinions.

I wanted to write a follow-up post to clarify my opinion and address some of the responses.

What I'm Claiming

My claims are simple and uncontroversial. I observed two things: one, the activity level in the Facebook forums is a fraction of what it was four months ago; two, Facebook apps launched today are much less likely to succeed.

The trends for these two observations are highly correlated and exhibit the same peak around February 2nd, 2008. What happened around that time? One, Facebook began instituting increasingly demanding and arbitrary developer policies. Two, other networks began launching fully-featured competitors to Facebook's platform.

From the high correlation, the timing of events, and comments from people working in the industry, I concluded that developers are less interested in Facebook today because there's less return on their investment of labor.

What I'm NOT Claiming

I'm not claiming that the Facebook Platform is unhealthy. Nor am I claiming that it was a bad idea for Facebook to implement the policy changes they did.

I'm certainly not claiming that any of the data implies either of the above. Indeed, it's still possible to find success on the Facebook Platform. It just requires more effort than it used to.

Also, most emphatically, I'm not talking about Facebook users. The article was only about developers and their decision to create software for Facebook, not about Facebook as a whole, which is still seeing phenomenal success.

Other Hypotheses

The most common alternate hypotheses for these trends was summarized by Jeffery McManus:

This is not a terrific metric for developer activity â€” it doesnâ€™t measure what you purport to measure. Developers generally view and post to forums when they have problems; if fewer developers are posting to the forums, it may mean that there are more developers who are having less trouble.

I call this the "documentation hypothesis" and addressed it briefly in my original article. I think it's an unappealing explanation for a few reasons.

First, if it were true, we'd expect to see spikes in forum activity whenever a new issue arose on the Platform, especially since Facebook's changes tend to be radical and out of nowhere. The decline in activity is virtually monotonic, however, and the data shows no such spikes.

Second, even if it were true, it doesn't explain the correlation between forum activity and application success, nor does it explain the sudden decline beginning around February 2nd. As an explanation it just isn't sufficient.

Is the Trend Good or Bad?

I understand the tone of the article was bearish, but I was writing it from the perspective of a developer deciding whether or not to commit to the Facebook Platform. There are lots of perspectives, though.

Facebook's Perspective

I believe Facebook is making these changes intentionally. They have a love-hate relationship with companies like Slide. Strategically speakingthese companies got in at the very beginning and quickly cordoned off sections of the social graph for themselves, largely out of Facebook's reach. Messages on FunWall don't go through Facebook, for example.

This is clearly not in Facebook's strategic interest, but they can't just boot these companies out because a significant number of Facebook users would throw a fit. From Facebook's perspective these trends in developer engagement are good because it allows them to reassert control and improve their image as the "high-quality social network."

Facebook Users' Perspective

Let's face it, most Facebook users don't like to be pestered by applications. For them these changes are good. And judging by Facebook's traffic stats it isn't hurting them one bit.

Advertisers' Perspective

For advertisers these developments are universally good. If the bar for application development is higher it means the applications that succeed will be of a higher quality. Nobody wants to advertise on "What color barf are you?" and Facebook doesn't want that application to be front-and-center, either. It just looks bad.

Developers' Perspective

For developers this is a mixed bag. Facebook's cavalier attitude about platform policy means that you're playing on shifting ground. On top of that the changes they've already made mean it's harder for applications to succeed, on average.

Still, for companies like SGN and PlayFish, who want to make quality applications, this means that they don't have to worry about competing with win-at-all-cost, spammy applications.

I just wouldn't recommend developing only on Facebook, as they've shown they're willing to change and bend the rules at a whim and for their own benefit. You know that as soon as Facebook decides they don't like what you're doing they'll do everything in their power to hinder you. Hedge your bets.

Hype Cycle

Don't forget about the hype cycle, either. All technologies go through a phase of inflated expectations followed by a trough of disillusionment.

I'd say we're right in the middle of the trough of disillusionment. Companies like Zynga, SocialMedia, Slide, RockYou, and SGN are going to slug through the slope of enlightenment.

Will we have a social operating system or a revolutionary social commerce system waiting at the end? Probably not. Will we have innovative casual gaming platforms? I'd take that bet.

I'm interested in hearing other perspectives, too, particularly investors' perspectives. Does anyone have any insight on that?

Update: Nick Gonzalez, formerly of TechCrunch and now of SocialMedia, makes a similar point about the hype cycle. I also like the Darwinian nature of his post's title: "The Thinning of the Herd." Heh.

The State of the Facebook Platform

Tue, 06 May 2008 08:00:41 +0000

Something is wrong in the Facebook developer community. Starting in March I began noticing that the level of activity in the Facebook developers forum was dropping sharply.

But it's numbers that matter, not vague impressions, so does the data back me up? Is the Facebook developer community retreating from the public space of the forums? The answer is yes, on both accounts.

Looking at five key metrics we'll see that the activity level of the Facebook forums is a fraction of what it was at the beginning of 2008. The number of active usersAn "active user" is defined as someone who has made at least one post in the period being considered. Here, a month has declined 27% since January, for example. And this is the best-performing metric discussed.

Furthermore, this decline in forum activity correlates to an overall decline in activity on the Facebook platform. Applications launched in early January were on average 1.5 times more successful than applications launched at the end of March.

The turning point occurred in early February where several interlocking factors came into play. First, Facebook finally saw real competition in the form of other social networking platforms, particularly Hi5.

Second, Facebook started instituting increasingly demanding and arbitrary rules on platform developers, which they then enforced selectively and for their own benefit.

Third, a trend of application consolidation began and accelerated through March, locking up developer resources inside private companies.

What is Activity?
The Data
The Trend
Analysis and Hypotheses
Facebook Application Data
Conclusions

What is Activity?

You can define activity in a lot of ways and I tried to be as liberal as possible. The Facebook forums have three main objects worth measuring: users, threads, and posts.

The forum is "active" when users are signing up and creating new threads and posts. To measure this I created a script in Ruby that can scrape any PunBB-based forum, like the Facebook forum. This basically results in a local copy of the forum database.

I then broke the data down in two ways. In the first I compare the activity in Janurary 2008 to the activity in April 2008. In the second I break down all activity from the launch of the forum in October 2008 through April 2008 on a weekly basis.

Disclaimer: I'm the creator of Adonomics, a key player in the Facebook platform ecosystem, although I no longer work for the company that now owns it.

The Data

Monthly Statistics for the Facebook Developer Forum
Month:	Jan 2008	Apr 2008
Posts per day	461	225	-51%
Signups per day	38	27	-29%
Threads per day	80	44	-44%
Active users	1,606	1,168	-27%
Highly active users	461	225	-47%

An "active user" is defined as someone who makes at least one post in that month. A "highly active user" is defined as someone who makes at least five posts in that month. The per day metrics are averaged over the number of days in each month.

The big picture doesn't look so good. But is the lower level of activity in April just a fluke or has this been a consistent trend? Let's extend the timeline from October 2007 to April 2008 and take a look at these metrics on a weekly basis.

The Trend

I'm going to post weekly graphs for only three metrics: posts per week, signups per week, and active users per week. The other two metrics show the same trends.

Each graph contains the weekly data plus a red line that represents a four-week moving average. This is to remove any noise produced from dips or spikes in activity and reveal the actual trend.

You can look at the graph for weekly active users, too, but it exhibits the same downward trend.

Analysis and Hypotheses

These graphs show the two main trends for the Facebook developers forum: engagement peaked in late-January while new signups have continually dropped since the launch of the forum.

In fact, since the start of 2008 we've seen 3.4% week over week decline in new posts. For new signups and active users these numbers are 1.7% and 0.8%, respectively.

The other metrics, new threads and highly active users, show the same downward trend.

Why has there been a steady, three-month decline in activity on the Facebook developer forum? What explains the peak in late-January?

Other platforms are more attractive

Since Janurary several other social networks have launched their own platforms. Those based on OpenSocial are wholly incompatible with Facebook and so far Bebo is still the only company to launch using Facebook's architecture for their base.

This means developing for other networks becomes an either-or proposition. Coding on OpenSocial precludes me from spending that time coding for Facebook. Hi5 and MySpace, in particular, are interesting because they potentially offer the huge growth opportunities that developers saw in the first six months of the Facebook Platform.

Developers are consolidating

Networks like Zynga and Social Gaming Network (SGN) have cropped up in the last few months and have made it their business to consolidate the game space on Facebook, probably the only real vertical that has found success on the platform. Bigger companies like Slide and RockYou have been actively recruiting from the Facebook developer pool all along, too.

Perhaps the open community that existed four months ago is closing up. Developers working for the same company talk to each other rather than on the forums, and developers working for different companies don't want to talk in public for competitive reasons.

Facebook has made it too hard to win

Starting in the middle of January Facebook began instituting ad hoc solutions to curb the spread of abusive and spammy apps. The side-effect of these measures is to make it harder for applications to spread.

These measure include banning the word "message" from news feed items, disallowing passive news feed stories, instituting feedback-based request and notification limits, and banning "forced invites."

The ad hoc and arbitrary nature of these rules makes it hard to keep up because Facebook generally gives less than a week's notice. This only serves to increase the relative cost of developing on the Facebook platform.

This is not to forget mini-scandals like the Facebook/CBS partnership, where Facebook removed invite restrictions on CBS' sponsored March Madness application, even though there were other, independent applications in the same category. It's hard to say how this affected developer morale, but it showed that Facebook was willing to hurt independent developers when it benefitted them.

One other possibility is that the external developer resources are now rich enough that people have to spend less time asking questions. This might be a contributing factor but is probably not a critical one given that the decline beginning in late-January is so well-defined.

Facebook Application Data

If it's true that winning on Facebook is not as easy as it used to be then that should be reflected in the application statistics. How can we measure this?

Success, for most developers, is defined as attaining a certain level of activity within a specific timeframe. There are two ways to measure this: one, take time-based cohorts of applications (e.g., all applications started in the same week) and measure their average level of activity some number of weeks later; two, measure the number of active applications as a percentage of the total application space over time.

I will do both. For the first I'm going to take weekly cohorts and measure their average level of activity three weeks later. That is, I am going to group all applications by the week they were launchedReally, I am going to group them by the week they began to be tracked by Adonomics. and see how they were doing three weeks laterThree weeks is arbitrary, but you can look for yourself â€” we see the same trend whether it's one week, two weeks, three weeks, or four weeks.

For the second I am going to measure the number of applications with at least 100 daily active users (DAU)Again, 100 is arbitrary, but we see the same trend whether it's 10, 100, or 1,000. as a percentage of the total number of applications on Facebook.

Here are the graphs.

These exhibit the same trends as the data from the Facebook developer forums.

Conclusions

Correlation is not causation, of course, so we can't say whether the decline in developer activity means less application activity, if developers are leaving because applications are no longer as successful as they used to be, or whether there is an unknown factor causing both of these phenomena.

What we can say is that the vitality of both the Facebook developer community and the Facebook platform is not what it was even four months ago, and that these two phenomena are closely related.

Moreover, talking to developers and investors inside the industry it's clear that the excitement over the Facebook Platform and its promise have waned. Application companies are branching out to other social networks not because they necessarily show more promise than Facebook, but because the future of the Facebook Platform has become murky.

Nobody knows how committed Facebook is to improving the platform or the role applications are meant to play in the overall Facebook ecosystem. Signals like the reduced level of direct participation in the developer community, increasingly restrictive developer policies, and the Facebook profile redesign seem to indicate that they are trying to regain control over some, if not all, aspects of application development while maintaining an aloof demeanor towards developers.

It boils down to this: investing most of your man-hours into Facebook at this point in time is a mistake. The potential return on that investment, a year after launch, is a fraction of what it once was. And the fact that Facebook continues to change the rules and selectively break them for their own benefit means the risk is comparatively higher.

It is better to branch out into other social networks or to piggy-back on Facebook as a means to establish your own, more independent social network. This is what the top companies like Slide, RockYou, Zynga, and SGN are doing, and what many of the independent Facebook developers I've talked with want to do.

The luster of the Facebook Platform might be gone, but that doesn't mean there aren't opportunities in the space. I just wouldn't go looking for them at the other end of Facebook's newsfeed.

Misc.

All application data courtesy of Adonomics. A spreadsheet containing the forum data is also available under a Creative Commons Attribution License, below.

Download the Facebook data used to generate the graphs for this article.

I've posted an update clarifying some of my opinions.

Network Programming in Erlang

Fri, 02 May 2008 00:00:16 +0000

Since I'm learning Erlang I thought my first non-trivial piece of code would be in an area where the language excels: network programming.

Network programming (or socket programming) is a pain in the ass in most languages. I first learned how to do it in C using Beej's Guide to Network Programming. Read it if you dare.

The big roadblock for most server applications is concurrency. Languages like, where concurrency was an afterthought, make developing robust server software more difficult than it has to be.

Even so-called modern languages like Java, Ruby, or Python don't handle it all that well, although you are relieved from the pain of managing all the minute details of the network connections. Erlang, on the other hand, was built with this purpose on mind.

I won't be writing any user-facing applications in Erlang any time soon, but I thought, "If I'm going to learn Erlang I may as well learn its strong points first."

To that end I decided to try to replicate the suite of classic UNIX daemons like echo and chargen.

echo

Echo is a service that spits back whatever data is handed to it over a TCP connection, bit-for-bit. Here it is in Erlang.

-module(echo).
-author('Jesse E.I. Farmer <jesse@20bits.com>').

-export([listen/1]).

-define(TCP_OPTIONS, [binary, {packet, 0}, {active, false}, {reuseaddr, true}]).

% Call echo:listen(Port) to start the service.
listen(Port) ->
	{ok, LSocket} = gen_tcp:listen(Port, ?TCP_OPTIONS),
	accept(LSocket).

% Wait for incoming connections and spawn the echo loop when we get one.
accept(LSocket) ->
	{ok, Socket} = gen_tcp:accept(LSocket),
	spawn(fun() -> loop(Socket) end),
	accept(LSocket).

% Echo back whatever data we receive on Socket.
loop(Socket) ->
	case gen_tcp:recv(Socket, 0) of
		{ok, Data} ->
			gen_tcp:send(Socket, Data),
			loop(Socket);
		{error, closed} ->
			ok
	end.

We can start this service by calling echo:listen(<port number>). from the Erlang shell, e.g., echo:listen(8888). will start the echo service on port 8888 of your machine. You can then telnet to port 8888 — telnet 127.0.0.1 8888 — and see it in action.

Here's the breakdown of the program, by function.

listen(Port): Creates a socket that listens for incoming connections on port Port and passes off control to accept.
accept(LSocket): Waits for incoming connections on LSocket. Once it receives a connection it spawns a new process that runs the loop function and then waits for the next connection.
loop(Socket): Waits for incoming data on Socket. Once it receives the data it immediately sends the same data back across the socket. If there is an error it exits.

There are a few things worth discussing in this example.

Spawning Processes

Processes in Erlang are a basic data type. They follow the actor model of concurrent computation and make network processes a breeze.

We create new processes using spawn, which takes a Fun, or functional object, as its input. You can think of them as functions. Control of the process is handed off to the functional object passed in, like a callback.

Functional Objects

Erlang, being a functional programming language, supports functions as first-class objects via the Fun, or functional object, data type. Functions can create new functions, return functions, modify functions, and pass functions around.

The syntax to create a new functional object is like this:

MyFunction = fun(...) ->
	% Your Erlang code here
	end.

CHARGEN

chargen is a service that spews back a stream of characters when you connect to it. You can read all about it, but it's not that interesting. There's a canonical pattern that it prints out.

Here it is in Erlang.

-module(chargen).
-author('Jesse E.I. Farmer <jesse@20bits.com>').

-export([listen/1]).

-define(START_CHAR, 33).
-define(END_CHAR, 127).
-define(LINE_LENGTH, 72).

-define(TCP_OPTIONS, [binary, {packet, 0}, {active, false}, {reuseaddr, true}]).

% Call chargen:listen(Port) to start the service.
listen(Port) ->
	{ok, LSocket} = gen_tcp:listen(Port, ?TCP_OPTIONS),
	accept(LSocket).

% Wait for incoming connections and spawn the chargen loop when we get one.
accept(LSocket) ->
	{ok, Socket} = gen_tcp:accept(LSocket),
	spawn(fun() -> loop(Socket) end),
	accept(LSocket).

loop(Socket) ->
	loop(Socket, ?START_CHAR).

loop(Socket, ?END_CHAR) ->
	loop(Socket, ?START_CHAR);
loop(Socket, StartChar) ->
	Line = make_line(StartChar),
	case gen_tcp:send(Socket, Line) of
		{error, _Reason} ->
			exit(normal);
		ok ->
			loop(Socket, StartChar+1)
	end.


make_line(StartChar) ->
	make_line(StartChar, 0).

% Generate a new chargen line -- [13, 10] is CRLF.
make_line(_, ?LINE_LENGTH) ->
	[13, 10];
make_line(?END_CHAR, Pos) ->
	make_line(?START_CHAR, Pos);
make_line(StartChar, Pos) ->
	[StartChar | make_line(StartChar + 1, Pos + 1)].

As with echo we can start this by dropping into the Erlang shell and running chargen:listen(8888) to start chargen running on port 8888 (or another port of your choice).

accept and listen are identical to the functions in echo, but here are the differences:

loop(Socket, StartChar): Calls make_line(StartChar) to get the CHARGEN line starting with StartChar, writes it to the socket, and then advances to the next line.
make_line(StartChar, Pos): Recursively generates a CHARGEN line, keeping track of the current position in the line with Pos.

There are a few key conceptual differences, too.

Definitions

As in C we can define constants in Erlang with the -define directive. These are resolved at compile-time. You can reference the definition by prefixing it with a question mark, ?, so as to differentiate it from a variable.

Function Definition Matching

As with assignment, function calls are done via matching. When you call a function it looks for the first matching definition. For example, if we invoke loop(Socket) it finds the appropriate definition, viz., the definition that takes a single argument.

We can fix arguments, too, which is how you deal with loop control in Erlang. ?END_CHAR is 127, so if we call loop(Socket, 127) it first matches that definition rather than the more general loop(Socket, StartChar) definition.

make_line works the same way. If we're at the last position in the line we return a carriage return and line feed and stop recursing.

Conclusion

I created these to be legible and easily understood. Working through them helped me understand a lot about the inner workings of Erlang and hopefully they'll do the same for you. A full-on project page will be coming shortly, but for now you can download the package here.

erlang-services-0.1.zip or erlang-services-0.1.tar.gz

Help, Facebook's Hacking Me!

Thu, 01 May 2008 16:46:10 +0000

BBC's technology program, Click, is claiming to have "exposed a security flaw in the social networking site Facebook which could compromise privacy."

ReadWriteWeb, without a trace of humor, followed on with an article called Facebook Hacked Again. Yes, the title of the post was that sensationalist.

Fortunately for we Facebook users the BBC and ReadWriteWeb show a fundamental misunderstanding of what is happening, how applications can purportedly "steal" user information, and then proceed to scare us by obfuscating the possible solutions.

The BBC's Mistakes

Since the BBC's report is all video, here's a screen capture and a transcript of the voice-over that accompanies it.

And the transcript:

We managed to write a very simple application which steals a user's personal Facebook details, and those of all their friends, without their knowledge.

Their report bothers me first as an engineer because the BBC talks as if this is some sort of sophisticated attack. Just look at the screen capture.

That's right — unless you're elite enough to be sitting in room lit like a rave working with two MacBook Pros there's just no way you'd be able to pull this shit off. Leave it to us, kid, we're professionals.

Snark aside, here's what's happening. In the summer of 2006 Facebook opened up their REST API to third-party websites. Yes, this actually pre-dates the platform, which launched less than a year ago in May of 2007.

Among other things the API permits people to grant external websites permission to access a user's data. Since the launch of the Facebook platform most application exist on Facebook, but the API remains the same.

When you try to log into or add an application here's an example of what you'd see. I've highlighted some relevant parts.

So the BBC's claim that application can access a user's data "without their knowledge" is dubious at best. Sure, it's likely that the user will bypass all that text and go right for the big blue button, but the BBC report makes it sounds like applications are doing something sneaky.

Sorry, folks, but it's right there: Allow this application to know who I am and access my information. Check.

Imagine this exposÃ© instead. "BBC Uncovers Fatal Flaw in Valet Parking System," in which our intrepid reporter poses as a valet and drives off with someone's car. It's so easy, and there's nothing stopping them!

But we trust valets not to do it because the valet will get fired and the police will arrest him. And it's the same on Facebook. In fact, Facebook requires developers adhere to its Terms of Use which explicitly forbids such uses of user information. Of course using this data for identity theft is more than just a violation of Facebook's Terms of Use, it's a violation of the law.

Exaggerated Dangers

The BBC mentions the above Terms of Use clause in passing, but then states quickly that your information is at risk if even only one of your friends installs an application. Yikes! Is that true?

Well, yes and no. Yes, under certain configurations applications can get information about a user's friends even if those friends haven't installed the application. But you're nowhere near as helpless as the BBC makes you seem.

Here is a screenshot of Facebook's Application Privacy page:

Notice the text above the field of options.

The following settings apply only to Facebook Platform applications to which you have not already granted access or explicitly restricted. For these applications, the information you select will be available to friends and other users who can already see your information on Facebook

The BBC and ReadWriteWeb are a day late and a dollar short. Not only is it against Facebook's rules to "steal" user data in this way, but Facebook actually provides mechanisms that allow users to secure their data. I, personally, don't let applications I haven't installed see more than my Facebook photo. They can't get my name, date of birth, location — any of that.

To summarize, the BBC and ReadWriteWeb didn't really uncover anything except a way to abuse a feature intentionally built into the Facebook platform in a way that Facebook anticipated two years ago. What they claim is technically accurate but the dangers are grossly exaggerated.

There are at least four levels of protection.

Facebook forbids developers from storing user data in their Terms of Use.
Facebook provides mechanisms for me to hide data from applications I have installed directly.
For application that I haven't installed but my friends have installed, I have full control over what they can and cannot see on Facebook's Application Privacy page.
Above all this, there is the law. Identity theft is illegal and using something like Facebook to steal personal data probably only increases the risks. If I were looking to steal someone's identity I'd rather just look through their garbage, personally.

This is not a hack and Facebook has controls for dealing with this on both the developer side and user side. Don't buy into the BBC's and RWW's sensationalism. Please.

Interview Questions: Counting Bits

Wed, 30 Apr 2008 00:00:59 +0000

Continuing my series of interview questions, today I bring you the classic bit-counting problem.

The setup usually goes something like this. We're receiving gigabytes of data per second. Each chunk of data comes with a header that contains an unsigned 32-bit integer. Let's call that integer the routing number. We choose the routing destination based on the number of on bits in the binary representation of the routing number.

Write a routine that returns the number of on bits in the binary representation of an unsigned 32-bit integer in C.

The Naive Solution

As usual there's a naive solution. In this case you could loop through each bit at a time, counting the number of ones.

int bitcount(unsigned int n) {
	int count = 0;    
	while (n) {
		count += n & 0x1u;
		n >>= 1;
	}
	return count;
}

>> is the right bit-shift operator. It drops the right-most bit from the binary representation of an integer. So, 0x1001 >> 1 is equal to 0x0100.

The above has a few issues. First, it takes O(n) time, where n is the length of the binary representation of the integer. Can we do better? Second, it doesn't take into account the fact that n is a 32-bit integerLet's ignore the subtleties of integer types in C for now, ok?.

Pre-computation

Since speed was a requirement something that takes linear time is probably a bad idea. The key idea is to realize that a deterministic function, like bitcount, is no different than a hash where the keys are the inputs to the function and the values are the output of the function.

This is principle behind memoization, for example, but here we're sitting pretty. Since both the input and output are unsigned integers we can create a regular array, call it bit_table, where bit_table[i] is the number of on bits in the binary representation of i.

Furthermore since we have the constraint that the integer is 32-bits we can, in theory, pre-compute the entirety of bit_table and include it in a header. It'd work like this:

// Pre-compute this elsewhere and put it here.
static unsigned int bit_table32[0x1u << 32];

int bitcount_32(unsigned int n) {
	return bit_table32[n & 0xFFFFFFFFu];
}

Size Constraints

bit_table32 is going to contain 4,294,967,296 integers. Depending on the size of an integer on your platform this will probably take up several gigabytes of memory. If we want a constant-time algorithm that takes up significantly less memory we can create a 16-bit table and use bit arithmetic.

// Pre-compute this elsewhere and put it here.
static unsigned int bit_table16[0x1u << 16];

// This only works for 32-bit integers but takes constant time.
int bitcount_32(unsigned int n) {
	return bit_table16[n & 0xFFFFu] + bit_table16[(n >> 16) & 0xFFFFu];
}

The Unrestricted Case

If we don't know how many bits the integer will contain (say we moved from a 32-bit to a 64-bit platform) then we can iterate over the binary representation 16 bits at a time, using the pre-computed table at each stepFor the hard-core bit-counters out there, the C specification requires that integers contain at least 16 bits..

// Pre-compute this elsewhere and put it here.
static unsigned int bit_table16[0x1u << 16];

// This works for any sized integer but no longer takes constant time.
int bitcount(unsigned int n) {
	int count = 0;
	while (n) {
		count += bit_table16(n & 0xFFFFu);
		n >>= 16;
	}
	
	return count;
}

Good or Bad Question?

This question suffers from the same problems that the reversing linked lists in that you probably either know the solution or you don't.

That said, the solution here — pre-computing a list of values to CPU time — is much more common than the tortoise and hare solution in the previous question, so the likelihood of it dawning on you during the interview is that much greater. Plus I've been asked this question so many times that it's one of those must-know exercises, in my opinion, even if the question itself could be better.

Learning Erlang

Tue, 29 Apr 2008 16:37:54 +0000

Last week I decided to learn Erlang, a functional programming language developed by Ericsson in 1987 for use in telecommunications environments. It's probably the strangest non-toy programming language I've ever tried to learn, so I thought I'd share some of my realizations.

Variables vs. Atoms

First, variables in Erlang are not like variables as most programmers think about them. Fortunately for me they're a lot like variables as mathematicians think of them.

That is, variables in Erlang are either bound or unbound, and bound variables cannot be rebound in the same context. This means that variables are write-once.

If I declare Name = "Jim". I cannot later declare Name = "Betty". in the same context. Erlang will throw a matching error because it's trying to match the right-hand side, "Betty," against the left-hand side, which is bound to "Jim."

When the left-hand side is unbound any match will succeed and assignment will occur, but if the left-hand side is bound Erlang will try to match the right-hand side to the value of the bound variable. Thus, if "Jim" is bound to Name, both Name = "Jim". and "Jim" = Name. will succeed, but Name = "Betty". will fail. Weird, huh?

Second, "context" in Erlang means lexical scope. What's more, there is no global scope. This is to enforce a no-side-effects style of programming, I suppose.

Finally, variables in Erlang start with a capital letter. Always. That is, Var is always a variable but var is never a variable. If you execute var = 5. you'll get a matching error.

In this case var is treated as an atom by Erlang. Atoms satisfy the same role that symbols do in Ruby. Any literal that isn't another data-type, variable, or function is an atom.

Atoms usually start with lower-case letters but you can also denote atoms by enclosing the name in single quotes. So, Var is a variable but 'Var' is an atom. var is an atom, too, and never a variable.

Data Types

In addition to atoms there are other data types. All the favorites are here, like integers, floats, and strings. We also have Funs, or "functional objects," which are anonymous functions.

Erlang also has two basic compound data types: lists and tuples. These are analogues of the same objects in Python. Items in both lists and tuples are separated by commas, but lists are enclosed by brackets, [], and tuples by curly braces, {}.

For example, [1,3.4,true] is a list and {person, 25, "Jason"} is a tuple.

There are no booleans in Erlang. Instead the atoms true and false are used.

Assignment vs. Pattern Matching

In every other language I've ever used assignment works something like this: var x = 5. In Erlang there is no assignment, at least not in this sense. Rather, Erlang matches patterns and variables will match any pattern.

Consider the following (using the erl shell):

1> {ip, IP} = {ip, "192.168.0.1"}.
{ip,"192.168.0.1"}
2> IP.
"192.168.0.1"

Erlang is matching the left-hand and right-hand sides and trying to align them. IP is a variable (we know this because it starts with a capital letter), so it matches any pattern. ip is an atom (we know this because it starts with a lower-case letter).

In this case alignment is possible because ip matches on both sides and IP is bound to the value "192.168.0.1".

Now consider this:

1> {foobar, IP} = {ip, "192.168.0.1"}.
** exception error: no match of right hand side value {ip,"192.168.0.1"}

Here we get an error because foobar and ip are different atoms, making a match impossible. If instead we did

1> {Atom, IP} = {ip, "192.168.0.1"}.  
{ip,"192.168.0.1"}
2> Atom.
ip
3> IP.
"192.168.0.1"

Here there's no error because Atom is a variable. It is bound with a value of ip, which is an atom.

Here's a more subtle example.

1> {A, {B, C}} = {first, {second, third}}.
{first,{second,third}}
2> A.
first
3> B.
second
4> C.
third
5> {X, Y} = {first, {second, third}}.
{first,{second,third}}
6> X.
first
7> Y.
{second,third}

If you understand why A, B, C, X, and Y get bound to the values that they do then I think you're a long way towards understanding how = works in Erlang.

Looping vs. Recursion

Since variables are bound to their lexical scope it makes procedural-style looping in Erlang difficult. i++ is not only verboten, it is syntactically invalid.

Instead loops are done through recursion. Here is the factorial function:

-module(factorial).
-export([factorial/1]).

factorial(0) -> 1.
factorial(N) ->
	N * factorial(N-1).

Briefly, -module defines an Erlang module, which is the mechanism by which the language supports code separation. -export tells Erlang which functions in this module to export. The /1 after factorial on the export line is the function's arity.

As with variable assignment, Erlang uses pattern matching in defining functions. Since 0 is an integer literal, all instances of factorial(0) match it. Any other calls to factorial with a single argument match the second and N is bound to that argument.

Tail Recursion

Since iterative loops are difficult in Erlang making sure your recursive functions are tail recursive is important. This means the last call a recursive function should make it to itself.

The factorial function above is not tail recusrive — the last call it makes is to * rather than factorial.

To fix this we need to re-write factorial to make use of an accumulator.

-module(factorial).
-export([factorial/1]).

factorial(N) ->
	factorial(N,1).

factorial(0, Acc) ->
	Acc,
factorial(N,Acc) ->
	factorial(N-1, N*Acc).

Thanks to Erlang's pattern matching capabilities we don't even have to redefine the interface. We only export the factorial function that supports one argument.

The Future

Erlang is about concurrency and message-passing, so for my first exercise I'm going to try to create some simple network services.

Also, does anyone know of a GeSHi plugin for Erlang?

The Future is Discovery, not Just Search

Fri, 25 Apr 2008 11:35:49 +0000

Let's start with a picture from Radar Networks' CEO Nova Spivack:

Erick Schonfeld, asking "Is Keyword Search About to Hit its Breaking Point?," talks about Spivack's view of the future of the web. According to him it lies ever-more-refined search technologies such as semantic search, natural language search, and artificial intelligence. A quote:

Keyword search engines return haystacks, but what we really are looking for are the needles . The problem with keyword search such as Googleâ€™s approach is that only highly cited pages make it into the top results. You get a huge pile of results, but the page you wantâ€”the â€œneedleâ€ you are looking forâ€”may not be highly cited by other pages and so it does not appear on the first page. This is because keyword search engines donâ€™t understand your question, they just find pages that match the words in your question.

Spivack wants to "do for data what the Web did for documents" and develop a standard, uniform system for semantic metadata. It's the classic "dumb software, smart data" idea. Tagging works to a degree, but it's neither uniform nor standard — the same tag can mean two different things for two different people, and two different tags can mean the same thing.

That said, the premise underpinning Spivack's whole argument is that search will is the correct interface when faced with a world of exponentially-increasing information. His version of the future says, "Keyword search will become increasingly inefficient and the solution is to develop semantically-aware systems that search based on meaning, rather than content."

Search and Discovery

Let's take a step back and think of other situations where we are faced with more information than we can handle at once, for example, music. How do you get new music? If you want some new hip hop do you search for it?

In truth, nobody I know searches for new music. How can you search for something you don't know, anyhow? Search doesn't just profit off intent, it requires it. To find new hip hop I'd ask a friend who is into that scene and get his opinion, or browse through the new releases at my local record store or iTunes.

The same pattern exists on TV. People don't search for new shows, they discover them either through friends and advertisements, or by channel surfing.

A Bi-Modal Future

The future of information on the Web does not rest in super-advanced search, but in both search and discovery. This bi-modal existence makes sense because people behave in two ways depending on whether they have intent or not.

If someone knows what they want, say, the average RBI among hitters in the American League, then search is perfect. If, however, you're in a channel surfing mood, then search is worthless because you don't know what you want — but you will when you see it.

Lots of sites straddle this divide. Yelp, for example, helps in discovery by giving you sensible metadata in the form of ratings. This fits into Spivack's hypothesis. I have some level of intent (e.g., "I want Thai food in San Francisco"), but not much.

But sites like YouTube fall clear off discovery side of the gap. Who searches YouTube unless they're trying to find a video they've already seen and want to show a friend? Furthermore, who uses the metadata on the site (besides, perhaps, related video) to find new content? Most of the highly-rated and highly-viewed stuff, speaking for myself and my friends, are not the things I watch regularly.

Instead I discover videos on YouTube through my social network or by serendipitously finding a great video embedded in a website I happen to be reading. Indeed, there are whole sites, like StumbleUpon, whose main mechanic is serendipity.

I'm still uncovering new information, but I'm sure as heck not searching for it in the search enging sense of the word.

Summary

In short, search is what we do when we have an idea of what we want and discovery is what we do the rest of the time. When looking for something to watch on TV people don't search, they channel surf. And when people want to find facts people search, they don't stumble around aimlessly.

As information density increases and more pieces of media, information, knowledge, and, in general, data become available online both mechanics, search and discovery, will have to be developed to accommodate the volume. Why?

In a world with more and more data the percentage of data that we are actively able to query becomes smaller and smaller. That is, if there is more data not only do we know less as a percentage of all the information out there, but we have less knowledge of what we do and do not know.

This is where discovery fits and it's a mistake to think the only solution is a single, ultra-intelligent search agent, or a single, unifying data structure for the Web.

Human behavior tells us otherwise.

Interview Questions: Loops in Linked Lists

Thu, 17 Apr 2008 00:00:15 +0000

This is part of my series on interview questions, so welcome aboard!

This installment deals with a common question about linked lists — how do we detect when one has a loop?

Linked Lists

Linked lists are one of the most simple data structures and most aspiring programmers learn them early on. But for completeness' sake let's cover that ground.

A linked list is a sequence of nodes. Each node contains a piece of data and a reference to the next node in the list. Graphically it looks like this

Loopy Linked Lists

It's possible, though, that a node in a linked list might point to a previous element in the list. This is bad for many reasons, not the least of which is that any loop which iterates over all the nodes in the list by accessing the next node will never terminate.

So, it becomes important to detect when linked lists have loops. Here's what one such errant linked list looks like.

The Easy Solution

The easy solution is to keep track of every node seen so far and check if the current node is in that list. Here's a very simple linked list implementation in Ruby.

class Node
  attr_accessor :data, :next
  
  def initialize(data = nil)
    @data = data
    @next = nil
  end
end

Here is the simple solution for detecting loops using the above implementation

def has_loop?(node)
  seen = []
  until node.next.nil? do
    return true if seen.include? node
    seen << node
    node = node.next
  end
  false
end

This solution is workable but sub-optimal (surprise!). This has O(n²) complexity in CPU and O(n) complexity in memory, but a solution with O(n) complexity in CPU and O(1) complexity in memory is possible. In fact, this question is usually posed to preclude the above solution.

The Tortoise and the Hare

The better solution involves a bit of mathematical thinking. If there is a loop then that means any iterator, no matter how many steps it takes per iteration, must hit the offending node.

So, if we have two iterators, one of which has a length that is a multiple of the other, they'll eventually land on the same node. The usual solution is to have one iterator advance one at a time ("the tortoise") and a second iterator advance two at a time ("the hare").

That algorithm looks like this

def has_loop?(node)
  slow = node
  fast = node
  until slow.next.nil? or fast.next.nil? do
    slow = slow.next
    fast = fast.next.next
    return true if (slow == fast)
  end
  false
end

The Best Facebook Ad Network

Wed, 16 Apr 2008 00:00:50 +0000

Even though most Facebook application developers make money off of low-CPM display ads from one of the many Facebook-specific ad networks, browsing the developer forum shows that lots of people don't have good information about which ad network is right for them. I'm here to tell you, once and for all, which ad network is the best.

The State of Advertising

There are dozens of Facebook ad networks with new ones cropping up all the time, including, in no particular order, SocialMedia, RockYou, Lookery, Cubics, VideoEgg, and AdBlade. They vary according to their terms, deals, performance, and stability.

As you can see the Facebook ad market is still very immature. Because of that most ad networks don't have the inventory to satisfy the demand and the fragmentation makes it difficult to get major advertising agencies to spend on Facebook.

It doesn't help that most developers are inexperienced when it comes to advertising, so one-on-one deals are probably out of reach for most of them.

The Best Facebook Ad Network

In light of the above, most developers turn to one of many ad networks to rep their inventory. If they're lucky they might see a $2 eCPM, but in truth they'll probably see an order or two magnitude less.

Lookery, for example, has a "guaranteed 12Â¢ CPM" program for US your traffic. They had to stop signing people up because the demand overwhelmed them. That tells you something about how much you can expect to earn by running simple display ads on Facebook.

So, which ad network is the best? The best ad network is the one that earns you the most money in the long runSince ad-supported companies exist in a two-sided market it's important to realize that short-term gains (e.g., tricking your users into installing spyware) might be readily available, but at the cost of future growth. How you find the balance point is beyond the scope of this article..

You Set Me Up! or A/B Testing for Fun and Profit

Ok, ok, I admit it, I set you up. But it's true. Your choice of ad network should be simple. Does Lookery make me more money in the long run than any of the others? If yes, use Lookery.

You can determine this quantitatively with A/B testing. Here's a basic example, in PHP:

function get_random_ad_code() {
	$ad_codes = array(
		'lookery'     => 'Your Lookery ad code',
		'adblade'     => 'Your AdBlade ad code',
		'socialmedia' => 'Your SocialMedia ad code',
		'rockyou'     => 'Your RockYou ad code'
	);
	
	return $ad_codes[array_rand($ad_codes)];
}

echo get_random_ad_code();

get_random_ad_code will return each ad code with equal probability. Assuming all other variables are constant (i.e., the ads appear in the same place, with the same colors, and without any other ads on the page) then you can look at the earnings reports for each of the above and know, for certain, which ad network performs best on average for your app.

Yes, that's right — there's no single "best" ad network. Two ad networks perform differently depending on their inventory. Some might have ads that do well for EU traffic and poorly for US traffic, or vice versa.

If you ask "which ad network is best?" on the Facebook developer forums you'll get a million anecdotes and little data, but decisions without data are guesses.

So test your own app with all the ad networks and answer the question yourself. Then you'll really know which Facebook ad network is the best.

Amazon, the Tech Company

Mon, 14 Apr 2008 20:26:44 +0000

Larry Dignan over at ZDNet writes:

Amazon on Monday announced persistent storage for its EC2 service and whatâ€™s notable is how quickly the e-tailer is running ahead of the competition. In fact, Amazonâ€™s real business down the line will be its cloud services. Amazon will be like a book store that sells cocaine out the back door. Books will be just a front to sell storage and cloud computing.

I've been speculating about this for a while. Why is Amazon pushing their technology so hard? Their business has been in retail and has been profitable since Q1 2002, if I recall correctly.

But it hit me over lunch with Andrew Chen when we were talking about what it means to "be an X company," where X is media, retail, technology, or whatever.

Think of it like this: what is Amazon's core competency? Andrew mentioned how Amazon uses technology to save money at every corner. They have their warehouses right next to FedEx and use technology to plot the shortest route to pick up all the books needed for a shipment.

In short, their core competency is their ability to develop and leverage their technology stack, including SimpleDB, EC2, and S3, towards making retail ultra-efficient.

All these advantages are worthless in a world where the retail business is mostly digital. Amazon knows this and that's why they're opening their technology stack.

It would make no sense otherwise, since it's precisely that technology which gave Amazon a competitive advantage in a world where books and music were shipped to your doorstep rather than downloaded to your iPod.

Interview Questions: When It's Your Turn

Fri, 11 Apr 2008 18:36:48 +0000

This is part of my series about interview questions. As promised this is about interview strategy rather than specific technical interview questions. I'll continue with that next week.

Every tech interview I've ever had has four stages:

Small talk and swapping brief personal bios.
Questions about your previous employment and projects.
Technical questions and brain teasers.
Turning the tables: "Do you have any questions for me?"

The meat of the interview is in the second and third parts where you can directly show your knowledge, skill, and passion, but don't underestimate the value of the fourth part.

Don't be Afraid to Ask Hard Questions

Most people use the fourth part to ask "What is it like working here?"-type questions. If you think you're going to get interesting responses by all means ask those, but most interviewers I know lie to some degree to make their job sound approximately ten times more awesome than it really is. They probably don't want to admit that there are parts of their job they hate to themselves, let alone some interviewee.

Besides, if you want to know the bad parts about the job — and there will be some — just ask that question directly. They'll either be forthright or they won't and it's pretty easy to discern between the two cases.

Using the Questions to Show Off

In Joel Spolsky's article The Guerilla Guide to Interviewing he says that you want to hire people who are two things: one, smart; two, able to get things done.

Since Interview 2.0 is the common interview style in most technology companies these days you don't always have the chance to show off how smart you are, but the fourth part offers a path to redemption.

Let's say you're interviewing at Amazon and have a background in mathematics. You should be asking the engineers questions about the interesting mathematical things they do, have done, or have tried to do with their massive data sets. This shows that you're not only engaged with the interviewer and the company, but have knowledge that can be brought to bear.

The same applies if you're a marketer or whatever. If you feel like you haven't had the chance to show the interviewer all you have to offer then asking intelligent questions that you know something about is a great strategy.

A Hard-Learned Lesson

I learned this lesson the hard way. About a month after I left Sugar, Inc. and two months after I launched Appaholic I was interviewing at Facebook. For most of the interviews the technical/quizzy type questions went well. I had even sent in solutions to two of their job puzzles before I came in.

I was a little frustrated that most of the CS-type questions were about designing databases (as in, writing one from scratch) since I'd never had to do that before. You can never know too much, though, so I only blame myself.

When it came time to ask them questions, instead of using the strategy above and showing them I did have a solid grasp of the fundamentals they were looking for, I asked them the following question: "Facebook is a _____ company. What would you put in the blank?"

Every single person said "technology" and then I probed them about that. "But you guys make money by selling attention. How does that not make you a media company?"

This is a bad question to ask engineers, even high-ranking ones, because most engineers don't give a crap — they just want to create cool products and gizmos and bristle when people interject marketing and business mumbo-jumbo.

And boy did they bristle. I won't name names, but it was clear this wasn't a welcome question. My time would have been better spent asking them technical questions because it would've created a discussion they wanted to take part in.

I thought I was being clever but instead I torpedoed my chances of getting an offer there by annoying my interviewers and reinforcing their opinions about my technical skills.

Not one month later I sold Appaholic/Adonomics, so it worked out well, but I still view it as a strategic mistake. Lesson learned!

Web 2.0 and Two-sided Markets

Tue, 08 Apr 2008 00:00:09 +0000

Last Friday Andrew Chen wrote a great article about ad-supported websites. His thesis boils down to the following snippet.

The key thing here is: The users of your website are not really your customers. Instead, the entire process of gathering eyeballs is just to sell to your ACTUAL customers, who are the ad agencies and advertisers. Get it? Your Web 2.0 consumer startup is actually a B2B that sells inventory to brand advertisers.

Subtleties

He brings up a worthwhile point, which is that many developers looking to make the next hot thing don't really understand the role advertisers play in their ecosystem. For developers business development, marketing, and advertising are all dirty words. They certainly don't view themselves as trafficking in attention, even though many if not most hot web properties make money by selling their users' attention to advertisers.

This is one extreme — marketing and advertising are dirty and our first duty is to our users, not our advertisers. Andrew's is the other — if you're making money through advertising then you're going to have to serve them above your users.

As it happens the situation is more complicated. Media companies, which I define as companies that sustain themselves by buying and selling attention, exist in a two-sided market. This includes ad-supported companies. (Yes, Facebook and Google, you're media companies, not technology companies.)

Here's a diagram that describes the economics of an ad-supported media company

In a two-sided market both sides can and do affect the other. Most developers are at least tangentially aware of this when they express fears that ads will drive users away. But if you do it right the relationship between you, your audience, and advertisers can be symbiotic rather than oppositional.

A good example of this sort of relationship is my former employer, Sugar, Inc. On occasion advertisers will pay to dress up the avatars that live at the top of their content sites like PopSugar.

The audience likes it because they enjoy fashion and it adds personality to the site. The advertisers like it because it's a unique value proposition and it attaches them strongly with Sugar, Inc.'s brand. And Sugar, Inc. likes it because it makes them money.

Creating Value vs. Stealing Value

It comes down to this: it's possible, if you're clever enough, to create situations where everyone wins, even if you're an ad-supported media company. And your goal as an an entrepreneur should be to create value.

To create value it's important to align your interests with your customers. This might be more difficult in a two-sided market since, in effect, you have two sets of customers.

But as Andrew points out there are other economies that are possible in the Web 2.0 world, such as e-commerce, subscription fees, or virtual goods. These might be easier if they're appropriate, but it's a mistake to think that supporting yourself through advertising automatically makes you an enemy of your audience.

Interview Questions: Shuffling an Array

Mon, 07 Apr 2008 00:00:14 +0000

This is part of my interview question series. It's about shuffling arrays.

The Question

You have an array A of size N. Write a routine that shuffles the array in-place. The only restrictions are that all possible permutations of A must be possible and equally likely.

This interview question serves as a test for basic algorithm construction. There's a canonical solution that's not too difficult to arrive at if you've never seen it before, so it's a good combination of "what do you know?" and "what can you do?"

Workin' it out

I'm going to create my solution in Ruby because that's the language the company that asked me this question used.

The first solution most people arrive at is subtly wrong. Jeff Atwood made the mistake in his blog post. The algorithm, in words, goes like this: iterate through each item in the array, pick another element at random, and swap the two.

In Ruby the above algorithm would look like this.

class Array
  def shuffle_naive!
    n = size
    until n == 0
      k = rand(size) #This is the line which proves our undoing
      n = n -1
      self[n], self[k] = self[k], self[n]
    end
  end
end

This solution seems correct if not optimal, but there's a subtle problem: not all outcomes are equally likely.

The root cause of this is because this algorithm is drawing from a sample space of size N^N, while the sample space of all permutations on an N-element array is only N!.

That is, for the naive shuffle, for each of the N steps in the iteration we make one of N decisions for a total of N^N possible outcomes.

But N^N > N! for all N > 1 and, more importantly, N! is not a divisor of N^N. This means we're going to prefer at least one of the permutations more than the others, so the algorithm doesn't select among the possible permutations uniformly.

KFC, KFY

The "best" solution is the Knuth-Fischer-Yates shuffle. Here it is in Ruby

class Array
  def shuffle!
    n = size
    until n == 0
      k = rand(n) #You can see I'm doing rand(n) rather than rand(size)
      n = n - 1
      self[n], self[k] = self[k], self[n]
    end
    self
  end
end

This works because it's an iterative version of an essentially recursive algorithm. If we know how to shuffle an array of size N-1 then shuffling an array of size N is easy — first shuffle the sub-array consisting of the first N-1 elements and then randomly swap in the last element to any of the N slots.

There's a proper inductive proof in there if you're so inclined, but it's not particularly illuminating.

Good Questions, Bad Questions

My next article is going to be more about the interview process rather than specific questions. One key thing to understand in an interview is what information the interviewer is looking for in asking their question. Hint: it's not always the answer.

Among other things they want to suss out the limits of your knowledge, how you solve problems, how quickly you resort to help, and a whole assortment of other, behavioral things, that they get because you're right there (hopefully) engaging in a dialogue.

Decisions Without Data

Sat, 05 Apr 2008 00:00:26 +0000

If you've ever worked on a project where you have to build something, be it software or anything else, you've seen it happen — people, especially designers and engineers, argue over the most petty stuff.

You know what kind of argument I mean. How many widgets should we let people enter at a time? Should we use horizontal or vertical navigation? Everyone knows they have "the" optimal answer and the situation quickly devolves into a game of verbal chicken, where the first one to realize it's a stupid argument loses.

Having seen this over and over at all levels of decision-making I've found a sentiment that stops the situation from devolving. It goes like this: decisions without data are guesses.

Whenever one of these decisions pops up I ask three questions.

What is the goal?
What metrics tell us whether we're closer or farther from the goal?
What data have we collected and what data do we need?

Example: Publisher Ad Choices

In the world of Facebook applications most developers, when it comes to making money, are members of the Ron Popeil school of business: set it and forget it. If you browse the forums for ad-related topics there are a few questions that recur over and over. What ad network is best? Where should I put ads on my application? What color scheme should my ads have?

In this case, the developers implicitly have a goal of making money in mind. The two key metrics are total pageviews and revenue per pageview (RPM). The data needed to calculate these metrics are total pageviews, which you can get by using Google Analytics, and revenue, which every ad network reports directly.

So, let's take the first question, which ad network should I use? I might think AdBlade is the best and have ten stories that back up my claim, while you might think RockYou is the best from your own experiences. We could go back and forth all day, but there's only one correct answer: the best ad network is the one that, for a given level of traffic, offers the highest RPM.

You can measure this by using A/B testing across multiple ad networks. Once you've collected information about how each ad network performs there's no room for arguments backed by anecdotes. The best choice is right there in the numbers.

Example: Website Layout

These arguments crop up all the time when talking about website design. Let's say you're creating the latest and greatest social networking site. What should the homepage look like?

First, you need to settle on a goal for what the homepage is supposed to do. Do you want lots of users to sign up? Do you want lots of users of a certain type (e.g., more engaged users, only women, etc.) to sign up? Let's say you just want as many people to sign up as possible.

The metric you're probably interested in in this case is the percentage of people who visit the homepage and then go through the signup process. Measuring this requires that you track a user through multiple parts of the site and identify which ones sign up and which ones don't. You can do this by assigning the potential user a unique identifier and persisting it through the entire signup process.

Now you know for a given homepage layout what percentage of users sign up. What homepage design is the best? Should we use 14-point or 16-point headers? Should be use an off-white or grey background? Depending on the granularity of the design elements you want to test you can do this either through A/B testing, as in the previous example, or through a more complex multivariate testing scheme.

People love arguing about what designs are "best," but this process forces you to ask, "Best for what?" If our goal is to get signups then the best design is the one that produces the most signups and that's something we can measure directly. Once we've done that not only is there no more room for silly arguments but metrics might reveal that both opinions were wrong.

Conclusions

Maybe it's my background as a scientist and mathematician, but I treat website development and design as an empirical venture. We should come to the task with a definite idea of what we're trying to achieve and at every step make the decision that the data says is best.

Not only does this produce better and more justifiable decisions but it prevents time-wasting arguments. If someone comes at you with an opinion you can just shoot back, "What data do you have?" If they have nothing but opinions and anecdotes you know they're not making a decision, they're guessing.

Interview Questions: Two Bowling Balls

Thu, 03 Apr 2008 00:00:01 +0000

This post is the first in a series I'm calling "interview questions," where I discuss interview questions I've been handed in my time out here in the Bay Area. Since I'm an engineer by trade most of the questions relate directly to technical topics. I'll also cover general interview strategies and advice — probably by serving myself up as an example of what not to do in an interview.

I know people keep a repertoire of interview questions at hand, so I'm not going to name names when discussing the questions. Anyhow, let's get started!

The Question

You're standing in front of a 100 story building with two identical bowling balls. You've been tasked with testing the bowling balls' resilience. The building has a stairwell with a window at each story from which you can (conveniently) drop bowling balls.

To test the bowling balls you need to find the first floor at which they break. It might be the 100th floor or it might be the 50th floor, but if it breaks somewhere in the middle you know it will break at every floor above.

Devise an algorithm which guarantees you'll find the first floor at which one of your bowling balls will break. You're graded on your algorithm's worst-case running time.

Warning: Stop reading here if you're not interested in seeing any of my solutions!

A Few Preliminaries

The original problem stated that the building had 100 floors, but it may as well have N floors. Using N rather than 100 will make it easier to quantify the performance of the algorithm, so that's what I'm going to do.

Solution 1: The Naïve Solution

Ok, there's one blindingly obvious solution: take one of the bowling balls and drop it from every floor, starting from the first. At worst this will take N tries, where N is the number of stories on the building.

Interview Advice:: In an actual interview situation don't be afraid to say the obvious solution, even if you know there's a better one. Problem solving is iterative and your answer should be, too.

Solution 2: Two Bowling Balls

We know the first solution is probably sub-optimal because it doesn't make use of both bowling balls. To give us some ideas let's just pick a floor, say the 50th floor, and drop one of the balls — we has nothing to lose since we know we can do it with only one.

If our building is 100 floors and we dropped one of the balls from the 50th floor one of two thing will happen: the ball will either break or it won't. If it breaks then we know the floor we're looking for is somewhere between floors 1-49. If it doesn't then we know it's somewhere between floors 51-100. In either case we've halved the size of the search space and now need at most N/2 (or 50) tries.

But 50 was arbitrary. What about other numbers? What happens if we drop the ball on the third floor? If it breaks then we can use the second ball to test floors 1-2, taking at most 3 tries. If it doesn't break then we try the same experiment again, dropping the ball from another floor.

Here's one possible strategy: pick a number S and call it the skip number. We drop one ball every S floors until it breaks on the k^th try. We then use the second ball to try every floor between floors (k-1)*S and k*S.

As an example, let N=100 and S=4. We'd try floors 4,8,12,16,... with one bowling ball until it breaks. Let's say it breaks on the 60th floor. Since it didn't break on the 56th floor we know the culprit is somewhere on floor 57, 58, 59, and we can use the second ball to test those floor one at a time using the naÃ¯ve strategy.

What is the best skip size? Obviously S=100 isn't ideal since that is equivalent to the naÃ¯ve strategy, as is S=1. But we know both S=50 and S=4 are better, so there must be an optimal strategy somewhere between. To find this strategy let L(S) be the number of drops requires in the worst-case scenario for a skip number of S. If you work it out you'll get

We want to minimize this function. Bringing back our high school calculus, the derivative of L(S) is

Setting the derivative equal to zero implies

For N=100 this gives an optima skip of S=10. If N isn't a perfect square you'll have to work out which skip gives the "correct" solution.

Solution 3: You can do better...

At this point in the interview you're probably pretty happy with yourself. The above took you a few minutes to work out, perhaps with some prodding by the interviewer. But then you hear that dreaded question, "Can you do any better?"

The interviewer isn't a jackass, though, and gives you a hint. He points out that it seems like we should be able to find a solution that works equally well irrespective of where the bad floor is. That is, it should take the same number of turns if its on the 100th floor as it would if it were on a lower floor.

We have a baseline for ourselves. For N=100 and S=10 we know we can do it in at most 19 turns. This can act as a sort of counter — if we beat this number at every step we've come up with a strictly better algorithm. So, at every step, we want to be able to find the floor in question in no more than 18 steps.

Let's start by dropping the first ball on the 18th floor. If it breaks we can test floors 1-17 with the second ball, taking at most 18 turns. If it doesn't break, we've used up one of our turns, leaving us with 17 turns left.

So, the next floor we should test is 18+17, or the 35th floor. If it breaks we can test floors 19-35, taking at most 18 turns. We can continue this way, shrinking the step size by one each time. Now we know we can do it in at least 18 steps.

But why not 17? If we repeat the above steps, starting with a counter of 17 rather than 18, we get an algorithm that takes at most 16 steps. Then, using 16 as a counter, we get an algorithm that takes at most 15 steps. We can't do this forever, since there's no possible algorithm that takes at most one step. So where is the end of the line?

The problem is that for this algorithm to work the first ball needs to be able to skip one fewer each time and still cover all 100 floors. If we set our counter to C that means we must have 1+2+...+C > 100.

Here's the math:

Using the quadratic formula to find the exact solution and then taking into account the fact that we want an integer solution gives

as the worst possible case for our third strategy. L(100) = 14, which checks out.

That's the best solution I know, and it was the best solution the interviewer knew, too. Can you do any better?

After The Interview

This was one of a few questions I was asked by one of four interviewers. I worked through the problem above, basically as it was written out, albeit with more digressions. How did the interview wind up going? I wasn't offered a job. At least I got a good interview question out of it, though.

Memo to OpenSocial: It's about distribution, stupid!

Wed, 07 Nov 2007 02:59:09 +0000

With the launch of Google's OpenSocial project last week and the subsequent announcement that MySpace will be one of the participating social networks the developer community on Facebook and the technology blogosphere is wondering what this means for Facebook's platform strategy. The short answer: not much. Why? Because it's not about users per se, it's about distribution.

Note: This was cross-posted to the Adonomics blog.

The Hype Machine

Talking heads love narratives. After MySpace announced their partnership the narrative became "the young and arrogant Facebook was shown up by the experienced and prudent Google, who in one fell swoop obsoleted their whole platform strategy." Michael Arrington asked if this were "checkmate" for Google. Others have been asking whether or not Facebook should just give up and join OpenSocial itself.

Hitwise even published a graph showing the total size of all the OpenSocial partners vs. Facebook. Yikes! That sure looks bad for Facebook.

But like most media narratives the story turns out to be more subtle.

The Value of the Facebook Platform

These subtleties first appear when you consider the value application developers get from Facebook. Nothing in this world is free, including developing applications for Facebook platform. In exchange for agreeing to dress your application in the Facebook blues they are offering you, the developer, two things:

Lower cost of acquisition per user
An unparalleled distribution mechanism

The cost of acquiring a user on Facebook is orders of magnitude cheaper on Facebook than on the web at large. Facebook effectively offers a single sign-on solution with their API, on top of which it only takes one click for a user to add an application to their profile.

OpenSocial clears this hurdle, too. If I'm in MySpace an OpenSocial widget will use MySpace's login system to verify whatever it needs to verify. Ok, so I can install OpenSocial widgets in one click. Once a user has decided they want to add my widget the barrier to entry is minimal.

Distribution, however, deals with the step before this. How do I get users information about my application to begin with?

It's About Distribution, Stupid!

Last September something remarkable happened: Facebook launched the newsfeed and mini-feed. Although most people didn't realize it at the time this effectively "activated" the social network underlying Facebook, making it possible for information to flow efficiently through the connections in that network. Information that I used to have to seek out now came to me without any effort on my part!

This made it possible for Facebook to become a distribution engine. Through the newsfeed Facebook could, in theory, distribute anything: advertisements, my friends' activity, and even software.

Heck, Facebook could partner with local governments to send out public health announcements to local Facebook users. This is powerful stuff and it's what drove the Facebook-made photos and events applications to be larger than Flickr and Evite, respectively.

So when Facebook opened up the platform plenty of people knew it would be possible for them to achieve the same kind of success. iLike saw three million users add their application in the first two weeks. Even now, three and a half months later, it's possible to get a million users in less than a month.

You might not hear stories like iLike's any more but that doesn't mean the Facebook platform isn't growing. The number of application installs across all of Facebook has been remarkably consistent since the launch of the platform growing at a rate of about 1.5% or 2.96 million installs per day.

That's right. Every day almost 3 million people click that big blue "Install" button and don't remove the application. And this growth shows absolutely no sign of slowing.

OpenSocial vs. Facebook

To quote Dave McClure, "open is not better, better is better!"

Ignoring questions of the quality of the users, OpenSocial's incorporation of MySpace appears to give them the edge. MySpace, after all, has 200 million users and Facebook has 50 million. But the size of the potential userbase doesn't matter nearly as much as the ability for an application developer to activate that userbase.

Peter Chane, the Group Product Manager at Google for OpenSocial, has stated that OpenSocial will have some components of Facebook's distribution system (newsfeed/mini-feed) but not others (notifications/requests).

Whether the OpenSocial model can surpass Facebook-level distribution is the key question. It's not about how many users are using the social networks on OpenSocial — it's not even clear that there's going to be any real interaction between users on different networks, anyhow. It's not about whether OpenSocial is more or less "open" than Facebook. It's about whether developers can build high-quality applications (not just widgets) using the OpenSocial technology and distribute them efficiently through the social graph.

There are all sorts of unknowns even though OpenSocial is a week old. How does OpenSocialâ€™s feed-based system compare to Facebookâ€™s newsfeed? How does the quality of the social graph wired up with OpenSocial compare to Facebookâ€™s? What impact do these factors have on the efficiency of distribution?

In Closing...

The point is this: don't be blinded by the big numbers of MySpace. Until OpenSocial shows it can activate those users in a way that is more viral than Facebook it's an unproven technology, even ignoring the fact that as of this post OpenSocial doesn't provide any meaningful way to interact between containers. If you're on MySpace you're not going to be able to switch to another social network and take your MySpace data with you.

That it's been a week since Google launched OpenSocial and the only story about iLike is not that they're on track to get another windfall of users via OpenSocial, but that their Ning application has been hacked makes be believe Facebook's king can more than meet the threat from Google's attack.

Graph Theory: Part III (Facebook)

Fri, 02 Nov 2007 07:00:28 +0000

In the first and second parts of my series on graph theory I defined graphs in the abstract, mathematical sense and connected them to matrices. In this part we'll see a real application of this connection: determining influence in a social network.

Recall that a graph is a collection of vertices (or nodes) and edges between them. The vertices are abstract nodes and edges represent some sort of relationship between them. In the case of a social network the vertices are people and the edges represent a kind of social or person-to-person relationship. "Friends of," "married to," "slept with," and "went to school with" are all example of possible relationships that could determine edges in a social graph.

So, right away, you can see how this applies to Facebook. They have a huge collection of just this sort of data: who is friends with whom, who is in a relationship with whom, who is married to whom, who went to college with whom, etc. Can anything useful be done with this?

What Can Be Done With a Social Graph

Let's step back and think like a marketer for a second. Facebook, thanks to the newsfeed, is essentially a word-of-mouth engine. Everything I do, from installing applications to commenting on photos, is broadcast to all my friends via the newsfeed. Intuitively, however, we know that some people are just more influential than others.

If my cool friend writes a note about an awesome new shop he found in the Lower Haight I'm probably going to pay more attention. People like this, who are influential and highly connected, are a marketers dream. If I can identify and target these people, "infecting" them with my marketing, I'll get ten times my return than I would going after random people in my target demographic.

Facebook is almost certainly doing something like this already with respect to the newsfeed. They process billions of newsfeed items per day. How do they know which messages are most important to me? Well, it stands to reason that the messages that are most important to me are the ones from the people who are most important to me. So, as Facebook, I want to be able to calculate the relative level of importance of a person's friends and use that measurement to weight whether their newsfeed items get displayed for their friends.

There are several problems. Can we come up with a good measure of social importance or influence? Are there multiple measures, and if so, what are their relative merits?

Measurements of Social Influence

Let's start simple. One way to measure influence is connectivity. People who have lots of friends tend to have more influence (indeed, it's possible they have more friends precisely because they are influential). Recall from the first part that the degree of a node in a graph is the number of other nodes to which it is connected. For a graph where "is friends with" is the edge relationship then the degree corresponds to the number of friends.

Degree Centrality

Let's call this influence function I_d ("d" for degree). Thus, if p is a person then I_d(p) is the measure of their influence. Mathematically we get

The main advantage of this is that it's dead-simple to calculate. If you represent your graph as an adjacency matrix, as in the second part of this series, then the influence of a node is just the row-sum of the corresponding row — an operation which is very fast and easily paralleizable.

The downside of this is that its naive. Consider the following graphs.

Single person with high degree

Single person low degree but high connectivity

Using I_d as a measure of influence the first person, p₁, has a higher measure of influence because they are directly connected to eight people. The second person, p₂, however, has the potential to influence up to 9 people. This happens in the real world, too. Consider a corporate hierarchy in a large company. The CEO only has direct relationships with his board, the VPs, and maybe a few other employees. He is undeniably more influential than an administrative assistant to the deputy regional director of sales for Southern Montana and yet might have fewer direct connections.

Using Eigenvalues

One way to capture this sort of indirect influence is to use a measurement called eigenvalue (or eigenvector) centrality. The idea is this: a person's influence is proportional to total influence of the people to whom he is connected. We'll call this influence measure I_e.

Let's say I'm the CEO at X Corp. There are four VPs each of whom has an influence of 5 and these are the people to whom I am connected directly. Then this measure says that there is some number λ such that

λ determines how much influence people share with each other through their connections. If λ is small then the CEO has a lot of influence, if it is large then he has little. How do we calculate λ?

Let G be a social network or social graph, where vertices are people and edges are some sort of social relationship, and let A be the adjacency matrix of G. If there are N people in the social network labeled p₁, p₂,... then we can generalize the above and say that

Remember that A_i,j is 1 if p_i and p_j are joined by an edge and 0 otherwise. It's also important to notice that λ is a function of the graph not of any individual node.

If we call x_i = I_e(p_i) then we can form a vector x whose i^th coordinate is the influence of the i^th person. We can rewrite the above equation using vectors and matrices, then, as follows:

Eigenvalue Problem

If you remember from the second part of this series this is the classical eigenvalue equation (hence the name of this influence measure).

Calculating someone's influence according to I_e is therefore equivalent to calculating what is known as the "principal component" or "principal eigenvector." Luck for us there are tons of eigenvalue algorithms out there.

The Power Method

The easiest, but not necessarily the most efficient, way of calculating the principal eigenvalue and eigenvector is called the power method. The idea is that if you take successive powers of a matrix A, normalize it, and take successive powers then the result approximates the principal eigenvector.

You can read the mathematical justification for why this at Wikipedia. Here is some Python code which takes as its input the adjacency matrix of a graph, an initial starting vector, and the level of error you're willing to permit:

from numpy import *

def norm2(v):
	return sqrt(v.T*v).item()
	
def PowerMethod(A, y, e):
	while True:
		v = y/norm2(y)
		y = A*v
		t = dot(v.T,y).item()
		if norm2(y - t*v) <= e*abs(t):
			return (t, v)

The upside to using this method is that it is relatively easy to compute (Google uses a variant of this to calculate PageRank, for example) and that it encompasses more subtleties about how nodes possibly influence each other. The downsides are mostly technical: there are certain situations where the power method fails to work.

Other Measures

There are other measure, too. The two most common are called betweenness and closeness.

Without going into detail, the betweenness of a node P is average number of shortest paths between nodes which contain P. So, nodes that are in high-density areas where most nodes are separated by one or two degrees have a high betweenness score. The closeness of a node P is the average length of the shortest path from P to all nodes which are connected to it by some path.

Both of these measurements are fairly sophisticated and difficult to calculate for large graphs.

Experimental Measurement

The downside to all these measure is that it only takes into account the topology of the graph. That is, it ignores the fact that the nodes, as people, are performing actions. In the case of Facebook we can measure influence directly by measuring how activity spreads throughout the network. Let's say we have a person P with friends F1, F2, F3, etc. As Facebook we send out a message on behalf of person P to his friends and count how many act on the message.

Statistically we can model this using a random variable. Let X_i be the number of people "influenced" by a message sent on behalf of p_i, the i^th person on our network. We can then calculate the expected value of X_i.

We can then take into account information from the profile data and answer questions like "What is the expected number of people this person will influence given he is a male, 32, a marketing executive who graduated from Harvard with 42 friends and 102 wall posts?"

Conclusion

These techniques work on any graph, including the social graph Facebook has. When you have a graph as complete as Facebook you're able to do a lot of interesting stuff. Imagine I'm a marketer who wants to have a sponsored newsfeed item. Facebook can charge a premium because they're able to target the influencers by using techniques like the ones above.

Of course I can't say whether Facebook is using some, none, or all of the techniques I described. But that doesn't mean application developers can't. By keeping track of who influences who you can use these techniques to maximize your exposure. Fancy that!

Graph Theory: Part II (Linear Algebra)

Thu, 02 Aug 2007 14:56:59 +0000

This is the second part in my series on graph theory. Part I included the basic definitions of graph theory, gave some concrete examples where one might want to use graph theory to tackle a problem, and concluded with some common objects one finds doing graph theory.

I'm going to cover three things in this post: vector spaces, linear transformations and matrices, and eigenvectors and eigenvalues.

Vector Spaces

Linear algebra is the study of vector spaces and and their transformations. No good mathematical text can begin without definitions, so lets dive in:

Definition. Let R be the real numbers. A vector space over the reals is a set V and two binary operations +: V×V → V and ·: R×V → V, called vector addition and scalar multiplication, respectively, which satisfy the following

(V,+) is an Abelian group.
Scalar multiplication distributes over vector addition
Formally, α·(v+w) = α·v + α·w for all α ∈ R and v,w ∈ V.
Vector addition distributes over scalar multiplication
Formally, (α + β)·v = α·v + β·v for all α,β ∈ R and v ∈ V
Scalar multiplication is compatible with vector addition
Formally, α·(β·v) = (αβ)·v for all α,β ∈ R and v ∈ V
Scalar multiplication has an identity element
Formally, 1·v = v for all v ∈ V, where 1 is the multiplicative identity.

These properties aren't arbitrary, even though they might look like it. The most common vector space is n-dimensional Euclidean space Rⁿ. R² is the Cartesian plane we grew up with in grade school and R³ is the three-dimensional space we live in every day.

A vector is an element of a vector space. For x ∈ Rⁿ we write x = (x₁, x₂, ..., x_n), i.e., an ordered tuple of n components. For example, (1,2,3) is an element of R³, as is (√2, 5/3, 2.12).

Let's work in R³. Take v = (1,0,0) and u = (1,1,0). Then v+u = (1+1, 1+0, 0+0) = (2,1,0). That is, addition on R³ is just coordinate-wise addition from R. Scalar multiplication works the same way, so that √3 · (1,10,0) = (√3,10√3,0).

Note. If we're just talking about some abstract vector v in some vector space V over the reals then we refer to the i^th coordinate of v as v_i.

Linear Transformations and Matrices

Any time you see a mathematical object you should immediately ask yourself, "What are the transformations of this object?" In linear algebra the transformations between vector spaces are called linear transformations (boring, eh?). The definition:

Definition. Let V,W be vector spaces over the reals. A function v: V → W is a linear transformation if it satisfies the following conditions:

f(v+u) = f(v) + f(u) for all v,u ∈ V
f(α·v) = α·f(v) for all α ∈ R and v ∈ V

It turns out that every linear transformation can be written as a matrixYeah, yeah, the matrix representation is only unique up to a choice of basis.. Remember those guys from high-school algebra?

Matrices are nice because matrix multiplication corresponds to the composition of linear transformations. That is, let V be a vector space over the reals and let f and g be linear transformations on V whose matrices are A and B respectively. Then, if v ∈ V, we have (A · B)v = f(g(v)). Matrix multiplication is a well-defined, computationally simple operation, where as the composition of linear transformations is comparatively difficult.

I don't want to write a full-on tutorial about multiplying matrices, so I recommend reading the Wikipedia article on the subject. Here's an example of a 3×2 matrix. All of the entries are assumed to be real numbers.

This takes as its input a three-dimensional vector and outputs a two-dimensional vector. In general, an m×n matrix takes as its input an m-dimensional vector and outputs an n-dimensional vector.

Definition. Let A be an m×n matrix over the reals. Then the entry in the i^th row and j^th column is denoted as a_ij. One might even write A = (a_ij).

Definition. Let A be an m×n matrix over the reals. The transpose of A, denoted A^T, is the matrix B defined by b_ij = a_ji, i.e., the matrix in which the rows and columns are swapped.

The transpose is important for all sorts of things, as we'll see.

Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors are two of the most important objects in linear algebra. They are defined as follows.

Definition. Let A be a square n×n matrix and let v ∈ Rⁿ. v is an eigenvector of A if there exists a real number λ such that λ is called an eigenvalue of A and v is the eigenvector corresponding to λ

Who cares about eigenvectors? They're useful for all sorts of things. Calculating a webpage's PageRank, for example, is really a problem of finding a certain eigenvector. You can read about algorithms to calculate eigenvalues, but I'll cover more of that ground in Part III.

For now, just keep this idea in your head. We're going to be doing something very much like PageRank to calculate a person's standing in a social network.

Graphs and Matrices

So far I've not even mentioned graphs, so you're probably wondering what the hell any of this has to do with graph theory. It turns out every graph has several associated matrices that are very useful for analyzing the properties of that graph.

Definition. Let G be a graph with n vertices and m edges, i.e., V = {v₁, ..., v_n} and E = {e₁, ..., e_m}. The incidence matrix of G is the m×n matrix C(G) = (c_ij) defined by

Definition. Let G be a graph with n vertices, V = {v₁, ..., v_n}. The adjacency graph of G is the n×n square matrix A(G) = (a_ij) defined by

Definition. Let G be an undirected graph with n vertices. The degree matrix of G, D(G), is the matrix defined by

These are the three basic matrices associated with a graph. The incidence matrix encapsulates vertex-edge relationships, the adjacency matrix encapsulates vertex-vertex relationships, and the degree matrix encapsulates information about the degrees. For a concrete example consider the cycle graph C₄:

Here are the corresponding matrices:

There's one last matrix that is worth knowing.

Definition. Let G be a graph. The Laplacian matrix or graph Laplacian for G is defined as L(G) = D(G) - A(G), where D is the degree matrix and A is the adjacency matrix.

Conclusion

It turns out that lots of information about the graph is stored in these matrices. From these graphs we can calculate things like the number of spanning trees, the algebraic connectivity, etc. Most of these are well-defined eigenvalue problems, and so become computationally feasible.

In Part III we'll use these matrices to tackle the problem of influence or prestige in social networks.

Rules of Thumb for Successful Facebook Applications

Wed, 01 Aug 2007 15:25:02 +0000

Creating Appaholic has given me the opportunity to see what apps succeed and why. Here are some rules of thumb to consider when writing your Facebook app.

The Complexity Ceiling
Facebook is simple. The features on Facebook are simple, even compared to similar features on other sites, e.g., Flickr vs. Photos. My hypothesis is that no application which is more complex than the most complex feature on Facebook will succeed. You can compare three similar apps: Bookshelf, Bookshare, and Visual Bookshelf. Of these Visual Bookshelf started the latest, is the most simple, and also has two orders of magnitude more users than the other applications. The more simple you make your app the better it will do.
Try to Be Social
Just because you've built it doesn't mean the users will come. Facebook is still first and foremost a social platform and that's why people are using it. If your application has no social component it's just going to flop. Not only will it be difficult to spread virally but there's little compelling reason for people to install it. For example, let's say you have a blog and want to promote its content with a Facebook app. You should not just create a Facebook app which displays your blog posts in people's profiles; rather, try to add a social component. What are my favorite articles? What have I commented on? It's then easy to see what my friends like and what they have commented on.
If You Can't Be Social, Be Viral
Writing good social applications is hard. If it weren't Facebook wouldn't be worth what it is. You can, however, write an application which does nothing more than spread itself. If your idea is funny enough, like Vampires or Zombies, people will use it. It's still too early to tell whether these apps have staying power, but you'll at least get your fifteen minutes of Facebook fame.
Don't Be Too Weird
People are used to how Facebook works. If your application is totally foreign they just won't understand it, even if it's the most usable, straight-forward piece of software. The top applications tend to fall into one of two broad categories (again, iLike is the exception). In the first category the applications are viral for virality's sake. Why this works is self-evident, since the applications exist for no other reason than to spread themselves. In the second the applications augment or complement an existing Facebook feature. Top Friends, Graffiti, and X Me all fall into this category, for example, which behave like Facebook's friends list, wall, and poke features, respectively.
Fads Exist
Even though the Facebook platform is only two months old fads have already come and gone. If you're just looking to get a respectable number of users in a short period of time then it's worth paying attention to these fads. The "quotes application" genre is an example. There are applications which add quotes to your profile from everything from Star Wars and Family Guy to Friends and Scrubs. Don't blame me if the fad becomes unpopular and your app fizzles, though.
Quality Matters
This might be obvious, but quality matters. Apps like iLike and Causes provide a set of very high-quality, social features, so even though they're not the most simple apps they're still compelling. You also have to be prepared to deal with your growth. You can model your growth using Appaholic to predict how many users you'll have in a few days, weeks, or whatever. If it's more than you anticipated make sure you have the hardware to handle it â€” users will uninstall apps that are slow or broken, even if it's not the apps' fault.

These rules aren't hard-and-fast, and many successful applications break some or all of them. iLike and Causes, for example, are relatively complex compared to most other applications and even Facebook itself, but they're still wildly successful. They get to bend the rules because they're extremely social and have high-quality features. If you can match that level of quality then go for it, but I personally think it's easier to create a simple application and add features as it grows than to create a complex application and hope users don't get lost. Users are more important than features.

Graph Theory: Part I (Introduction)

Tue, 31 Jul 2007 17:28:32 +0000

This is the first in a multi-part series about graph theory here on 20bits. This started out of me wanting to write about some of the mathematical aspects of Facebook, but I realized that many people might not have a sufficient background to just jump right in. Rather than cover the all the ground in one article I decided to break it up into multiple parts. This is the first part, a quick introduction to graph theory.

Graph theory is a fundamental area of study in discrete mathematics. As the name implies graph theory is about graphs, so I'll first define graph and then discuss why people are so interested in studying these critters. I'm going to assume you're familiar with the idea of ordered pairs and sets.

Some Definitions

Definition. A graph G is a pair (V,E) of sets, called the vertex set and edge set. V is a collection of abstract vertices, written {v₁, v₂,...,v_n} and E is a collection of ordered pairs of vertices, called edges.

As you can see, this is pretty abstract. The definition leaves you free to define both what the vertices and edges are, precisely. Vertices could be cities and edges could be interstate highways. An example:

This type of graph is called a directed graph because some of the edges have a direction, i.e., they only go one way. Going back to the original definition, we have V = {v₁,v₂,v₃,v₄} and E = {(v₁,v₂),(v₁,v₃),(v₁,v₄),(v₂,v₃)}

Definition. Let G be a graph, then we write V(G) to mean the vertex set of G and E(G) to mean the edge set. We will just write V and E if the context makes it clear which graph we're talking about.

Definition. Let G be a graph and let u,v ∈ V(G). We write v ~ u if there is an edge connecting v to u, i.e., if (v,u) ∈ E(G).

Sometimes we don't care about direction and can make edges directionless. These sorts of graphs are called undirected graphs and look like this

Definition. A graph G is an undirected graph if u ~ v implies v ~ u for all u,v ∈ V(G).

Concrete examples

I'd be remiss if I kept talking like graph theory is some pie-in-the-sky theoretical abstraction. In fact, many real-world situations can be modeled using graph theory. Some examples:

Shipping routes
The vertices are shipping hubs and the edges are the routes between them
Social networks
The vertices are people and the edges are social connections (e.g., p ~ q is p is a friend of q)
Telecommunications networks
The vertices are computers on the network and the edges are the network connections between them
Disease transmission
The vertices are organisms which can carry the disease and the edges represent one organism spreading it to another
Sexual networks
The vertices are people and the edges denote which pairs have slept together (see, e.g., The Structure of Romantic and Sexual Relations at Jefferson High)

More Definitions

Essentially any situation where you want to consider pairs of objects and the connections between those pairs can be analyzed using graph theory. There are a few more definitions to cover.

Definition. A graph loop or just loop is a vertex v which is connected to itself, i.e., v ~ v. A graph with no such loops is called a simple graph.

Note. Some authors allow multiple edges between vertices. In this situation graphs with no loops and at most one edge between any two vertices is called a simple graph. Although this type of graph is less than ideal for analysis it occurs relatively frequently in reality, e.g., two different roads connecting a pair of cities or redundant network connections. I'll make sure to note where we are dealing with such graphs and how we work around it.

Definition. Let G be a graph. A path or a walk is a collection of vertices v₁, v₂, ..., v_k} such that v_i ~ v_i+1 for all i, 1 ≤ i < k. A path with no repeated vertices is called simple, and a path such that v_k ~ v₁ is called a closed path, closed walk, or a cycle.

Definition. A weighted graph G is a graph such that each edge in E(G) has an associated weight, typically a real number.

Weighted graphs are the stuff of many famous algorithms. There are a whole slew of algorithms dedicated to finding the shortest path between two vertices in a weighted graph, where "shortest" means the path with the smallest weight. The algorithms vary in performance depending on several factors: the ratio of vertices to edges, whether the graph has negative weights, whether we have a good heuristic for determining what a path might cost, etc.

Important Graphs

There are some graphs every student of computer science or discrete mathematics is just sort of expected to know.

The Complete Graph

The complete graph on n vertices, written K_n, has n vertices and an edge for every pair of distinct vertices. K₄ looks like this, for example:

The Cycle

The cycle of n vertices, written C_n is the undirected graph on n vertices which consists of precisely one cycle containing every vertex. C₄ looks like

The Complete Bipartite Graph

The complete bipartite graph, written K_n,m is an undirected graph that has the union of two disjoint sets of size n and m, respectively, as its vertex set, i.e., V(K_n,m) = V ∪ U, V ∩ U = ∅, |V| = n, and |U| = m. Its edges satisfy the property that v ~ u for all v ∈ V and all u ∈ U. It's easier to draw than to write, I think, so here is K_2,2:

Here V = {v₁,v₂} and U = {v₃, v₄}. Any graph G whose vertex set V(G) can be partitioned into two sets such that no two vertices in a partition share an edge is called bipartite, meaning "of two parts." The complete bipartite graph is the bipartite graph which has as many edges as possible without ceasing to be bipartite.

Conclusion

That's it for the basics. Hopefully you understand what a graph is, some of the problems graph theory is useful in analyzing, and some of the common constructs in graph theory. In Part II I'm going to cover the relationship between graph theory and linear algebra. My ultimate plan is to use this relationship to come up with a way to rank people's influence in social networks, using a measure called eigenvector centrality.

Errata

I left out a few definitions yesterday, so here they are.

Definition. Let G be a graph and let v ∈ V(G) be a vertex. The out degree of v, written deg^-(v), is the number of edges directed away from v. Conversely, the in degree of v, written deg⁺(v), is the number of edges directed towards v. If G is undirected then deg⁺(v) = deg^-(v) for all v ∈ V(G), so we call this number the degree of v and write it deg(v).

The plus (+) and minus (-) have to do with the idea of flow. If you imagine a directed graph G and some substance diffusing through the graph along the edges then the out degree and the in degree measure the degree to which that vertex loses (-) or accumulates (+) the substance, respectively.

Appaholic and Inside Facebook

Mon, 16 Jul 2007 00:37:20 +0000

This is more an update than an article. For those who don't know my big project for the last three weeks has been Appaholic, a great utility for graphing the growth of Facebook applications. It's been on the front pages digg and Mashable and has been making rounds in the blogosphere (today Robert Scoble linked to it).

I've also been given a guest blogger position at Inside Facebook, a blog about Facebook stuff. I'm going to put most of my Facebook editorial over there for now and keep 20bits more focused on code and technology. I think I'm going to write an article about using eigenvector centrality to determine the influencers in a social network, for example.

Anyhow, this was just a head's up and an explanation for why posting has been slower than before. Cheers!

The Social Graph, Facebook, and Virality

Wed, 11 Jul 2007 01:25:51 +0000

The social graph is the web of connections between friends, family, and acquaintances that everyone has. I know my friend who knows someone who works at the company I want to interview at, so he connects us and I get a shiny new job after acing my interview. It helps me meet new people, find new music I like, and generally navigate my social world. If I find something because of the connection in my social graph I'm much more apt to trust its worth. After all, people I know recommended it!

The thing all social networking websites, like MySpace, LinkedIn, Friendster, and Facebook, have in common is that they try to create a virtual copy of the social graph. The graph serves as a way to spread whatever content the site is pushing. If the representation of the graph is good enough I'll know what my friends are reading, watching, doing this weekend, etc.

I don't think it's controversial to say that among all the major social networking sites Facebook has the best representation of the social graph. MySpace's representation is too connected — my real-life connections might be there, but I also have a million other connections that don't reflect anything in the real social graph. Friendster's representation isn't connected enough since none of my friends use it. Facebook, however, has hit the sweet spot: collecting friends a la MySpace is actively discouraged by the features on the site, but they began with a narrow enough demographic that they were able to reach a critical mass among college students.

The Social Graph and Virality

The social graph is closely related to so-called "viral" services. News can spread faster through the lines in the social graph than it could otherwise. By having an accurate virtual representation of the social graph Facebook is able to amplify the usual word-of-mouth effect. Everything I want I can broadcast instantly to all my connections. Facebook does this using the News Feed and Mini-Feed features, showing my friends' activities chronologically and my own activities in my personal profile.

Now we have the Facebook platform. Some people mistakenly treat it as a fancy "widget platform," but they don't realize that the platform has moved Facebook far beyond the tired MySpace + widgets = success formula. In addition to providing compelling content every social networking site wants to build a quality copy of the social graph, but in one fell swoop Facebook has given away access to the highest quality copy on the web, namely theirs. Every developer or entrepreneur looking to build a social networking site will have to ask themselves from now on, "Would it just be better to write my application on the Facebook platform?"

I don't think one can overstate the power of the position this has put the Facebook in, from both a business and technology perspective. If they can actually cultivate a market around Facebook apps — a proposition which is becoming increasingly likely with the creation of programs like appfactory — they might very well move into truly revolutionary quarters.

Powerful, Yes, But no Magic Bullet

But what does this mean for the app developer who just wants to create something popular? The integration with the Facebook news feed allows apps to dramatically increase their virality by piggy-backing on Facebook's copy of the social graph. But the Facebook platform isn't a magic bullet. Turning your website into a Facebook app doesn't mean it will become an instant viral sensation — you also have to understand how to use the social graph.

Let's say you have a popular blog and want to increase people's interest by somehow integrating with Facebook. Your first inclination is probably to let users add a feed of stories from your blog to their profile. Even if your blog is popular this idea probably won't do much good for two reasons: one, readers will have to leave Facebook to get the content, interrupting whatever they were doing; two, there is no social component.

The first problem is probably not worth your time to solve if all you're doing is recycling content from your blog. The second problem is more fatal because it runs against the idea of the social graph. The content being served is in no way related to the person whose profile I'm viewing. It's not stories they've commented on, stories they've recommend, or anything of that sort, it's just a boring list of all stories. Unless there is a social component do not expect Facebook to miraculously drive 30 million users to your website.

Actually, don't expect it to drive traffic to you site at all. Like I said, Facebook users don't want to leave Facebook. As a concrete example, I wrote the PopSugar 100 application for Sugar Publishing. It drives about 5K visits per week, which is respectable, but a drop in the bucket relative to our usual traffic. You can try to monetize the application directly on Facebook but if you're a medium-sized startup already monetizing your userbase in other ways then a Facebook app probably best serves as a way to increase your valuation.

Case by Case

Edit: Joyce Park of Renkoo has written up a guide about designing for the Facebook platform that hits a lot of the points here. You should read that, too.

To see that the Facebook isn't a panacea for your viral distribution problems let's look at some case studies.

digg.com

Users per hour

The digg app is an example of an application which you would expect to do well on Facebook since it has a huge social component but which is essentially floundering. Where other apps with much less impact than digg are growing at a rate of hundreds of users per hour, digg's app is almost static.

This is because, ironically, they don't harness the social graph at all. All the app lets me do is post a list of stories I've dugg into my profile. I've suggested five ways the folks at digg could improve it. The easiest way is to post all news items I submit or digg, at my preference, to my news feed. Not only does this benefit digg by bringing in more users but it makes digg much more useful because I can promote my stories on digg through my corner of the social graph.
Zombies

Users per hour

This app is an example of an application which is entirely viral but provides little content. As you'd expect, it spread very fast initially but is becoming saturated. I suspect, as time goes on, people will uninstall it because there is little there to hold people's interest. Being viral helps your application spread but if there is no substance to back it up people will eventually leave. We'll see if the authors of Zombies can create compelling content to keep people involved.
Graffiti

Users per hour

Graffiti is probably the perfect example of a Facebook application right now. Not only is it extremely viral, it augments an existing Facebook service so there is essentially zero learning curve. I'm informed when someone's Graffiti wall is updated via the news feed and people can share particularly clever drawings with each other.

It also shows the power social networking software can have in our real lives. So many friends of mine have the Graffitti Wall that it has started seeping into my real social interactions. Universal access among people connected to me means that it becomes a useful avenue of social expression. More plainly, when something amusing happens I might think of a great thing to draw on someone's Graffitti Wall.
Booze Mail

Users per hour

Booze Mail, like Graffiti, does a great job at exploiting the social graph. I'm informed whenever a friend of mine sends someone a drink, so not only do I learn about Booze Mail, but I get a better glimpse at my friend's social network. I'd bet money that Booze Mail is going to be one of the million+ user applications within a week. Indeed, of the 20 apps posting larger absolute per-day gains, only three have less than a million users.
Matches

Users per hour

Matches is one of a dozen-or-so apps posting regular hourly losses. It serves as an example of the pitfalls inherent in the Facebook platform. If you look at the Matches discussion board you'll see that, for whatever reason, it isn't working for most people. Part of the problem is that Facebook's model requires the app developer to shoulder all the load.

If you created a popular app then you'd best be prepared to deal with Facebook-sized traffic. If you can't then people will get fed up. Apps like this show that people are definitely willing to uninstall applications that don't do what they want (regardless of whose fault it is). So even once you have a userbase you're not guaranteed to keep it.

Conclusions

The Facebook platform has a potential to revolutionize the way we think about the social web. It serves as a gateway to a high-quality copy of the social graph that exists in real life. This graph, in turn, lets people share content, ideas, money, goods, and all sorts of things with ever-increasing efficiency. The degree to which the graph permits targetted selection, e.g., find me all influential people who like to read between the ages of 20 and 25, is an advertiser's wet dream.

But the platform is no magic bullet. Applications you think would do well sometimes falter and you're not guaranteed to be a viral hit just because you've created a widget for your blog. Apps are rising and falling all the time and the market is still taking shape. It's an exciting time, regardless.

All stats brought to you by Appaholic.

5 Ways to Improve the Digg App

Thu, 21 Jun 2007 22:08:49 +0000

The digg.com Facebook application has a little under 20,000 users. According to the digg blog they reached 1 million registered users in early March. Even if we reduce this number to the number of active digg users we can see that only a small percentage of the digg userbase is using the Facebook application. Why?

Demographics aren't the reason since, according to some reports about digg's demographics I'd be very surprised if most people on digg didn't also have a Facebook account. The reason is actually pretty simple. Digg's application doesn't exploit the core of Facebook's platform: the social graph.

Solutions

digg.com is a social bookmarking site. More than just letting people submit and vote on links it lets people see what their friends submitted, voted, and commented on. The social component is import to digg's success, but Facebook's social graph is more interesting because it reflects the social connections we have in real life. Digg can use this to its advantage.

Put Digg on Facebook

Digg has more to gain than to lose from the Facebook, not least because Facebook has an order of magnitude more active users than digg. So why not start big and create a Facebook-ified version of digg? Users would submit, vote, and comment on items directly from Facebook.

This would pit digg against Facebook's own link-submission mechanism, but with the added benefit of the digg algorithm to surface the interesting content. Wouldn't it be awesome if the most common way people submitted links to Facebook was via the digg application? Monetization shouldn't be an issue since Facebook allows for ads.
More interesting profile box

People like to have interesting and customizable profile pages. The more personalized it is, the better. The digg application should give us the option of including all or some of the content we're interested in showing off.

Some people show their interest by digging stories and some by commenting on stories, so both of those should be options. For active submitters being able to include submitted stories is also important since this makes their Facebook profile act as a mini-advertisement for the stories they've submitted.
Use the newsfeed

The newsfeed is one of the keys to viral success on Facebook and the digg app doesn't use it at all. This is probably one reason why it only has 20k users. The newsfeed should update whenever I digg a story, comment on a story, or submit a story.

For video submissions (or image submissions once digg gets an image section) the newsfeed post could contain a thumbnail version. There should always be a "digg this" link directly in the newsfeed item. This keeps the interaction within the flow of Facebook and would increase my use of digg since I know at least 100-some people would always see what stories I submit.
Use the Graph, Luke!

I'm on Facebook. I have my friends. They may or may not be my friends on digg, although I know a lot of my friends use digg. The digg app should let me see the stories they've dugg, submitted, and commented on whether we're friends on digg or not.

The app could also break down the diggs by network. What's the most popular story in the San Francisco, CA network? How about the Google, Inc. network? At the very least I'd like to see what's popular in my networks, especially since digg doesn't store any of this kind of information on their end.
Sharing and Inviting

The digg app should make it easier to share digg links on Facebook. I can't count the number of times per day I send links to people via IM or Facebook because I think they're funny, interesting, or whatever else. More often than not I found them on digg.

The digg app should also include an "invite friends" page which allows me to invite en masse all my friends who haven't installed the digg app. This is perhaps the easiest way to get Facebook users actively using digg, either directly or via the digg app.

The Flow

With these features let's imagine the flow of the Facebook app. I come to the Facebook. I see that my friend has submitted a story to digg, so I click the "digg this" link without ever leaving Facebook.

I submit a story to digg. My mini-feed is updated stating that 'Jesse submitted "The most hilarious video EVER!!!!!!" to digg.' My friends see this in their newsfeed and all click "digg this," because if I say it's the most hilarious video ever then they know I'm serious.

You get the idea. Basically Facebook provides an excellent way for digg to spread into a richer social setting.

Conclusions

Facebook has done something remarkable in modeling the social graph that exists in the real world. Opening up this data is as remarkable as it would be if Google released their internal graph of the web, in my opinion.

While the most popular apps right now are generally gimmicks just look at the numbers. The largest applications have maybe 3-4 million users, but Facebook has over 25 million registered users. Zuckerberg has stated that only 1/3 of the Facebook has interacted with a Facebook app. Although he was talking about his surprise with the speed at which the Facebook platform I see the potential for huge gains. Beyond gimmicks I believe the apps that will be the most successful and most valuable for Facebook will be those that effectively exploit the social graph.

If you view digg as an enhanced version of Facebook's own link-sharing mechanism the fit is almost perfect. Not only would Facebook benefit from digg's technology but digg would benefit by a more effective and viral distribution mechanism.

Any other ideas? Leave a comment!

P.S., give my Facebook app, Bookshelf, a try. It lets you post your personal collection of books, CDs, DVDs, and video games and share them with your friends and neighbors.

More Facebook Application Gotchas

Thu, 21 Jun 2007 21:55:32 +0000

This is a continuation of my previous article, 5 Facebook Application Gotchas.

User invites

Everyone loves user invites. Well, every application developer, at least. Requests are achieved by calling $facebook->api_client->notifications_sendRequest(...). Facebook only allows you to send notifications to ten users at a time, however.

To implement an invite page create a URL called, say, http://apps.facebook.com/myapp/process. You have a form which lets users select their friends who haven't installed the app, as follows

global $facebook;
$fql = "SELECT uid, strlen(books)
            FROM user
            WHERE uid IN (SELECT uid2 FROM friend WHERE uid1 = {$facebook->user})
            AND has_added_app = 0";
$uids = $facebook->api_client->fql_query($fql);
// Render the form using the above UIDS

Have the form you render submit to /process, which should behave roughly as follows

global $facebook;
if (isset($request['uids'])) {
	if (empty($request['uids']))
		$facebook->redirect('APP URL');

	$array = explode(',', $request['uids']);
} elseif (isset($request['users'])) {
	$array = array_keys($request['users']);
} else {
	$facebook->redirect('APP URL');
}

$uids = array();
while (count($array) > 0 and count($uids) < 10) {
	$uids[] = array_shift($array);
}

$url = $facebook->api_client->sendRequest($uids, 'MyApp', $msg, $img_url, true);

if ($url) {
	$facebook->redirect($url . "&canvas=1&next=" . urlencode("process?uids=" . implode(',', $array)));
} else {
	$facebook->redirect('APP URL');
}

Note, the next parameter tells the request URL where to redirect after a batch has been processed. If it is a canvas page you should include canvas=1.

CSS ids

Facebook only allows inline styles and styles within style tags. To prevent you from affecting the layout of the rest of Facebook it inserts a wrapper div around your content and assigns it an id of the form app_XXXX, where XXXX is your application ID number. It then affixes #app_XXXX to all your CSS rules.

This means that, for example, CSS hacks which involve things like html > * won't work since they will come out as #app_XXXX html > * on the other side. For ID rules, however, it does something much more annoying: it rewrites the ID itself. So, e.g., a rule like #MyDiv h1 becomes #app_XXXX_MyDiv h1 rather than #app_XXXX #MyDiv h1. There's no good reason for this, AFAIK, but it means using IDs on a page inside Facebook becomes tedious — you need to know your application ID number.

To work around this I just uses classes when writing Facebook pages.
Facebook is Strict

In order to strip out bad elements and alter the CSS Facebook actually lexes and parses your code. And it is strict. Like, really strict — much more strict than any browser.

If you're the developer you can see the error messages and I advise you to clean them up. With CSS at least bad elements just get dropped. I'd also bet money than acceptance into the application directory is contingent on you outputting well-formed FBML.
Use an Icon

The process for what applications get accepted and what applications get rejected from the application directory is totally opaque. The best we have is that an application must "work," have at least five users, have an icon, and follow the TOS.

Looking over the application directory you see plenty of apps with terrible icons, so the quality doesn't so much matter as presence does. Just create one and upload it before you submit your application to the directory. You can change it later if you want, but you won't get accepted without one.
No conditional comments

The Facebook JavaScript doesn't always play nice with IE. That is, you can find permutations which work in Safari and Firefox but fail in IE. Shucks. Unfortunately Facebook doesn't allow conditional comments which would give you the ability to let your application degrade nicely in IE.

What's more, because of the way Facebook rewrites CSS, most of the CSS hacks don't work. Facebook does, however, pass along the user agent when it requests data from your server, so if you must absolutely have browser-specific code you'll have to push the logic back into the PHP (or Java, if you swing that way).

I had a sixth gotcha but I forgot what it was. Oops!

5 Facebook Application Gotchas

Tue, 19 Jun 2007 23:40:57 +0000

Everyone and their uncle is writing Facebook applications for the new Facebook Platform. I, too, have my own offering, written by myself and the other OpenHive guys: Bookshelf. Even though the platform was released almost a month ago there are still plenty of tricks, gotchas, and other undocumented oddities that deserve to be brought to light.

Gotchas, tips, and tricks

The Timeout

For those who know what I'm talking about already the answer is 12 seconds. Everyone else read on.

Facebook canvas pages (URLs of the form http://apps.facebook.com/yourapp/foo) work on a proxy model. In the application configuration you specify a callback URL so that when someone visits http://apps.facebook.com/yourapp/foo Facebook in turn requests http://mydomain.com/myapp/foo. Facebook fetches the FBML from your callback URL and renders it on the canvas page.

If your callback takes too long to respond Facebook spits out this ugly message:
There are still a few kinks Facebook and the makers of <application name> are trying to iron out. We appreciate your patience as we try to fix these issues. Your problem has been logged - if it persists, please come back in a few days. Thanks!

Ignore the fact that this error message is awful (try back in a few days?!), for now. I did some testing (i.e., a PHP file and a call to sleep) and found that the timeout is set to around 12 seconds. Although it should never, ever take this long to render any webpage, if you're doing a lot of processing you might run afoul of this limit, so watch out.
The Load

Because canvas pages work on a proxy model your servers will have to handle the load Facebook throws at it. For some apps, like iLike, this means growing from zero to three million users in a week. If you plan on creating a popular app then you'll need to plan and benchmark for high concurrency situations.

To start you should make sure your database is well optimized. Read my article on MySQL optimization tips for some ideas of what that means — most of the tips are database neutral.

Second you should use a tool like ab with the concurrency set high and try to maximize your requests served per second. In short, if you're going to be hosting a popular Facebook application be prepared to deal with Facebook-magnitude loads.
The Session

Before you can talk with Facebook you must initialize a session using the Facebook class provided by the Facebook API library. You cannot tell if the session is valid by whether the session_key field in your object is null — sometimes it looks completely valid but has actually expires. The REST client will throw an exception if you try to do anything with an invalid session, so it's something to avoid.

You can get your session data you can call auth_getSession(). It returns an array that contains the timeout so you can check directly if the session has expired. If the timeout is set to 0 then the session lasts forever. You can also use try/catch to make sure your sessions are valid:
```
$fbuid = $facebook->get_loggedin_user();
if ($fbuid) {
	try {
		if ($facebook->api_client->users_isAppAdded()) {
			// The user has added our app
		} else {
			// The user has not added our app
		}
	
	} catch (Exception $ex) {
		//this will clear cookies for your app and redirect them to a login prompt
		$facebook->set_user(null, null);
		$facebook->redirect($_SERVER['SCRIPT_URI']);
		exit;
	}
} else {
	// The user has never used our app
}
```
The above will guarantee that you always have a valid session. (Thanks to Aditya for information about session expiration.)
The JS

The Facebook Platform supports three means of dynamic, client-side content: iframes, flash, and javascript wrappers. By using iframes you are essentially given free reign to do what you will. Flixster uses Javascript in an iframe to create its UI elements, for example.

Flash is flash and can be embedded using the fb:swf FBML tag. The Javascript wrappers, however, are where the gotchas pop up. Facebook supports three pieces of Javascript functionality: showing a DOM element, hiding a DOM element, and replacing the contents of a DOM element with HTML returned from a remote URL.

You can show, hide, or toggle an element with id foo by giving an element clicktoshow, clicktohide, or clicktotoggle attributes with the value foo, respectively.

To swap out the content of an element with remote content use clickrewriteurl and clickrewriteform. The first parameter contains the URL and the second parameter is the id for a form element containing parameters to pass to the URL. You can combine clicktoshow, clicktohide, and clicktotoggle in a single element but cannot combine these with clickrewriteurl.

To get around this you can mark it up as follows:
```
	Click me!
```
This is useful to, for example, show a progress indicator or "Saving..." text as you process something asynchronously. Make sure to test this in all major browsers since I've seen this fail in IE under circumstances.
Using Lighttpd

lighttpd is an increasingly popular webserver. It is much lighter than Apache at the expense of Apache's modularity and extensibility. A common scenario would be to use it for serving static content.

However, many people are using it in place of Apache as a full, dedicated webserver. The problem arises when you try to submit large amounts of data via POST to a Facebook canvas page. If the data is large enough Facebook will send your app an Expect: 100-continue header, which lighttpd doesn't understand. This results in lighttpd throwing an HTTP 417 error (pretty obscure, eh?), which Facebook spits right back in the users face.

To get around this you need to either use something besides lighttpd which does support the 100-continue header (e.g., Apache) or submit the data directly to your server and then redirect to the Facebook after the data is processed.

The Facebook Platform is still young and changes weekly. Keeping abreast of the changes can be daunting, so let me know if this helped at all.

An Introduction to FBML

Mon, 04 Jun 2007 05:50:26 +0000

On May 24^th, 2007 Facebook released the Facebook platform. This is the complement to their previous API, based around the Facebook Query Language (FQL). Where FQL allows you do create applications from Facebook data, the Facebook platform, via Facebook Markup Langage (FBML), allows you to embed your application in the Facebook. Finally, Facebook has entered the world of widgets.

Lucky for us Facebook actually has a "widget strategy." MySpace's "widget strategy" isn't really a strategy at all; rather, it's a consequence of the fact that they basically allow people to enter anything they want into their profiles. A "MySpace widget" works just as well on MySpace as it does anywhere else. The Facebook platform, however, gives you access to the jewel of the Facebook universe: the social graph.

On the one hand this means that a Facebook widget really only works on the Facebook (at least until some other website supports FBML). On the other hand this means that you can create much richer applications by exploiting information about your users' relationships. Since the mini-feed informs your friends whenever you install an application you also get an excellent viral way to spread your application and your brand. But to do that you need to understand FBML.

Your Data as FBML

One of the most important concepts associated with Web 2.0 is the independence of data and presentation. You see this in things like XML/XSLT and HTML/CSS. Let's say you have a database-backed web application. Most of the time you're going to be surfacing this data as HTML. You have other options, of course. Maybe your reader wants his data in an RSS feed. The underlying data is the same but the format in which it is presented is different.

For those still in a SAT mindset we get the following analogy: HTML is to a browser as RSS is to a feed reader, and RSS is to a feed reader as FBML is to Facebook. Graphically the relationship is this:


                            HTML
User <------>   Browser   <------> Server

                            RSS
User <------> Feed reader <------> Server

                            FBML
User <------>  Facebook   <------> Server

The Nuts and Bolts of FBML

FBML isn't quite HTML and isn't quite proprietary. The closest analog I can think of is ColdFusion, ironically the language in which MySpace is written. FBML consists of a subset of HTML (no script tags, for example) and a set of proprietary extensions.

These extensions act like HTML tags and can be divided into two broad classes: markup tags and procedural tags. Markup tags include UI elements and are generally directly translated into HTML. The fb:header tag, for example, produces the HTML for a Facebook-style header.

Other tags like fb:if-can-see have a programmatic component. In this case the content between the tags is rendered only if the current user has permission to do whatever is specified in the tag's attributed. For example


	You're allowed to see 12345's profile, chum!
	
	
		No profile for you!

This would display "You're allowed to see my profile, chum!" if the current user could see user 12345's profile and display "No profile for you!" otherwise.

Some tags are more complicated, like fb:switch. fb:switch evaluates each of the fb: tags inside and returns the first one which does not evaluate to an empty string, e.g.,


	
	
	You can't see either the photo or the profile pic

This would display the photo with pid 12345 if it could, otherwise it would try to display the profile picture of user 54321. If neither of these can be displayed (e.g., the privacy settings are such that you're not allowed to see them) then it will display the content in fb:default.

If you want to play around with FBML without installing and configuring your own application you can use Facebook's FBML test console.

Integrating With Facebook

FBML itself isn't so complicated, but integrating your existing application with the Facebook Platform can be a pain, especially since the whole process isn't very well documented. The first thing you need to do is install the Developer Application, which allows you to manage the applications you create.

Each application has a unique API key which doesn't ever change. When you create an application you also get a secret which you should never share — it's the only way the Facebook knows that an application is the application it claims to be.

So, to create a new application go to the Developer application, click on My Applications and then Apply for another key. Here you enter the name of your application. After agreeing to the Terms of Service click submit and you'll be redirected back to the My Application page. Once there click on "Edit settings" for your new application.

I'll wait until you get to the "Edit Settings" page. The key part here is to understand the Callback (URL) field. If you enter as the callback URL http://mydomain.com/myapp/ then all requests directed to http://facebook.com/myapp will go to http://mydomain.com/myapp. The callback URL serves as the base URL from which all requests are made. If you ask Facebook for foo.php it will try to fetch the FBML from http://mydomain.com/myapp/foo.php, interpret it, and display the results.

The Library

One could write an application which consists solely of static FBML pages if they wanted, but it would be pretty boring. To aide integration Facebook provides both Java and PHP client libraries. We'll focus on the PHP5 library.

The client library includes an example application called "Footprints" which is very instructive. The library provides a Facebook object, initialized with your API key and secret, which helps control the flow of the application.

$api_key = 'YOUR API KEY';
$secret = 'YOUR SECRET';
$facebook = new Facebook($api_key, $secret);

Facebook allows several points of integration and the $facebook object is the glue which allows you to push data to each of those integration points.

An important fact to note is that the Facebook platform contains both push and pull APIs. All user-specific data follows a push model. That is, if you want to publish data on a users profile, send a message, make a request, publish an item on a user's mini-feed, etc., you must push the request. All other data is fetched from your server by the Facebook when users access URLs like http://apps.facebook.com/myapp/do_something.php.

Here is the procedure by which users install an application, giving that application permission to push data to their profile, mini-feed, etc.

User visits http://apps.facebook.com/myapp/ and Facebook requests http://mydomain.com/myapp/
The application requests the user install the application by invoking $facebook->require_login() if the application plans to push user-specific data.
The user/application go through the authentication process. After the end of the authentication process (presuming the user follows through) the application is given the user's Facebook uid and a session key via a POST request. These are required to push user-specific data.
The application can now push data to a user's profile or mini-feed, make application-related requests on their behalf, etc.

The Nuts and Bolts of the Facebook Object

The Facebook object contains all the methods you'll need to interact with the Facebook platform. After a user has authenticated you'll probably be interested in the following:

$facebook->redirect($url): Redirects to the given URL. This is required because the the headers have already been sent by the time the Facebook requests data from your application.
$facebook->require_login() and $facebook->require_add(): Requires the user to login to your application or install it, respectively.
$facebook->get_login_url() and $facebook->get_add_url(): Returns the URL for your application's login or install page, respectively.
$facebook->api_client->feed_publishStoryToUser($title, $body, ...): Publishes a feed item for the currently authenticated user.
$facebook->api_client->friends_get(): Returns the friends of the currently authenticated user.
$facebook->api_client->friends_getAppUsers(): Returns the friends of the currently authenticated user who also have the application installed.
$facebook->api_client->groups_get($uid=null,$gids=null): Returns the specified groups (all by default) for the specified user (the current user by default).
$facebook->api_client->profile_setFBML($markup, $uid=null): Sets the profile box FBML for the specified users (defaults to the current user).

This list is by no means comprehensive, but these are the highlights. There are also functions which deal with photos, notifications, and events. There's no real documentation for these functions outside of the library source, although there is a one-to-one correspondence with methods in the api_client and the methods listed in the sidebar of the developer documentation. This is definitely the least document part of the Facebook platform.

AJAX and other miscellany

No Web 2.0 application would be complete without AJAX. Of course the whole point of the Facebook platform is to give developers access to the Facebook without compromising security, so unadorned Javascript and script tags are out of the question.

To solve this Facebook provides a very basic mock AJAX system. You create a dummy form which contains the various values you're interested in and point it at an element which activates the AJAX request. It's a little hackish but the alternative (no Javascript at all) is probably worse. The examples in the above documentation are as clear as I could write it, so just read those.

In addition Facebook supports Flash and iframes on canvas pages. This means you could, in theory, embed your page directly into the Facebook.

Resources

From the above you should understand the basics of how Facebook interacts with an application. The Facebook expects your application to output FBML which it then transforms into a page for your user. In addition you can use the Facebook object to get information about the current user, such as their friends, groups, photos, and notifications.

But the above only touches the important parts. A lot of the platform remains undocumented and the best way to learn more is to just dive in. Here are some helpful resources.

Developers Documentation
Anatomy of a Facebook Application
Step-by-step Guide to Creating an Application
Developer FAQ
Platform Wiki
PHP5 Client Library, including a sample Facebook application.

Speculation

There are some totally undocumented aspects of FBML. One that sticks out, using my ColdFusion analogy above, is the fb:query tag. You can see the stub on the FBML documentation at the wiki.

One oddity with the current platform is the way it integrates FBML and FQL. You can issue FQL queries directly via the Facebook object. This effectively doubles the latency of your application since the Facebook first issues a request to your application which then in turn might issue several FQL queries back to the Facebook before returning the finalized FBML. My suspicion is that FBML either at point contained or will contain the ability to execute FQL directly on the Facebook and iterate through the resultset.

Cheers, and happy coding!

Designing Content-focused Websites

Thu, 17 May 2007 01:00:57 +0000

Every website has two fundamental components: data and one or more users/readers who consume that data. This data can be produced by many ways — an author or editorial staff, other users of the website, a database, etc. I'm not interested in the question of what data a user is interested in consuming. That is, I'm not interested in giving editorial advice for someone looking to create a popular blog.

Rather, given that a user is at a website which has data they want to consume, I'm interested in the question of how best to deliver that data. This question intersects the realms of technology, usability, and design.

In thinking about this question I've come up with three categories into which most any website fits. By analyzing these categories I believe one can arrive at some solid, general advice for how to structure websites. Some might accuse me of being "too academic," but I think there's something to be learned about designing websites by understanding these categories and your website's relation to them.

The Three Categories
Content-focused Websites
Surfacing Content
Choosing and Estimate
Conclusions

The Three Categories

Application-focused

Application-focused websites are those which enable the user to complete some specific task. The primary question to ask of one of these websites is "How well does it work?" They have little user-user interaction and often no author per se.

Most of Google's websites fall into this category, for example. Google's base business is centered around aggregating and organizing information. A more pedestrian example would be a website which helps you complete your annual tax returns or find tickets for nearby concerts.

Content-focused

Content-focused websites are those which provide regularly updated topical content. The primary question to ask of one of these websites is "What information does it provide?" There will always be at least one author and there might be an extensive degree of user-user interaction, but this interaction is always subordinate to the content.

Blogs and other news-oriented websites, including online magazines and newspapers, fall into this category. Wikipedia is also an example, albeit one where the line between "readership" and "authorship" is blurred. This is why the categories are defined in terms of data/user interaction rather than author/user interaction. However, Wikipedia would be no less a website if it had the same content it currently has but were only authored by, say, a certified editorial staff. In other words, it is the content that matters, not the means by which the content is generated.

Another example is Livejournal, which allows user-user interaction in comments, groups, and via its "friends" feature (which is really a subscription feature in disguise). User-user interaction is not the primary focus of LJ, however, and it is generally only used as a way to surface interesting content.

User-focused

User-focused websites are those which are based upon user-user interaction. The primary question to ask of one of these websites is "Who is using this website?" There might be topical content or searchable data, but this is incidental to the relationships between users.

Most social networks, like Facebook, Yahoo 360Âº, Friendster, and MySpace, fall into this category. Nobody would use Facebook for photo sharing or storing contact information were it not for the fact that all their friends are using it, too. MySpace was originally a content-focused website, centering around bands and their music, but has since evolved (some might say degenerated) into a user-focused website where most people just use it as a platform to promote their own personality to other users of MySpace.

I don't intend for these categories to be absolute, but rather just a useful tool for reasoning about websites and website design. If you can think of any websites which do not fall into any of the above categories I'd love to hearSome websites themselves are the content, e.g., an art student's website in which the piece of art is the website. As far as I'm concerned these are one-off affairs with no unifying logic outside of the usual artistic conventions..

Content-focused Websites

So, you have a blog and you're writing interesting stuff that has an audience. This in itself is no small feat, but arguably the harder part is knowing how to present that information so that any given reader gets information he wants to read, even if they didn't know they wanted to read it before coming to your site. This applies to any content-focused website. How do I give the reader the most relevant and interesting content with the least amount of effort on their part?

Many content-focused websites don't even have real registration, e.g., wordpress blogs where registering doesn't actually confer any additional benefits. How are the authors of the content supposed to serve up interesting content if they don't know anything about an individual reader's preferences? And that's the key to designing a good content-focused website — can you come up with a way to estimate your readers' preferences? If yes then you just serve up content according to that estimate.

Surfacing Content

Let's assume that you're an author of a content-focused website (a blog, say) and write quality content which has an audience. For your website the data consists in a collection of posts and your job, given that people are actually interested in what you have to write, is to surface the content which is most interesting to a given reader. There are three ways which you can do that.

Global Preference Estimation

Global preference estimation is the idea that if you know nothing about a specific reader your best estimate is the average case. If your article about Widgets has been read more than any other article then it's not a bad bet that the average reader would also find it worth reading, for example. Here are some ways to estimate global preferences, with explanations where necessary.
- Recency
- Pageviews
- Number or recency of comments
- Number of inbound links
- In general if your site has a feature which requires readers to take a definite action on a post, e.g., commenting, viewing, emailing, etc., then you can measure preferences by the numbers of times a post has been acted upon.
- Featured articles — if you have a good understanding of your audience explicitly surface content you believe they'd find interesting.
- Average post rating, if your website supports ratings.
- A promotion model ("Promoted Articles") based on explicit votes (X votes marks a story as promoted) or votes over time (a la digg).
The pro of global preference estimation is that is it relatively easy to implement and does not suffer from sparsity problems. That is, a specific user does not need to register all their preferences for it return good results. Instead preferences are collected in aggregate so that one reader's habits are as good as any other's from the perspective of a global estimate. The con is that this estimate only deals with averages. At best this will let you please most of the people some of the time.
Local Preference Estimation

Local preference estimation is based on implicit and explicit information you have about a specific reader, such as their reading, browsing, and commenting patterns. If you can collect enough data you can surface content that often the user doesn't even realize they were looking for.

The easiest way to get a local preference estimation is to use the most obvious fact about a reader — you know when they are reading something. It's a fairly safe assumption that the reader is interested in whatever they are reading, so it stands to reason they would also be interested in related content. Coming up with a way to surface related content is therefore one of the first things a content-focused website should implement, in my opinion.

For sites on which the readers are creating the content another way to measure interest is to allow readers to befriend each other. Since this friendship is essentially arbitrary they will take "friendship" to mean whatever you tell them it means. If you use friendship status as a means to surface interesting content then they will befriend people creating interesting content. That suggests presenting the user with the following:
- Content created by my friends
- Content commented on by my friends
- Content voted on by my friends
- Content read by my friends
- etc.
If you want to get very fancy (and very technical) you can create a content recommendation system. Reader A like stories 1, 3, and 5. Reader B likes stories 3, 5, and 7. It's probable that Reader A would like story 7 and Reader B would like story 1. Techniques for content recommendation dive straight into the fields of information retrieval and data mining. Given other local preference estimates you can come up with what it means for a reader to "like" some piece of content. You register their preference and then use standard IR and data-mining techniquesFor example, slope one recommenders or clustering recommenders based on similarity metrics, such as cosine similarity. to extract patterns about their tastes. This really only works well if you have a lot of diffuse content and a large, active readership.

The upside of local preference estimation is that it can give fairly accurate results. Google, for example, bases much of their business around contextual information. If you have a Google account they know your searching habits and what Google ads you're seeing around the web. From this, in turn, they can recommend to you all sorts of things. The con is that to get accurate results you need a lot of data. Google and Yahoo! can pull it off because they have terabytes upon terabytes of data. The average blog, however, will have a harder time.
Explicit Preferences

Explicit preferences are just that, preferences which the user has made known or wants to make known. To accommodate these preferences it is best for the website to simply get out of the readers way. Here, search is king.

Let's say the user remember an old post you wrote on your blog about Widgets, but can't remember the exact title or some of the secondary content. The first thing they will probably want to do is search for "Widget." Search isn't easy (otherwise Google wouldn't be a multi-billion dollar company), so it's not uncommon to leave search up to a third-party application. For this blog I trust Google to index it and for my readers to use Google to search it — I know Google will do a better job than any native Wordpress search functionality would.

Aside from search a common feature in the Web 2.0 world, at least, is the tag cloud. If you tag your content with semantically meaningful tags then the tag cloud provides a sort of topographical map of your content. Presumably you tagged that post about Widgets with "widget," so a user looking for some post on Widgets will be able to find it by looking through all Widget-related content.

Choosing an Estimate

For content-focused sites with worthwhile content the most important job is to surface the most interesting content. What constitutes "interesting" varies from site-to-site and audience-to-audience but abstractly speaking the process is the same. That is, you need to come up with some way to measure how interesting a given piece of content is and display views of your content ranked according to that measure.

For example, recency is going to be an important component of what is interesting on a news-focused site, but is hardly sufficient. The news that Grandma Smith died just isn't as interesting as the news that a Presidential candidate was caught doing drugs, for example. Traditional news outlets use editorial discretion to surface the interesting news. Good editorial staffs lead to successful newspapers.

The internet, however, affords more direct access to your audience's tastes. Sites like digg exploit this by allowing users to vote directly on articles. The measure of how "interesting" content is then a function of both the recency of the content and the number of votes. The only essential difference between a site like digg and a traditional news blog is the way in which they measure how interesting a given piece of content is.

What measures work depends heavily on both the content and audience, however. A new measure might make for a novel kind of content-focused website but it is no guarantee that that website will be successful, even if the content has an audience. The mechanics of the metric might not sit well with your audience. For example they might not understand a digg-like voting mechanism, making any metric based on "votes" totally ineffective.

So the problem for a would-be website author is two-fold: create quality content that has an audience and determine a preference estimate which surfaces the content most interesting to both the audience as a whole and a specific reader. There are many proven measures listed above which work well, although the truly breakaway successes are usually those that either have some novel means of content creation, preference estimation, or both.

Conclusions

Most every website falls into one of three categories, each of which is defined in terms of data-user interaction. Content-focused websites are those which regularly generate topical content, such as online newspapers, blogs, digg, or Wikipedia. The most pertinent question for these websites is "What information does it provide?"

For a reader to answer this question the author of a content-focused website needs to provide a window into their content. Presuming the author actually wants the reader to stay around and consume more content these windows need to do more than just show random content, they need to show interesting content It is therefore important for the author to find a way to estimate the preferences of his readers.

This can be accomplished at either the global, aggregate level or the local, contextual level. A global estimate surfaces content which is interesting to the average reader while a local estimate surfaces content interesting to a specific reader, given what you know about them. In addition readers sometimes make their preferences known explicitly in which case there should also be a path for readers who are looking for specific content, e.g., a proper search function. Assuming you are actually writing worthwhile content then a good estimate goes a long way towards converting users to your site.

Above all it is important to think clearly about getting to your readers what they want as easily as possible. I often find it useful to imagine I know nothing about where content resides in my site and go from there. Is what I see interesting enough for me to keep looking? Is so, how long before it becomes uninteresting? If not, how long before it does? Could I find what I wanted if I really had to?

Finally, I'd love to get any and all feedback on this article. I've been tossing these ideas around in my head for a few months now and thought now was a good time to write them down for the first time. Cheers!

Introduction to Dynamic Programming

Tue, 08 May 2007 09:23:48 +0000

Dynamic programming is a method for efficiently solving a broad range of search and optimization problems which exhibit the characteristics of overlappling subproblems and optimal substructure. I'll try to illustrate these characteristics through some simple examples and end with an exercise. Happy coding!

Overlapping Subproblems
Optimal Substructure
The Knapsack Problem
Everyday Dynamic Programming

Overlapping Subproblems

A problem is said to have overlapping subproblems if it can be broken down into subproblems which are reused multiple times. This is closely related to recursion. To see the difference consider the factorial function, defined as follows (in Python):

def factorial(n):
	if n == 0: return 1
	return n*factorial(n-1)

Thus the problem of calculating factorial(n) depends on calculating the subproblem factorial(n-1). This problem does not exhibit overlapping subproblems since factorial is called exactly once for each positive integer less than n.

Fibonacci Numbers

The problem of calculating the n^th Fibonacci number does, however, exhibit overlapping subproblems. The naÃ¯ve recursive implementation would be

def fib(n):
	if n == 0: return 0
	if n == 1: return 1
	return fib(n-1) + fib(n-2)

The problem of calculating fib(n) thus depends on both fib(n-1) and fib(n-2). To see how these subproblems overlap look at how many times fib is called and with what arguments when we try to calculate fib(5):

fib(5)
fib(4) + fib(3)
fib(3) + fib(2) + fib(2) + fib(1)
fib(2) + fib(1) + fib(1) + fib(0) + fib(1) + fib(0) + fib(1)
fib(1) + fib(0) + fib(1) + fib(1) + fib(0) + fib(1) + fib(0) + fib(1)

At the k^th stage we only need to know the values of fib(k-1) and fib(k-2), but we wind up calling each multiple times. Starting from the bottom and going up we can calculate the numbers we need for the next step, removing the massive redundancy.

def fib2(n):
	n2, n1 = 0, 1
	for i in range(n-2): 
		n2, n1 = n1, n1 + n2
	return n2+n1

In Big-O notation the fib function takes O(cⁿ) time, i.e., exponential in n, while the fib2 function takes O(n) time. If this is all too abstract take a look at this graph comparing the runtime (in microseconds) of fib and fib2 versus the input parameter.

The above problem is pretty easy and for most programmers is one of the first examples of the performance issues surrounding recursion versus iteration. In fact, I've seen many instances where the Fibonacci example leads people to believe that recursion is inherently slow. This is not true, but in cases where we can define a problem with overlapping subproblems recursively using the above technique will always reduce the execution time.

Now, for the second characteristic of dynamic programming: optimal substructure.

Optimal Substructure

A problem is said to have optimal substructure if the globally optimal solution can be constructed from locally optimal solutions to subproblems. The general form of problems in which optimal substructure plays a roll goes something like this. Let's say we have a collection of objects called A. For each object o in A we have a "cost," c(o). Now find the subset of A with the maximum (or minimum) cost, perhaps subject to certain constraints.

The brute-force method would be to generate every subset of A, calculate the cost, and then find the maximum (or minimum) among those values. But if A has n elements in it we are looking at a search space of size 2ⁿ if there are no constraints on A. Oftentimes n is huge making a brute-force method computationally infeasible. Let's take a look at an example.

Maximum Subarry Sum

Let's say we're given an array of integers. What (contiguous) subarray has the largest sum? For example, if our array is [1,2,-5,4,7,-2] then the subarray with the largest sum is [4,7] with a sum of 11. One might think at first that this problem reduces to finding the subarray with all positive entries, if one exists, that maximizes the sum. But consider the array [1,5,-3,4,-2,1]. The subarray with the largest sum is [1, 5, -3, 4] with a sum of 7.

First, the brute-force solution. Because of the constraints on the problem, namely that the subsets under consideration are contiguous, we only have to check O(n²) subarrays (why?). Here it is, in Python:

def msum(a):
	return max([(sum(a[j:i]), (j,i)) for i in range(1,len(a)+1) for j in range(i)])

This returns both the sum and the offsets of the subarray. Let's see if we can't find an optimal substructure to exploit.

We are given an input array a. I'm going to use Python notation so that a[0:k] is the subarray starting at 0 and including every element up to and including k-1. Let's say we know the subarray of a[0:i] with the largest sum (and that sum). Using just this information can we find the subarray of a[0:i+1] with the largest sum?

Let a[j:k+1] be the optimal subarray, t the sum of a[j:i], and s the optimal sum. If t+a[i] is greater than s then set a[j:i+1] as the optimal array and set s = t. If t + a[i] is negative, however, the contiguity constraint means that we cannot include a[j:i+1] in our subarray since any such subarray will have a smaller sum than a subarray without it. So, if t+a[i] is negative set t = 0 and set the left-hand bound of the optimal subarray to i+1.

To visualize consider the array [1,2,-5,4,7,-2].

Set s = -infinity, t = 0, j = 0, bounds = (0,0)
(1   2  -5   4   7  -2 
(1)| 2  -5   4   7  -2  (set t=1.  Since t > s, set s=1 and bounds = (0,1))
(1   2)|-5   4   7  -2  (set t=3.  Since t > s, set s=3, and bounds = (0,2))
 1   2  -5(| 4   7  -2  (set t=-2. Since t < 0, set t=0 and j = 3 )
 1   2  -5  (4)| 7  -2  (set t=4.  Since t > s, set s=4 and bounds = (3,4))
 1   2  -5  (4   7)|-2  (set t=11. Since t > s, set s=11 and bounds = (3,5))
 1   2  -5  (4   7) -2| (set t=9.  Nothing happens since t < s)

This requires only one pass through the array and at each step we're only keeping track of three variables: the current sum from the left-hand edge of the bounds to the current point (t), the maximal sum (s), and the bounds of the current optimal subarray (bounds). In Python:

def msum2(a):
	bounds, s, t, j = (0,0), -float('infinity'), 0, 0
	
	for i in range(len(a)):
		t = t + a[i]
		if t > s: bounds, s = (j, i+1), t
		if t < 0: t, j = 0, i+1
	return (s, bounds)

In this problem the "globally optimal" solution corresponds to a subarray with a globally maximal sum, but at each each step we only make a decision relative to what we have already seen. That is, at each step we know the best solution thus far, but might change our decision later based on our previous information and the current information. This is the sense in the problem has optimal substructure. Because we can make decisions locally we only need to traverse the list once, reducing the run-time of the solution to O(n) from O(n²). Again, a graph:

The Knapsack Problem

Let's apply what we're learned so far to a slightly more interesting problem. You are an art thief who has found a way to break into the impressionist wing at the Art Institute of Chicago. Obviously you can't take everything. In particular, you're constrained to take only what your knapsack can hold — let's say it can only hold W pounds. You also know the market value for each painting. Given that you can only carry W pounds what paintings should you steal in order to maximize your profit?

First let's see how this problem exhibits both overlapping subproblems and optimal substructure. Say there are n paintings with weights w₁, ..., w_n and market values v₁, ..., v_n. Define A(i,j) as the maximum value that can be attained from considering only the first i items weighting at most j pounds as follows.

Obviously A(0,j) = 0 and A(i,0) = 0 for any i ≤ n and j ≤ W. If w_i > j then A(i,j) = A(i-1, j) since we cannot include the i^th item. If, however, w_i ≤ j then A(i,j) then we have a choice: include the i^th item or not. If we do not include it then the value will be A(i-1, j). If we do include it, however, the value will be v_i + A(i-1, j - w_i). Which choice should we make? Well, whichever is larger, i.e., the maximum of the two.

Expressed formally we have the following recursive definition

This problem exhibits both overlapping subproblems and optimal substructure and is therefore a good candidate for dynamic programming. The subproblems overlap because at any stage (i,j) we might need to calculate A(k,l) for several k < i and l < j. We have optimal substructure since at any point we only need information about the choices we have already made.

The recursive solution is not hard to write:

def A(w, v, i,j):
    if i == 0 or j == 0: return 0
    if w[i-1] > j:  return A(w, v, i-1, j)
    if w[i-1] <= j: return max(A(w,v, i-1, j), v[i-1] + A(w,v, i-1, j - w[i-1]))

Remember we need to calculate A(n,W). To do so we're going to need to create an n-by-W table whose entry at (i,j) contains the value of A(i,j). The first time we calculate the value of A(i,j) we store it in the table at the appropriate location. This technique is called memoization and is one way to exploit overlapping subproblems. There's also a Ruby module called memoize which does it for Ruby.

To exploit the optimal substructure we iterate over all i <= n and j <= W, at each step applying the recursion formula to generate the A(i,j) entry by using the memoized table rather than calling A() again. This gives an algorithm which takes O(nW) time using O(nW) space and our desired result is stored in the A(n,W) entry in the table.

Everyday Dynamic Programming

The above examples might make dynamic programming look like a technique which only applies to a narrow range of problems, but many algorithms from a wide range of fields use dynamic programming. Here's a very partial list.

The Needleman-Wunsch algorithm, used in bioinformatics.
The CYK algorithm which is used the theory of formal languages and natural language processing.
The Viterbi algorithm used in relation to hidden Markov models.
Finding the string-edit distance between two strings, useful in writing spellcheckers.
The D/L method used in the sport of cricket.

That's all for today. Cheers!

The Infection Puzzle

Sat, 05 May 2007 00:03:49 +0000

I first heard this puzzle from the affable Hungarian mathematician and computer scientist LÃ¡szlÃ³ Babai in his Combinatorics and Probability class. It gave headaches to a lot of people much smarter than I am, but there is what Babai would call an "Ah-haaaa!" solution. Read on if you're brave enough.

The Rules

The rules of the puzzle are simple.

You are given an n-by-n board, e.g., an 8x8 chessboard.
Some initial number of the squares are "infected."
If an uninfected square shares an edge with two or more infected squares then it too becomes infected.
The infection spreads until there are no more squares which can be infected.

We view the infection as spreading in discrete time steps, which makes this puzzle an example of a cellular automaton. Here is an example of an initial infection which stops spreading well before it infects the whole board.

The Puzzle

There are many initial configurations which will infect the whole board. The most obvious and least interesting configuration is one with every square infected. It's also not hard to see that we can infect the whole board with fewer than this many initially infected squares. So, the question is, what is the smallest number of initially infected squares required to infect an n-by-n board? Does it depend on n? If so, how?

To help you get a grips on the puzzle I've included a Javascript implementation below using the canvas tag. This means that IE users are out of luck, but if you're on Windows you should be using Firefox or Opera, anyhow. Edit: I just learned about Google's Explorer Canvas, so hopefully this now works in IE. Since I don't have the means to test it on my laptop "hopefully" is the best I can offer.

To infect or disinfect a square just click on it. Once you're ready to test your initial configuration click "Run it!".

Run it! | Clear

You're free to use the code pursuant to the Creative Commons Attribution-ShareAlike 3.0 license. The code requires a browser that supports the canvas tag, as mentioned above, and also the Prototype JavaScript framework.

Download the infection puzzle code.

Update

If you've read the comments when you know you can infect the whole n-by-n board by infecting one of the diagonals. This leads people to believe that you cannot infect the whole board with fewer than n initially-infected squares. The first "proof" of this rests in the claim that for the whole board to be infected every row and column must contain at least one infected square. If we start with fewer than n squares infected than at least one row or column would be empty, so that the whole board could not be infected.

This would be a fine proof were it the case that every row or column must have an initially infected square. Here's a configuration which infects the whole board but which nevertheless has fewer than n (here n=8) infected rows and columns:

So, we know we can infect the whole board by infected n squares initially (one of the diagonals). But can we do it in fewer? If not, why not?

Facebook job puzzles: Prime bits

Fri, 27 Apr 2007 00:39:05 +0000

Welcome to the second installment of 20bits' Facebook job puzzles solution manual. This time I am going to tackle the Prime Bits puzzle.

Like my Korn Shell solution, this puzzle is mostly mathematical. This time, however, we're going to be wading deep into combinatorics territory. Combinatorics is the mathematics of counting. If you have three pairs of socks, two pairs of pants, and four shirts, how many outfits can you wear? If you have a collection of twenty playing cards how many two-card hands are there? These are the sorts of questions combinatorics exists to answer, although the questions quickly become more complicated.

Again, I'm aiming to make this understandable to an intelligent layperson interested in the puzzle. If that's you then read on!

The Question

This time around the question is much more straightforward. Every (positive) integer can be represented using binary. Normally we write using decimal, which is based around powers of ten. For example, 215 is really 2*10² + 1*10¹ + 5*10⁰. Binary involves using 2s rather than 10s as place values, so, e.g., 5 is written as 101 = 1*2² + 0*2¹ + 1*2⁰.

Every integer therefore can be respresented as a string of zeroes and ones. Define P(x) to be true if the number of ones in the binary representation of x is prime and false otherwise. So, e.g., P(5) is true but P(4) is false. Our job is to implement the function uint64_t prime_bits(uint64_t a, uint64_t b); which returns the number of integers k, a ≤ k ≤ b such that P(k) is true. uint64_t is a way of designating 64-bit numbers in C/C++.

The Obvious Solution

Assuming we have implemented P(x) the obvious solution in Python is this

def prime_bits(a,b):
	return sum([P(k) for k in range(a,b+1)]);

That is, for each integer in the desired range we calculate P(k) and them sum all these values. Since "true" corresponds to "1" and "false" to "0" we get the total number of true entries in our desired range. A more explicit procedural implementation would be

def prime_bits(a,b):
	c = 0
	for k in range (a,b+1):
		if P(k):
			c = c + 1
	return c

Assuming P(n) = O(n) (this is called Big-O notation) then we have prime_bits(a,b) = O(b²). Surely we can do better. In fact, according to the constraints on the Facebook page we have to do better to even be considered in the running.

The above big-o analysis tells us that we must be careful of two things: one, our imlementation, whatever it is, cannot iterate over every integer between a and b; two, our prime-checking function has to be fast enough that it doesn't dwarf the running time of the rest of our algorithm.

Bring a Little Binomial into Your Life

Forget about the fact that we're dealing with numbers and think of our binary representation as a string. That is, think of "110101" as nothing more than some sequence of zeroes and ones. It could just as easily be "aababa" or "!!?!?!" or any other two characters we choose.

From this perspective we can turn the question on its head. Rather than going through each possible string one by one, let's count groups of strings en masse. How many 6-character strings of zeroes and ones have 3 ones?

The answer rests in that foundational function of combinatorics, the binomial coefficient. It is defined as follows: pronounced "n choose k" and sometimes written nCk, where is the factorial.

This might seem like gibberish so let's start with what it means. Let's say we're given a vat of balls labeled from 1-6. The binomial coefficient tells us how many ways we can pick three balls from the vat if we don't care about the order in which they are picked. We might choose {1,2,3} or {7,3,9}, for example, but {4,5,3} and {3,4,5} are considered to be the same drawing.

How might we go about counting this? Well, let's start by caring about order because that's easier to count. If we do care about order then there are 6 ways to pick the first ball, 5 ways to pick the second ball, and 4 ways to pick the third and final ball. This means there are 6*5*4 ways to pick the balls if we care about order, i.e., if we treat (4,5,3) and (3,4,5) as different drawings.

Now how many ways are there to arrange each triplet? Well, there are 3 ways to choose the first element, 2 ways to choose the second, and 1 way to choose the third and final element. Explicitly, the six following triplets are different orderings of the same drawing:

(1,2,3), (2,3,1), (3,1,2), (1,3,2), (3,2,1), (2,1,3)

But we can write 6*5*4 = 6!/3! and 3*2*1 = 3!, using the factorial notation. This means that if we don't care about order, there are a total of 6!/(3!*3!) ways of drawing balls from the vat. Or we could write (6!/(6-3)!3!) = 6C3, 6 choose 3. So hopefully you can see that the above formula didn't totally come from nowhere.

Let's go back to our binary brou-ha-ha. Let's take a 6-bit binary string, 000000. Now, how many such strings have three ones? Well, 6C3. If we want to know how many 6-bit binary strings have a prime number of ones we just need to find all the prime numbers less than 6 and use the binomial coefficient. Since 2, 3, and 5 are the only primes less than 6 we know there are 6-bit binary strings with either 2, 3 or 5 (i.e., a prime number of) ones.

Paging Mr. Pascal

Since our motivation for writing a new method is performance the above might very well be worthless if we can't make it perform. Luckily the binomial coefficient has an excellent recursive definition related to Pascal's triangle:

By using dynamic programming we can generate the n^th row of Pascal's triangle in O(n²) time. But wait, weren't we looking for something that beat this? (Remember our original implementation of prime_bits(a,b) took O(b²) time.) Here's where you need to be careful: we need to generate as many rows in Pascal's triangle as there are significant bits in our integer. For a 64-bit integer this means we'll be generating at most 64 rows of Pascal's triangle. In terms of our inputs a and b this is O((log b)²) time. Of course we're a far cry from a full implementation, but at least we know we can use binomial coefficients within our performance requirements.

Back to Numbers

Let's say we have a number which is a power of two, e.g., 2⁸. In binary this is 10000000. All numbers less than it are of the form 0xxxxxxxx, where x is either 0 or 1. We have eight places to fill and we're interested in those combinations which have a prime number of 1s. 2, 3, 5, and 7 are the only primes less than 8 so there are positive integers less than 2⁸ which have a prime number of ones in their binary representation.

So far so good, but the above method only works for numbers that are powers of two. What happens if we want to count the number of desired positive integers less than 100101? 111101 has a prime number of ones but it is larger than 100101. So how do we extend the above method to arbitrary integers?

Consider the following three ranges of numbers:

Numbers less than 100101 (37)
----
000000 to 011111
100000 to 100011
100100
100101

First, does this range of numbers encompass every number less than 100101? The first range is the set of all numbers less than 100000, the second the set of all numbers less than 100100, the third the set of all numbers less than 100101, and the fourth is the number itself. So certainly every number in this set is less than 100101, but is that every such number?

Yes, and we can see that by looking at place values. If a number agrees with 100101 in the lest-most bit then it cannot have a 1 in the next two places without being greater than the number in question. Likewise, if it agrees with 100101 in the for the four left-most bits then it cannot have a 1 in the second position without being greater.

From this we can determine combinatorially how many such numbers have a prime number of 1 bits. For the range 000000 to 011111 there are numbers which have the desired property since we have 5 bits to choose freely and 2,3, and 5 are the only primes less than or equal to 5.

The next range, 100000 to 10011, is similar, except we have one bit fixed and two bits to choose freely. Thus there are numbers which have the desired property. This is because we there is already one bit set, so we count all combinations which, when added to 1, have a prime number of bits set to 1.

The Algorithm and Some Caveats

So, for a given number we now have a function pb(n) which counts the number of positive integers less than or equal to n which have a prime number of bits set to 1. prime_bits(a,b) then just becomes pb(b) - pb(a).

To find the number of positive integers between 1 and n, inclusive, we therefore need three things:

Generate Pascal's triangle so we can easily extract binomial coefficients. This takesO((log n)²) time.
Given a number n find its most significant (or left-most) bit. This takes O(log n) time.
For each bit set to 1 count the number of combinations of bits to its right which yield a prime number of bits. If we have a pre-generated list of primes this takes O((log n)²), otherwise we can do it in O((log n)³) time.

Although the Facebook puzzle says it would be "uncouth" to not support integers over 64 bits, generating lists of primes is a well-understood problem. Using the Sieve of Eratosthenes one can generate every prime less than k in O(k) time and using the Sieve of Atkin one can do it in O(k/log log k) time. For now I'm just going to hard-code every prime less than 1024. This means the above algorithm will work for up to 1024-bit integers. If you care about running the above algorithm for abritrarily large numbers, well, you can implement one of the prime number sieves yourself.

Here is my implementation in Python.

This code has been redacted at the request of Facebook. Contact me if you want the code.

Improvements

The code as-is runs quite fast. Every input I've given it runs in well under a second, usually less than a tenth of a second, but there are still improvements.

Rather than calculate pb(b) - pb(a), integrate the two passes by determining the first bit-position in which a and b differ and running a variation of the above counting algorithm.
Improve the significant bit calculation. The current implementation is the most naÃ¯ve method.
Generally tweak the loops to improve speed, since Python is notoriously slow in iterating over lists. Or write it in a langage like C (or ASM, you hardcode d00d, you).

Anyhow, none of the above strike me as particularly "interesting," so I'll leave them to more enterprising individuals. Cheers and happy coding!

Facebook job puzzles: Korn Shell

Thu, 19 Apr 2007 23:37:14 +0000

Welcome to the first installment of 20bits' Facebook job puzzles solutions manual. My ultimate goal is to solve every interesting puzzle in the aforelinked list and make a public post with the solution code and an explanation. Why? Because I'm a mathematician by training and no good puzzle should go unsolved.

I intended to start with the Evil Gambling Monster puzzle because I thought it would be the best learning experience given my lack of formal CS training. I did learn some new algorithms in solving it (e.g., the A^* search algorithm), but that's when the Korn Shell puzzle caught my eye. My inner mathematician couldn't resist and I believe I have solved the puzzle mathematically. The solution involves my favorite area of mathematics (group theory), so I'm going to attempt to explain it in a way that is understandable to a layperson.

The Rules of the Game

You are sitting at a computer terminal whose 26 alphabetic keys have been randomly rearranged (permuted) and you enter your name, say, "Mike Korn." Since the keys are scrambled what appears on the screen is also scrambled. The game consists of you typing the characters you see on the screen until your name appears. The length of the game is the number of times you must type a the string on the screen before your name appears, at which point the game terminates.

A game therefore consists of two pieces of information, viz., an input string and a permutation of the 26 alphabetic keys. Everything else is completely determined by these two pieces of information. The puzzle is this: if I give you a name can you give me the length of the longest possible game which uses that name as its first input?

Questions to Ask

The first step in reasoning mathematically is to isolate your assumptions. The biggest assumption above is that every possible game is guaranteed to terminate in a finite number of steps. This isn't particularly obvious on the outset. This leads us to the following questions.

Is every possible game guaranteed to terminate in a finite number of steps?
How does the length of game vary with respect to the two inputs, i.e., the name and the permutation?
If every game does terminate in a finite number of steps, what is the longest game?

If we can answer these questions well enough then we are done. I'm claiming that each of these questions has an explicit answer and that no algorithm is necessary to determine any of them. So, let's get answering.

Helpful Notation

I will introduce more notation as it becomes necessary but I'd like to introduce a few preliminaries. Let's say the A and B keys have been swapped on the keyboards and no other keys have been touched. We can write this as A → B → A, since, if you type 'A' the screen prints 'B,' and then if you type 'B' the screen again prints 'A.' Similarly, if we took the letters Q,W,E and R and moved each to the right (on a QWERTY keyboard), with R going to Q's place, we could write Q → W → E → R → Q.

Additionally, if I want to refer to a specific permutation I will use the symbol σ or τ to represent it. Those are the lower-case Greek letters sigma and tau, respectively. This is only in the case where I am talking about a permutation in the abstract — if I want to talk specifically about what the permutation has done to the keyboard keys then I will use the above notation to describe its action.

The single-letter case

Is every possible game guaranteed to terminate in a finite number of steps? When in doubt, start simple. What happens if we have a single letter as an input, say, 'A?' We enter A and another letter, say, V, appears. We then enter V and yet another letter appears. Now, at any point, the next letter that appears could be A, in which case we're done.

But can this go on forever? Well, let's say we've seen every letter from A to Z exactly once and the last letter on the screen was an X. What happens when we hit the X key? The original letter, A in this case, must appear on the screen. If another letter appears, say B, then the key we pressed earlier which produced B must have been X. But this means that we have seen X twice, an impossibility!

In fact, we just solved the puzzle for the single-letter input case. If we are given a single letter as an input the permutation (i.e., the rearrangement of the 26 alphabetic keys) we want is represented by A → B → C → ... → Y → Z → A. For any single-letter input this produces a game of length 26. As we showed above, no game with a single-letter input can last longer than 26 turns.

On Cycles and Sufjan Stevens (or, the multi-character case)

What happens when we are given a two-character input string? When I explained this to my girlfriend I use a music analogy. Let's say we have two musicians, one playing in 4/4 time and the other playing in 5/4 time (maybe he's Dave Brubeck or Sufjan Stevens). If both musicians start playing on the same beat how many beats does it take for their measures to come back into alignment? The answer is 20, the least common multiple of 4 and 5.

Remember that if A and B are swapped then we write A → B → A. This is called a "cycle" because the path a letter takes through it always eventually comes back to the beginning. A more common notation is (A B) instead of A → B → A, with the understanding that the right side wraps around to the left. Similarly we write (Q W E R) instead of Q → W → E → R → Q.

Given an arbitrary permutation σ of the 26 alphabetic keys every letter belongs to exactly one cycle. We can find all the other letters in this cycle by following the bouncing ball. For example, if we want to find the cycle of C then we type in C. 'R' appears on screen, so we type in 'R.' We then see 'H' followed by 'L' followed by 'V' followed by 'C' again. The whole cycle is therefore (C R H L V). If, in addition to having C, R, H, L, and V permuted in that fashion, the A and B keys have been swapped, we write (A B)(C R H L V). By convention if we write down a permutation using this notation and leave off letters that means the corresponding keys haven't been moved.

Let's take the permutation (A L)(E M R), i.e., the A and L keys have been swapped and the E key has been moved to M's position, M to R's, and R to E's. What happens when we enter the name "Mel Farb?" Remember that if a letter isn't present in any cycles it means it isn't affected by the permutation (mathematicians would say that the permutation fixes that letter).

Mel Farb
Rma Fleb
Erl Famb
Mea Flrb
Rml Faeb
Era Flmb
Mel Farb

Going back to the musical analogy, imagine we have a whole orchestra. Every letter is a musician. Musicians playing the same part all belong to the same cycle. The length of a cycle is the time signature in which each part is playing. The length of the game is the number of beats it takes for the measures in the various parts (i.e., cycles) to realign, which happens to be the least common multiple of the lengths of the cycles.

We have therefore answered our first question and much of the second. Yes, every game is guaranteed to terminate in a finite number of steps. If every part is being played by one musician, i.e., every cycle contains at least one letter in our input string, then the length of the game is determined wholly by the lengths of the cycles of the permutation.

The Computation

So every permutation can be represented uniquelyUp to the ordering of the cycles using the above cycle notation. This means that if we construct a cycle representation then we also get a permutation. To construct really long games, then, we need to maximize the size of the least common multiple of the cycle lengths under the constraint that the sum of their lengths is no greater than 26.

For now let's work in the case where our input string contains enough distinct letters, i.e., every cycle of our permutation contains at least one letter in the input string. Let's look at some permutations. Take (A B)(C D E)(F G H I J). If the input string is something like "ACF" this permutation leads to a game of length 30, since the least common multiple of 2, 3, and 5 is 5*3*2 = 30. But we could do better by adding another cycle of length 7, giving us a game length of 210. Every number between 7 and 11 is a multiple of 2, 3, or 5, so no cycles of any of these lengths will increase the length of our game, but we can't add a cycle of length 11 because we only have 26 letters and 2+3+5+7+11 = 28.

If every cycle in a permutation of the 26 letters contains at least one letter of the input string then we know that the length of the game is wholly determined by the cycle structure of the permutation. The length is the least common multiple of the size of the cycles. We therefore want to find the cycle structure which maximizes the length of the game. We can do that as follows.

Mathematically, writing a number n as sum of positive integers is called a partition of n. So we're interested in partitions of 26. We want to find the maximum least common multiple of the sizes of the parts of all possible partitions of 26. Whew. For example, we can write 26 = 25+1. The least common multiple of 25 and 1 is 25. We can also write 26 = 1+2+3+5+7+8. The least common multiple of these parts is 210. It turns out there are 2436 partitions of 26, which is more than we want to check by hand.

I have written some Python code below which calculates this number and it happens to be 1260, corresponding to cycles of length 4, 5, 7, and 9. This means that the longest possible game, period, is 1260. No matter what the user inputs we will never be able to beat this number. But can we always match it?

From one permutation to many

So, we know the "ideal" permutation has four cycles. As I've said before, if the input string contains fewer than four distinct letters then one of the cycles will be empty and won't contribute to the length of the game. This means that we have four cases for the number of distinct letters in the input string: one, two, three, and four or more. Let's go case-by-case, focusing on the last case first.

Four or more

We know the ideal cycle structure in this situation (cycles of length 4,5,7, and 9), but the input string can still vary. If the input string contains four or more distinct characters can we always find a permutation which makes the game last 1260 rounds? Yes, because we are free to choose what letters go in what cycle.

If our input string is "ABCDE" then we might put A in the first cycle, B in the second, C in the third, D in the fourth, and E in any of the cycles. The remaining spots in the cycles can then be filled with whatever letters we want. Because the length of the game is determined only by the cycle structure we are guaranteed to get a game that lasts 1260 rounds.
Three characters

If our input string contains three characters we're back to square one. We might think to try using the three longest cycles from our ideal 4-5-7-9 permutation above, but that doesn't quite work. This permutation makes the game last 315 rounds, but what is to say that there isn't a permutation with exactly three cycles that lasts longer than 315 rounds?

As it turns out there is a permutation which yields a game that lasts 630 rounds. This permutation has cycles of length 7, 9, and 10. Again, there is Python code below to calculate this number exactly.
The last two cases

This case is essentially the same as the three-character case, except that we wish to find a permutation with two cycles rather than three. Using the code below we find a permutation which leads to a game of length 165 and has cycles of length 15 and 11.

As for the one-letter case, well, we've already done it! If you recall a single letter cannot go through a game longer than 26 rounds before appearing again. What's more, the permutation which produces this game can be named explicitly: (A B C ... X Y Z).

The Code

As promised, here is the code which produces the above numbers. landau implements Landau's function, which produces the longest possible permutation on n letters — that means landau(26) = 1260. part2 handles the two-character case and part3 the three-character case. With slight modifications you can make them print out the cycle structure rather than the length of the game.

At the request of Facebook portions of the code have been redacted.

#!/usr/bin/python
import sys

def gcd(a,b):
	if b == 0: return a
	return gcd(b, a % b)

def lcm(a,b):
	return (a*b)/gcd(a,b)

def lcm2(a,b,c):
	return lcm(lcm(a,b),c)

def landau(n):
	ans = [1]*(n+1)
	
	for i in range(1,n+1):
		for k in range (1,i+1):
			test = lcm(k, ans[i-k])
			if (test > ans[i]):
				ans[i] = test
			
	return ans[n]

print [landau(int(n)) for n in sys.argv[1:]]

Of course the Facebook puzzle (remember that?) stipulates that, given an input string on the command line, we are to output the longest possible time it could take to produce that input string again. We've solved the problem mathematically which means the "solution" program is stupidly simple. Here it is in Ruby, just because I can:

#!/usr/bin/ruby
case ARGV.inject{|sum, i| sum + i}.downcase.split('').uniq.size
when 1
  puts 26
when 2
  puts 165
when 3
  puts 630
else
  puts 1260
end

The Math

Some of you might be interested in the math, so here it is in a hyper-condensed form. We start with what is called a group. The set of all permutations on n letters forms a group called the symmetric group, denoted S_n. We are interested in S₂₆, i.e., the set of all permutations of 26 letters. For two elements σ, τ in S_n we write στ to mean their composition under the group operation. We write σ² to mean σσ, i.e., the application of sigma twice.

Permutations can be written using cycle notation. Let's say σ(A) = B and σ(B) = A. That is, σ swaps the letters A and B. This corresponds to the cycle (A B). Similarly, we might have σ(Q) = W, σ(W) = E, σ(E) = R, and σ(R) = Q, which corresponds to the cycle (Q W E R). As in the explanation of the puzzle we can do the opposite: by writing down cycles we are actually choosing a permutation which contains those cycles.

For every permutation σ in S_n there is a smallest positive integer n such that σⁿ is the identity element, i.e., the permutation which fixes every letter. This n is called the order of σ, denoted |σ|. This is equivalent to the statement that every game as described initially will stop in a finite number of steps. The order of σ is the least common multiple of the lengths of its cycles when written as a product of disjoint cycles.

Since every element has a well-defined order the function g(n) = max{|σ| : σ in S_n} is also well-defined. In fact, this function is called Landau's function. I didn't learn of this function until I showed my solution to some math friends and they pointed me to the definition. It turns out that g(26) = 1260. You can see the values of g(n) for 0 ≤ n ≤ 47 at the Online Encyclopedia of Integer Sequences (yes, such a thing exists). Or you can use my Python program to calculate it — it runs quite quickly!

Das Ende

The statement of the original puzzle is as follows:

Given a particular name (e.g. "Mike Korn"), what is the maximum number of times the user might have to try typing in his name (or whatever has appeared on the screen) until his real name appears, if the manner in which the keys have been mixed up is unknown?

From the above arguments we have learned that the answer depends only on the number of distinct characters in the input string. The Ruby code above outputs that number given the input. Furthermore, it is not difficult to generate the permutation in addition by filling in the empty spots in the cycle structure of our ideal permutation in each case. So, in that regard, we have actually solved a harder puzzle than the Facebook puzzle.

I hope you all enjoyed reading this as much as I enjoyed writing it. My daily life doesn't afford me many opportunities to return to my old math haunts. Cheers, and happy coding!

PS

Here is some C code to print off the silly Facebook email address.

#include 

int main(char **argv, int argc) {
    printf("%d\n", 0xFACEB00C>>2);
    return 0;
}

10 Tips for Optimizing MySQL Queries (That don't suck)

Tue, 10 Apr 2007 00:27:34 +0000

Justin Silverton at Jaslabs has a supposed list of 10 tips for optimizing MySQL queries. I couldn't read this and let it stand because this list is really, really bad. Some guy named Mike noted this, too. So in this entry I'll do two things: first, I'll explain why his list is bad; second, I'll present my own list which, hopefully, is much better. Onward, intrepid readers!

Why That List Sucks

He's swinging for the top of the trees

The rule in any situation where you want to opimize some code is that you first profile it and then find the bottlenecks. Mr. Silverton, however, aims right for the tippy top of the trees. I'd say 60% of database optimization is properly understanding SQL and the basics of databases. You need to understand joins vs. subselects, column indices, how to normalize data, etc. The next 35% is understanding the performance characteristics of your database of choice. COUNT(*) in MySQL, for example, can either be almost-free or painfully slow depending on which storage engine you're using. Other things to consider: under what conditions does your database invalidate caches, when does it sort on disk rather than in memory, when does it need to create temporary tables, etc. The final 5%, where few ever need venture, is where Mr. Silverton spends most of his time. Never once in my life have I used SQL_SMALL_RESULT.
Good problems, bad solutions

There are cases when Mr. Silverton does note a good problem. MySQL will indeed use a dynamic row format if it contains variable length fields like TEXT or BLOB, which, in this case, means sorting needs to be done on disk. The solution is not to eschew these datatypes, but rather to split off such fields into an associated table. The following schema represents this idea:
```
CREATE TABLE posts (
	id int unsigned not null auto_increment,
	author_id int unsigned not null,
	created timestamp not null,
	PRIMARY KEY(id)
);

CREATE TABLE posts_data (
	post_id int unsigned not null.
	body text,
	PRIMARY KEY(post_id)
);
```
That's just...yeah

Some of his suggestions are just mind-boggling, e.g., "remove unnecessary paratheses." It really doesn't matter whether you do SELECT * FROM posts WHERE (author_id = 5 AND published = 1) or SELECT * FROM posts WHERE author_id = 5 AND published = 1. None. Any decent DBMS is going to optimize these away. This level of detail is akin to wondering when writing a C program whether the post-increment or pre-increment operator is faster. Really, if that's where you're spending your energy, it's a surprise you've written any code at all

My list

Let's see if I fare any better. I'm going to start from the most general.

Benchmark, benchmark, benchmark!

You're going to need numbers if you want to make a good decision. What queries are the worst? Where are the bottlenecks? Under what circumstances am I generating bad queries? Benchmarking is will let you simulate high-stress situations and, with the aid of profiling tools, expose the cracks in your database configuration. Tools of the trade include supersmack, ab, and SysBench. These tools either hit your database directly (e.g., supersmack) or simulate web traffic (e.g., ab).
Profile, profile, profile!

So, you're able to generate high-stress situations, but now you need to find the cracks. This is what profiling is for. Profiling enables you to find the bottlenecks in your configuration, whether they be in memory, CPU, network, disk I/O, or, what is more likely, some combination of all of them.

The very first thing you should do is turn on the MySQL slow query log and install mtop. This will give you access to information about the absolute worst offenders. Have a ten-second query ruining your web application? These guys will show you the query right off.

After you've identified the slow queries you should learn about the MySQL internal tools, like EXPLAIN, SHOW STATUS, and SHOW PROCESSLIST. These will tell you what resources are being spent where, and what side effects your queries are having, e.g., whether your heinous triple-join subselect query is sorting in memory or on disk. Of course, you should also be using your usual array of command-line profiling tools like top, procinfo, vmstat, etc. to get more general system performance information.
Tighten Up Your Schema

Before you even start writing queries you have to design a schema. Remember that the memory requirements for a table are going to be around #entries * size of a row. Unless you expect every person on the planet to register 2.8 trillion times on your website you do not in fact need to make your user_id column a BIGINT. Likewise, if a text field will always be a fixed length (e.g., a US zipcode, which always has a canonical representation of the form "XXXXX-XXXX") then a VARCHAR declaration just adds a superfluous byte for every row.

Some people poo-poo database normalization, saying it produces unecessarily complex schema. However, proper normalization results in a minimization of redundant data. Fundamentally that means a smaller overall footprint at the cost of performance — the usual performance/memory tradeoff found everywhere in computer science. The best approach, IMO, is to normalize first and denormalize where performance demands it. Your schema will be more logical and you won't be optimizing prematurely.
Partition Your Tables

Often you have a table in which only a few columns are accessed frequently. On a blog, for example, one might display entry titles in many places (e.g., a list of recent posts) but only ever display teasers or the full post bodies once on a given page. ~~Horizontal~~ vertical partitioning helps:
```
CREATE TABLE posts (
	id int unsigned not null auto_increment,
	author_id int unsigned not null,
	title varchar(128),
	created timestamp not null,
	PRIMARY KEY(id)
);

CREATE TABLE posts_data (
	post_id int unsigned not null,
	teaser text,
	body text,
	PRIMARY KEY(post_id)
);
```
The above represents a situation where one is optimizing for reading. Frequently accessed data is kept in one table while infrequently accessed data is kept in another. Since the data is now partitioned the infrequently access data takes up less memory. You can also optimize for writing: frequently changed data can be kept in one table, while infrequently changed data can be kept in another. This allows more efficient caching since MySQL no longer needs to expire the cache for data which probably hasn't changed.
Don't Overuse Artificial Primary Keys

Artificial primary keys are nice because they can make the schema less volatile. If we stored geography information in the US based on zip code, say, and the zip code system suddenly changed we'd be in a bit of trouble. On the other hand, many times there are perfectly fine natural keys. One example would be a join table for many-to-many relationships. What not to do:
```
CREATE TABLE posts_tags (
	relation_id int unsigned not null auto_increment,
	post_id int unsigned not null,
	tag_id int unsigned not null,
	PRIMARY KEY(relation_id),
	UNIQUE INDEX(post_id, tag_id)
);
```
Not only is the artificial key entirely redundant given the column constraints, but the number of post-tag relations are now limited by the system-size of an integer. Instead one should do:
```
CREATE TABLE posts_tags (
	post_id int unsigned not null,
	tag_id int unsigned not null,
	PRIMARY KEY(post_id, tag_id)
);
```
Learn Your Indices

Often your choice of indices will make or break your database. For those who haven't progressed this far in their database studies, an index is a sort of hash. If we issue the query SELECT * FROM users WHERE last_name = 'Goldstein' and last_name has no index then your DBMS must scan every row of the table and compare it to the string 'Goldstein.' An index is usually a B-tree (though there are other options) which speeds up this comparison considerably.

You should probably create indices for any field on which you are selecting, grouping, ordering, or joining. Obviously each index requires space proportional to the number of rows in your table, so too many indices winds up taking more memory. You also incur a performance hit on write operations, since every write now requires that the corresponding index be updated. There is a balance point which you can uncover by profiling your code. This varies from system to system and implementation to implementation.
SQL is Not C

C is the canonical procedural programming language and the greatest pitfall for a programmer looking to show off his database-fu is that he fails to realize that SQL is not procedural (nor is it functional or object-oriented, for that matter). Rather than thinking in terms of data and operations on data one must think of sets of data and relationships among those sets. This usually crops up with the improper use of a subquery:
```
SELECT a.id, 
	(SELECT MAX(created) 
	FROM posts 
	WHERE author_id = a.id) 
AS latest_post
FROM authors a
```
Since this subquery is correlated, i.e., references a table in the outer query, one should convert the subquery to a join.
```
SELECT a.id, MAX(p.created) AS latest_post
FROM authors a
INNER JOIN posts p
	ON (a.id = p.author_id)
GROUP BY a.id
```
Understand your engines

MySQL has two primary storange engines: MyISAM and InnoDB. Each has its own performance characteristics and considerations. In the broadest sense MyISAM is good for read-heavy data and InnoDB is good for write-heavy data, though there are cases where the opposite is true. The biggest gotcha is how the two differ with respect to the COUNT function.

MyISAM keeps an internal cache of table meta-data like the number of rows. This means that, generally, COUNT(*) incurs no additional cost for a well-structured query. InnoDB, however, has no such cache. For a concrete example, let's say we're trying to paginate a query. If you have a query SELECT * FROM users LIMIT 5,10, let's say, running SELECT COUNT(*) FROM users LIMIT 5,10 is essentially free with MyISAM but takes the same amount of time as the first query with InnoDB. MySQL has a SQL_CALC_FOUND_ROWS option which tells InnoDB to calculate the number of rows as it runs the query, which can then be retreived by executing SELECT FOUND_ROWS(). This is very MySQL-specific, but can be necessary in certain situations, particularly if you use InnoDB for its other features (e.g., row-level locking, stored procedures, etc.).
MySQL specific shortcuts

MySQL provides many extentions to SQL which help performance in many common use scenarios. Among these are INSERT ... SELECT, INSERT ... ON DUPLICATE KEY UPDATE, and REPLACE.

I rarely hesitate to use the above since they are so convenient and provide real performance benefits in many situations. MySQL has other keywords which are more dangerous, however, and should be used sparingly. These include INSERT DELAYED, which tells MySQL that it is not important to insert the data immediately (say, e.g., in a logging situation). The problem with this is that under high load situations the insert might be delayed indefinitely, causing the insert queue to baloon. You can also give MySQL index hints about which indices to use. MySQL gets it right most of the time and when it doesn't it is usually because of a bad scheme or poorly written query.
And one for the road...
Last, but not least, read Peter Zaitsev's MySQL Performance Blog if you're into the nitty-gritty of MySQL performance. He covers many of the finer aspects of database administration and performance.

What is a word? An introduction to computational linguistics.

Wed, 04 Apr 2007 15:09:00 +0000

What is a word? This question is one of the most deceptively simple ones I know. Everyone will say they know the answer, or at least say they know one when they see one, but even native speakers of a language can and do disagree. The dictionary isn't much help since many dictionaries have multi-sentence, ad hoc definitions which basically boil down to "a word is a unit of language that means something, sort of."

Let's jump ahead and assume we know what a word is, or that we can get native speakers to identify most words most of the time. Furthermore, let's say that our goal is to get a computer to understand a given language. Since humans learn languages initially by learning words and basic grammar it seems like a good choice to try and get computers to recognize words. So, our goal: given a string of English letters insert spaces between the words.

What is a word?

To show that the above exercise isn't totally contrived let's look at some of the subtleties in the idea of the word. This is only for people interested in the "linguistics" part of "computaitonal linguistics," but if you want to read it then click here.

Words are linguistic constructs, not orthographic ones.
Language preceeds writing. Children can speak and comprehend a language before they learn to read and write in that language. That is to say nothing of people who are illiterate or languages which have no formalized writing system. Furthermore, when learning a foreign language one typically learns words and basic grammar long before learning how to write, particularly if the writing system is dramatically different, e.g., an English-speaker learning Chinese or Arabic. This is all to say that words are not just formal orthographic constructs like quotation marks or apostrophes. Words appear to have some linguistic reality and are therefore worth studying from a language perspective.

None of this tells us what a word is, only that speakers of a language believe there is such a thing as a word. English speakers can still say, however, that what we see on paper is more-or-less accurate: spaces represent breaks between words. This dodges the question since the idea of a "word" exists cross-linguistically. The definition of a word, therefore, should encompass all such contingencies.
Synthetic languages
A few definitions, first. A morpheme is the smallest unit of language with a meaning. "Dogs" has two morphemes: "dog" and "s," with the former indicating a canine and the latter indicating multiplicity. Similarly, "looked" as two morphemes, with the suffix "-ed" indicating the tense of the verb "look." Synthetic languages have a high morpheme/word ratio. English-speakers might be familar with the comically long German word. In fact, German allows for essentially arbitrarily long words. DonaudampfschiffahrtsgesellschaftskapitÃ¤n, for example, means "Danube company steamship captain." Is this one word or four? And even in the English translation, is "steamship" one word, or two?

Another extreme are synthetic languages like Hungarian which have many more grammatical affixes than English or German. The ideas of "conditional," "future," etc. are all marked by single morphemes attached to a word. In English "I would go" is three words, but in Hungarian it would be appear to be just one or perhaps two.
Phonetics versus Orthography
But most linguists wouldn't even say the above isn't that relevant. At best it just provides us with more evidence that words are something real. When we speak, however, there is no "space" between words in the same way that there are spaces between words in English. If you've ever learned a foreign language you probably remember a point at which you realized you were hearing words rather than sounds. Before that it sounded like one continuous stream of nonsense — and you were right about the "one continuous stream" part, anyhow.

Once you have learned some of the language, however, the patterns become clear in your mind and words jump out at you. That is, humans are able to take a single unbroken string of "characters" (i.e., sounds) and break them into words. Imagine if instead of parsing a string of English characters we were parsing some phonetic representation of a sentence. Suddenly our exercise no longer seems so uninteresting; indeed, we'd be doing something very much like what humans do when they parse a language they understand.

Assumptions

Obviously we can't integrate all of the subtleties above as that would be tantamount to writing a computer program which actually processed text in the same way humans do. Rather, we will work under the following assumptions: first, we already have a database (called the "lexicon") of words; second, this database is complete. The first assumption isn't totally off-the-wall since it's a general working assumption among linguists that humans have just such a database. The second, however, is much harder to swallow since the lexicon is typically understood to contain root morphemes plus general information about the morphology, phonology, phonotactics, etc. of the language.

If I said "koop" were a verb, you'd know right away that "kooped," "koops," "kooper," etc. were all also valid words. Likewise, even though "cromulent" is not actually an English word an English speaker knows that it could be (and that, furthermore, it would probably be an adjective), but that "plkdjfhg" could never be an English word. Our database, however, is very dumb and very uncompressed: every permutation of every word should be present, otherwise that permutation won't be counted as a word. We're only making this assumption to simplify the problem. I may be a pretty good programmer, but I'm not good enough to write a computer program which automatically learns a language's syntax, morphology, and phonology.

Enough chit chat, let's get to the code.

The Algorithm

The algorithm I'm going to use is a simple probabalistic dynamic programming algorithm. Let's say we have a string like "therentisdue" and want to parse it as "the-rent-is-due." Assuming our training data is representative of the language as a whole (a big assumption, for sure) then we know the probability of each word is #occurances of the word in the data over the total number of words in the data. The idea is that the best parse of a string, given our training data, is the parse which has the highest probability of occuring.

For the CS students out there this should scream "dynamic programming." For everyone else, I'll explain. The most obvious way to find the parse with the highest probability is to find every possible parse and then find that parse which has the highest probability. Implementing the algorithm this way is intractable since there are 2^n-1 parses (why?). Instead we'll do the following. The pseudo code:

BestParse[0] := ""
FOR i in [1..length of StringToParse] DO
	FOR j in [0..i) DO
		parse := BestParse[j] + StringToParse[j,i]
		
		IF COST(parse) < COST(BestParse[i]) THEN
			BestParse[i] = parse
		ENDIF
	ENDFOR
ENDFOR

DEFINE COST(parse)
	return -LOG2(PROBABILITY(parse))
END

DEFINE PROBABILITY(parse)
	return product of the frequencies of each word in parse
END

Let the input string be s. At each point i, that is, for the initial i-length substring of s, determine what the best parse up to i is. Now, let's say we know what the best parse at i is for some fixed i. To find the best parse at i+1 we try to insert a break after each initial j substring, for j < i+1. Since we've been keeping track of the best parse (and cost) at each such j the whole time, we just see which break insertion is the cheapest.

Here is an illustration, again with "therentisdue." Let's say we have "therenti" parsed so far. This means we know the best parse for each initial substring of this string, e.g., "t", "th", "the", etc. The best parse will probably be "the-rent-i" since each of these is a word and every other parse contains at least one non-word. Now let's see how the algorithm determines the best parse of "therentis" from this.

After each character in the string we need to decide whether or not to insert a break. Should we insert a space after the first character? Well, yes, since the best parse of a single character is definitely that character. So at the first step we get "t-|herentis." If we're favoring single letters over non-words (it's our choice to make) then the best parse after the second character would be "t-h-|erentis." After the third, however, the parse is "the-|rentis" since "the" is a word and therefore the best parse of the first three letters is "the" (we know this because, by assumption, we have already computed the best parse for "the"). Next we get "the-r-|entis," followed by "the-re-|ntis," and so on, until we get to "the-ren-|tis." After this step we try "the-rent-|is." This is a very good parse since we have three words. Finally, we try "the-rent-i-|s," which has a lower probability than the previous parse because "s" is not a word. Therefore "the-rent-is" is the parse which we save as the best parse of "therentis."

I implemented this algorithm using C++, which you can download here. By default it uses the KJV Bible as training data, which means what it considers words can be a little funny. For example, "sin" is considered a very common word.

spaces-1.0.0.tar.bz2 (840K)

Recent Articles at 20bits

Keith Olbermann Thinks I'm an Idiot

Monday, September 2009, 9AM

The Best Laid Plans: 10AM

The Carnival Begins: 12PM

The National Stage

From Would-be Assassin to Idiot

Fallout

Getting Ahead: A Letter to Myself

Dear Jesse

Acknowledgements

The Value of a Social Commerce Referral

The Sites

The Numbers

A Wrinkle

Social Commerce?

Additional Info

Does this interest you?

Click Hacking for Fun and Profit

What is click hacking?

For Good or Evil

Understand Your Audience

Ask for Help or Feedback

Give a Gift

Bribery

Scarcity

Social Proof

Pick a Fight

Environmental Flaws

What else?

Everlane is Hiring

Speed vs. Certainty in A/B Testing

Tuning the Confidence Level

Speed vs. Certainty

Volume

Reversibility

Real Money

Conclusions

A Spreadsheet Model

Credits

8 Tips for Crafting Metrics That Matter

Be Actionable

Be Understandable and Trustworthy

Measure Results

Understand the Downside

Understand the Upside

Don't Be Ambiguous

Segment by Purpose

Appropriate Granularity

Conclusions

Building a Social Network, Island by Island

Islands

Case Study: Facebook

Other Networks

Multiple Dimensions of Density

Conclusion and Counterexamples?

What Verna Taught Me

Enter Verna

Go and See

Notification Strategies for Social Networks

The Basic Considerations

Heuristics and Centrality Measures

A Global Solution

The Algorithm and The Results

Feasibility

Footnotes

Why hi5 Might Have an Edge on Facebook

Are Virtual Goods the Key?

The US Advertising Crutch

The Demographic Crunch

You're crazy. You know that, right?

Behavior Adoption on Social Networks

The Threshold Model

The Cascade Model

Practical Implications

Formalisms

Takeaways

Almost Viral: A Hybrid Acquisition Strategy

The Paid Strategy

The Viral Strategy