<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Recent Articles at 20bits</title>
    <description>20bits by Jesse Farmer</description>
    <link>http://20bits.com/</link>
    <item>
      <title>Keith Olbermann Thinks I'm an Idiot</title>
      <description>&lt;img class="math" src="http://assets.20bits.com/20120410/olbermann-hero.jpg" style="width:629px" alt="Keith Olbermann: Genius"&gt;

&lt;p&gt;This story ends with Keith Olbermann dismissing me as "another idiot" on national television, but it begins on a Monday morning with me sitting on my brown leather IKEA couch in &lt;a href="http://en.wikipedia.org/wiki/Palo_Alto,_California"&gt;Palo Alto&lt;/a&gt;, two blocks from Facebook's then-new &lt;a href="http://en.wikipedia.org/wiki/College_Terrace,_Palo_Alto,_California"&gt;College Terrace&lt;/a&gt; office.  Five months earlier I started company with &lt;a href="http://about.me/zellunit"&gt;Matt Humphrey&lt;/a&gt;, &lt;a href="http://twitter.com/joedamato"&gt;Joe Damato&lt;/a&gt;, and &lt;a href="http://twitter.com/tmm1"&gt;Aman Gupta&lt;/a&gt; called Bumba Labs.&lt;/p&gt;

&lt;p&gt;Our first application, Polls, let anyone on Facebook create or vote in a poll on Facebook.  When someone voted they were prompted to post their response to their newsfeed, where all their friends could see it.  That was enough to make it grow exponentially, and within a month there were millions of people voting.  Over 10,000 polls were created each day.  Even Mark Zuckerberg used our app to create a poll about Gidget, the Taco Bell dog, who had died earlier that year.  Not a bad start.&lt;/p&gt;

&lt;p&gt;By the end, Facebook, the Secret Service, and every major news station would be involved.&lt;/p&gt;

&lt;h3&gt;Monday, September 2009, 9AM&lt;/h3&gt;

&lt;p&gt;I sat down on my couch, opened my laptop, and logged on to Facebook to see how our application was doing.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The application "Polls" is temporarily unavailable due to an issue with its third-party developer. We are investigating the situation and apologize for any inconvenience.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the message I saw when I visited our application that morning.  "Today is going to be awesome!"&lt;/p&gt;

&lt;p&gt;Every successful Facebook developer has seen that message at least once.  It means Facebook found something they didn't like in your application and decided to take it down.  Normally they'd warn you a few days in advance and tell you to fix it or else.  I double-checked my email and, nope, no warning — the app was just gone.&lt;/p&gt;

&lt;p&gt;It would take Facebook at least a day to respond to any questions I had, so in the meantime I connected to the Polls database to see if I could spot anything unusual.  At Facebook's request we had implemented a feature to report offensive polls a few months earlier, and now took time each morning to spot and delete any truly awful polls.&lt;/p&gt;

&lt;p&gt;People would report a poll for anything, offensive or not: the poll's prompt didn't agree with their politics, there was a typo in one of the answers, the poll's photo offended them, etc.  For most polls there was one report every hundred votes or so, but today there was a poll with only a few hundred votes and thousands of reports.&lt;/p&gt;

&lt;p&gt;What was it?&lt;/p&gt;

&lt;p&gt;&lt;img src="http://assets.20bits.com/20120410/original.jpeg" class="math"&gt;&lt;/p&gt;

&lt;h3&gt;The Best Laid Plans: 10AM&lt;/h3&gt;

&lt;p&gt;I knew in my gut this is why Facebook shut down our application, but it was still strange that they hadn't warned us.  Playing dumb, I sent an email to someone I knew on the Facebook policy enforcement team.&lt;/p&gt;

&lt;pre&gt;
Hey XXX,

Polls is down and it's displaying the TOS violation screen:

The application "Polls" is temporarily unavailable due to an     
issue with its third-party developer. We are investigating 
the situation and apologize for any inconvenience.

Any idea what's up?

Cheers,
Jesse
&lt;/pre&gt;

&lt;p&gt;Our users had created horrible polls in the past, asking questions like "Should gay people be lynched?" or "Is Mrs. Jones the English teacher at such-and-such High School a racist?"  We'd delete those polls as soon as we found them and ban whomever made them from ever using Polls again.  In this case, the poll was created by a middle school girl from Orange County, California.  She'll be graduating from high school this year.&lt;/p&gt;

&lt;p&gt;This looked less like someone earnestly plotting to kill Obama and more like a bored kid phoning in a fake bomb threat to their high school.  Thankfully, only a few hundred people voted in the poll — the truly viral polls had millions of votes.  How did Facebook find this if it had so few votes, though?  It took at least several thousand votes over the previous hour to break into the list of popular polls, and nobody outside Bumba saw the complaints.&lt;/p&gt;

&lt;p&gt;I decided to wait until Facebook got back to me before I did anything else.  The poll was removed, the provocateur banned, and hardly anyone had voted in the poll.&lt;/p&gt;

&lt;p&gt;I left Palo Alto to spend the day working in San Francisco out of Matt, Joe, and Aman's apartment.&lt;/p&gt;

&lt;h3&gt;The Carnival Begins: 12PM&lt;/h3&gt;

&lt;p&gt;When I arrived at their apartment, Matt, Joe, and Aman had already seen the app was down, so I explained what I had found and we went to work on other projects.  About twenty minutes later, Aman pointed at his monitor and shouted: "Hey, we're on the Huffington Post!  Polls is on the Huffington Post!"&lt;/p&gt;

&lt;p&gt;And there it was on the &lt;a href="http://www.huffingtonpost.com/2009/09/28/obama-facebook-poll-asks_n_301860.html"&gt;Huffington Post&lt;/a&gt;.  And the &lt;a href="http://www.huffingtonpost.com/2009/09/28/kill-obama-facebook-poll-_n_302090.html"&gt;Associated Press Wire&lt;/a&gt;.  And all over the blogosphere.&lt;/p&gt;

&lt;p&gt;After a bit of digging we found "patient zero," a small liberal blog called &lt;a href="http://thepoliticalcarnival.blogspot.com/2009/09/screen-grab-facebook-poll-should-obama.html"&gt;The Political Carnival&lt;/a&gt;.  They started a campaign late Sunday evening to call Facebook, the FBI, and the Secret Service, which quickly spread to larger communities like &lt;a href="http://www.dailykos.com/story/2009/09/28/787194/-UPDATE-3-Facebook-thank-you-for-finally-deleting-"&gt;DailyKos&lt;/a&gt;.  One problem: nobody seemed to understand the difference between Facebook, developers on the Facebook Platform, and Facebook users creating polls with our app.  They were blaming Facebook for hosting the poll and us for creating it!&lt;/p&gt;

&lt;h3&gt;The National Stage&lt;/h3&gt;

&lt;p&gt;The poll was created Sunday morning, discovered by a single blog Sunday evening, and had become a national news story by the time I woke up Monday.  Nobody had reached out to us yet, either, so most articles about the poll were filled with wild speculation.  Left-leaning outfits assumed the person who created it was a middle-aged white man out to kill Obama.  Right-leaning outfits assumed this was a liberal plant, designed to make the right look bad.  Everyone tried to weave this incident into a larger story about race, politics, and the direction of American society.&lt;/p&gt;

&lt;p&gt;I wondered if they would be embarrassed to know that their outrage was unknowingly directed at a 14-year-old girl.  Did they even know how to feel embarrassed?&lt;/p&gt;

&lt;p&gt;&lt;a href="http://twitter.com/eldon"&gt;Eric Eldon&lt;/a&gt; and &lt;a href="http://twitter.com/justinsmith"&gt;Justin Smith&lt;/a&gt; from Inside Network were the first people to reach out.  Eric had been a journalist in Silicon Valley for several years, so he understood how hard it was for any site to police user-generated content.  &lt;a href="http://www.insidefacebook.com/2009/09/28/the-obama-assassination-poll-another-story-about-offensive-user-generated-content/"&gt;His writeup at Inside Facebook&lt;/a&gt; was the most level-headed account by a mile or ten.&lt;/p&gt;

&lt;p&gt;The second person to reach out called with no warning.  I assumed it was a journalist trying to catch me off guard, but instead the caller replied: "This is Special Agent Mark Weller from the United States Secret Service, San Jose office." Because of their location, he told me, they work with Facebook and other Silicon Valley companies all the time, and were used to dealing with complaints about user-generated content.  I gave him the identity of the girl who created the poll and we ended the call in under 15 minutes.&lt;/p&gt;

&lt;p&gt;Self-deprecating, sarcastic humor being my default coping mechanism, I tweeted the following:&lt;/p&gt;

&lt;blockquote class="twitter-tweet tw-align-center"&gt;&lt;p&gt;Life TODO: [X] Have a phone conversation with an agent from the US Secret Service&lt;/p&gt;— Jesse Farmer (@jessefarmer) &lt;a href="https://twitter.com/jessefarmer/status/4455778065" data-datetime="2009-09-28T23:26:06+00:00"&gt;September 28, 2009&lt;/a&gt;&lt;/blockquote&gt;

&lt;script src="//platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;&lt;br&gt;
After I got off the phone with Agent Weller I immediately called Bumba's lawyer: "I'm not sure they'll send one, but don't be surprised if you get a subpeona from the Secret Service.  I gave them your fax number."  The call every lawyer wants to receive.&lt;/p&gt;

&lt;h3&gt;From Would-be Assassin to Idiot&lt;/h3&gt;

&lt;p&gt;Once my name was out there, the coverage escalted.  I was getting unprompted calls from reporters wanted to talk to "the polls developer."  People would email me asking why I wanted to assassinate Obama.  When I explained that I actually worked on the Obama campaign in 2008, their brains melted.  It just didn't fit into the story they had been telling themselves since the news broke.&lt;/p&gt;

&lt;p&gt;Here are some selected pieces of coverage:&lt;/p&gt;

&lt;div class="math"&gt;
&lt;script src="http://i.cdn.turner.com/cnn/.element/js/2.0/video/evp/module.js?loc=dom&amp;vid=/video/us/2009/09/29/nr.obama.facebook.poll.cnn" type="text/javascript"&gt;&lt;/script&gt;&lt;noscript&gt;Embedded video from &lt;a href="http://www.cnn.com/video"&gt;CNN Video&lt;/a&gt;&lt;/noscript&gt;

&lt;iframe width="420" height="315" src="http://www.youtube.com/embed/l7ohjT9-8X4" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/div&gt;

&lt;p&gt;The best part came Tuesday night, though.  My phone rang.  It was my mother.  Her voice was hurried and cracking a little.  Before I could ask what was wrong she blurted out, "Jesse, Keith Olbermann just called you an idiot on national television."  She was concerned this would hurt my reputation.&lt;/p&gt;

&lt;p&gt;I went through the Countdown transcripts and read his refreshingly thoughtful commentary:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;This as the fallout continues over that poll from Facebook. It continues. The Secret Service is now investigating the threat. And the developer of the online application that was used to create the survey has come forward, telling "Politico" "there is definitely a culture of paranoia and fear, and I think both sides are reacting in extreme ways. People carrying guns to town hall meetings. That's scary. People losing their cool over an internet poll like this that doesn't calm the situation."&lt;/p&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;Posting such a poll cools and calms the situation? Another idiot.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Thanks to Keith, I finally recognized my own idiocy and spent the next four months at an "idiot detox" center in eastern Oregon.&lt;/p&gt;

&lt;p&gt;Later that week the Secret Service announced they had investigated the lead and found no credible threat.  By then the media had moved on to the next banal controversy that could generate outrage and attention, so hardly anyone noticed.&lt;/p&gt;

&lt;h3&gt;Fallout&lt;/h3&gt;

&lt;p&gt;I'll stop myself from commenting on the "state of the media" and such.  It's cliché at this point to say that people like Keith Olbermann and the bloggers who first broke this story aren't interested in the truth, but instead their own aggrandizement.  (Oops.)&lt;/p&gt;

&lt;p&gt;Community sites like DailyKos were better at understanding what happened, especially after I took the time to &lt;a href="http://www.dailykos.com/story/2009/09/28/787366/-Facts-from-the-developer-of-the-Facebook-Poll-Application"&gt;patiently answer all their questions&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As for the fallout, Polls was down throughout this episode.  Because it spread virally by posting people's votes to their newsfeed, the three or four days of down time halted all growth.  Traffic dropped to nothing and never recovered.  It wouldn't have mattered much, anyhow, because three months later Facebook changed how the newsfeed worked and made it nearly impossible for apps to grow through the newsfeed alone.&lt;/p&gt;

&lt;p&gt;The four of us shut down Bumba about a month later.  Matt, Joe, and Aman, along with &lt;a href="http://www.crunchbase.com/person/jared-kopf"&gt;Jared Kopf&lt;/a&gt;, went on to start &lt;a href="http://www.crunchbase.com/company/homerun"&gt;HomeRun&lt;/a&gt;. It was at this time, too, that Matt introduced me to &lt;a href="http://www.linkedin.com/in/mpreysman"&gt;Michael Preysman&lt;/a&gt;, a friend of his from &lt;a href="http://www.cmu.edu"&gt;CMU&lt;/a&gt;.  Less than a year later Michael and I started &lt;a href="http://www.everlane.com"&gt;Everlane&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In fact, here is the very first email I ever sent Michael:&lt;/p&gt;

&lt;pre&gt;
From: Jesse Farmer &lt;jesse@20bits.com&gt;
To: Michael Preysman &lt;mpreysman@gmail.com&gt;
Date: Mon, Sep 28, 2009 at 12:12 PM
Subject: If you get a call from the Secret Service…

Here's why: 
http://www.sfgate.com/cgi-bin/article.cgi?f=/n/a/2009/09/28/national/w115451D54.DTL

You're still listed as a developer :|

- J
&lt;/pre&gt;

&lt;p&gt;I sure know how to make a graceful first impression.&lt;/p&gt;</description>
      <pubDate>Wed, 11 Apr 2012 04:59:37 +0000</pubDate>
      <link>http://20bits.com/article/keith-olbermann-thinks-i-m-an-idiot</link>
    </item>
    <item>
      <title>Getting Ahead: A Letter to Myself</title>
      <description>&lt;blockquote&gt;
&lt;p&gt;All the pulses of the world,&lt;br&gt;
Falling in they beat for us, with the Western movement beat, &lt;br&gt;
Holding single or together, steady moving to the front, all for us,&lt;br&gt;
Pioneers! O pioneers!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I moved to Silicon Valley the summer of 2006, as soon as I graduated from the University of Chicago.  Two college friends, &lt;a href="http://laptopandarifle.wordpress.com/"&gt;Ryo Chijiiwa&lt;/a&gt; and &lt;a href="http://agnoster.net/"&gt;Isaac Wolkerstorfer (neé Wasileski)&lt;/a&gt;, asked me to join their startup &lt;a href="http://chicagomaroon.com/2005/02/04/new-website-showcases-campus-media-libraries/"&gt;OpenHive&lt;/a&gt;, a "social" search engine for college campuses that allowed students to search each others' bookshelves.   I had no expectations.  In fact, before my plane landed in San Jose, I had never even set foot in California.&lt;/p&gt;

&lt;p&gt;I was underprepared.      I suppose everyone is, though.  This is the letter I wish someone had written me that summer.  Since nobody did, I'll have to write it to myself six years later.&lt;/p&gt;

&lt;h3&gt;Dear Jesse&lt;/h3&gt;

&lt;p&gt;Dear Jesse,&lt;/p&gt;

&lt;p&gt;Hello from the future!  You're about to move to California and help Yitz and Ryo with their startup.  There's so much you're going to fuck up, but it's all worth it — honest.  I want to help you get ahead.&lt;/p&gt;

&lt;p&gt;The best thing about Silicon Valley is how friendly, open, and helpful everyone is.  The chattering class can be cynical, but this is where the future is getting built if you look hard enough.  Do look hard enough.&lt;/p&gt;

&lt;p&gt;Here's the big secret: do valuable work and share it.  People out here spend so much time talking about who's up, who's down, who's working with whom, who raised money and on what terms, who sold their company and for how much, etc.  Twitter has nothing on the speed at which gossip travels in Silicon Valley.  It's the work that earns you respect and credibility in the end, though.&lt;/p&gt;

&lt;p&gt;Forget meetups, getting coffee, and "quick" phone calls.  Doing valuable work and sharing it is the best way to build a network — it becomes your calling card.  Idle meetings are the Silicon Valley equivalent of showing up empty-handed to a potluck.  Everyone is happier if you bring something unique and delicious.  Until you can do that, you're better off practicing your kitchen skills.  You want to be able to point to something fantastic and say "I did &lt;em&gt;that&lt;/em&gt;."&lt;/p&gt;

&lt;p&gt;It's hard to know whether your work is valuable, particularly while you're in the middle of doing it.  What if it's not good enough?  What if people you respect think poorly of it?  Being dissatisfied with your own work is what pushes you to improve, but don't give yourself too much credit.  Most people won't think anything of it at all.  You have to trust yourself that if you're thoughtful enough and prolific enough, something amazing will happen.&lt;/p&gt;

&lt;p&gt;Small work can be valuable, too.  When I was first learning Erlang there were no tutorials outside the official and very opaque documentation, so I took the time to write the tutorials I wish existed.  If you think it's valuable someone else will, too.  It might even be an opportunity to work with them.  Don't be trapped by thinking that all your work has to be momentous.  Seeds aren't momentous.&lt;/p&gt;

&lt;p&gt;I know you can be independent and stubborn, but don't be afraid to ask for help in your work.  You'll be surprised at how helpful people are, even people you've only met once or twice.  The Midwesterner in you will say it's impolite to be a burden on other people's time.  He's being overcautious (that's a polite way of saying he's full of shit).  Ask for help especially when you're about to do something you've never done before.&lt;/p&gt;

&lt;p&gt;Conversely, don't take everyone's advice to heart.  You can get every possible piece of advice, if you want.  Take the job, don't take the job.  Work with him, don't work with him.  Hire that guy, don't hire that guy.  Take the money, don't take the money.  You'll feel like a ping pong ball if you try to listen to it all.&lt;/p&gt;

&lt;p&gt;Speaking of advice, someone will give you a copy of this poem when it really matters:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;since feeling is first&lt;br&gt;
who pays any attention&lt;br&gt;
to the syntax of things&lt;br&gt;
will never wholly kiss you;&lt;br&gt;
wholly to be a fool&lt;br&gt;
while Spring is in the world&lt;br&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It applies to everything.  Life, love, work, business.  You're great at being logical, mathematical, and methodical.  If that's everything, though, you run the risk of being effective but uninspiring.  Remember that poem and be more audacious (in everything).  When it comes to inspiring people the Daniel Burnham quote — "Make no little plans. They have no magic to stir men's blood." — is absolutely true.&lt;/p&gt;

&lt;p&gt;Finally, surround yourself with talented people you trust and respect.  Keep them close, help them every chance you get, and make sure they know how much they matter.  These are the people who will help you regardless of how much help you can offer in return, and will keep you grounded when you're about to do something really crazy.&lt;/p&gt;

&lt;p&gt;I hope you don't find this letter too self-absorbed.  I thought of ways to make it more clever, funnier, etc., but decided to opt for plain-spoken and sincere.  If it annoys you, well: fuck off. ;)&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br&gt;
Jesse&lt;/p&gt;

&lt;h3&gt;Acknowledgements&lt;/h3&gt;
&lt;p&gt;Thanks to &lt;a href="http://www.davidcole.me/"&gt;David Cole&lt;/a&gt;, &lt;a href="http://timetobleed.com/"&gt;Joe Damato&lt;/a&gt;, Raph Lee, and David Kaye for reading earlier drafts of this essay.&lt;/p&gt;
&lt;p&gt;Have your own letter you wish you'd received when you were just starting out in your career?  &lt;a href="mailto:jesse@20bits.com"&gt;Send me&lt;/a&gt; a link — I'd love to compile a list.&lt;/p&gt;
&lt;p&gt;If you like this essay, &lt;a href="http://twitter.com/jfarmer"&gt;follow me on Twitter&lt;/a&gt;.&lt;/p&gt;</description>
      <pubDate>Tue, 03 Apr 2012 17:58:59 +0000</pubDate>
      <link>http://20bits.com/article/getting-ahead-a-letter-to-myself</link>
    </item>
    <item>
      <title>The Value of a Social Commerce Referral</title>
      <description>&lt;p&gt;At &lt;a href="http://www.everlane.com"&gt;Everlane&lt;/a&gt; we recently shut down a side-project of ours, &lt;a href="http://www.indiecases.com"&gt;Indie Cases&lt;/a&gt;, where we were selling &lt;a href="http://www.indiecases.com/collections/frontpage/products/made-to-measure"&gt;some&lt;/a&gt; &lt;a href="http://www.indiecases.com/collections/frontpage/products/staches-in-space"&gt;pretty&lt;/a&gt; &lt;a href="http://www.indiecases.com/collections/frontpage/products/peace"&gt;sweet&lt;/a&gt; iPhone cases.  Over the month and a half that the store was live we saw some good traffic from up-and-coming social commerce sites like &lt;a href="http://svpply.com"&gt;Svpply&lt;/a&gt; and &lt;a href="http://pinterest.com"&gt;Pinterest&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;
I thought I'd take the time to share some of the numbers.
&lt;/p&gt;

&lt;h3&gt;The Sites&lt;/h3&gt;
&lt;p&gt;
Startup ideas come in bunches, and these social commerce sites are no exception.  They all are built around aiding discovery, typically by allowing users on the site to share "finds" from around the web  and build a following.
&lt;/p&gt;
&lt;p&gt;
These finds may or may not be products and have a price tag associated with them.  They may also be restricted to a specific category (e.g., women's high fashion) on the site.
&lt;/p&gt;
&lt;p&gt;
Here's a list of all the sites I know.  If you know any other, please, &lt;a href="mailto:jesse@20bits.com"&gt;send me an email&lt;/a&gt; and I'll update the list.&lt;/p&gt;
&lt;ul style="list-style: none;margin-left: 20px;"&gt;
&lt;li&gt;&lt;a href="http://buyosphere.com"&gt;Buyosphere&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ly.st"&gt;Ly.st&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://notcouture.notcot.org/"&gt;NotCouture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://pinterest.com"&gt;Pinterest&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://svpply.com"&gt;Svpply&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://thefancy.com"&gt;The Fancy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;
Of these sites, only Pinterest, Svpply, and The Fancy sent any traffic to &lt;a href="http://www.indiecases.com"&gt;Indie Cases&lt;/a&gt;.
&lt;/p&gt;

&lt;h3&gt;The Numbers&lt;/h3&gt;
&lt;p&gt;
The hope for these startups is that referrals from these sites are worth more than the average, non-qualified visitor.  Some, like The Fancy, are even experimenting &lt;a href="http://shop.thefancy.com"&gt;directly with commerce&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
&lt;table class="monthly-data" style="margin: 0 auto"&gt;
  &lt;tr class="top"&gt;
    &lt;th colspan="4"&gt;Value of a Social Commerce Visitor&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr class="odd"&gt;
    &lt;td&gt;Site&lt;/td&gt;
    &lt;td&gt;Conversion %&lt;/td&gt;
    &lt;td&gt;$/visitor&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;Pinterest&lt;/td&gt;
    &lt;td&gt;0.93%&lt;/td&gt;
    &lt;td&gt;$0.31&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr class="odd"&gt;
    &lt;td&gt;Svpply&lt;/td&gt;
    &lt;td&gt;3.70%&lt;/td&gt;
    &lt;td&gt;$1.09&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;The Fancy&lt;/td&gt;
    &lt;td&gt;0.48%&lt;/td&gt;
    &lt;td&gt;$0.13&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/p&gt;

&lt;p&gt;
I left out the traffic number deliberately, but will say this: The Fancy drove approximately 10x the traffic of Svpply, which drove the least amount of traffic of the three.
&lt;/p&gt;

&lt;h3&gt;A Wrinkle&lt;/h3&gt;
&lt;p&gt;
Why did Svpply convert so much better than Pinterest or The Fancy?  Well, half the purchases from Svpply were of &lt;a href="http://www.indiecases.com/products/class-a-cellphone"&gt;this iPhone case&lt;/a&gt;, which Ben Pieratt, CEO of Svpply, tweeted about directly.&lt;/p&gt;

&lt;p&gt;
&lt;a href="http://twitter.com/#!/pieratt/status/98392249120993280"&gt;&lt;img src="http://assets.20bits.com/20110822/Screen-shot-2011-08-22-at-12.07.00-AM.png" alt="" title="Ben Pieratt &amp;lt;3 Indie Cases" width="594" height="288" class="aligncenter size-full wp-image-842" /&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;h3&gt;Social Commerce?&lt;/h3&gt;

&lt;p&gt;
Consider the above: a plug from a respected member of a community interested in our products (Svpply) produced roughly the same number of sales as a site driving 10x the traffic (The Fancy).
&lt;/p&gt;

&lt;p&gt;
That, in a nutshell, is the promise of social commerce: the right recommendation at the right time from the right person.
&lt;/p&gt;

&lt;p&gt;
The success of sites like Svpply, Pinterest, and The Fancy will hinge on their ability to consistently produce that.
&lt;/p&gt;

&lt;h3&gt;Additional Info&lt;/h3&gt;
&lt;p&gt;
Just for reference, here are the Compete graphs of the sites that sent traffic to &lt;a href="http://www.indiecases.com"&gt;Indie Cases&lt;/a&gt;.
&lt;a href='http://siteanalytics.compete.com/pinterest.com+svpply.com+thefancy.com/?metric=uv' class="math"&gt;&lt;img src='http://grapher.compete.com/pinterest.com+svpply.com+thefancy.com_uv_460.png' /&gt;&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
Pinterest dominates in raw site traffic, but the question is whether their referrals are coming with intent to purchase.  I'm sure they were grilled nonstop about that while they were out fundraising &amp;mdash; a traffic graph like that allows for a lot of leeway. ;)
&lt;/p&gt;

&lt;h3&gt;Does this interest you?&lt;/h3&gt;
&lt;p&gt;
At &lt;a href="http://www.everlane.com"&gt;Everlane&lt;/a&gt;, we're not building a social commerce platform, we're building a full-on store selling our own products.  Indie Cases was a small preview of what's coming.
&lt;/p&gt;

&lt;p&gt;
If you are an engineer or designer interested in defining what online retail should look like in a world where Twitter and Facebook exist, and YouTube creates stars in weeks (not months), shoot me an email at &lt;a href="mailto:jesse@everlane.com"&gt;jesse@everlane.com&lt;/a&gt; and let's talk.
&lt;/p&gt;</description>
      <pubDate>Mon, 22 Aug 2011 11:44:37 +0000</pubDate>
      <link>http://20bits.com/article/the-value-of-a-social-commerce-referral</link>
    </item>
    <item>
      <title>Click Hacking for Fun and Profit</title>
      <description>&lt;p&gt;A friend IMed me the other day, asking, "You know how to make people click on things.  I'm submitting something to Reddit &amp;mdash; can you help me title the post?" A stark description of my skills, certainly, but it made me laugh and inspired me to write an article about the art of click hacking.&lt;/p&gt;

&lt;h3&gt;What is click hacking?&lt;/h3&gt;
&lt;p&gt;
Aside from being a term I just made up, &lt;strong&gt;click hacking&lt;/strong&gt; is a type of &lt;a href="http://en.wikipedia.org/wiki/Social_engineering_(security)"&gt;social engineering&lt;/a&gt; where the goal is to get someone to click a hyperlink.
&lt;/p&gt;

&lt;p&gt;
The link could be the title of a Reddit post, a button on a landing page form that's trying to capture your email, or an image in a Facebook ad.  Deception can be, but is not necessarily involved.
&lt;/p&gt;

&lt;p&gt;
I'm going to use Facebook, Hacker News, and Reddit as the primary examples throughout this article because I know them best.  If you have other examples please leave a comment!
&lt;/p&gt;

&lt;h3&gt;For Good or Evil&lt;/h3&gt;
&lt;p&gt;
All the tactics I'm about to describe can be used for good or evil, and I've seen each used for both.  Don't take this article as an endorsement for spamming, scamming, or other internet trickery.
&lt;/p&gt;

&lt;p&gt;
There's plenty of grey area, too.  At its start Reddit faked on-site activity to avoid looking like a ghost town.  Was that unethical?  A mistake?
&lt;/p&gt;

&lt;p&gt;
None of these tactics are a substitute for generating real value for your customer, though they will help you understand how your customers react to what you present them (and how to incentivize them to react the way you want).  The hard (and most important) parts are still up to you.
&lt;/p&gt;

&lt;p&gt;With that disclaimer aside, let's dive into specifics.&lt;/p&gt;

&lt;h3&gt;Understand Your Audience&lt;/h3&gt;
&lt;p&gt;
Above all else, I take the time to understand my audience.  Each online community has its own customs, norms, and standards of behavior.  You have to understand them before you can blend in or stand out as necessary.
&lt;/p&gt;

&lt;p&gt;
&lt;a href="http://24.media.tumblr.com/tumblr_ljyor743nc1qzrifqo1_500.png"&gt;&lt;img src="http://24.media.tumblr.com/tumblr_ljyor743nc1qzrifqo1_500.png" width="200" align="right" /&gt;&lt;/a&gt;

For example, Hacker News values civility&lt;span class="footnote"&gt;See, e.g., &lt;a href="http://news.ycombinator.com/item?id=1400882"&gt;Some Tips to Improve the Civility on Hacker News&lt;/a&gt;.&lt;/span&gt;, straight-dealing, and intellectual honesty, but can be a little short-sighted and dour.  Reddit values wit (especially puns and in-jokes/memes) and has a strong sense of community justice.  It can also be completely juvenile. 
&lt;/p&gt;


&lt;p&gt;
From the click hacker's perspective, a pun-filled title could do well on Reddit, but would never see the light of day on Hacker News.
&lt;/p&gt;

&lt;h3&gt;Ask for Help or Feedback&lt;/h3&gt;
&lt;p&gt;
One way to get people to pay attention is to ask for help.  Hacker News has a "&lt;a href="http://news.ycombinator.com/ask"&gt;Ask HN&lt;/a&gt;" feature which evolved purely from community behavior.  There are similar "Tell HN" and "Review my Startup" features.
&lt;/p&gt;

&lt;p&gt;
For example, here is &lt;a href="http://news.ycombinator.com/item?id=8863"&gt;Drew Houston's original Hacker News post&lt;/a&gt; asking for reviews of Dropbox.  Every startup I know wants to publish a "showoff" entry on Hacker News and have it get to the front page, and their motives range from honest (they really want feedback) to self-serving (they just want the attention).
&lt;/p&gt;

&lt;h3&gt;Give a Gift&lt;/h3&gt;
&lt;p&gt;
Everyone loves gifts, but when we receive them we also feel pressure to reciprocate.  The click hacker can use that pressure to get people to do what they want.
&lt;/p&gt;

&lt;p&gt;
This tactic is the difference between "I baked a cake" and "I baked a cake &lt;em&gt;for you&lt;/em&gt;."  A normal person is obliged to accept and even reciprocate.
&lt;/p&gt;

&lt;p&gt;
For example, &lt;a href="http://www.reddit.com/r/programming/comments/e45ch/"&gt;this Reddit user&lt;/a&gt; wrote a CSS hack to change the appearance of the site &amp;mdash; a little present for anyone using Reddit.  The Reddit community reciprocated by giving him upvotes.
&lt;/p&gt;

&lt;p&gt;
Notice how he says he made it "for Reddit."
&lt;/p&gt;

&lt;p&gt;
The &lt;a href="http://www.reddit.com/r/pics/comments/9h520/my_wife_made_me_a_reddit_alien_birthday_cake_it/c0cracz"&gt;cake example&lt;/a&gt; works, too, though. :)
&lt;/p&gt;

&lt;h3&gt;Bribery&lt;/h3&gt;
&lt;p&gt;
Bribery is the opposite of gift-giving.  In this situation the click hacker says, "Do this for me and I'll give you something."  If that something is really awesome people will do almost anything.
&lt;/p&gt;

&lt;p&gt;
The stereotypical example here are those scammy ads for free iPods and the like.  All you have to do is fill out this form and click this link and you'll be entered to win.  Or, hey, instead maybe you can forward this offer to three of your friends, and if one of them wins, you win, too!
&lt;/p&gt;

&lt;p&gt;
This can be done well in certain contexts.  For example, Facebook game developers often trade installs by paying players virtual currency, e.g., "Install this game to earn ten farm dollars."  Or a game developer might have a special landing page offering players virtual currency as a way to encourage the player to click the install button. 
&lt;/p&gt;

&lt;p&gt;
The player gets what they want (virtual currency) and the game developer acquired a customer for free.
&lt;/p&gt;

&lt;p&gt;
Assuming the terms are clearly outlined there's nothing wrong with this.  Of course, there's plenty of room for outright dishonesty by promising goods that never arrive.  Don't be that guy.
&lt;/p&gt;

&lt;h3&gt;Scarcity&lt;/h3&gt;
&lt;p&gt;
People want what they can't have.  Or, more accurately, people want what they think they can't have.
&lt;/p&gt;

&lt;p&gt;
Gilt, for example, relies heavily on this tactic to drive sales.  Better click that buy button now, because we're running out!
&lt;/p&gt;

&lt;p&gt;
In Gilt's case the scarcity is legitimate, but it can be artificial, too.  A game developer might make an in-game object ultra-rare in order to drive specific behavior.  Limited beta invites are a tried-and-true method of generating traffic and interest; give out 20 invites to an audience of 20,000 and collect follow-up information for the people who don't get there in time.
&lt;/p&gt;

&lt;p&gt;
Scarcity also works with time, e.g., "If you install this game within the next thirty seconds, you'll get a ten credit bonus."  Tick, tick, tick, tock.
&lt;/p&gt;

&lt;h3&gt;Social Proof&lt;/h3&gt;
&lt;p&gt;
Similar to scarcity, people don't want to feel like they're missing out.  If they see other people doing something, especially people they know or respect, they'll be more likely to do it.
&lt;/p&gt;

&lt;p&gt;
This is the &lt;em&gt;raison d'Ãªtre&lt;/em&gt; for Facebook's &lt;a href="http://developers.facebook.com/docs/reference/plugins/facepile/"&gt;Facepile&lt;/a&gt; widget.  Click that Like button.  You know you want to.  Come on, all your friends are doing it.
&lt;/p&gt;

&lt;p&gt;
Even adding faces of random but friendly-looking people can be effective in getting people to click through.  Here's a screenshot of Match.com's homepage.
&lt;/p&gt;

&lt;a href="http://assets.20bits.com/20110422/match-com.jpg"&gt;&lt;img src="http://assets.20bits.com/20110422/match-com.jpg" alt="Match.com homepage" title="Match.com homepage" width="500" class="aligncenter math size-medium wp-image-811" /&gt;&lt;/a&gt;

&lt;p&gt;Look at all those happy people.  Don't you want to be happy, too?&lt;/p&gt;

&lt;h3&gt;Pick a Fight&lt;/h3&gt;
&lt;p&gt;
Rather than trying to blend in with a community, some times it helps to generate controversy.  This can either be you vs. the community, or setting two factions within a community against each other.
&lt;/p&gt;

&lt;p&gt;
I'll share a story myself.  Back in 2009 I was working on a polling application for Facebook.  People would vote in polls and their vote would appear in their newsfeed.
&lt;/p&gt;

&lt;p&gt;
The most controversial topics were the most successful, so to get things started I created a poll: "Do you support same-sex marriage?"
&lt;/p&gt;

&lt;p&gt;
This was just after the Proposition 8 fight, so I decided to run two sets of ads on Facebook: one targeting 50 miles around San Francisco and another targeting 50 miles around Salt Lake City.
&lt;/p&gt;

&lt;p&gt;
I can't say it was a win for civil political discourse, but it generated enough controversy to make that poll viral and in turn make the entire application viral.&lt;span class="footnote"&gt;The way that application died &lt;a href="http://www.huffingtonpost.com/2009/10/01/facebook-obama-death-poll_n_306637.html"&gt;makes a good story&lt;/a&gt;, too.&lt;/span&gt;
&lt;/p&gt;

&lt;h3&gt;Environmental Flaws&lt;/h3&gt;
&lt;p&gt;
Sometimes the environment has flaws which allow the click hacker to do some interesting things.
&lt;/p&gt;

&lt;p&gt;
When Facebook first launched their user-generated polling product, I remember seeing several polls asking questions like, "Which model is hotter?"  The possible answers were URLs of images.
&lt;/p&gt;

&lt;p&gt;
Unfortunately for the user, clicking on the text of the answer (the URL in this case) would register a vote and post that vote to your newsfeed.  Their friends, who of course also wanted to see pictures of hot models,  clicked the URLs, accidentally voted, and spread the poll to their friends.  Oops!
&lt;/p&gt;

&lt;p&gt;
Click hackers were using this mis-feature deliberately to spread new polls across Facebook.&lt;span class="footnote"&gt;There's a blog post describing data from this social hack &amp;mdash; if anyone has it please share it in the comments!&lt;/span&gt;
&lt;/p&gt;

&lt;p&gt;
I don't think many people will disagree when I say that most of the growth on the Facebook Platform from 2007-2009 was built on similar environmental flaws.  Here's a two-year old example from Slide's Super Pocus:
&lt;/p&gt;

&lt;img src="http://assets.20bits.com/20110422/super-pocus.png" alt="" title="Super Pocus click hacking" class="alignnone size-medium wp-image-378 math"&gt;

&lt;h3&gt;What else?&lt;/h3&gt;
&lt;p&gt;What else am I missing?  Post examples in the comments.&lt;/p&gt;

&lt;p&gt;Have you ever been "tricked" into clicking something?  (I know I have.) Have a funny story?  Have experience optimizing landing pages, etc. and think about this all the time anyway?  Leave a comment and share!&lt;/p&gt;

&lt;h3&gt;Everlane is Hiring&lt;/h3&gt;
&lt;p&gt;
My startup, &lt;a href="http://www.everlane.com"&gt;Everlane&lt;/a&gt;, is hiring.  We're trying to take the best elements of shopping offline &amp;mdash; visual merchandising, personality, and curation &amp;mdash; and bring them online.  Check out our jobs page at &lt;a href="http://www.everlane.com/jobs"&gt;http://www.everlane.com/jobs&lt;/a&gt; and send an email to &lt;a href="mailto:jesse@everlane.com?subject=[20bits]%20Everlane"&gt;jesse@everlane.com&lt;/a&gt; if you're interested in talking.
&lt;/p&gt;</description>
      <pubDate>Fri, 22 Apr 2011 16:12:38 +0000</pubDate>
      <link>http://20bits.com/article/click-hacking-for-fun-and-profit</link>
    </item>
    <item>
      <title>Speed vs. Certainty in A/B Testing</title>
      <description>&lt;p&gt;
&lt;a href="/article/an-introduction-to-ab-testing"&gt;A/B testing&lt;/a&gt; is a great tactical tool for studying customer behavior on the web.  But like any randomized trial there's some chance that the improvement we measure is just statistical noise.
&lt;/p&gt;

&lt;p&gt;
How worried should we be that the feature we thought improved our product actually does nothing, or worse, hurts our bottom line?  How can we ever really know that we're making the correct decision?  And is it better to run tests more quickly or more accurately?
&lt;/p&gt;

&lt;p&gt;
The answers to these questions depend on the cost of a bad decision.  If mistakes are cheap then it's better to make 1,000 decisions and get only 60% of them right than to make 100 decisions and get 100% of them right.
&lt;/p&gt;

&lt;p&gt;
One way to achieve this balance in the context of A/B testing is to tune the confidence level.
&lt;/p&gt;

&lt;h3&gt;Tuning the Confidence Level&lt;/h3&gt;
&lt;p&gt;
Intuitively, the confidence level of an A/B test tells you how certain you can be of the result of the A/B test.  For example, a confidence level of 95% means that there's a 5% chance that a statistically significant result is actually random variation, i.e., there is a 5% chance of a false positive.
&lt;/p&gt;

&lt;p&gt;
Of course, we're free to choose some other confidence level besides 95%.  We could choose 80%, 90%, or 99.999%.  A higher confidence level requires more data before reaching statistical significance, but we will be more certain of the result.
&lt;/p&gt;

&lt;p&gt;
If you're not comfortable with the nuts and bolts of statistical analysis, confidence levels, and A/B testing I recommend reading my article about &lt;a href="/article/statistical-analysis-and-ab-testing"&gt;statistical analysis and A/B testing&lt;/a&gt;, which explains exactly how one "chooses" a confidence level.
&lt;/p&gt;

&lt;p&gt;
In short, the confidence level acts as a dial between speed and certainty, and we're free to choose where to set that dial depending on the priorities of our business or product.
&lt;/p&gt;

&lt;h3&gt;Speed vs. Certainty&lt;/h3&gt;
&lt;p&gt;
So where on the speed-certainty spectrum should you, as a product manager or startup entrepreneur, sit?
&lt;/p&gt;

&lt;p&gt;
Mike Cassidy has a great presentation where he argues that &lt;a href="http://www.slideshare.net/dmc500hats/best-strategy-is-speed-startup2startup-may-2008"&gt;speed is the primary business startegy&lt;/a&gt; for startups.
&lt;/p&gt;

&lt;p&gt;
Why is speed great for startups?  Because mistakes are cheap and calculated risks are rewarded.  Most product decisions can be undone, and important early tests can be redone at a higher confidence level when the product has more traction.
&lt;/p&gt;

&lt;p&gt;
But mistakes aren't always cheap.  Here are some factors that increase the cost of a mistake.
&lt;/p&gt;

&lt;h4&gt;Volume&lt;/h4&gt;
&lt;p&gt;
Volume is leverage.  If you have millions of customers, like Google or Amazon, a 1% improvement to the bottom line is a huge win.  Conversely, a 1% mistake is a huge hit. 
&lt;/p&gt;

&lt;p&gt;
Fortunately this problem helps mitigate itself.  Increased volume affords you the luxury of running A/B tests at a higher confidence level in the same amount of time.
&lt;/p&gt;

&lt;h4&gt;Reversibility&lt;/h4&gt;
&lt;p&gt;
Most product decisions in a consumer technology startup can be undone, for a price.  For example, it's easy to undo a bad decision for a web-based product, slightly harder to undo a decision for desktop software, and very difficult (and costly) to undo a decision for a physical product.
&lt;/p&gt;

&lt;p&gt;
The less reversible a decision is the more certain you should be before you make it.  In the context of A/B testing a product feature this means a higher confidence level, even if it takes longer to run the test.
&lt;/p&gt;

&lt;h4&gt;Real Money&lt;/h4&gt;
&lt;p&gt;
Imagine you're an ad network.  You're constantly A/B testing formatting, positioning, offers, etc. to see which performs best.  Making a mistake in this regard costs your publishers money.
&lt;/p&gt;

&lt;p&gt;
Like volume, money creates leverage.  But it is more complicated than that: publishers don't just want increased revenues, they want reliable cash flow.  That is, when money is involved, not only do you have to perform better but you have to perform more consistently because of phenomena like the &lt;a href="http://en.wikipedia.org/wiki/Peak-end_rule"&gt;peak-end rule&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
In this case a "three steps forward one step back" strategy might actually be worse than going step-by-step in the right direction, even if the former averages out to better performance.
&lt;/p&gt;

&lt;h3&gt;Conclusions&lt;/h3&gt;
&lt;p&gt;
Maintaining momentum in a startup isn't about making &lt;em&gt;only&lt;/em&gt; correct decisions &amp;mdash; it's about making &lt;em&gt;enough&lt;/em&gt; correct decisions.  This presents a continuum from speed to certainty.  At one extreme you run the business with a magic eight-ball; at the other you agonize over every detail until you're 100% certain that you've made the correct choice.
&lt;/p&gt;

&lt;p&gt;
This thought process extends naturally to A/B testing where the idea of "certainty" and "cost" can be quantified.  To recap:
&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;A/B testing is a great tactical tool for testing &lt;em&gt;specific&lt;/em&gt; hypotheses about your customers.&lt;/li&gt;
	&lt;li&gt;However, there is a tradeoff between speed and certainty, controlled by the confidence level of the A/B test.&lt;/li&gt;
	&lt;li&gt;The cost of doing A/B tests quickly is that you will make more wrong decisions, but that is ok if mistakes are cheap.&lt;/li&gt;
	&lt;li&gt;For example, it's better to make 1,000 decisions and get only 60% of them right than to make 100 decisions and get 100% of them right, all else being equal.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;A Spreadsheet Model&lt;/h3&gt;
&lt;p&gt;
Below is a little spreadsheet model that illustrates all my points above.
&lt;/p&gt;

&lt;p&gt;
The two independent variables are the gain from a good decision and the cost of a bad decision.  The spreadsheet assumes a fixed time period, so a higher confidence level means more certainty but &lt;strong&gt;fewer tests&lt;/strong&gt;.  The ideal confidence level is highlighted as you change the parameters of the model.
&lt;/p&gt;

&lt;p&gt;
You can download &lt;a href="http://assets.20bits.com/downloads/confidence-model.xls"&gt;the A/B testing confidence model&lt;/a&gt; here.
&lt;/p&gt;

&lt;p&gt;
For the statistically inclined this model assumes that traffic increases linearly over time, that the sample statistic is normally distributed, and that a one-tailed t-test is the appropriate statistical test.
&lt;/p&gt;

&lt;h3&gt;Credits&lt;/h3&gt;
&lt;p&gt;
This article was inspired by a conversation with &lt;a href="http://startup-marketing.com/"&gt;Sean Ellis&lt;/a&gt; and edited by my &lt;a href="http://twitter.com/aleeeex"&gt;wonderful girlfriend&lt;/a&gt;, who is probably going to yell at me for linking to her Twitter account.
&lt;/p&gt;</description>
      <pubDate>Mon, 25 May 2009 10:00:18 +0000</pubDate>
      <link>http://20bits.com/article/speed-vs-certainty-in-ab-testing</link>
    </item>
    <item>
      <title>8 Tips for Crafting Metrics That Matter</title>
      <description>&lt;p&gt;
Metrics are the marketer's microscope.  They show him what his customers are actually doing, as opposed to what they say they are doing or intend to do. 
With proper metrics he can make decisions faster and more accurately.
&lt;/p&gt;

&lt;p&gt;
You can decide to measure anything, but what metrics matter and what ones are just for show?  Here are some rules I hope will guide you toward creating meaningful metrics that help, rather than hinder, the decision-making process.
&lt;/p&gt;


&lt;h3&gt;Be Actionable&lt;/h3&gt;
&lt;p&gt;
If I had to give a one-sentence answer to the question "What metrics should I implement for my product?" it would be "Whatever metrics are actionable."  This means the line from a question to a metric and the line from a metric to an action should be as short as possible.
&lt;/p&gt;

&lt;p&gt;
Most of the tips below are meant to focus attention on this issue.  What can you do to make sure your metrics are actionable?
&lt;/p&gt;

&lt;h3&gt;Be Understandable and Trustworthy&lt;/h3&gt;
&lt;p&gt;
Do you understand what your metric measures?  Does everyone in your organization also understand and do they trust it?
&lt;/p&gt;

&lt;p&gt;
Trust is the important part.  Everyone has to trust the metric if you're going to use it to make decisions, otherwise you'll be getting constant pushback.  This will slow the decision-making process and cause a lot of ill-tempered arguments.
&lt;/p&gt;

&lt;h3&gt;Measure Results&lt;/h3&gt;
&lt;p&gt;
Does your metric measure customer behavior or a correlate of customer behavior?  In the past approximations and correlations were necessary because measuring behavior directly was hard, but on the web there's no excuse &amp;mdash; you have access to every single thing a person does on your site, down to where their mouse is hovering and for how long.
&lt;/p&gt;

&lt;p&gt;
For example, if you want to know how good Twitter is for your business don't measure the number of positive tweets about your company.  Instead, measure how many customers it drives to your product and how much money those customers put in your hands.
&lt;/p&gt;

&lt;h3&gt;Understand the Downside&lt;/h3&gt;
&lt;p&gt;
What would you do if your metric were 50% off the mark today?  Would you be able to articulate why this is a problem for your business?  What it costs you?  Would you know where to start looking for possible causes?
&lt;/p&gt;

&lt;p&gt;
As an example, I've worked with a startup that used "number of MySpace friends" as a go-to metric in every marketing meeting.  Is that really material to the business? 
&lt;/p&gt;

&lt;p&gt;
What would happen if tomorrow we had half as many MySpace friends? Would we lose $100?  $10,000?
&lt;/p&gt;

&lt;p&gt;
This number says nothing about the value of MySpace as a marketing channel, which in the most charitable case is what it is trying to approximate, and the downside is completely ambiguous.  Is the number of MySpace friends tied to anything meaningful?
&lt;/p&gt;

&lt;p&gt;
Like the Twitter example above, if I thought MySpace were an important marketing channel for my product I'd be measuring things like the number of qualified leads from MySpace and the value they generate for the business.
&lt;/p&gt;

&lt;h3&gt;Understand the Upside&lt;/h3&gt;
&lt;p&gt;
Conversely, ask yourself, "What value does improving the metric bring to the company?"  Some metrics are blindingly obvious in this regard, e.g., top-line revenue numbers and some "efficiency" metrics like effective CPM and revenue per user. 
&lt;/p&gt;

&lt;p&gt;
Some metrics are less obvious.  What about page views?  Can you imagine a scenario where your pageviews double but the net effect was bad for your business?
&lt;/p&gt;

&lt;p&gt;
Metrics like these are dangerous because they lull you into a false sense of security.  Everything on your analytics dashboard is going up and to the right, but the fundamentals of the business might still be suffering.
&lt;/p&gt;

&lt;h3&gt;Don't Be Ambiguous&lt;/h3&gt;
&lt;p&gt;
Does your metric measure one thing and one thing only?  Or is it really an aggregation of several variables, each of which can rise or fall independently?
&lt;/p&gt;

&lt;p&gt;
A good example of an ambiguous metric is the notion of a "daily active user," or the total number of people who interacted with your product today.  This number is ambiguous because it is really the sum of two metrics: the number of new users and the number of returning users.
&lt;/p&gt;

&lt;p&gt;
Ambiguous metrics are bad because they obscure the underlying variables that truly reflect customer behavior and delay the decision-making process as you are forced to determine which of these variables is actually affecting the aggregate metric.  Furthermore, it's possible that one variable accounts for all the growth in the metric, e.g., you have millions of new users per day but no returning users.
&lt;/p&gt;

&lt;p&gt;
This latter scenario has been the death of many Facebook apps.
&lt;/p&gt;

&lt;h3&gt;Segment by Purpose&lt;/h3&gt;
&lt;p&gt;
Whenever I'm building a web product I divide usage into key segments.  Generally these are acquisition, retention, engagement, and monetization.  That is, how do customers find my product?  Are those customers coming back?  Are they doing the things I want or need them to do?  And how much money are those customers making me?
&lt;/p&gt;

&lt;p&gt;
See Dave McClure's &lt;a href="http://www.slideshare.net/dmc500hats/startup-metrics-for-pirates-long-version"&gt;Startup Marketing for Pirates&lt;/a&gt; presentation, which focuses on this idea.
&lt;/p&gt;

&lt;p&gt;
Doing this lets you tackle each segment independently and allows for a sharper product focus.  Early on you might want to focus on acquisition, or maybe you want to focus on monetization from day one.  Eventually you'll starting caring about longer-term metrics like retention and you won't be distracted or overwhelmed by unrelated metrics from different segments.
&lt;/p&gt;

&lt;p&gt;
You might select one or two from each category to act as top-line variables in a dashboard that you look at every day, e.g., new users (by channel), returning users (by channel), activity, and revenue.
&lt;/p&gt;

&lt;h3&gt;Appropriate Granularity&lt;/h3&gt;
&lt;p&gt;
Sometimes you need a bird eye's view and sometimes you need a tunneling electron microscope.  Know when you need which.
&lt;/p&gt;

&lt;p&gt;
As a general rule of thumb I focus on the microscopic when I am designing specific optimizations but focus on the macroscopic when I'm determining whether the decisions we're making are working.  Another way is to finish this sentence, "I know my product is healthy because..."
&lt;/p&gt;

&lt;p&gt;
You'd never finish that sentence with "because the click-through-rate on my login page is 20%."  You'd say something like "because 80% of my customers return every week" or "because our revenue is growing by 5% month-over-month."
&lt;/p&gt;

&lt;h3&gt;Conclusions&lt;/h3&gt;
&lt;p&gt;
These are just tips, not hard-and-fast rules.  They are meant to focus the discussion of what metrics matter and why because the choice of metrics has a long-lasting impact if your team is committed to building a data-driven culture.
&lt;/p&gt;

&lt;p&gt;
If you have any tips on deciding what metrics matter and why, leave them in the comments!  This list wasn't meant to be exhaustive and I know (or hope) people will have some strong opinions!
&lt;/p&gt;</description>
      <pubDate>Tue, 12 May 2009 10:00:31 +0000</pubDate>
      <link>http://20bits.com/article/8-tips-for-crafting-metrics-that-matter</link>
    </item>
    <item>
      <title>Building a Social Network, Island by Island</title>
      <description>&lt;p&gt;
A necessary condition for building a self-sustaining social network is density.  We understand this intuitively.  After all, a network of one person is hardly a "network" at all.
&lt;/p&gt;

&lt;p&gt;
Metcalfe's Law, which states that the value of a network grows in proportion to the square of the number of users of that network, express this idea formally.  The value of a social network rests in its ability to foster communication, in its connections.
&lt;/p&gt;

&lt;p&gt;	
If you're building a social network, whether it's a destination website or an application that exists on another social network, density must figure into key strategic decisions.  Let's see how.
&lt;/p&gt;

&lt;h3&gt;Islands&lt;/h3&gt;
&lt;p&gt;
One way to think about social networks is as a network of networks.  On Facebook, for example, if I graph the connections between all my friends I see distinct groups: my high school friends, my college friends, my current circle of friends, and my professional network.
&lt;/p&gt;

&lt;p&gt;
Each group of people is more or less isolated from each other.  Density exists within each of these islands, but not between them.  I'm fairly certain this is a topological property of any social network.
&lt;/p&gt;

&lt;h3&gt;Case Study: Facebook&lt;/h3&gt;
&lt;p&gt;
Facebook started at Harvard and was initially college-only.  Their growth strategy was explicit from day one: move from school to school as demand warranted.  I've been told that Sean Parker wouldn't consider opening up access to a new school until at least several dozen students from that school requested an account.
&lt;/p&gt;

&lt;p&gt;
Each school was an island.  Once Facebook saturated a specific set of colleges it moved onto the next round.  Eventually there was enough anticipatory buzz that they could launch at large, state schools without risk of fading away.
&lt;/p&gt;

&lt;p&gt;
They still pursue this strategy today.  After establishing a critical density among colleges they opened up access to high schools and then to everyone with an email address.  From there they started moving country-to-country.
&lt;/p&gt;

&lt;p&gt;
The countries where Facebook is having the most difficulty gaining traction are the ones with already-established social networks, like Germany with &lt;a href="http://en.wikipedia.org/wiki/StudiVZ"&gt;StudiVZ&lt;/a&gt;.  In fact, if you look at &lt;a href="http://gawker.com/tech/data-junkie/the-world-map-of-social-networks-273201.php"&gt;this old map&lt;/a&gt; of the most popular social network in each country, you get an idea of how isolated this country-by-country growth really is.
&lt;/p&gt;

&lt;h3&gt;Other Networks&lt;/h3&gt;
&lt;p&gt;
Facebook isn't the only example of a social network who grew this way.  hi5 has a similar story, starting with smaller markets overseas and spreading from country-to-country.  Or Craigslist, by starting small in San Francisco and eventually becoming a presence in most major US cities.
&lt;/p&gt;

&lt;p&gt;
The MySpace team had a background in direct marketing, which is all about targeting specific offers at the people who are most likely to respond.  They started with the club scene in LA and grew from there.
&lt;/p&gt;

&lt;p&gt;
The key to all these strategies was density.  
&lt;/p&gt;

&lt;p&gt;
If you're launching a new social service, even if your end goal is to have everyone and their mother using it, it's important to understand the impact density has on the growth
&lt;/p&gt;

&lt;h3&gt;Multiple Dimensions of Density&lt;/h3&gt;
&lt;p&gt;
So far the only kind of density we've talked about is network density, i.e., multiple people connected through their shared use of a service.  You could call this "product density."
&lt;/p&gt;

&lt;p&gt;
Sometimes product density isn't enough.  Take IM, for example, or any network that requires synchronous communication.  Not only do two people have to be using the same product but they have to be using it at the same time.  What good is your friend being on IM if you're never awake at the same time?
&lt;/p&gt;

&lt;p&gt;
&lt;a href="http://en.wikipedia.org/wiki/Xfire"&gt;Xfire&lt;/a&gt;, an IM client for gamers, is an example of a product that innovated in this space by tackling a segment of customers who were already interacting synchronously.
&lt;/p&gt;

&lt;p&gt;
Mobile social networks take this to an even greater extreme.  To connect with people on Loopt or Google Latitude not only do we have to be using the same product at the same time, but we have to be in the same place!
&lt;/p&gt;

&lt;p&gt;
This isn't to say building these networks is impossible.  Rather, they come with an extra handicap in the form of reduced density.  Overcoming that problem has to be a key part of the product strategy.
&lt;/p&gt;

&lt;h3&gt;Conclusion and Counterexamples?&lt;/h3&gt;
&lt;p&gt;
Most product strategy discussions, in my experience, are focused on acquisition or other topline metrics that go "up and to the right."  Instead, if you're building social software, I believe density is a necessary condition for long-term success and needs to be a part of the strategy discussion from day one.
&lt;/p&gt;

&lt;p&gt;
First, understand the density requirements for your product.  Do customers need to sign up for the same service?  Do they need to be using it at the same time?  Do they need to be in the same place?  Is there anything you can do to lower the density requirement?
&lt;/p&gt;

&lt;p&gt;
Second, build a "depth first" strategy.  Are there any naturally dense customer segments that might fit your product?  Do you have the ability to target specific demographics or segments for acquisition?  Which ones respond positively and is it possible to build density there?
&lt;/p&gt;

&lt;p&gt;
Once you've achieved sufficient density on one island hop to the next and repeat.
&lt;/p&gt;

&lt;p&gt;
And if anyone out there can think of any counterexamples &amp;mdash; social networks or services that got big "all at once" &amp;mdash; leave a comment and let me know!  I honestly can't think of any.
&lt;/p&gt;</description>
      <pubDate>Mon, 11 May 2009 08:30:36 +0000</pubDate>
      <link>http://20bits.com/article/building-a-social-network-island-by-island</link>
    </item>
    <item>
      <title>What Verna Taught Me</title>
      <description>&lt;p&gt;
When I was student at the University of Chicago I worked for Residential Computing, or ResCom.  ResCom was responsible for maintaining all the computer labs and IT systems in the residential dorms.
&lt;/p&gt;

&lt;p&gt;
I had two jobs.  The first was to help manage the dorm computer labs.  If a new virus broke out and computers were banned from the network because they were spamming everybody on campus, I had to go in and clean it up.  I still remember when the &lt;a href="http://en.wikipedia.org/wiki/Blaster_(computer_worm)"&gt;Blaster worm&lt;/a&gt; infected most of the Windows machines on campus.
&lt;/p&gt;

&lt;p&gt;
The second job was to help build a web-based dorm management system called Chopin.  This system was at the center of the daily operation of the dorms.  Students could use it to submit work requests and report problems with the dorms.  Staff could use it to send out mass mailings, sell students printing credits for the dorm printers, and any number of other routine tasks.
&lt;/p&gt;

&lt;h3&gt;Enter Verna&lt;/h3&gt;

&lt;p&gt;
Verna was a front-desk clerk at one of the dorms.  She was responsible for helping students if a problem came up and used Chopin to get her job done.
&lt;/p&gt;

&lt;p&gt;
She had also lived in Hyde Park, the neighborhood surrounding the University, for most of her life and just wanted to do her job without having to deal with whiny students or crappy software.  Chopin was supposed to help her do that.
&lt;/p&gt;

&lt;p&gt;
One day while fixing the the front-desk computer I asked Verna, "What do you think of Chopin?"  She immediately supplied a laundry list of complaints.  It didn't work well, it was confusing, she never knew where to go or what to do, and so forth.
&lt;/p&gt;

&lt;p&gt;
It would have been easy to blame her.  Maybe she just needed better training or maybe she wasn't trying hard enough.  But that was all ego: I was just upset because she told me the product I helped build was pretty awful.
&lt;/p&gt;

&lt;p&gt;
Her critique was absolutely fair.  I had built solutions I liked for problems I wasn't even sure existed.  In fact, before then, I had never honestly talked to any of the people who actually &lt;em&gt;used&lt;/em&gt; Chopin about the product itself.  In retrospect that seems completely insane.
&lt;/p&gt;

&lt;p&gt;
And watching her use Chopin I saw so much waste.  She would click ten times where two would have sufficed.  Some features would never get used, others used for things they weren't intended for.  Once I was there, talking with her and watching how she used Chopin, I saw that the whole process was messed up, from top to bottom.  Most of her energy was spent working around problems I had caused!
&lt;/p&gt;

&lt;h3&gt;Go and See&lt;/h3&gt;

&lt;p&gt;
If my job hadn't required that I work on Chopin and get out of the office I never would have even realized there was a problem.
&lt;/p&gt;

&lt;p&gt;
That experience taught me that whenever I didn't understand a customer's frustration or thought that maybe they were feeling this way or that way I should just go ask them before building solutions that might be worse than the problem.  When in doubt, go and see for yourself. Actually, scratch that: &lt;em&gt;always&lt;/em&gt; go see for yourself.
&lt;/p&gt;

&lt;p&gt;
Too often I'd pass off mere belief as knowledge, or generalize from a specific set of circumstances to a fundamentally different set of circumstances.  Maybe this solution worked over there, but why should it work over here?  The only people who can validate your product are your customers &amp;mdash; everyone else, including yourself, can wait their turn.
&lt;/p&gt;

&lt;p&gt;
Verna taught me that.
&lt;/p&gt;</description>
      <pubDate>Wed, 06 May 2009 08:30:28 +0000</pubDate>
      <link>http://20bits.com/article/what-verna-taught-me</link>
    </item>
    <item>
      <title>Notification Strategies for Social Networks</title>
      <description>&lt;p&gt;
You've built a social application and launched a new feature.  The number of notifications you can send out is constrained.  Which set of users should you notify to guarantee the most people start using this new feature?
&lt;/p&gt;

&lt;p&gt;
This problem might seem artificial.  Why not put up an ad on your product, or send a notification to every single person who might be interested?  There are several reasons the number of people you can notify might be constrained.
&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;There is a technical constraint, e.g., Facebook limits the number of application-to-user notifications at the API level.&lt;/li&gt;
&lt;li&gt;There is a financial constraint, e.g., you're sending notifications over SMS and every message costs you money.&lt;/li&gt;
&lt;li&gt;There is a strategic constraint, e.g., sending notifications too frequently causes fatigue and reduces the effectiveness of future notifications.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;
So, the situation is not too far fetched.  Let's investigate the issue.
&lt;/p&gt;

&lt;p&gt;
For the rest of the article the "application" is going to be a Facebook application and it can only send 100 application-to-user notifications per week.  Which 100 users should we notify?
&lt;/p&gt;

&lt;h3&gt;The Basic Considerations&lt;/h3&gt;
&lt;p&gt;
In the &lt;a href="/article/behavior-adoption-on-social-networks"&gt;linear cascade model&lt;/a&gt; when a user in a social network adopts a new behavior there is a probability that each neighbor in the network will adopt it.
&lt;/p&gt;

&lt;p&gt;
Under this model we probably wouldn't want to notify two people who are friends, and especially not a cluster of friends or a &lt;a href="http://en.wikipedia.org/wiki/Clique_(graph_theory)"&gt;clique&lt;/a&gt;.  The new feature wouldn't spread very far beyond this group.
&lt;/p&gt;

&lt;p&gt;
Likewise, we wouldn't want to notify people who are very far apart on the social network because a user is more likely to adopt a behavior if more than one of his friends has also adopted it.  So there is a balancing act between notifying users who are close together, to achieve density, and notifying users who are far apart, to achieve breadth.
&lt;/p&gt;

&lt;h3&gt;Heuristics and Centrality Measures&lt;/h3&gt;
&lt;p&gt;
The easiest solution is to pick 100 random users to notify, but this is also the most naive since it takes into account neither the structure of the network nor likelihood that a person will influence their neighbors.
&lt;/p&gt;

&lt;p&gt;
A better&lt;span class="footnote"&gt;"Better" according to what?  As we'll see, randomly selecting seed users performs worse than all the other heuristics.&lt;/span&gt; solution to this problem is to develop a heuristic that ranks every user in the network according to some metric.  If we can only send 100 notifications then they are sent to the first 100 people on this ranked list.
&lt;/p&gt;

&lt;p&gt;
The idea here is to use &lt;a href="http://en.wikipedia.org/wiki/Centrality"&gt;centrality measures&lt;/a&gt; to come up with heuristics.  In graph theory "centrality" is a measure of how important an individual node is.
&lt;/p&gt;

&lt;p&gt;
The simplest measure is called "degree centrality" and is equal to the number of neighbors of a node.  On a social network this is the number of friends of a given user.  So, if you wanted to send out 100 notifications using this heuristic we'd send notifications to the 100 users with the most friends.  This heuristic involves convincing celebrities to use the new feature.
&lt;/p&gt;

&lt;p&gt;
There are other, more complex heuristics.  The Wikipedia article linked above has a list of other centrality measures, and I wrote an article about calculating &lt;a href="/article/graph-theory-part-iii-facebook"&gt;eigenvalue centrality&lt;/a&gt;, which is similar to PageRank.  Each of these admits a heuristic which can tell us which users to notify.
&lt;/p&gt;

&lt;p&gt;
Of course, which strategy works best is hard to know beforehand, as it varies with respect to both time and the underlying notification.  A/B testing this is difficult because the effects are intentionally dependent.  If anyone has a good solution to this that doesn't involve collecting massive amounts of data about user behavior I'd be interested in hearing it!
&lt;/p&gt;

&lt;p&gt;
It should be noted that each of these heuristics only takes into account the underlying structure of the graph and not the probability of "infection."  By including the latter we can come up with a nearly exact model of the optimal subset of users to notify.
&lt;/p&gt;

&lt;h3&gt;A Global Solution&lt;/h3&gt;
&lt;p&gt;
&lt;a href="http://www.cs.cmu.edu/~bmeeder/"&gt;Brendan Meeder&lt;/a&gt; at CMU pointed me to a great paper that discuss this very topic, &lt;a href="http://www.cs.cornell.edu/home/kleinber/kdd03-inf.pdf"&gt;Maximizing the Spread of Inï¬‚uence through a Social Network&lt;/a&gt; by Kempe, et al.
&lt;/p&gt;

&lt;p&gt;
Rather than take a localized view of the problem by ranking each node individually, we create a statistical model of how the new feature propagates through the network.
&lt;/p&gt;

&lt;p&gt;
First, we start with a finite seed set, A.  In our case A is a set of 100 users.  Say we convert each of these 100 users.
&lt;/p&gt;

&lt;p&gt;
In our model if a user &lt;em&gt;u&lt;/em&gt; is converted then for each neighbor &lt;em&gt;v&lt;/em&gt; there is some probability
&lt;/p&gt;
&lt;div class="math"&gt;
$latex p_{u,v}$
&lt;/div&gt;
&lt;p&gt;
that v will also be converted.
&lt;/p&gt;

&lt;p&gt;
After the process has run its course some set of users has adopted the new feature.  Because adoption is probabilistic the size of this final configuration is a random variable.  Using the notation from Kempe, et al., for a given seed set A the size of the final set of adopters is a random variable denoted by
&lt;/p&gt;
&lt;div class="math"&gt;
$latex \displaystyle{\varphi\left(A\right)}$
&lt;/div&gt;

&lt;p&gt;
Our goal is to pick the set A which maximizes the expected value of this random variable.
&lt;/p&gt;

&lt;p&gt;
Formally, we want to find the subset A such that
&lt;/p&gt;
&lt;div class="math"&gt;
$latex \displaystyle{\sigma\left(A\right) = E\left[\varphi\left(A\right)\right]}$
&lt;/div&gt;
&lt;p&gt;
is maximized, where
&lt;/p&gt;
&lt;div class="math"&gt;
$latex \displaystyle{\sigma\left(\cdot\right)}$
&lt;/div&gt;
&lt;p&gt;
is called the &lt;em&gt;influence function&lt;/em&gt;.
&lt;/p&gt;

&lt;h3&gt;The Algorithm and The Results&lt;/h3&gt;
&lt;p&gt;
It turns out that calculating the influence function exactly is &lt;a href="http://en.wikipedia.org/wiki/NP-hard"&gt;NP-hard&lt;/a&gt;, but there is a &lt;a href="http://en.wikipedia.org/wiki/Greedy_algorithm"&gt;greedy algorithm&lt;/a&gt; which approximates the value under certain (unrestrictive) conditions.
&lt;/p&gt;

&lt;p&gt;
If you want more details read the paper linked above or the related &lt;a href="http://www.cs.cornell.edu/home/kleinber/icalp05-inf.pdf"&gt;Inï¬‚uential Nodes in a Diï¬€usion Model for Social Networks&lt;/a&gt; by Kempe, et al.
&lt;/p&gt;

&lt;p&gt;
Using Monte Carlo methods Kempe, et al. simulated the diffusion process using this algorithm versus several of the heuristics I described above.  The results are fairly striking: their algorithm performs at least 18% beter than the best-performing heuristic (degree centrality) and 48% better than if the seed set were randomly selected.  I've embedded a graph of their results below.
&lt;/p&gt;
&lt;img src="http://assets.20bits.com/20090505/kempe-graph.png" alt="kempe-graph" title="kempe-graph" width="429" height="334" class="alignnone math size-full wp-image-613" /&gt;
&lt;p&gt;
The "target set" is the initial seed set of users to notify, and the "active set" is the final set of users who actually adopted the new feature or product.  The more users who adopt the feature the better the strategy.
&lt;/p&gt;

&lt;h3&gt;Feasibility&lt;/h3&gt;
&lt;p&gt;
Kempe's algorithm is more feasible than many of the heuristics discussed above, although the best performing heuristic — degree centrality — is also the easiest to calculate.  He also doesn't include eigenvalue centrality in his analysis, which I'd be interested in comparing.
&lt;/p&gt;

&lt;p&gt;
The biggest downside to his algorithm is that it requires both full knowledge of the underlying graph and an accounting of all the user-to-user transmission probabilities.  Modeling these probabilities would require a lot of data about users over an extended period of time.
&lt;/p&gt;

&lt;p&gt;
Whether the additional 18% is worth the extra computation and data collection depends on a lot on specific circumstances, but personally I'm going to try to implement it in my projects and see how the performance compares first-hand.
&lt;/p&gt;

&lt;h3&gt;Footnotes&lt;/h3&gt;
&lt;ol id="footnotes"&gt;&lt;/ol&gt;</description>
      <pubDate>Tue, 05 May 2009 09:00:02 +0000</pubDate>
      <link>http://20bits.com/article/notification-strategies-for-social-networks</link>
    </item>
    <item>
      <title>Why hi5 Might Have an Edge on Facebook</title>
      <description>&lt;p&gt;
Facebook has been trying hard to find a business model.  Their &lt;a href="http://en.wikipedia.org/wiki/Beacon_(Facebook)"&gt;Beacon&lt;/a&gt; advertising product is probably the most infamous example.  So far they've been left empty handed and have been forced to look outside the company for money, first from Microsoft&lt;span class="footnote"&gt;&lt;a href="http://news.cnet.com/8301-13577_3-9803872-36.html"&gt;Microsoft acquires equity stake in Facebook, expands ad partnership&lt;/a&gt; (cnet)&lt;/span&gt; and then from foreign investors.&lt;span class="footnote"&gt;&lt;a href="http://www.businessinsider.com/2008/11/update-on-facebook-s-dubai-fundraising-trip"&gt;Update On Facebook's Dubai Fundraising Trip&lt;/a&gt; (Business Insider)&lt;/span&gt;
&lt;/p&gt;

&lt;p&gt;
If Facebook wants to be &lt;a href="http://news.cnet.com/8301-10784_3-9946606-7.html"&gt;the internet's cable company&lt;/a&gt; what are they going to have to do to turn themselves into a &lt;a href="http://www.google.com/finance?q=NASDAQ%3ACMCSA"&gt;$40Bn&lt;/a&gt; company?
&lt;/p&gt;

&lt;h3&gt;Are Virtual Goods the Key?&lt;/h3&gt;
&lt;p&gt;
Not all social networks are struggling to find a great business model.  Tencent, a Chinese social networking company, pulled in over $1Bn in revenue last year, primarily through its use of virtual goods.&lt;span class="footnote"&gt;&lt;a href="http://venturebeat.com/2009/03/19/the-worlds-most-lucrative-social-network-chinas-tencent-beats-1-billion-revenue-mark/"&gt;The worldâ€™s most lucrative social network? Chinaâ€™s Tencent beats $1 billion revenue mark.&lt;/a&gt; (VentureBeat)&lt;/span&gt;
&lt;/p&gt;

&lt;p&gt;
But Facebook doesn't need to look overseas to see that virtual goods could work for them.  Most of the top Facebook games use virtual currency to make money, powered by leadgen-based ad networks like &lt;a href="http://offerpal.com"&gt;Offerpal&lt;/a&gt; and &lt;a href="http://getgambit.com"&gt;Gambit&lt;/a&gt;.  There are reports that some of these apps are pulling in eight figures per year.&lt;span class="footnote"&gt;&lt;a href="http://venturebeat.com/2008/08/25/developer-analytics-facebook-game-mob-wars-making-22000-a-day/"&gt;Developer Analytics: Facebook game Mob Wars making $22,000 a day&lt;/a&gt; (VentureBear)&lt;/span&gt;
&lt;/p&gt;

&lt;p&gt;
And of course there's Facebook's own gifting service, which has recently moved to a virtual currency system, pricing gifts in "points" that can be bought with real money.&lt;span class="footnote"&gt;&lt;a href="http://blog.facebook.com/blog.php?post=36577782130"&gt;Gift Shop Credits Have Arrived&lt;/a&gt; (Facebook)&lt;/span&gt;
&lt;/p&gt;

&lt;p&gt;
All of this is to say that it appears that virtual goods are a natural business model for social networks and Facebook has enough data to see that.  Why isn't Facebook pursuing this strategy more aggressively?  Why do they seem dead-set on building advertising technologies like Social Ads and Beacon?
&lt;/p&gt;

&lt;h3&gt;The US Advertising Crutch&lt;/h3&gt;
&lt;p&gt;
In the world of advertising not all countries are equal.  US traffic is generally valued the highest, followed by other English-speaking countries, the &lt;a href="http://en.wikipedia.org/wiki/G20_industrial_nations"&gt;G20&lt;/a&gt;
, and finally the rest of the world.
&lt;/p&gt;

&lt;p&gt;
Until recently Facebook was concentrated in the English-speaking world.  It's the second largest social network in the US, after MySpace, and the largest in both Canada and the UK.
&lt;/p&gt;

&lt;p&gt;
Unlike other social networks which don't have a significant presence in the English-speaking world, Facebook can support itself through advertising.  This is a crutch that prevents Facebook making bold decisions with their business model.  I believe Facebook sees themselves as the next Google, one piece of technology away from &lt;a href="http://www.roughtype.com/archives/2007/11/the_social_graf_1.php"&gt;changing the world of advertising&lt;/a&gt;.
&lt;/p&gt;

&lt;h3&gt;The Demographic Crunch&lt;/h3&gt;
&lt;p&gt;
Not all social networks have Facebook's demographics, of course. hi5, the world's third largest social network after MySpace and Facebook, has an extensive presence throughout Latin America and other countries which advertisers and publishers typically ignore.  The same can be said of the advertising market in China, but recall that Tencent pulled in $1Bn last year through virtual goods.
&lt;/p&gt;

&lt;p&gt;
It's little wonder, then, that hi5 is aggressively pursuing a virtual goods strategy.&lt;span class="footnote"&gt;&lt;a href="http://venturebeat.com/2009/01/22/hi5s-virtual-entertainment-plans-could-hit-a-virtual-jackpot/"&gt;Hi5's virtual entertainment plans could hit a virtual jackpot&lt;/a&gt; (VentureBeat)&lt;/span&gt;  Their demographics makes this strategy much more appealing.  Facebook has the money and the audience to waste pursuing a pure-advertising strategy for social networks.
&lt;/p&gt;

&lt;p&gt;
What once seemed like a demographic disadvantage might turn out to be a demographic advantage for hi5.  Will they beat Facebook to the business model punch?
&lt;/p&gt;
&lt;p&gt;
And a year from now will we be reading articles about Facebook's virtual goods strategy compares to hi5's, as opposed to articles about how Facebook's new homepage compares to Twitter?
&lt;/p&gt;

&lt;h3&gt; You're crazy.  You know that, right?&lt;/h3&gt;

&lt;p&gt;
Obviously hi5 has an uphill battle.  Facebook is growing on the order of 500,000 new users &lt;em&gt;per day&lt;/em&gt; and shows no signs of slowing.  But the same was said of MySpace and Friendster when Facebook launched.  I think we still have a few more twists in the story of social networking on the web, and this is just one possible twist among many.
&lt;/p&gt;</description>
      <pubDate>Tue, 28 Apr 2009 10:50:26 +0000</pubDate>
      <link>http://20bits.com/article/why-hi5-might-have-an-edge-on-facebook</link>
    </item>
    <item>
      <title>Behavior Adoption on Social Networks</title>
      <description>&lt;p&gt;
Why and how do people adopt new behaviors?  Why do they start using new products?  Did you sign up for Facebook because all of your friends were on it, or because a specific friend recommended it to you?  Or do you refuse to sign up at all?
&lt;/p&gt;

&lt;p&gt;
In this article I'm going to outline two models that describe how new behaviors, ideas, and messages propagate through social networks.
&lt;/p&gt;

&lt;h3&gt;The Threshold Model&lt;/h3&gt;
&lt;p&gt;
The first model is called the Threshold Model.&lt;span class="footnote"&gt;&lt;span class="footnote"&gt;See &lt;a href="http://rumordynamics.awardspace.com/phfs/Threshold_Models_of_Collective_Behavior.pdf"&gt;Threshold Models of Collective Behavior&lt;/a&gt; (1978) by the famous sociologist Mark Granovetter. &lt;/span&gt;  It says that people adopt a new behavior bceause a sufficiently large proportion of their friends have adopted that behavior.  Early adopters have a very low threshold, say 5% or 10%, while late adopters would have a much higher threshold. Every person, however, has their own individual threshold.
&lt;/p&gt;

&lt;p&gt;
For example, my girlfriend's stated reason for signing up for Twitter was that "all my friends were using it."  And during the 2008 US Presidential election, some Obama supporters would adopt Hussein as their middle name.&lt;span class="footnote"&gt;See &lt;a href="http://www.huffingtonpost.com/2008/06/28/obama-supporters-adopting_n_109788.html"&gt;Obama Supporters Adopting Middle Name "Hussein" As Their Own&lt;/a&gt;&lt;/span&gt;  When I saw that lots of my friends were doing it I was certainly tempted to do the same.
&lt;/p&gt;

&lt;p&gt;
The underlying psychological principle is one of "missing out" or "when in Rome."  The key variable here is the initial distribution of thresholds across a social network, which describes in totality the final extent of the behavior.
&lt;/p&gt;

&lt;p&gt;
It's worth noting that this model says nothing about how people &lt;em&gt;initially&lt;/em&gt; adopt behavior.  That is, it says nothing about innovators, only about the spread of innovation through a social network.
&lt;/p&gt;

&lt;h3&gt;The Cascade Model&lt;/h3&gt;
&lt;p&gt;
The second model is called the Cascade or Word-of-Mouth Model&lt;span class="footnote"&gt;See &lt;a href="http://pluto.huji.ac.il/~msgolden/home_page/pdf/TalkofNetworks.pdf"&gt;Talk of the Network: A Complex Systems Look at the 
Underlying Process of Word-of-Mouth&lt;/a&gt; (2001) by Goldenburg, Libari, and Muller.&lt;/span&gt;, and is the method of "viral growth" that most &lt;a href="/article/social-applications-are-social-networks"&gt;social application developers&lt;/a&gt; are familiar with.  It says that every person has a chance of adopting a new behavior whenever one of their neighbors adopts it.
&lt;/p&gt;

&lt;p&gt;
This model describes phenomena like product recommendations or user-to-user notifications on Facebook.  The probability that a person adopts the new behavior is the conversion rate for the notification.&lt;span class="footnote"&gt;More accurately, we'd model the "probability" as a random variable whose mean was the conversion rate.&lt;/span&gt;
&lt;/p&gt;

&lt;p&gt;
This probability is both a function of the sender and the recipient, so more influential people are more likely to convince you to adopt a behavior (or purchase a product, or install an application).
&lt;/p&gt;

&lt;h3&gt;Practical Implications&lt;/h3&gt;
&lt;p&gt;
Both of these models describe facets of real-world interaction on social networks.  My take is that the cascade model is more accurate at the beginning of a social network's life, where behavior is spreading through sparse areas, connected by influencers.  Later on, after a critical density has settled in, people start adopting the behavior because everyone else is adopting it and there's a social cost to not doing the same.
&lt;/p&gt;

&lt;p&gt;
We see this pattern in services like Facebook and MySpace, both of which got their start by harvesting emails and spreading through word-of-mouth (and spam) across a social network.&lt;span class="footnote"&gt;See &lt;a href="http://www.amazon.com/Stealing-MySpace-Control-Popular-Website/dp/1400066948"&gt;Stealing MySpace: The Battle to Control the Most Popular Website in America&lt;/a&gt; for details about the MySpace team's background in direct marketing.  The ConnectU vs. Facebook court documents, which you can find via Google, paint a similar story for Facebook's early years.&lt;/span&gt; Eventually each network reached a point where a sufficient number of people were familiar with the product and new users adopted it not because their friends recommended it (the cascade model), but because there was a social expectation that they do (the threshold model).
&lt;/p&gt;

&lt;p&gt;
Also, with respect to analytics and viral growth, the threshold model is more difficult to track.  In the cascade model we record who sent what to whom and which messages they responded to.  It's clear who gets credit for a user's conversion.  In the threshold model you have to track passive exposures, and there's no clear causal relationship. 
&lt;/p&gt;

&lt;p&gt;
If ten of my friends are doing something and I decide to start doing the same thing, who gets credit?  Most analytics packages will show this behavior as a direct visit, with no connection to other users' behavior, even though there is a viral process underlying it.
&lt;/p&gt;

&lt;p&gt;
In short, the threshold model requires a certain level of behavioral density, while the cascade model doesn't.  However, we see both models expressed in how people actually adopt new behaviors in social contexts.
&lt;/p&gt;

&lt;h3&gt;Formalisms&lt;/h3&gt;
&lt;p&gt;
In the threshold model every person &lt;em&gt;u&lt;/em&gt; has a threshold
&lt;/p&gt;
&lt;div class="latex math"&gt; T_u \in [0,1]&lt;/div&gt;
&lt;p&gt;
and each of their neighbors &lt;em&gt;v&lt;/em&gt; is weighted according to
&lt;/p&gt;
&lt;div class="latex math"&gt; w_{u,v}&lt;/div&gt;
&lt;p&gt;
If
&lt;/p&gt;
&lt;div class="math"$latex \displaystyle{T_u &lt; \sum_{v \in \text{adopters}} w_{u,v}}$&lt;/div&gt;
&lt;p&gt;
then the person &lt;em&gt;u&lt;/em&gt; adopts the behavior.
&lt;/p&gt;

&lt;p&gt;
The set of thresholds, weights, and initial adopters completely determines the extent of the behavior in the social network.
&lt;/p&gt;

&lt;p&gt;
In the cascade model, for every person &lt;em&gt;u&lt;/em&gt; and neighbor &lt;em&gt;v&lt;/em&gt; there is a random variable
&lt;/p&gt;
&lt;div class="latex math"&gt; X_{u,v}&lt;/div&gt;
&lt;p&gt;
which describes the likelihood of &lt;em&gt;u&lt;/em&gt; adopting the behavior if &lt;em&gt;v&lt;/em&gt; has adopted it.
&lt;/p&gt;

&lt;h3&gt;Takeaways&lt;/h3&gt;
&lt;p&gt;
I'll try to boil all this down into a few, practical takeaways.
&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;The Threshold and Cascade Models describe two mechanisms of behavior adoption in social networks.&lt;/li&gt;
	&lt;li&gt;The Threshold Model says that people do something if enough of their friends are doing it.&lt;/li&gt;
	&lt;li&gt;The Cascade Model says that people have a chance of doing something if one of their friends is doing it.&lt;/li&gt;
	&lt;li&gt;Both models correspond to different real-life adoption patterns.&lt;/li&gt;
	&lt;li&gt;The typical "viral loop" involves the cascade model, but most successful social networks rely on the mechanics of the threshold model in the long run, i.e., density is important for long-term success.&lt;/li&gt;
	&lt;li&gt;The cascade model is a good tool for analyzing acquisition scenarios, but the threshold model is probably more helpful for understanding retention and engagement &amp;mdash; it at least implies that &lt;em&gt;density&lt;/em&gt; is a key factor in social network growth, a metric that's not often discussed publicly.&lt;/li&gt;
&lt;/ol&gt; 

&lt;p&gt;
Agree?  Disagree?  Leave a comment, send me an email, or &lt;a href="http://twitter.com/jessefarmer"&gt;follow me on Twitter&lt;/a&gt;!
&lt;/p&gt;</description>
      <pubDate>Fri, 24 Apr 2009 09:20:12 +0000</pubDate>
      <link>http://20bits.com/article/behavior-adoption-on-social-networks</link>
    </item>
    <item>
      <title>Almost Viral: A Hybrid Acquisition Strategy</title>
      <description>&lt;p&gt;
Two common acquisition strategies for a new application are a paid acquisition strategy and a viral acquisition strategy.  The former involves acquiring users at a cost less than the revenue they generate.  The latter involves users inviting their friends to the application.
&lt;/p&gt;

&lt;p&gt;
"Going viral" has become a sort of holy grail and most people would say they'd rather have a viral application than not, but it has a distinct downside: uncontrolled growth and ever-increasing operational costs.  By being almost-but-not-quite viral you can dramatically reduce your cost of acquisition without setting your servers on fire.
&lt;/p&gt;

&lt;h3&gt;The Paid Strategy&lt;/h3&gt;
&lt;p&gt;
The key variables for a paid strategy are cost of acquisition and average revenue per user, or ARPU.  If your ARPU is greater than your cost of acquisition then you can buy as many users as your budget allows, reinvesting the new revenue into acquiring new users.  Generally this is done by advertising your product through something like AdWords or Facebook's Social Ads.
&lt;/p&gt;

&lt;p&gt;
The best thing about a paid strategy is its relative simplicity.  If your ARPU is $2.00 then you can run as many ads you want so long as the cost of acquisition is less than $2.00 and still be profitable.
&lt;/p&gt;

&lt;p&gt;
Eventually, though, you will run out of ad inventory.  There are only so many publishers who are willing to accept $0.10 per click.  At this point your only options are to decrease your cost of acquisition&lt;span class="footnote"&gt;If you increase the conversion rates for your ads then you can pay more per click to get more impressions without hurting your bottom line &amp;mdash; you pay the same to acquire a user, but get more users through the door.&lt;/span&gt; or increase your ARPU.
&lt;/p&gt;

&lt;h3&gt;The Viral Strategy&lt;/h3&gt;
&lt;p&gt;
The viral acquisition strategy requires you get current users to invite other users to the application.  The two key variables for the viral strategy are the average number of invites each new users sends and the rate at which those invites convert into new users.
&lt;/p&gt;

&lt;p&gt;
The ratio of converting invites to new users is your viral coefficient, k.  If this is greater than one you will see self-sustaining, viral growth.&lt;span class="footnote"&gt;See my article &lt;a href="/article/three-myths-of-viral-growth"&gt;Three Myths of Viral Growth&lt;/a&gt; for more information about viral growth.&lt;/span&gt;  If the coefficient is less than one each user will bring in a fixed number of new users, but the application's growth is still linear.
&lt;/p&gt;

&lt;p&gt;
People who have decided on a viral acquisition strategy focus on this number obsessively.  It's the first big hurdle to clear and if you haven't had experience engineering a viral application it can take months to build something viral.
&lt;/p&gt;

&lt;p&gt;
But being viral isn't an either-or proposition.  Increasing your viral coefficient from 0.5 to 0.8 has other advantages, especially if you integrate it with a paid strategy.  Let's see how.
&lt;/p&gt;

&lt;h3&gt;A Hybrid Strategy&lt;/h3&gt;
&lt;p&gt;
Say you're building a game on Facebook backed by a virtual currency.  Users give you money or fill out offers to get coins, so you have a positive ARPU.  This means you're free to pursue a paid acquisition strategy.
&lt;/p&gt;

&lt;p&gt;
On the other hand, you're on Facebook, which has many viral hooks.  Many of the technical hurdles are much smaller there, so it becomes more a question of design and optimization rather than implementation.  At the very least you encourage players to invite their friends.
&lt;/p&gt;

&lt;p&gt;
The first version of your application has a viral coefficient of k=0.5, that is, every new user who joins brings in 0.5 new users.  Equivalently, for every two users you acquire you get one free.
&lt;/p&gt;

&lt;p&gt;
That's interesting, especially if you're also &lt;em&gt;paying&lt;/em&gt; for users.  If you're paying $1.50 per user then you paid $3.00 to get two users, but acquired a third user for free!  This means that you effectively paid $1.00 per user: $3.00 paid / 3 users. 
&lt;/p&gt;

&lt;p&gt;
This process is actually geometric, however.  If you purchased 4 users with a viral coefficient of 0.5, you'd first get 2 new users for free.  These 2 new users would then bring in 1 additional user, for a total of 3 new users, reducing your cost of acquisition even further.  This is an infinite geometric series, which I'll outline below.
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Having a non-zero viral coefficient reduces your effective cost of acquisition.&lt;/strong&gt;
&lt;/p&gt;
&lt;p&gt;
This means that you get more users for every dollar you spend on ads.  Or, if you've run out of inventory, this means you can now spend more per user and retain the same net revenue level.
&lt;/p&gt;

&lt;p&gt;
Let's figure out how to calculate your &lt;em&gt;effective&lt;/em&gt; cost per acquisition.
&lt;/p&gt;

&lt;h3&gt;Effective Cost of Acquisition&lt;/h3&gt;
&lt;p&gt;
If you just want the formula, here it is:
&lt;/p&gt;
&lt;div class="latex math"&gt; C' = C(1-k)&lt;/div&gt;
&lt;p&gt;
Where C is your cost of acquisition, k is your viral coefficient, and C' is your effective cost of acquisition.
&lt;/p&gt;
&lt;p&gt;
If k=0, i.e., you have no viral acquisition, then C' = C.
&lt;/p&gt;
&lt;p&gt;
If k = 1 and your application is viral then C' = 0 and your application grows without spending any additional money.  But rather than being an either-or proposition &amp;mdash; you're either viral or you're not &amp;mdash; there's a sliding scale.  The more viral you are the cheaper it is to acquire users.
&lt;/p&gt;

&lt;h3&gt;The Benefits of Being Almost Viral&lt;/h3&gt;
&lt;p&gt;
From the above formula, if you have a viral coefficient of 0.90 then you have reduced your cost of acquisition by 90%.  This is a great situation to be in.  You might ask, "Why wouldn't I want go the last 0.10 and make my application viral?"
&lt;/p&gt;

&lt;p&gt;
The one benefit of being viral is huge growth, which looks sexy on a graph and can tip an investor to your side if you're looking for outside money, but the growth is unpredictable.  Not only do you have little control over what demographics come to dominate your application, but sometimes it grows so quickly that you run into operational problems (servers on fire, etc.).
&lt;/p&gt;

&lt;p&gt;
By being almost viral you can grow very cheaply, control your rate of growth and demographics, and get enough traffic to conduct meaningful &lt;a href="/article/scientific-product-development"&gt;experiments&lt;/a&gt;.  Need to grow more slowly?  Just decrease your daily ad spend.  Need statistically significant results more quickly?  Increase your daily ad spend.
&lt;/p&gt;

&lt;p&gt;
Put another way, with a viral coefficient of 0.9 you've dealt with your acquisition risk.  Rather than going fully viral and dealing with the operational difficulties, it might be worth your time to deal with other market risks: retention, engagement, and monetization.
&lt;/p&gt;

&lt;p&gt;
So, stop sweating about "being viral."  Sometimes it's better to be almost viral.
&lt;/p&gt;

&lt;h3&gt;Deriving the Formula for Effective Cost of Acquisition&lt;/h3&gt;

&lt;p&gt;
You can skip this and go right to the comments if you're not interested in the math.
&lt;/p&gt;

&lt;p&gt;
We have an application with a viral coefficient of k = 0.5.  Every new user who joins brings in 0.5 new users.  Another way of thinking of it is that for every new user who joins there is a 50% probability that he will bring in another user.
&lt;/p&gt;

&lt;p&gt;
But this potential user also has a 50% chance of bringing in a new user, so the expected number of users is now 1 + 0.5 + 0.5*0.5.  This continues &lt;em&gt;ad infinitum&lt;/em&gt;.
&lt;/p&gt;

&lt;p&gt;
Formally, if we acquire one user and have a viral coefficient of k then the number of users we expect to join is N(k), given by the formula
&lt;/p&gt;

&lt;div class="math"&gt;
&lt;div class="latex math"&gt;\displaystyle{N(k) = 1 + k + k^2 + k^3 + \cdots = \sum_{i = 0}^{\infty} k^i}&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;
The initial 1 is our original user, whom we paid for, and each k represents the expected number of users from each step in the viral process.
&lt;/p&gt;

&lt;p&gt;
This is a &lt;a href="http://en.wikipedia.org/wiki/Geometric_series"&gt;geometric series&lt;/a&gt;, so we know that
&lt;/p&gt;
&lt;div class="math"&gt;
&lt;div class="latex math"&gt;\displaystyle{N(k) = \frac{1}{1-k}}&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;
Therefore, our effective cost of acquisition is
&lt;/p&gt;
&lt;div class="math"&gt;
&lt;div class="latex math"&gt;\displaystyle{C' = \frac{C}{N(k)} = \frac{C}{\frac{1}{1-k}} = C(1-k)}&lt;/div&gt;
&lt;/div&gt;</description>
      <pubDate>Wed, 15 Apr 2009 09:00:54 +0000</pubDate>
      <link>http://20bits.com/article/almost-viral-a-hybrid-acquisition-strategy</link>
    </item>
    <item>
      <title>Social Applications are Social Networks</title>
      <description>&lt;p&gt;
Are all social applications also social networks?  Dave McClure made a passing reference to this a little over a year ago, saying "RockYou &amp; Slide [are] arguably social networks of their own."&lt;span class="footnote"&gt;&lt;a href="http://500hats.typepad.com/500blogs/2007/11/google-open-soc.html"&gt;Google Open Social + Friends vs. Facebook Platform&lt;/a&gt;&lt;/span&gt;  I want to make the stronger claim: social applications are always social networks.
&lt;/p&gt;

&lt;p&gt;
It doesn't matter how large you are, it doesn't matter what your goals are, and it doesn't matter what your product is.  I think if you're building a social application then you're trying to build a new social network.  As we'll see, this has both strategic and technical implications.
&lt;/p&gt;

&lt;h3&gt;What is a Social Network?&lt;/h3&gt;
&lt;p&gt;
First, if I'm going to convince you that something is a social network we should understand what a social network is. If you ask a person to name a few social networks, they will probably list services like Facebook, MySpace, and Twitter.  And if an investor tells you they're "not investing in social networks," they mean it in this concrete, social-network-as-a-product sense.
&lt;/p&gt;

&lt;p&gt;
Others, like Brad Fitzpatrick and Mark Zuckerberg, use the term &lt;em&gt;social graph&lt;/em&gt;&lt;span class="footnote"&gt;See, e.g., &lt;a href="http://bradfitz.com/social-graph-problem/"&gt;Thoughts on the Social Graph&lt;/a&gt;&lt;/span&gt; to distinguish between the underlying social relations between people and the services, called social networks, that are built on top of them.
&lt;/p&gt;
&lt;p&gt;
But if there's one thing I learned from my mathematics education it's this: we're free to define things however we want so long as they're consistent.  Therefore we ought to choose the definition that helps us get our job done.
&lt;/p&gt;

&lt;p&gt;
So, here is my first, and most abstract definition: &lt;blockquote&gt;A social network is a collection of people bound together through a specific set of social relations.&lt;/blockquote&gt;
&lt;/p&gt;

&lt;!-- Let's see if anyone makes me define social relation! --&gt;

&lt;p&gt;
By "social relation" I mean a connection between people that permits the exchange of information.  This prevents artificial relations like "Alex and James are connected if they have the same hair color."
&lt;/p&gt;

&lt;p&gt;
When I say "social network" I always mean the actual collection of people.  Facebook is a social network.  There are actual people engaged with the site, creating relationships, sharing information, and doing all the things they'd do in "real life."  Or, put another way: a family is a social network, a family tree is not.&lt;span class="footnote"&gt;&lt;a href="http://www.artinthepicture.com/artists/Rene_Magritte/pipe.jpeg"&gt;Ceci n'est pas un Social Network&lt;/a&gt;&lt;/span&gt;
&lt;/p&gt;

&lt;p&gt;
If you don't like the above definition I can give you a functional one which I believe is equivalent. &lt;blockquote&gt;A collection of people is a social network if and only if it is possible for something to spread virally through that collection.&lt;/blockquote&gt;
&lt;/p&gt;

&lt;p&gt;
In Web 2.0 speak, a "social network" is a collection of people over which you can "go viral".  I believe that virality and social networks are fundamentally linked, and that both the above definitions are equivalent.
&lt;/p&gt;

&lt;h3&gt;Social Applications are Social Networks&lt;/h3&gt;
&lt;p&gt;
Accepting the above definitions, even if for the sake of argument, I don't think it's too hard to see why social applications are social networks. Let's take Slide's &lt;a href="http://www.facebook.com/apps/application.php?id=2425101550"&gt;Top Friends&lt;/a&gt; as an example.  Is Top Friends a social network in its own right?
&lt;/p&gt;

&lt;p&gt;
I think it's easier to see that Top Friends meets the first definition.  It is certainly a collection of people: the set of Facebook users who have installed the application.  Are those people bound by specific social relations?  Yes, and those relations are distinct from the ones represented in Facebook.  For example, Alex adding James as a top friend is a social signal distinct from Facebook.
&lt;/p&gt;

&lt;p&gt;
What about the second definition?  Top Friends doesn't have an external API so it's impossible to build apps or plugins for Top Friends.&lt;span class="footnote"&gt;For all I know Slide has an internal Top Friends API that lets them build new services that ride on Top Friends' success, but that's only &lt;a href="http://api.topfriends.com/"&gt;speculation&lt;/a&gt;.&lt;/span&gt;  So, what "goes viral" over Top Friends?  New features and patterns of usage do.&lt;span class="footnote"&gt;This is the essence of &lt;a href="http://startuplessonslearned.blogspot.com/2008/12/engagement-loops-beyond-viral.html"&gt;engagement loops&lt;/a&gt;.  Eric Ries talks about going "beyond viral."  There is no "beyond viral."  Rather, on social networks viral processes govern the whole stack: acquisition, retention, engagement, and monetization.&lt;/span&gt;
&lt;/p&gt;

&lt;p&gt;
I'd also argue that the converse is true: social networks are all social applications.  YouTube spread through MySpace, Facebook spread through email, email spread through the real-life "social graph", and PayPal spread through eBay.&lt;span class="footnote"&gt;Slide is to Facebook as Paypal was to eBay.  Anyone buy it?&lt;/span&gt; All social networks are social applications built off of pre-existing social networks.
&lt;/p&gt;

&lt;h3&gt;Strategic Implications&lt;/h3&gt;
&lt;p&gt;
If Top Friends is a social network in its own right then there are strategic implications for Facebook. &lt;em&gt;Prima facie&lt;/em&gt;, Top Friends is competing with facebook for users' attention on its own platform.  Before Facebook launched the Platform it was the Eye of Providence, collecting, collating, and analyzing every bit of activity that occurred on its network.
&lt;/p&gt;

&lt;p&gt;
After the Platform launched these third parties were able to infect portions of Facebook's network.  In some cases, e.g., the Causes application, the relationship was symbiotic.  In others, e.g., Top Friends, the relationship was antagonistic, with Facebook actually shutting down Top Friends at one point.&lt;span class="footnote"&gt;See &lt;a href="http://www.techcrunch.com/2008/06/26/did-facebook-shut-down-slides-top-friends-how-very-myspace-of-them/"&gt;this TechCrunch article&lt;/a&gt;.&lt;/span&gt;  
&lt;/p&gt;

&lt;p&gt;
What does Facebook gain by having Top Friends on its Platform?  Nothing substantial, as far as I can tell.  What does it lose?  Control and insight over the activities of its userbase.&lt;span class="footnote"&gt;There's a broader argument that ceding control in this way is the right strategic move, but Facebook is not there yet — the limit of that argument is something like OpenSocial.&lt;/span&gt;
&lt;/p&gt;

&lt;p&gt;
In effect, Top Friends is a social network bootstrapped off of Facebook, with its own set of communication channels over which Facebook has no authority or insight.  This tension is present everywhere in the Platform because application developers' interests are not wholly aligned with Facebook's and will probably never be.
&lt;/p&gt;

&lt;h3&gt;Technical Implications&lt;/h3&gt;
&lt;p&gt;
I'm going to save the technical implications for another article, but it boils down to this: social networks in the sense that I defined above are fairly well understood.  I believe the techniques used on the web today to grow "viral" applications are under the research from fields like social network analysis and epidemiology.
&lt;/p&gt;

&lt;p&gt;
Since I believe social applications and social networks are synonymous, we can better understand how these applications grow by understanding how social networks grow.
&lt;/p&gt;

&lt;p&gt;
In the meantime, I recommend reading &lt;em&gt;&lt;a href="http://www3.interscience.wiley.com/journal/118986267/abstract"&gt;The Statistical Evaluation of Social Network Dynamics&lt;/a&gt;&lt;/em&gt; by Tom A. B. Snijders from the University of Groningen if you're interested in the technical aspects of social networks and social applications.
&lt;/p&gt;

&lt;p&gt;
And please, leave a comment if you have any thoughts about the above!
&lt;/p&gt;</description>
      <pubDate>Thu, 09 Apr 2009 08:00:51 +0000</pubDate>
      <link>http://20bits.com/article/social-applications-are-social-networks</link>
    </item>
    <item>
      <title>Where the iTunes Store Fails: Community</title>
      <description>&lt;p&gt;
You don't need me to tell you that the iTunes Store has changed the face of music distribution, digital or otherwise.  In April of 2008 it became the top music retailer in the US&lt;span class="footnote"&gt;&lt;a href="http://www.apple.com/pr/library/2008/04/03itunes.html"&gt;iTunes Store Top Music Retailer in the US&lt;/a&gt;&lt;/span&gt; and passed 6 billion songs downloaded earlier this year&lt;span class="footnote"&gt;&lt;a href="http://www.techcrunch.com/2009/01/06/itunes-sells-6-billion-songs-and-other-fun-stats-from-the-philnote/"&gt;iTunes Sells 6 Billion Songs, And Other Fun Stats From The Philnote&lt;/a&gt;&lt;/span&gt;.  That's almost one song downloaded for every person on the planet.
&lt;/p&gt;

&lt;p&gt;
For music startups iTunes figures into most strategic decisions.  If you're streaming music for free to consumers you're going to be an iTunes affiliate&lt;span class="footnote"&gt;Both imeem and Last.FM are, for example&lt;/span&gt;.  If you're selling music to consumers you're going to competing directly with iTunes &amp;mdash; consumers have no reason to get their music from both you and iTunes if you both have it.
&lt;/p&gt;
&lt;p&gt;
It's understandable if your heart skips a beat when you catch rumor that Apple will be building a similar product.  How can you maneuver in this environment?
&lt;/p&gt;


&lt;h3&gt;Finding Room to Breath&lt;/h3&gt;
&lt;p&gt;
The iTunes Store is a lot like Wal-Mart: ubiquitous&lt;span class="footnote"&gt;Who did Apple pass as the top music retailer?  Wal-Mart&lt;/span&gt;, highly integrated, and bland.  People shop there because it's easier, not because it's sexier, even though in other areas Apple is very good at selling precisely that hip, sexy lifestyle.
&lt;/p&gt;

&lt;p&gt;
But Wal-Mart's strategy isn't the only strategy out there.  Companies like Whole Foods can still thrive in their niche even though people can get cheaper food at Wal-Mart.  Where is the Whole Foods of digital music?  Does such a thing even make sense?
&lt;/p&gt;

&lt;h3&gt;Building a Community&lt;/h3&gt;
&lt;p&gt;
Corner record stores are about more than just the transaction.  They attract a certain crowd and embrace a certain culture.  &lt;a href="http://en.wikipedia.org/wiki/High_Fidelity_(film)"&gt;High Fidelity&lt;/a&gt; is a good example of this on film.
&lt;/p&gt;

&lt;p&gt;
iTunes misses out on the cultural and communal aspects of music altogether.  It's very sterile.  It's also a terrible means of &lt;em&gt;discovering&lt;/em&gt; new music, a role which traditional record stores can fulfill. 
&lt;/p&gt;

&lt;p&gt;
As an example, say you're really into electronica.   What good are the reviews on the iTunes music store to you?  They're left by idiots who don't know Aphex Twin from Paul van Dyk.  You go there when you know what you want and leave the second you have it&lt;span class="footnote"&gt;This is a problem with iTunes in general.  It's the last step in your marketing campaign, not the first.  See my article &lt;a href="/article/the-099-app-store"&gt;The $0.99 (App) Store&lt;/a&gt;.&lt;/span&gt;.
&lt;/p&gt;

&lt;p&gt;
And knowing iTunes, they might not even have music from your favorite bands if they're obscure enough.
&lt;/p&gt;

&lt;p&gt;
Instead, imagine a hub of engaged electronica fans with a custom, electronica-centric store&lt;span class="footnote"&gt;Of course, you can break down genres into subgenres, and so forth.  Maybe the sweet spot is having an ambient store, a trance store, a house store, a D&amp;B store, etc.&lt;/span&gt;.  The community itself spurs demand for its own store due to its reputation for quality electronica-related recommendations.
&lt;/p&gt;

&lt;p&gt;
Improved discovery, better quality merchandise, exclusive deals with bands, and a community of like-minded people are just a few reasons why people might prefer to shop at a genre-specific music store rather than iTunes if they're forced to choose.
&lt;/p&gt;

&lt;p&gt;
Plus an independent store is more freely able to experiment with payment models, distribution methods, and marketing campaigns.  This might interest bands who see iTunes as a love-it-or-leave-it environment controlled from top to bottom by Apple.
&lt;/p&gt;

&lt;h3&gt;Will This Work?&lt;/h3&gt;
&lt;p&gt;
I don't know if this will work, but I think it's a reasonable enough &lt;a href="/article/scientific-product-development"&gt;hypothesis to test&lt;/a&gt;.  This is just one possible strategy for building a music product and probably has several flaws I haven't thought through.  Leave a comment and let me know your thoughts.
&lt;/p&gt;

&lt;p&gt;
May a thousand music stores bloom!
&lt;/p&gt;

&lt;h3&gt;Update&lt;/h3&gt;
&lt;p&gt;
&lt;a href="http://adam.blog.heroku.com/"&gt;Adam from Heroku&lt;/a&gt; pointed me towards &lt;a href="https://www.beatport.com/"&gt;Beatport&lt;/a&gt;, which has been pursuing this exact strategy for the last few years.&lt;/p&gt;
&lt;p&gt;
After a little digging I found a few others, too.  &lt;a href="http://www.insound.com/"&gt;Insound.com&lt;/a&gt; for indie music and &lt;a href="http://mondomix.com/"&gt;Mondomix&lt;/a&gt; for world music.  I also know of one stealth startup pursuing this strategy for another genre.  Do you know of any others?  How well does this strategy perform?
&lt;/p&gt;

&lt;p&gt;
In the limit you can imagine a "Ning for iTunes Stores," where the costs of implementing the store are shared but the community-building aspects are left to the company.
&lt;/p&gt;</description>
      <pubDate>Mon, 06 Apr 2009 08:45:04 +0000</pubDate>
      <link>http://20bits.com/article/where-the-itunes-store-fails-community</link>
    </item>
    <item>
      <title>When in Rome: Newcomers on Facebook</title>
      <description>&lt;p&gt;
A &lt;a href="http://zellunit.com"&gt;teammate&lt;/a&gt; of mine recently sent me a link to a paper called "&lt;a href="http://www.thoughtcrumbs.com/publications/paper0778-burke.pdf"&gt;Feed Me: Motivating Newcomer Contribution in Social Network Sites&lt;/a&gt;" and I thought it was worth discussing.  The paper was jointly authored by &lt;a href="http://www.cs.cmu.edu/~mkburke/"&gt;Moira Burke&lt;/a&gt;, a PhD student at Carnegie Mellon, and &lt;a href="http://overstated.net/"&gt;Cameron Marlow&lt;/a&gt; and Thomas Lento, two research scientists at Facebook.
&lt;/p&gt;

&lt;h3&gt;The Chicken and the Egg&lt;/h3&gt;
&lt;p&gt;
The root question addressed in the paper is this: &lt;em&gt;what motivates newcomers to contribute to social networks?&lt;/em&gt;  For social networking sites getting users to contribute is one of the primary problems, right after how you acquire new users.
&lt;/p&gt;

&lt;p&gt;
Let's dive right in and look at their hypotheses, methodology, and conclusions.
&lt;/p&gt;

&lt;h3&gt;Hypotheses&lt;/h3&gt;

&lt;p&gt;
The authors took all users who joined on a random weekday in March 2008 &amp;mdash; amounting to about 140,000 users &amp;mdash; and tried to predict their long-term sharing habits based on the experiences they have in the first two weeks.  Specifically, they looked at how users interacted with photos.
&lt;/p&gt;

&lt;p&gt;
The paper outlines four hypotheses:
&lt;ol&gt;
&lt;li&gt;Social learning: Newcomers whose friends share more content will go on to contribute more content themselves.&lt;/li&gt;
&lt;li&gt;Singling out: Newcomers who are singled out in content will contribute more content.&lt;/li&gt;
&lt;li&gt;Feedback: Newcomers receiving more feedback on their initial content will go on to contribute more content.&lt;/li&gt;
&lt;li&gt;Distribution: Newcomers whose initial content is distributed widely will go on to contribute more content.&lt;/li&gt;
&lt;/ol&gt;
&lt;/p&gt;

&lt;h3&gt;Conclusion: When in Rome...&lt;/h3&gt;
&lt;p&gt;
The authors also broke down the newcomers into two categories, early uploaders and non-early uploaders, depending on whether or not they uploaded more than one photo in the first two weeks.
&lt;/p&gt;

&lt;p&gt;
The two factors that correlated with long-term photo sharing for early uploaders were whether your friends were also sharing photos in your firs two weeks, and whether people commented on your photos.  Surprisingly "singling out," i.e., getting tagged in photos, had no &lt;a href="/article/hypothesis-testing-the-basics"&gt;statistically significant&lt;/a&gt; effect.
&lt;/p&gt;

&lt;p&gt;
Singling out, however, did work for non-early uploaders, suggesting that people can be cajoled into uploading photos by tagging them, but that people already uploading photos to Facebook won't upload any more than they were before.
&lt;/p&gt;

&lt;p&gt;
In short, newcomers are susceptible to peer pressure.
&lt;/p&gt;

&lt;h3&gt;How is this useful?&lt;/h3&gt;
&lt;p&gt;
The upside to this paper is that it gives a clear picture of what is worth measuring.  Getting a user to upload a photo doesn't just mean one more photo on the site &amp;mdash; some percentage of their friends will upload a photo, too.
&lt;/p&gt;

&lt;p&gt;
What's more, you can enter into a sort of feedback loop.  The paper didn't address whether "social learning" also correlated with increasing auxiliary activities like feedback, but imagine this: more photos uploaded means more comments, which in turn means more photos.  Is it possible to make this cycle self-sustaining?
&lt;/p&gt;

&lt;p&gt;
The downside is that this doesn't help with the chicken-and-egg problem.  What happens when a user comes to the site and they have no friends?  There are some public spaces on Facebook, but most social networking in that vein are dominated by interactions among friends.
&lt;/p&gt;

&lt;p&gt;
Overall one of the most detailed papers analyzing data from a huge social networks.  Leave a comment and let me know your thoughts, especially if you know any other papers of this kind!
&lt;/p&gt;</description>
      <pubDate>Mon, 19 Jan 2009 03:23:04 +0000</pubDate>
      <link>http://20bits.com/article/when-in-rome-newcomers-on-facebook</link>
    </item>
    <item>
      <title>The $0.99 (App) Store</title>
      <description>&lt;img src="http://assets.20bits.com/20081210/iphone_home.gif" alt="" title="iPhone" width="150" height="250" class="alignleft size-medium wp-image-435" style="float: left"/&gt;
&lt;p&gt;
I was going to hold off writing this article, but after reading this &lt;a href="http://www.macblogz.com/2008/12/09/iphone-developer-writes-personal-letter-to-steve-jobs/"&gt;open letter to Steve Jobs from an iPhone developer&lt;/a&gt; I just couldn't.
&lt;/p&gt;

&lt;p&gt;
Mr. Hockenberry isn't the first to argue that iPhone apps are &lt;a href="http://blogs.oreilly.com/iphone/2008/06/what-should-iphone-application.html"&gt;too cheap&lt;/a&gt;.  So, what gives?
&lt;/p&gt;

&lt;h3&gt;Marketing v. Distribution&lt;/h3&gt;
&lt;p&gt;
The problem is that the App Store is a distribution channel (and a very good one, at that), but developers are using it as their primary means of marketing.  Distribution and marketing aren't one and the same, and this tension is why developers are feeling pinched.
&lt;/p&gt;

&lt;p&gt;
Distribution is the "how," as in, how do you get your product to your customer?  Wal-Mart, Target, and your favorite mom-and-pop store are distribution channels.  Malls are a way of aggregating distribution channels and amortizing the fixed costs.
&lt;/p&gt;

&lt;p&gt;
Marketing is the "why," as in, why do your customers want to buy your product?  Marketing channels like TV ads, direct marketing, etc. are about getting your message in front of your customers and convincing them they should buy your product.
&lt;/p&gt;

&lt;p&gt;
For iTunes apps the only significant distribution channel is the app store itself.  Unfortunately, the primary marketing channel is getting your app on one of the featured lists on the front page of the app store.
&lt;/p&gt;

&lt;h3&gt;Why Apps Are Cheap&lt;/h3&gt;
&lt;p&gt;
Here's a thought experiment. Pretend Borders is the only book store in the world and that they put their best-selling books closer to the front.  10,000 people wander in and out of Borders every day.  People are five times as likely to buy the books out front versus the ones in the back.
&lt;/p&gt;

&lt;p&gt;
Now imagine there is only one prime spot and two books that share the same &lt;a href="http://en.wikipedia.org/wiki/Demand"&gt;demand curve&lt;/a&gt;.  What happens?  The one that has the lower price gets a 5x boost in sales, so each publisher tries to undercut the other until they're both priced at near-zero.
&lt;/p&gt;

&lt;p&gt;
And if one of them is willing to put ads in their book, well, they're happy pricing their book at zero from the start.
&lt;/p&gt;

&lt;p&gt;
This is the app store as it stands now, more or less.  Marketing is about stimulating demand.  App developers are confusing the app store with a marketing channel, and the only way to stimulate demand in that environment is to violently slash prices.
&lt;/p&gt;

&lt;h3&gt;Beyond the App Store&lt;/h3&gt;
&lt;p&gt;
There are two ways to increase demand: by lowering your price, or by marketing.
&lt;/p&gt;

&lt;p&gt;
The app store is a store just like Borders or Wal-Mart.  They make their money by distributing lots of goods that other people make and taking a cut, so of course they give premium spots to the apps that sell the most.
&lt;/p&gt;

&lt;p&gt;
App developers, however, are acting like the people in the store are the only people in the world.  The only way to stimulate demand is to lower the price and hope for that premium spot.
&lt;/p&gt;

&lt;p&gt;
Instead developers should look for creative ways to stimulate demand outside the app store itself.  Lower prices aren't what convince you to buy Beyonce's new album at the record store, it's the other way around. Beyonce's multi-million dollar marketing campaign is what convinces you to go into the record store in the first place.
&lt;/p&gt;

&lt;p&gt;
The first iPhone developer to capitalize on this will make a big splash and reverse the $0.99 App Store trend.  Just remember to link to this article when they do.
&lt;/p&gt;</description>
      <pubDate>Wed, 10 Dec 2008 01:23:05 +0000</pubDate>
      <link>http://20bits.com/article/the-099-app-store</link>
    </item>
    <item>
      <title>The Dangers of Genetic Optimization</title>
      <description>&lt;p&gt;
The guys at &lt;a href="http://weebly.com"&gt;Weebly&lt;/a&gt; just had a round of press for their latest product, &lt;a href="http://snapads.com"&gt;SnapAds&lt;/a&gt;, an ad optimization platform that uses genetic algorithms.  The technology is very cool, so check it out.
&lt;/p&gt;

&lt;p&gt;
&lt;a href="http://ejohn.org/blog/genetic-ab-testing-with-javascript/"&gt;John Resig&lt;/a&gt; then posted a link to &lt;a href="http://demo.genetify.com/"&gt;Genetify&lt;/a&gt;, the previous incarnation of this technology, which uses genetic algorithms to optimize your website at large by "evolving" your stylesheets.
&lt;/p&gt;

&lt;h3&gt;The Black Box of Genetic Algorithms&lt;/h3&gt;

&lt;p&gt;
SnapAds is a great application of this technology because the guiding metric function is obvious: total ad revenue.  Since we've talked about &lt;a href="/article/statistical-analysis-and-ab-testing"&gt;A/B testing&lt;/a&gt;, you might wonder why not do this automatically for your website at large and optimize other user behavior?
&lt;/p&gt;

&lt;p&gt;
The answer is what we call "black box testing."  You know the results &amp;mdash; maybe users are 50% more likely to click a certain link &amp;mdash; but you don't understand why.
&lt;/p&gt;

&lt;p&gt;
This is a pitfall of normal A/B and multivariate testing, too.  You put up an experiment, measure the outcomes, and pick the one that performs the best according to the metrics that matter.  Bingo bango.
&lt;/p&gt;

&lt;p&gt;
And hey, if you automate the optimization step with something like genetic algorithms, you don't even need to do this.  The machine makes the decision for you!
&lt;/p&gt;

&lt;h3&gt;Analysis Matters&lt;/h3&gt;

&lt;p&gt;
The problem with black box testing &amp;mdash; when you understand the outcome but not the underlying cause &amp;mdash; is that there's no learning.  Analysis matters.  Customer insight matters.
&lt;/p&gt;

&lt;p&gt;
If you're only doing black box testing you don't really understand your customers.  You're just blindly following the dictates of whatever algorithm you've set up.
&lt;/p&gt;

&lt;p&gt;
Your customers might be buying more now, but can you apply that knowledge to your next product?
&lt;/p&gt;</description>
      <pubDate>Thu, 04 Dec 2008 06:00:46 +0000</pubDate>
      <link>http://20bits.com/article/the-dangers-of-genetic-optimization</link>
    </item>
    <item>
      <title>The Cult of the Product</title>
      <description>&lt;p&gt;
In the movies you can build a baseball stadium in an Iowa cornfield and get millions of people to show up.  Who wouldn't want to see the ghost of Mickey Mantle play another game?  In real life there are millions of details that go into constructing a baseball stadium, not the least of which are having a team and fans ready to fill it from day one.
&lt;/p&gt;

&lt;p&gt;
The first scenario might be more romantic, but if you're looking to be a baseball mogul you'd better be operating in the second.
&lt;/p&gt;

&lt;p&gt;
The same is true of your consumer internet venture.  Most hackers and entrepreneurs spend their time thinking about just the product.  "If we just build the most awesomest widget possible," they think, "people will love it and give us money."  Product first, everything else second.
&lt;/p&gt;

&lt;h3&gt;The Cult of the Product&lt;/h3&gt;
&lt;p&gt;
This is the sentiment that embodies what I call the "cult of the product."  Like the Field of Dreams, people in this mindset believe that product is the most important thing and if they build it customers will come flocking.
&lt;/p&gt;

&lt;p&gt;
It's hard to blame anyone &amp;mdash; these signals are everywhere in the technology industry.
If you take Apple's public image at face value, for example, you'd believe that every product idea that comes out of the company springs fully formed from the head of Steve Jobs himself.
&lt;/p&gt;

&lt;img src="http://assets.20bits.com/20081202/cult.jpg" alt="" title="cult" width="300" height="238" class="math size-medium wp-image-403" /&gt;

&lt;p&gt;
This is a carefully crafted illusion.  In reality Apple has one of the most refined (and most well-guarded) design processes in the industry. If Steve Jobs is the heart of the company their design process is the blood.
&lt;/p&gt;

&lt;p&gt;
This process helps them identify market needs and build the product that best satisfies that need.  The function of a design process is to increase the value of the end-product and reduce the risk of shipping it. 
&lt;/p&gt;

&lt;p&gt;
Here's a great video of Steve Jobs discussing product strategy while he was still at NeXT:
&lt;/p&gt;
&lt;div class="math"&gt;
&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/p9dmcRbuTMY&amp;hl=en&amp;fs=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/p9dmcRbuTMY&amp;hl=en&amp;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/div&gt;
&lt;h3&gt;The Components of a Successful Product&lt;/h3&gt;
&lt;p&gt;
In order to succeed every company needs three components: a market, a product, and a distribution channel.  You'll notice Steve Jobs talks about all three in the above video.
&lt;/p&gt;

&lt;p&gt;
Most aspiring internet entrepreneurs think a lot about product and a little about the market.  Their distribution strategy, however, often amounts to little more than "get mentioned on TechCrunch a lot."
&lt;/p&gt;

&lt;p&gt;
Think back to the most successful consumer internet startups: Google, Craigslist, MySpace, YouTube, Facebook, etc.  How many of them needed TechCrunch-style press to drive growth?  None.  And that's not just "luck" &amp;mdash; each had a distribution strategy built in from day one.
&lt;/p&gt;

&lt;p&gt;
In fact, building a web product, like any product, requires you to think about all of these things.  A product that fits a market is useless if you have no way to get it in front of customers, and even the best distribution model can't help if nobody wants to use the product you're distributing.
&lt;/p&gt;

&lt;h3&gt;Vanquishing the Cult&lt;/h3&gt;
&lt;p&gt;
Apple analyzes markets through a top-dow, design-centric process, but we can flip this around and apply a bottom-up, data-driven approach.
&lt;/p&gt;

&lt;p&gt;
In fact, we can use the same principles of &lt;a href="/article/scientific-product-development"&gt;scientific product development&lt;/a&gt; to reason about business strategy.
&lt;/p&gt;

&lt;p&gt;
For example, we might start with the hypothesis that 1,000,000 people are willing to pay $50/month for your product.
&lt;/p&gt;

&lt;p&gt;
To test this you need to get real bullets in your gun as fast as possible.  This means talking to potential customers and getting their feedback, implementing simple prototypes and measuring their performance, etc.  Put your product and product ideas through the most rigorous process, using a combination of qualitative and quantitative feedback.
&lt;/p&gt;

&lt;p&gt;
Then, using the data you gathered from your measurements and tests, iterate.  It might turn out that nobody was willing to pay more than $20/month for your a simplified version of your product.  This data lets you form new hypotheses.
&lt;/p&gt;

&lt;p&gt;
Did you have the right product/market fit?  Were you approaching the right customers?  Should you lower your price?  Should you improve the product?  Should you have a different pricing scheme entirely?  Each of these questions is itself a testable hypothesis and can be approached through an empirical, data-driven process.
&lt;/p&gt;

&lt;p&gt;
The important point is to have a process, whether design-centric or data-driven, that helps you identify the key market, product, and distribution challenges of your business.
&lt;/p&gt;

&lt;p&gt;
Sacrificing even a little bit of your business to the cult of the product is an unnecessary risk.
&lt;/p&gt;</description>
      <pubDate>Tue, 02 Dec 2008 06:00:03 +0000</pubDate>
      <link>http://20bits.com/article/the-cult-of-the-product</link>
    </item>
    <item>
      <title>Three Myths of Viral Growth</title>
      <description>&lt;h3&gt;Myth 1: Viral Growth is Exponential Growth&lt;/h3&gt;
&lt;p&gt;
Viral growth isn't exponential growth.  Your web product has a maximum audience, for example, but an exponential curve grows forever. Instead viral growth follows a logistic curve.
&lt;/p&gt;

&lt;p&gt;
The logistic curve comes from population biology where the growth of a population has an exponential component, e.g., humans on average have 2.1 children each, but is dampened by competition for resources.  If the population is growing too fast eventually it reaches the limit where their environment can no longer support pure exponential growth.
&lt;/p&gt;

&lt;p&gt;
This upper limit is called the &lt;strong&gt;carrying capacity&lt;/strong&gt;.
&lt;/p&gt;

&lt;p&gt;
Seth Godin misses this in his otherwise excellent &lt;a href="http://sethgodin.typepad.com/seths_blog/2007/08/elephant-math.html"&gt;Elephant Math&lt;/a&gt; essay.  He equates "perfect viral growth" with exponential growth, but no viral growth is exponential.  There are two variables at play: the rate of reproduction and the carrying capacity.
&lt;/p&gt;

&lt;p&gt;
Sometimes the carrying capacity is obvious.  If you're building a Facebook app the carrying capacity can't be any larger than the total size of Facebook, for example.  Other times you don't know until you start feeling its effects.
&lt;/p&gt;

&lt;img src="http://assets.20bits.com/20081118/viral-growth.png" alt="" title="" width="450" height="246" class="alignnone size-medium wp-image-378 math" /&gt;&lt;/a&gt;

&lt;p&gt;
This is what happens to every viral product, whether you like it or not.  Viral growth slows and you have to worry about retaining users rather than acquiring new ones.  If you don't your product runs the risk of &lt;a href="http://andrewchenblog.com/2008/03/05/facebook-viral-marketing-when-and-why-do-apps-jump-the-shark/"&gt;jumping the shark&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
And here's the kicker: the faster your app is growing the sooner you have to care about retention because you reach the carrying capacity much more quickly.
&lt;/p&gt;

&lt;h3&gt;Myth 2: Viral Growth is a Marketing Buzzword&lt;/h3&gt;
&lt;p&gt;
Viral growth isn't a marketing buzzword, although I won't say marketers don't occasionally misuse the word.  It has the potential to happen any time you're in a situation where people can communicate.
&lt;/p&gt;

&lt;p&gt;
In simple terms viral growth happens when a user comes across your product and recommends it to his friends.  If the average user recommends the product to more than one person you get viral growth.
&lt;/p&gt;

&lt;p&gt;
What most people don't understand is that viral growth is also a function of the &lt;strong&gt;viral substrate&lt;/strong&gt;, or the underlying communication medium.  The easier it is to communicate with other people the more likely something is to go viral.
&lt;/p&gt;

&lt;p&gt;
With modern technologies like email, Facebook, SMS, etc., communication is virtually frictionless.  I could send an email to 1,000 people right now if I wanted, or text ten of my best friends simultaneously.  What's more, these channels are actually &lt;em&gt;easier&lt;/em&gt; to measure than word-of-mouth recommendations.
&lt;/p&gt;

&lt;p&gt;
Instead marketers use vague terms like "word of mouth marketing."  But every step in the viral process can and should be measured, and you should use mathematical models (like the logistic growth curve, above) to understand what is really happening.
&lt;/p&gt;

&lt;h3&gt;Myth 3: Viral Growth Can't be Engineered&lt;/h3&gt;
&lt;p&gt;
Viral growth is one of many distribution strategies and that means it can be engineered.  Innovation in distribution might be boring, but it makes the difference between a K-Mart and a Wal-Mart.
&lt;/p&gt;

&lt;p&gt;
Innovation in viral distribution means building and optimizing a &lt;a href="http://venturebeat.com/2007/06/11/q-a-with-rockyou-three-hit-apps-on-facebook-and-counting/"&gt;viral loop&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
Here is an example, taken from KISSMetrics' &lt;a href="http://productplanner.com"&gt;Product Planner&lt;/a&gt;: &lt;div style="width:425px;text-align:left" class="math"&gt;&lt;a style="text-align:center; width:425px;"&gt;&lt;a style="font:14px;font-family:Arial,Sans-serif;display:block;margin:5px;text-decoration:underline;" href="http://productplanner.com/gallery/hug_me/invite_loop" title="Invite Loop"&gt;Invite Loop&lt;/a&gt;&lt;object width="425" height="375" type="application/x-shockwave-flash" allowscriptaccess="always" data="http://www.productplanner.com/static/main.swf" style="visibility: visible;"&gt;&lt;param name="flashvars" value="&amp;flowXML=%3Cflow%20layout%3D%22loop%22%3E%0D%0A%3Csteps%20%3E%0D%0A%3Cstep%20id%3D%2299%22%20name%3D%22User%20allows%20access%22%20icon_url%3D%22http%3A//productplanner.com/static/images/icons/unlock.gif%22%20image_url%3D%22http%3A//productplanner.com/static/uploads/3f2/2e7/e06/1b5/c95/64b/feb/9d6/b53/510/5a1/303/c9c1.jpg%22%20custom_image%3D%221%22%3E%3C/step%3E%0D%0A%3Cstep%20id%3D%22103%22%20name%3D%22User%20invites%20friends%22%20icon_url%3D%22http%3A//productplanner.com/static/images/icons/chat.gif%22%20image_url%3D%22http%3A//productplanner.com/static/uploads/84f/7a0/b1e/f3a/474/d58/f8c/0a7/1e3/c23/490/d05/ce13.jpg%22%20custom_image%3D%221%22%3E%3C/step%3E%0D%0A%3Cstep%20id%3D%22100%22%20name%3D%22User%20clicks%20link%22%20icon_url%3D%22http%3A//productplanner.com/static/images/icons/upload.gif%22%20image_url%3D%22http%3A//productplanner.com/static/uploads/d00/cdc/161/372/e0d/b69/5f9/e70/fa5/839/2e3/20c/dbee.jpg%22%20custom_image%3D%221%22%3E%3C/step%3E%0D%0A%3Cstep%20id%3D%22104%22%20name%3D%22User%20views%20notification%22%20icon_url%3D%22http%3A//productplanner.com/static/images/icons/check.gif%22%20image_url%3D%22http%3A//productplanner.com/static/uploads/56e/376/7a7/492/b68/244/1f0/600/f08/961/9db/edf/f526.jpg%22%20custom_image%3D%221%22%3E%3C/step%3E%0D%0A%3Cstep%20id%3D%22100%22%20name%3D%22User%20clicks%20link%22%20icon_url%3D%22http%3A//productplanner.com/static/images/icons/upload.gif%22%20image_url%3D%22http%3A//productplanner.com/static/uploads/bae/7ff/935/efb/b89/892/df8/5bc/32b/36a/7bb/e97/f848.jpg%22%20custom_image%3D%221%22%3E%3C/step%3E%0D%0A%3C/steps%3E%0D%0A%3C/flow%3E&amp;baseURL=http%3A//productplanner.com/&amp;embedded=1&amp;viewURL=http%3A//productplanner.com/gallery/hug_me/invite_loop"/&gt;&lt;/object&gt;&lt;div style="font-family:Tahoma,Arial;font-size:11px;height:26px;padding-top:2px;"&gt;View more &lt;a style="text-decoration:underline;" href="http://productplanner.com/gallery/hug_me" title="Hug Me"&gt;Hug Me&lt;/a&gt; user flows.&lt;/div&gt;&lt;/div&gt;
&lt;/p&gt;

&lt;p&gt;
Simply put, your viral loop is the series of steps a user goes through before he invites his friends.  Each step in the loop costs you users.  Perhaps only 10% of users click the "accept invitation" link, for example.  The &lt;strong&gt;efficiency&lt;/strong&gt; of the funnel is the percentage of users who make it all the way through the funnel.
&lt;/p&gt;

&lt;p&gt;
One of fundamental equations of viral growth is &lt;div class="latex math"&gt; k = e\cdot i&lt;/div&gt; where "e" is the efficiency of your loop and "i" is the average number of invites per user.  k is the &lt;strong&gt;viral coefficient&lt;/strong&gt;, or the average number of additional users each new user brings in.  If k &gt; 1 you get viral growth.
&lt;/p&gt;

&lt;p&gt;
Phrasing it like this makes "getting viral" into an optimization problem, one that you can &lt;a href="/article/statistical-analysis-and-ab-testing"&gt;A/B test&lt;/a&gt;.
&lt;/p&gt;

&lt;h3&gt;Measure, Test, Repeat&lt;/h3&gt;
&lt;p&gt;
Viral growth can be measured, tested, and modeled.  It's not a fuzzy marketing term.  And if you can do it right you'll have yourself a difficult-to-top distribution channel for your next product.
&lt;/p&gt;</description>
      <pubDate>Tue, 18 Nov 2008 15:47:16 +0000</pubDate>
      <link>http://20bits.com/article/three-myths-of-viral-growth</link>
    </item>
    <item>
      <title>Statistical Analysis and A/B Testing</title>
      <description>&lt;p&gt;
In this article we're going to talk about how hypothesis testing can tell you whether your &lt;a href="/article/an-introduction-to-ab-testing"&gt;A/B tests&lt;/a&gt; actually effect user behavior, or whether the variations you see are due to random chance.
&lt;/p&gt;


&lt;p&gt;
First, if you haven't yet, read my previous introductory article on &lt;a href="/article/hypothesis-testing-the-basics"&gt;hypothesis testing&lt;/a&gt;.  It explains the statistical principles behind hypothesis testing using the example of a biased coin.  We're going to move quickly beyond that and dive right into A/B testing.
&lt;/p&gt;

&lt;h3&gt;Landing Page Conversion&lt;/h3&gt;
&lt;p&gt;
You're testing a landing page that has a signup form.  You want to test various layouts to try and maximize the percentage of people who sign up.  This percentage is called the "conversion rate," i.e., the rate at which you convert visitors from passerbys to customers.
&lt;/p&gt;

&lt;p&gt;
You have a four-way experiment with a control treatment and three experimental treatments.  How you &lt;a href="http://andrewchenblog.com/2008/10/27/how-to-generate-awesome-test-candidates-for-ab-testing/"&gt;pick your treatments&lt;/a&gt; is a subject worth discussing in its own right, but they should try to move the big levers: copy, layout, and size.
&lt;/p&gt;

&lt;p&gt;
For this experiment we'll just call the treatments control, A, B, and C.  You can use your imagination.
&lt;/p&gt;

&lt;h3&gt;Fake Data&lt;/h3&gt;
&lt;p&gt;
Your totally awesome Project X is attracting users.  You've analyzed your sales pipeline and the point with the highest potential impact is the landing page.  You want increase the landing page conversion rate by at least 20%.
&lt;/p&gt;

&lt;p&gt;
You create an A/B test with four treatments: control, A, B, and C.  Here is the data you collect:
&lt;/p&gt;
&lt;center&gt;
&lt;table class="monthly-data"&gt;
&lt;tr class="top"&gt;
&lt;th colspan="4"&gt;Project X Landing Page&lt;/th&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;th style="width: 15ex;"&gt;Treatment&lt;/th&gt;
&lt;th style="width: 8ex;"&gt;Visitors Treated&lt;/th&gt;
&lt;th style="width: 8ex;"&gt;Visitors Registered&lt;/th&gt;
&lt;th style="width: 8ex;"&gt;Conversion Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class="statistic"&gt;Control&lt;/td&gt;
&lt;td&gt;182&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td class="negative"&gt;19.23%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td class="statistic"&gt;Treatment A&lt;/td&gt;
&lt;td&gt;180&lt;/td&gt;
&lt;td&gt;45&lt;/td&gt;
&lt;td class="negative"&gt;25.00%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class="statistic"&gt;Treatment B&lt;/td&gt;
&lt;td&gt;189&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td class="negative"&gt;14.81%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td class="statistic"&gt;Treatment C&lt;/td&gt;
&lt;td&gt;188&lt;/td&gt;
&lt;td&gt;61&lt;/td&gt;
&lt;td class="negative"&gt;32.45%&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;
From the data both treatments A and C show at least a 20% improvement in the landing page performance, which was our goal.  You might declare Treatment C "good enough," choose it, and move on.  But how do you know the variation isn't due to random chance?  What if instead of 188 visitors treated we only had 10 visitors treated?  Would you still be so confident?
&lt;/p&gt;

&lt;p&gt;
As usual we're aiming for a 95% confidence interval.
&lt;/p&gt;

&lt;p&gt;
Hypothesis testing is all about quantifying our confidence, so let's get to it.
&lt;/p&gt;

&lt;h3&gt;The Statistics&lt;/h3&gt;
&lt;p&gt;
Remember, we need to start with a null hypothesis.  In our case, the null hypothesis will be that the conversion rate of the control treatment is no less than the conversion rate of our experimental treatment.  Mathematically
&lt;/p&gt;
&lt;div class="latex math"&gt; H_0: p - p_c \le 0&lt;/div&gt;
&lt;p&gt;
where p&lt;sub&gt;c&lt;/sub&gt; is the conversion rate of the control and p is the conversion rate of one of our experiments.
&lt;/p&gt;

&lt;p&gt;
The alternative hypothesis is therefore that the experimental page has a &lt;em&gt;higher&lt;/em&gt; conversion rate.  This is what we want to see and quantify.
&lt;/p&gt;

&lt;p&gt;
The sampled conversion rates are all normally distributed random variables.  It's just like the coin flip, except instead of heads or tails we have "converts" or "doesn't convert."  Instead of seeing whether it deviates too far from a fixed percentage we want to measure whether it deviates too far from the control treatment.
&lt;/p&gt;

&lt;p&gt;
Here's an example representation of the distribution of the control conversion rate and the treatment conversion rate.
&lt;/p&gt;
&lt;img src="http://assets.20bits.com/20081112/two-normals.png" alt="" title="two-normals" width="472" height="188" class="alignnone size-full wp-image-357 math" /&gt;

&lt;p&gt;
The peak of each curve is the conversion rate we measure, but there's some chance it is actually somewhere else on the curve.  Moreover, what we're &lt;em&gt;really&lt;/em&gt; interested in is the difference between the two conversion rates.  If the difference is large enough we conclude that the treatment really did alter user behavior.
&lt;/p&gt;

&lt;p&gt;
So, let's define a new random variable &lt;div class="latex math"&gt; X = p - p_c&lt;/div&gt; then our null hypothesis becomes &lt;div class="latex math"&gt; H_0 : X \le 0&lt;/div&gt;
&lt;/p&gt;

&lt;p&gt;
We can now use the same techniques from our coin flip example, using the random variable X.  But to do this we need to know the probability distribution of X.
&lt;/p&gt;

&lt;p&gt;
It turns out that the sum (or difference) of two normally distributed random variables is itself normally distributed.  You can read the &lt;a href="http://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables"&gt;gory mathematical details&lt;/a&gt; yourself, if you're interested.
&lt;/p&gt;



&lt;p&gt;
This gives us a way to calculate a 95% confidence interval.
&lt;/p&gt;

&lt;h3&gt;Z-scores and One-tailed Tests&lt;/h3&gt;
&lt;p&gt;
Mathematically the z-score for X is &lt;div class="latex math"&gt; z = \frac{p - p_c}{\sqrt{\frac{p(1-p)}{N} + \frac{p_c(1-p_c)}{N_c}}}&lt;/div&gt; where N is the sample size of the experimental treatment and N&lt;sub&gt;c&lt;/sub&gt; is the samle size of the control treatment.
&lt;/p&gt;

&lt;p&gt;
Why?  Because the mean of X is p - p&lt;sub&gt;c&lt;/sub&gt; and the variance is the sum of the variances of p and p&lt;sub&gt;c&lt;/sub&gt;.
&lt;/p&gt;

&lt;p&gt;
In the coin flip example the 95% confidence interval corresponded to a z-score of 1.96.  But it's different this time.
&lt;/p&gt;

&lt;p&gt;
In the coin flip example we rejected the null hypothesis if the percentage of heads was too high or too low.  The null hypothesis there was &lt;div class="latex math"&gt; p = 0.50&lt;/div&gt; but in this case our null hypothesis is &lt;div class="latex math"&gt; X \le 0&lt;/div&gt;
&lt;/p&gt;

&lt;p&gt;
In other words, we only care about the positive tail of the normal distribution.  Here's a graphical representation of what I'm talking about.  In the coin example we have &lt;img src="http://assets.20bits.com/20081112/two-tailed.gif" alt="" title="one-tailed" width="455" height="333" class="alignnone size-full wp-image-356 math" /&gt; and we reject the null hypothesis if the percentage heads is too high or too low.
&lt;/p&gt;
&lt;p&gt;
In this example we only reject the null hypothesis if the experimental conversion rate is significantly higher than the control conversation rate, so we have 
&lt;img src="http://assets.20bits.com/20081112/one-tailed.gif" alt="" title="one-tailed" width="455" height="333" class="alignnone size-full wp-image-356 math" /&gt;
&lt;/p&gt;

&lt;p&gt;
That is, we can reject the null hypothesis with 95% confidence if the z-score is higher than 1.65.  Here's a table with the z-scores calculated using the formula above:
&lt;/p&gt;
&lt;center&gt;
&lt;table class="monthly-data"&gt;
&lt;tr class="top"&gt;
&lt;th colspan="4"&gt;Project X Landing Page&lt;/th&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;th style="width: 15ex;"&gt;Treatment&lt;/th&gt;
&lt;th style="width: 8ex;"&gt;Visitors Treated&lt;/th&gt;
&lt;th style="width: 8ex;"&gt;Visitors Registered&lt;/th&gt;
&lt;th style="width: 8ex;"&gt;Conversion Rate&lt;/th&gt;
&lt;th style="width: 8ex;"&gt;Z-score&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class="statistic"&gt;Control&lt;/td&gt;
&lt;td&gt;182&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td class="negative"&gt;19.23%&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td class="statistic"&gt;Treatment A&lt;/td&gt;
&lt;td&gt;180&lt;/td&gt;
&lt;td&gt;45&lt;/td&gt;
&lt;td class="negative"&gt;25.00%&lt;/td&gt;
&lt;td&gt;1.33&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class="statistic"&gt;Treatment B&lt;/td&gt;
&lt;td&gt;189&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td class="negative"&gt;14.81%&lt;/td&gt;
&lt;td&gt;-1.13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td class="statistic"&gt;Treatment C&lt;/td&gt;
&lt;td&gt;188&lt;/td&gt;
&lt;td&gt;61&lt;/td&gt;
&lt;td class="negative"&gt;32.45%&lt;/td&gt;
&lt;td&gt;2.94&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;h3&gt;Conclusions&lt;/h3&gt;
&lt;p&gt;
From the table above we are safe concluding that Treatment C did, in fact, outperform the control treatment.  Whether the performance of Treatment A is statistically significant is irrelevant at this point because we know the performance of Treatment C is, so we should just pick that one and move on with our lives.
&lt;/p&gt;

&lt;p&gt;
Here are the key take-aways:
&lt;ul&gt;
	&lt;li&gt;The conversion rate for each treatment is a normally distributed random variable&lt;/li&gt;
	&lt;li&gt;We want to measure the difference in performance between a given treatment and the control.&lt;/li&gt;
	&lt;li&gt;The difference itself is a normally distributed random variable.&lt;/li&gt;
	&lt;li&gt;Since we only care if the difference is greater than zero we only need a z-score of 1.65, corresponding to the positive half of the normal curve.&lt;/li&gt;
&lt;/ul&gt;
&lt;/p&gt;

&lt;p&gt;
Statistical significance is important for A/B testing because it lets us know whether we've run the test for long enough.  In fact, we can ask the inverse question, "How long do I need to run an experiment before I can be certain if one of my treatments is more than 20% better than control?"
&lt;/p&gt;

&lt;p&gt;
This becomes more important when money is on the line because it lets you quantify risk, minimizing the impact of potentially risky treatments.
&lt;/p&gt;

&lt;p&gt;
We'll cover these things in future articles.  Until then!
&lt;/p&gt;</description>
      <pubDate>Wed, 12 Nov 2008 06:00:41 +0000</pubDate>
      <link>http://20bits.com/article/statistical-analysis-and-ab-testing</link>
    </item>
    <item>
      <title>Data Management, Facebook-style</title>
      <description>&lt;p&gt;
&lt;a href="http://jeffhammerbacher.com/"&gt;Jeff Hammerbacher&lt;/a&gt;, the former lead of the Data Team at Facebook and now VP of Product at &lt;a href="http://www.cloudera.com/"&gt;Cloudera&lt;/a&gt;, put up some great slides on the evolution of Facebook's &lt;a href="http://www.cloudera.com/blog/2008/10/24/thrift-scribe-hive-and-cassandra-open-source-data-management-software/"&gt;data management strategy&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
They're very interesting from many perspectives, so take a look and then stay tuned for my two cents.
&lt;/p&gt;

&lt;p&gt;
&lt;div class="math"id="__ss_689126"&gt;&lt;object style="margin:0px" width="425" height="355"&gt;&lt;param name="movie" value="http://static.slideshare.net/swf/ssplayer2.swf?doc=20081022cca-1224867567253598-9&amp;stripped_title=20081022cca-presentation" /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed src="http://static.slideshare.net/swf/ssplayer2.swf?doc=20081022cca-1224867567253598-9&amp;stripped_title=20081022cca-presentation" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/div&gt;
&lt;/p&gt;

&lt;h3&gt;Growing With Data&lt;/h3&gt;
&lt;p&gt;
Jeff was at Facebook for about two and a half years and saw Facebook grow from a company dealing with gigabytes of data per day to a company dealing with terrabytes of data per day.  It was his job to guide the process of making sense of this pile of semi-structured data.
&lt;/p&gt;

&lt;p&gt;
The technical aspects are interesting, but what's more interesting to me is the story.  A good title for the presentation might be "Growing With Data."
&lt;/p&gt;

&lt;h3&gt;The Three Stages&lt;/h3&gt;
&lt;p&gt;
As I said, the most interesting part to me was how Facebook's data initiatives evolved over time to meet their growing needs.
&lt;/p&gt;

&lt;p&gt;
At first they did what everyone does &amp;mdash; periodic offline batch processing.  But we all know this doesn't scale forever, especially if your data is growing at an exponential rate.
&lt;/p&gt;

&lt;p&gt;
Eventually you wind up in a situation where you produce more data in an hour that you can process.  You can try to scale vertically, getting more bandwidth, more processing power, faster disks, etc., but the exponential nature of the situation will win in the end.
&lt;/p&gt;

&lt;p&gt;
Once the ad hoc ETL system no longer met their needs they built a system for distributed logging.  Unfortunately it didn't provide the flexibility they needed.  Analysts couldn't run SQL and maintaining the system was difficult.
&lt;/p&gt;

&lt;p&gt;
Eventually they hit upon &lt;a href="http://hadoop.apache.org/core/"&gt;Hadoop&lt;/a&gt;, an open source implementation of Google's MapReduce.  They built &lt;a href="http://wiki.apache.org/hadoop/Hive"&gt;Hive&lt;/a&gt;, a system for querying datasets stored in Hadoop files.  This means you get the scalability of Hadoop and the flexibility of a SQL-like language.  It's very slick.
&lt;/p&gt;

&lt;p&gt;
They also built &lt;a href="http://code.google.com/p/the-cassandra-project/"&gt;Cassandra&lt;/a&gt;, which provides a &lt;a href="http://en.wikipedia.org/wiki/BigTable"&gt;BigTable&lt;/a&gt;-like system for storing massive amounts of structured data.

&lt;h3&gt;Evolution, not Revolution&lt;/h3&gt;
&lt;p&gt;
As I said, I like the story.  They didn't start by building these complex tools, but rather they evolved to fit a growing need within the company.  Beyond that I like that their approach to Hive was so customer-centric.  The analysts wanted SQL so they built a SQL-like language on top of their fancy distributed technology.  Very cool.
&lt;/p&gt;

&lt;p&gt;
There's a lot more where that came from over at the &lt;a href="http://www.cloudera.com/blog/"&gt;Cloudera blog&lt;/a&gt;, so check it out.  The future is data.
&lt;/p&gt;</description>
      <pubDate>Mon, 10 Nov 2008 06:00:55 +0000</pubDate>
      <link>http://20bits.com/article/data-management-facebook-style</link>
    </item>
    <item>
      <title>Obama, McCain, and Data-Driven Campaigning</title>
      <description>&lt;p&gt;
On Monday Slate published an article about &lt;a href="http://www.slate.com/id/2203146/"&gt;Obama's text messaging strategy&lt;/a&gt; (h/t &lt;a href="http://andrewchenblog.com/2008/10/29/slate-on-split-testing-in-the-mccain-and-obama-campaign-robo-calling-versus-text-messaging/"&gt;Andrew Chen&lt;/a&gt;) and how it compared to the traditional robo-calling strategy.  Politics is getting more quantitative every year and it's great to see the campaigns using techniques like A/B testing to determine what works and doesn't work in political messaging.
&lt;/p&gt;

&lt;h3&gt;A Channel to Voters&lt;/h3&gt;
&lt;p&gt;
Every campaign has several channels to voters: person-to-person contact, phone calls, mailers, etc.  You can attach metrics like "dollars per vote" or "votes per contact" to each of these channels, and the campaigns do.
&lt;/p&gt;

&lt;p&gt;
From the Slate article, &lt;blockquote&gt;Robo-calls are the pyrotechnics of politics: They create a big disturbance, but they don't have a prolonged effect. Numerous studies of robo-call campaigns show that they're ineffective both as tools of mobilization and persuasion &amp;mdash; they don't convince voters to go to the polls (or to stay away), and they don't change people's minds about which way to vote. So why do campaigns run robo-calls? Because they're cheap and easy. Telemarketing firms charge politicians between 2 and 5 cents per completed robo-call; that's as low as $20,000 to reach 1 million voters right in their homes.&lt;/blockquote&gt;
&lt;/p&gt;

&lt;p&gt;
The campaigns also measure votes-per-dollar.  Using the Green and Gerber numbers we get this table:
&lt;/p&gt;
&lt;center&gt;
&lt;table class="monthly-data"&gt;
&lt;tr class="top"&gt;
&lt;th colspan="3"&gt;Voter Contact Methods&lt;/th&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;th style="width: 6ex;"&gt;Channel&lt;/th&gt;
&lt;th style="width: 10ex;"&gt;Votes-per-contact&lt;/th&gt;
&lt;th style="width: 10ex;"&gt;Dollars-per-vote&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Canvassing&lt;/td&gt;
&lt;td&gt;1/14&lt;/td&gt;
&lt;td&gt;$29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td&gt;Phonebanking&lt;/td&gt;
&lt;td&gt;1/38&lt;/td&gt;
&lt;td&gt;$38&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Telemarketing&lt;/td&gt;
&lt;td&gt;1/180&lt;/td&gt;
&lt;td&gt;???&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td&gt;Mailers (non-partisan)&lt;/td&gt;
&lt;td&gt;1/200&lt;/td&gt;
&lt;td&gt;???&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;
Partisan mailers are even farther down the list, and leaflets, emails, and robo-calls showed "no discernible effect" on the electorate.  Obama's breakthrough is using text messaging which costs an astonishing &lt;em&gt;$1.50 per vote&lt;/em&gt;.
&lt;/p&gt;

&lt;h3&gt;Hierarchy of Personalization and Maturity&lt;/h3&gt;
&lt;p&gt;
The hierarchy is pretty clear: the more personal the contact the more effective it is. Person-to-person contact will always be personal, obviously, and until this campaign text messaging was something you did with your friends and family.
&lt;/p&gt;

&lt;p&gt;
Will this change?  Email marketing is less effective nowadays because everyone is used to spam.  Text messaging has worked brilliantly for Obama this campaign but as the technique becomes more common there's no way to know if, four years from now, people will still respond the same way.&lt;/p&gt;

&lt;p&gt;
&lt;em&gt;Bombarding a communication channel with impersonal messages makes the medium itself less personal and therefore less effective.&lt;/em&gt;
&lt;/p&gt;

&lt;h3&gt;A/B Testing&lt;/h3&gt;
&lt;p&gt;
How did Green and Gerber calculate the effectiveness of voter contact methods? By carefully measuring the results of their A/B test and using &lt;a href="/article/hypothesis-testing-the-basics"&gt;hypothesis testing&lt;/a&gt; to determine whether there were actually differences between the control and test groups:
&lt;/p&gt;

&lt;blockquote&gt;
Rather than merely theorizing about how campaigns might get people to vote, Green, Gerber, and their colleagues favor randomized field experiments to test how different techniques work during real elections. Their method has much in common with double-blind pharmaceutical studies: With the cooperation of political campaigns (often at the state and local level), researchers randomly divide voters into two categories, a treatment group and a control group. They subject the treatment group to a given tactic: robo-calls, e-mail, direct mail, door-to-door canvassing, etc. Then they use statistical analysis to determine whether voters in the treatment group behaved differently from voters in the control group.
&lt;/blockquote&gt;
&lt;/p&gt;

&lt;h3&gt;Personal Experience&lt;/h3&gt;
&lt;p&gt;
I'm an Obama supporter and have been volunteering on-and-of since early this year.  Working on the California primary I had a chance to see the campaign's data-driven approach first-hand.
&lt;/p&gt;

&lt;p&gt;
Every state is broken down to the precinct using a system called VoteBuilder.  For a village the area might be the whole town, but for a city it could be as small as a few city blocks.  Precinct captains or other field workers can slice up the city using queries like "get me a list of voters who are not strong supporters of either candidate and print off their names in walking order."
&lt;/p&gt;

&lt;p&gt;
What's more, Obama's campaign is as much about his brand as it is about finding the votes it knows are there, even if they're in traditionally Republican areas.  I canvassed my hometown in Northern Michigan, for example, an area classified as a "persuasion area."&lt;/p&gt;

&lt;p&gt;How did the Obama campaign know that this small segment of typically Republican Northern Michigan was persuadable?&lt;/p&gt;

&lt;p&gt;
It's simple, really: data plus an empirical mindset.
&lt;/p&gt;

&lt;h3&gt;Lessons from Republicans&lt;/h3&gt;
&lt;p&gt;
In a lot of ways the Obama campaign has learned from the Republicans.  In the past Republicans have won in large part because they applied their background in quantitative direct marketing to voter mobilization.
&lt;/p&gt;

&lt;p&gt;
And if Obama wins it will be in large part because he absorbed and modernized the data-driven techniques Republicans have been using since the early 90's.
&lt;/p&gt;</description>
      <pubDate>Wed, 29 Oct 2008 09:50:34 +0000</pubDate>
      <link>http://20bits.com/article/obama-mccain-and-ab-testing</link>
    </item>
    <item>
      <title>Hypothesis Testing: The Basics</title>
      <description>&lt;p&gt;
Say I hand you a coin.  How would you tell if it's fair?  If you flipped it 100 times and it came up heads 51 times, what would you say?  What if it came up heads 5 times, instead?
&lt;/p&gt;

&lt;p&gt;
In the first case you'd be inclined to say the coin was fair and in the second case you'd be inclined to say it was biased towards tails.  How certain are you?  Or, even more specifically, how likely is it &lt;em&gt;actually&lt;/em&gt; that the coin is fair in each case?
&lt;/p&gt;

&lt;h3&gt;Hypothesis Testing&lt;/h3&gt;
&lt;p&gt;
Questions like the ones above fall into a domain called &lt;em&gt;hypothesis testing&lt;/em&gt;.  Hypothesis testing is a way of systematically quantifying how certain you are of the result of a statistical experiment.
&lt;/p&gt;

&lt;p&gt;
In the coin example the "experiment" was flipping the coin 100 times.  There are two questions you can ask.  One, assuming the coin was fair, how likely is it that you'd observe the results we did?  Two, what is the likelihood that the coin is fair given the results you observed?
&lt;/p&gt;

&lt;p&gt;
Of course, an experiment can be much more complex than coin flipping.  Any situation where you're taking a random sample of a population and measuring something about it is an experiment, and for our purposes this includes &lt;a href="/article/an-introduction-to-ab-testing"&gt;A/B testing&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;Let's focus on the coin flip example understand the basics.&lt;/p&gt;

&lt;h3&gt;The Null Hypothesis&lt;/h3&gt;
&lt;p&gt;
The most common type of hypothesis testing involves a &lt;em&gt;null hypothesis&lt;/em&gt;.  The null hypothesis, denoted H&lt;sub&gt;0&lt;/sub&gt;, is a statement about the world which can plausibly account for the data you observe.  Don't read anything into the fact that it's called the "null" hypothesis &amp;mdash; it's just the hypothesis we're trying to test.
&lt;/p&gt;
&lt;p&gt;
For example, "the coin is fair" is an example of a null hypothesis, as is "the coin is biased."  The important part is that the null hypothesis be able to be expressed in simple, mathematical terms.  We'll see how to express these statements mathematically in just a bit.
&lt;/p&gt;

&lt;p&gt;
The main goal of hypothesis testing is to tell us whether we have enough evidence to reject the null hypothesis.  In our case we want to know whether the coin is biased or not, so our null hypothesis should be "the coin is fair."  If we get enough evidence that contradicts this hypothesis, say, by flipping it 100 times and having it come up heads only once, then we can safely reject it.
&lt;/p&gt;

&lt;p&gt;
All of this is perfectly quantifiable, of course.  What constitutes "enough" and "safely" are all a matter of statistics.
&lt;/p&gt;

&lt;h3&gt;The Statistics, Intuitively&lt;/h3&gt;
&lt;p&gt;
So, we have a coin.  Our null hypothesis is that this coin is fair.  We flip it 100 times and it comes up heads 51 times.  Do we know whether the coin is biased or not?
&lt;/p&gt;

&lt;p&gt;
Our gut might say the coin is fair, or at least probably fair, but we can't say for sure.  The expected number of heads is 50 and 51 is quite close.  But what if we flipped the coin 100,000 times and it came up heads 51,000 times?  We see 51% heads both times, but in the second instance the coin is more likely to be biased.
&lt;/p&gt;

&lt;p&gt;
Lack of evidence to the contrary is not evidence that the null hypothesis is true.  Rather, it means that we don't have sufficient evidence to conclude that the null hypothesis is false.  The coin might actually have a 51% bias towards heads, after all.
&lt;/p&gt;

&lt;p&gt;
If instead we saw 1 head for 100 flips that would be another story.  Intuitively we know that the chance of seeing this if the null hypothesis were true is so small that we would be comfortable rejecting the null hypothesis and declaring the coin to (probably) be biased.
&lt;/p&gt;

&lt;p&gt;
Let's quantify our intuition.
&lt;/p&gt;

&lt;h3&gt;The Coin Flip&lt;/h3&gt;
&lt;p&gt;
Formally the flip of a coin can be represented by a Bernoulli trial.  A Bernoulli trial is a random variable &lt;strong&gt;X&lt;/strong&gt; such that &lt;div class="latex math"&gt; Pr\left(X = 1\right) = 1 - Pr\left(X = 0\right) = 1 - q = p&lt;/div&gt;
&lt;/p&gt;

&lt;p&gt;
That is, &lt;strong&gt;X&lt;/strong&gt; takes on the value 1 (representing heads) with probability &lt;em&gt;p&lt;/em&gt;, and 0 (representing tails) with probability &lt;em&gt;1 - p&lt;/em&gt;&lt;span class="footnote"&gt;Of course, 1 can represent either heads or tails so long as you're consistent and 0 represents the opposite outcome&lt;/span&gt;.
&lt;/p&gt;

&lt;p&gt;
Now, let's say we have 100 coin flips.  Let &lt;strong&gt;X&lt;/strong&gt;&lt;sub&gt;i&lt;/sub&gt; represent the &lt;em&gt;i&lt;sup&gt;th&lt;/sup&gt;&lt;/em&gt; coin flip.  Then the random variable &lt;div class="latex math"&gt; Y = \sum_{i=1}^{100} X_i&lt;/div&gt; represents the run of 100 coin flips.
&lt;/p&gt;

&lt;h3&gt;The Statistics, Mathematically&lt;/h3&gt;
&lt;p&gt;
Say you have a set of observations O and a null hypothesis H&lt;sub&gt;0&lt;/sub&gt;.  In the above coin example we were trying to calculate &lt;div class="latex math"&gt; P\left(O \mid H_0\right)&lt;/div&gt;
i.e., the probability that we observed what we did given the null hypothesis.  If that probability is sufficiently small we're confident concluding the null hypothesis is false&lt;span class="footnote"&gt;But remember, if that probability is &lt;em&gt;not&lt;/em&gt; sufficiently small, that doesn't mean the null hypothesis is true!&lt;/span&gt;
&lt;/p&gt;

&lt;p&gt;
We can use whatever level of confidence we want before rejecting the null hypothesis, but most people choose 90%, 95%, or 99%.  For example if we choose a 95% confidence level we reject the null hypothesis if &lt;div class="latex math"&gt; P\left(O \mid H_0\right) \le 1 - 0.95 = 0.05&lt;/div&gt;
&lt;/p&gt;

&lt;p&gt;
The &lt;a href="http://en.wikipedia.org/wiki/Central_limit_theorem"&gt;Central Limit Theorem&lt;/a&gt; is the main piece of math here.  Briefly, the Central Limit Theorem says that the sum of any number of re-averaged identically distributed random variables approximates a normal distribution.
&lt;/p&gt;

&lt;p&gt;
Remember our random variables from before? If we let &lt;div class="latex math"&gt; p = \frac{Y}{N}&lt;/div&gt; then &lt;em&gt;p&lt;/em&gt; is the proportion of heads in our sample of 100 coin flips.  In our case, it is equal to 0.51, or 51%.
&lt;/p&gt;

&lt;p&gt;
But by the central limit theorem we also know that &lt;em&gt;p&lt;/em&gt; approximates a normal distribution.  This means we can estimate the standard deviation of &lt;em&gt;p&lt;/em&gt; as &lt;div class="latex math"&gt; \sigma = \sqrt{\frac{p(1-p)}{N}}&lt;/div&gt;
&lt;/p&gt;

&lt;h3&gt;Wrapping It Up&lt;/h3&gt;
&lt;p&gt;
Our null hypothesis is that the coin is fair.  Mathematically we're saying &lt;div class="latex math"&gt; H_0 : p_0 = 0.50&lt;/div&gt;
&lt;/p&gt;

&lt;p&gt;
Here's the normal curve:
&lt;/p&gt;

&lt;img class="math" src="http://assets.20bits.com/20081027/normal-curve-small.png" alt="" title="" width="600" height="450" /&gt;

&lt;p&gt;
A 95% level of confidence means we reject the null hypothesis if &lt;em&gt;p&lt;/em&gt; falls outside 95% of the area of the normal curve.  Looking at that chart we see that this corresponds to approximately 1.98 standard deviations.
&lt;/p&gt;

&lt;p&gt;
The so-called "z-score" tells us how many standard deviations away from the mean our sample is, and it's calculated as &lt;div class="latex math"&gt; z = \frac{p-0.50}{\sqrt{\frac{0.50(1-0.50)}{N}}}&lt;/div&gt;
&lt;/p&gt;

&lt;p&gt;
The numerator is "p - 0.50" because our null hypothesis is that p = 0.50.  This measures how far the sample mean, p, diverges from the expect mean of a fair coin, 0.50.
&lt;/p&gt;

&lt;h3&gt;The Data&lt;/h3&gt;
&lt;p&gt;
Let's say we flipped three coins 100 times each and got the following data.
&lt;/p&gt;
&lt;center&gt;
&lt;table class="monthly-data"&gt;
&lt;tr class="top"&gt;
&lt;th colspan="3"&gt;Data for 100 Flips of a Coin&lt;/th&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;th style="width: 6ex;"&gt;Coin&lt;/th&gt;
&lt;th style="width: 6ex;"&gt;Flips&lt;/th&gt;
&lt;th style="width: 6ex;"&gt;Pct. Heads&lt;/th&gt;
&lt;th style="width: 6ex;"&gt;Z-score&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coin 1&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;51%&lt;/td&gt;
&lt;td&gt;0.20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td&gt;Coin 2&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;2.04&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coin 3&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;5.77&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;
Using a 95% confidence level we'd conclude that Coin 2 and Coin 3 are biased using the techniques we've developed so far.  Coin 2 is 2.04 standard deviations from the mean and Coin 3 is 5.77 standard deviations.
&lt;/p&gt;

&lt;p&gt;
When your test statistic meets the 95% confidence threshold we call it &lt;em&gt;statistically significant&lt;/em&gt;.
&lt;/p&gt;

&lt;p&gt;
This means there's only a 5% chance of observing what you did assuming the null hypothesis was true.  Phrased another way, there's only a 5% chance that your observation is due to random variation.
&lt;/p&gt;

&lt;h3&gt;Recap&lt;/h3&gt;
&lt;p&gt;
Hypothesis testing is a way of systematically quantifying how certain you are of the result of a statistical experiment.  You start by forming a null hypothesis, e.g., "this coin is fair," and then calculate the likelihood that your observations are due to pure chance rather than a real difference in the population.
&lt;/p&gt;

&lt;p&gt;
The confidence interval is the level at which you reject the null hypothesis.  If there is a 95% chance that there's a real difference in your observations, given the null hypothesis, then you are confident in rejecting it.  This also means there is a 5% chance you're wrong and the difference is due to random fluctuations.
&lt;/p&gt;

&lt;p&gt;
The null hypothesis can be any mathematical statement and the test you use depends on both the underlying data and your null hypothesis.  In our coin flipping example the underlying data approximated a normal distribution and we wanted to test whether the observed proportion of heads was different enough to be significant.  In this case we were measuring the &lt;em&gt;sample mean&lt;/em&gt;.
&lt;/p&gt;

&lt;p&gt;
We can measure anything, though: the sample variance, correlation, etc.  Different tests needs to be used to determine whether these are statistically significant, as we'll see in coming articles.
&lt;/p&gt;

&lt;h3&gt;What's Next?&lt;/h3&gt;
&lt;p&gt;
Now that we understand the innards of hypothesis testing we can apply our knowledge to A/B tests to determine whether new features &lt;em&gt;actually&lt;/em&gt; effect user behavior.  Until then!
&lt;/p&gt;</description>
      <pubDate>Mon, 27 Oct 2008 14:09:54 +0000</pubDate>
      <link>http://20bits.com/article/hypothesis-testing-the-basics</link>
    </item>
    <item>
      <title>Scientific Product Development</title>
      <description>&lt;p&gt;
Growing up every kid learned about the scientific method, about hypotheses, testing, measurement, and analysis.  Data-driven development is about taking these scientific principles and applying them (at least in part) to all aspects of a business &amp;mdash; especially product development.
&lt;/p&gt;

&lt;p&gt;
It's about subjecting your decisions to empirical reality and letting the data guide your intuition.  Mike Speiser &lt;a href="http://laserlike.com/2008/09/22/scientific-product-development/"&gt;wrote about this&lt;/a&gt; last September and it's a good phrase, so I'm going to steal it.
&lt;/p&gt;

&lt;p&gt;
Let's take a look!
&lt;/p&gt;

&lt;h3&gt;Scientific Product Development&lt;/h3&gt;

&lt;p&gt;
In that spirit, let's break down the process of scientific product development.  A picture is worth a thousand words, so here goes:
&lt;/p&gt;

&lt;img class="math" src="http://assets.20bits.com/20081020/cmta.png" alt="" title="HMTA"/&gt;

&lt;p&gt;
Let's break it down.
&lt;/p&gt;

&lt;dl&gt;
	&lt;dt&gt;Hypothesize&lt;/dt&gt;
	&lt;dd&gt;
	&lt;p&gt;
	The first step in scientific product development is to come up with a testable hypothesis.  This can be a hypothesis about anything: your userbase, your market, whatever.
	&lt;/p&gt;
	
	&lt;p&gt;
	&lt;/p&gt;
	&lt;/dd&gt;
	
	&lt;dt&gt;Measure&lt;/dt&gt;
	&lt;dd&gt;
	&lt;p&gt;
	Many scientific breakthroughs happen because of improvements in instrumentation.  Think microscope, telescope, x-ray crystallography, etc.  The nuts and bolts of instrumentation, data collection, and measurement are just as important in building a data-driven business as they are in science.
	&lt;/p&gt;
	
	&lt;p&gt;
	What data do you need to collect?  How do you collect it?  How do you &lt;em&gt;store&lt;/em&gt; it?  And finally, how do you extract it in a way your analysts can make sense of it?
	&lt;/p&gt;
	
	&lt;p&gt;
	This is also the step for the &lt;a href="http://startonomics.com/"&gt;metrics-obsessed&lt;/a&gt;.
	&lt;/p&gt;
	
	&lt;p&gt;
	For most startups measurement and instrumentation involves little more than SQL and your database of choice, but the big boys use technologies like &lt;a href="http://hadoop.apache.org/core/"&gt;Hadoop&lt;/a&gt;, &lt;a href="http://code.google.com/p/the-cassandra-project/"&gt;Cassandra&lt;/a&gt;, and &lt;a href="http://en.wikipedia.org/wiki/BigTable"&gt;BigTable&lt;/a&gt; to solve various problems in this domain.  Google Analytics fits here, too.
	&lt;/p&gt;
	&lt;/dd&gt;
	
	&lt;dt&gt;Test&lt;/dt&gt;
	&lt;dd&gt;
	&lt;p&gt;
	I've been talking about this step the most.  Once you are recording data and have the ability to extract intelligence from it you can subject your products to A/B tests, multivariate tests, and other experimental techniques.
	&lt;/p&gt;
	
	&lt;p&gt;
	You can either &lt;a href="/article/implementing-ab-testing"&gt;build your own testing infrastructure&lt;/a&gt; or use off-the-shelf stuff like &lt;a href="https://www.google.com/analytics/siteopt/?hl=en"&gt;Google Website Optimizer&lt;/a&gt;.
	&lt;/p&gt;
	&lt;/dd&gt;
	
	&lt;dt&gt;Analyze&lt;/dt&gt;
	&lt;dd&gt;
	&lt;p&gt;The final step in the process is analysis.  This means taking the data you've collected and generating &lt;em&gt;insight&lt;/em&gt;.&lt;/p&gt;
	
	&lt;p&gt;
	Analysis involves looking at all the interlocking variables, isolating the ones that matter most, and presenting them in a way that is easily understood.  Statistics is helpful here, as is a background in &lt;a href="http://en.wikipedia.org/wiki/Information_Design"&gt;Information Design&lt;/a&gt;.
	&lt;/p&gt;
	
	&lt;p&gt;
	Your analysis is what gives you the data to contradict or support your hypothesis.  It also tells you what you're missing as you iterate your data-driven process.  Are you not collecting data you need?  Are there additional hypotheses that you need to test?  Good analysis yields a few good answers and several more questions.
	&lt;/p&gt;
	&lt;/dd&gt;
&lt;/dl&gt;

&lt;h3&gt;Directed versus Undirected Analysis&lt;/h3&gt;
&lt;p&gt;
When I was learning the scientific method as a kid it was always presented as a rigid process.  You form a hypothesis then you go about methodically testing that hypothesis, refining it until it's a "theory."
&lt;/p&gt;

&lt;p&gt;
This model is too rigid.  Flashes of insight often come when you're freely exploring the data.  There isn't always a statement to test &amp;mdash; sometimes you need to look at the data to figure out what's worth testing.
&lt;/p&gt;


&lt;h3&gt;I'm Not Alone&lt;/h3&gt;
&lt;p&gt;
Other people, like &lt;a href="http://laserlike.com/2008/09/22/scientific-product-development/"&gt;Mike Speiser&lt;/a&gt; and &lt;a href="http://startuplessonslearned.blogspot.com/2008/09/thoughts-on-scientific-product.html"&gt;Eric Ries&lt;/a&gt;, are talking explicitly about building products in this way.
&lt;/p&gt;

&lt;p&gt;On the strategic and measurement side there are people like my friend &lt;a href="http://andrewchenblog.com/"&gt;Andrew Chen&lt;/a&gt; and &lt;a href="http://500hats.typepad.com/"&gt;Dave McClure&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;
If you have any recommendations of other scientific-minded bloggers I'm all ears!
&lt;/p&gt;</description>
      <pubDate>Mon, 20 Oct 2008 06:00:53 +0000</pubDate>
      <link>http://20bits.com/article/scientific-product-development</link>
    </item>
    <item>
      <title>Implementing A/B Testing</title>
      <description>&lt;p&gt;
Before you can start doing A/B tests you need a system that can support them.  That means either you find one off the shelf or you build it yourself.
&lt;/p&gt;

&lt;h3&gt;Off-the-shelf A/B Testing&lt;/h3&gt;

&lt;p&gt;
Most off-the-shelf A/B testing software is geared towards marketers as they were the first group online to adopt the technique en masse.  The two pieces of software I'm most familiar with are &lt;a href="http://www.omniture.com/en/products/conversion/testandtarget"&gt;Omniture Test &amp;amp; Target&lt;/a&gt;, which used to be part of Offermatica before Omnitured acquired them, and &lt;a href="http://www.google.com/websiteoptimizer"&gt;Google Website Optimizer&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
Omniture Test &amp;amp; Target costs money and is designed for big corporate clients with equally big wallets.  It's very nice software, but probably not what you're looking for if you're just getting started out.
&lt;/p&gt;

&lt;p&gt;
Google Website Optimizer, however, is free and much more simple.  It lets you do both A/B testing and multivariate testing, but is limited in that it only has a notion of "conversions."
&lt;/p&gt;

&lt;p&gt;
You place a bit of code on every page that is part of an experiment and another bit of code on the page that counts as a "conversion."  You can then track conversion rates across your treatments (or "variations" as GWO calls them).
&lt;/p&gt;

&lt;p&gt;
Conversion Rate Experts has a google &lt;a href="http://www.conversion-rate-experts.com/articles/101-google-website-optimizer-tips/"&gt;introductory article&lt;/a&gt; on Google Website Optimizer if you're interested.
&lt;/p&gt;

&lt;h3&gt;Rolling Your Own&lt;/h3&gt;
&lt;p&gt;
Rolling your own A/B testing system isn't that hard.  Let's say we're running an email campaign called &lt;tt&gt;buy_our_book&lt;/tt&gt; where we're trying to advertise our new book.  Here's how the code might look (in PHP):
&lt;/p&gt;

&lt;pre class="brush: php"&gt;function send_mail($recipient, $campaign) {
	$subject = get_mail_subject($recipient, $campaign);
	$copy = get_mail_copy($recipient, $campaign);
	mail($recipient, $subject, $copy);
}

function get_mail_subject($recipient, $campaign) {
	if ($campaign == 'buy_our_book') {
		// Get treatment if it exists, else get random treatment
		$treatment = get_treatment_for('book_subject', $recipient);
		switch($treatment) {
			case 'control':
				return get_default_subject($recipient, $campaign);
			case 'discount_price':
				return "Get 50% off our latest book!";
			case 'direct_appeal'
				return "Buy our book, we beg you!"	
		}
	} else {
		get_default_subject($recipient, $campaign);
	}
}&lt;/pre&gt;

&lt;p&gt;
&lt;tt&gt;get_treatment_for&lt;/tt&gt; does the meat of the work.  It should do the following:
&lt;ol&gt;
	&lt;li&gt;Use an existing treatment if the recipient already has one.&lt;/li&gt;
	&lt;li&gt;Otherwise, assign that user a random treatment&lt;/li&gt;
&lt;/ol&gt;
&lt;/p&gt;

&lt;p&gt;
Here is an example MySQL schema for A/B testing:
&lt;/p&gt;

&lt;pre class="brush: sql"&gt;CREATE TABLE treatments (
	id INT UNSIGNED NOT NULL auto_increment,
	name VARCHAR(255),
	experiment_id INT UNSIGNED NOT NULL,
	PRIMARY KEY(id)
);

CREATE TABLE experiments (
	id INT UNSIGNED NOT NULL auto_increment,
	name VARCHAR(255),
	PRIMARY KEY(id)
);

CREATE TABLE users_treatments (
	user_id INT UNSIGNED NOT NULL,
	experiment_id INT UNSIGNED NOT NULL,
	treatment_id INT UNSIGNED NOT NULL,
	PRIMARY KEY(user_id, experiment_id)
);&lt;/pre&gt;

&lt;p&gt;
This schema assumes each user can be uniquely identified by an integer, but if the requirements of your website are different you can change it.  For example, if users aren't required to sign up and you track their activities using cookies you'd store the cookie ID in the database.
&lt;/p&gt;


&lt;h3&gt;Use Weights&lt;/h3&gt;
&lt;p&gt;
It's also advisable that you add weights to your treatments so that, e.g., you can select one treatment 90% of the time and another 10% of the time.  All this logic can be encapsulated in the &lt;tt&gt;get_treatment_for&lt;/tt&gt; function.  I even wrote an article about how to get &lt;a href="/article/random-weighted-elements-in-php"&gt;weighted random elements&lt;/a&gt; in PHP.
&lt;/p&gt;

&lt;p&gt;
Why weights? If you're dealing with revenue running a 50/50 split between the control and a total unknown puts your bottom line at risk.
&lt;/p&gt;

&lt;p&gt;
Even if you're not monetizing your traffic you probably don't want to put your traffic itself at risk.  Weighting your treatments takes care of this.
&lt;/p&gt;

&lt;h3&gt;What's Next&lt;/h3&gt;
&lt;p&gt;
Hopefully you understand enough to go out and implement a basic A/B testing system yourself.  In my next two articles I'm going to cover instrumentation and analysis.  That is, what should you be measuring (and how) and how do you know which treatment was successful?
&lt;/p&gt;</description>
      <pubDate>Tue, 14 Oct 2008 06:00:24 +0000</pubDate>
      <link>http://20bits.com/article/implementing-ab-testing</link>
    </item>
    <item>
      <title>An Introduction to A/B Testing</title>
      <description>&lt;p&gt;
A/B testing is one of the primary tools in any &lt;a href="/article/data-driven-development"&gt;data-driven&lt;/a&gt; environment.  You can think of it as a big cage match.  Send in your champion versus several other challengers and out comes a victor.&lt;/p&gt;

&lt;p&gt;
Of course, on the web there's less blood and more statistics, but the principle remains the same: how do you know who will win unless you force them to fight to the death?
&lt;/p&gt;

&lt;p&gt;
A/B Testing lets you compare several alternate versions of the same web page simultaneously and see which produces the best outcome, e.g., increased click-through, engagement, or any other metric of your choice. 
&lt;/p&gt;

&lt;h3&gt;Ok, What is A/B Testing, Really?&lt;/h3&gt;
&lt;p&gt;
A/B Testing is a way of conducting an experiment where you compare a control group to the performance of one or more test groups by randomly assigning each group a specific single-variable treatment.  Let's break that down.
&lt;/p&gt;

&lt;p&gt;
First, you decide on an experiment.  Maybe you're building a web application that forces users to register and you want to experiment on your landing page.  You want to see if you can improve the percentage of people who register.&lt;/p&gt;

&lt;p&gt;
The &lt;em&gt;conversion rate&lt;/em&gt; for your landing page is &lt;div class="latex math"&gt; \text{conversion rate} = \frac{\text{\# of visitors who register}}{\text{\# of total visitors}}&lt;/div&gt;
&lt;/p&gt;

&lt;p&gt;
For example, if 100 people visit your landing page today and 20 of those people register then you have a conversion rate of 20%.  All else being equal, the landing page with the higher conversion rate is better&lt;span class="footnote"&gt;"All else being equal" is important here &amp;mdash; if one of your landing pages promises free candy to people who register you might get a higher conversion rate, but the resulting users will have less long-term value once they realize you're a big fat liar.  I'm also not going to talk about statistical significance, yet.&lt;/span&gt;.
&lt;/p&gt;

&lt;h3&gt;Building Treatments&lt;/h3&gt;
&lt;p&gt;
Once you know &lt;em&gt;what&lt;/em&gt; you want to test you have to create treatments to test it.  One of the treatments will be the control, i.e., your current landing page.  The other treatments will be variations on that.  Here are some things worth testing:
&lt;ul&gt;
	&lt;li&gt;Layout.  Move the registration forms around.  Add fields, remove fields.&lt;/li&gt;
	&lt;li&gt;Headings.  Add headings.  Make them different colors.  Change the copy.&lt;/li&gt;
	&lt;li&gt;Copy.  Change the size, color, placement, and content of any text you have on the page.&lt;/li&gt;
&lt;/ul&gt;
&lt;/p&gt;

&lt;p&gt;
You can have as many treatments as you want, but you get better data more quickly with fewer treatments.  I rarely conduct A/B tests with more than four treatments.
&lt;/p&gt;

&lt;h3&gt;Randomization Means Control&lt;/h3&gt;
&lt;p&gt;
You can't just throw up one landing page on Friday and another landing page on Saturday and compare the conversion rates &amp;mdash; there's no reason to believe that the conversion rate for users who visit on a Friday is the same for users who visit on a Saturday.  In fact, they're probably not.
&lt;/p&gt;

&lt;p&gt;
A/B testing solves this by running the experiment in parallel and &lt;em&gt;randomly&lt;/em&gt; assigning a treatment each person who visits.  This controls for any time-sensitive variables and distributes the population proportionally across the treatments.
&lt;/p&gt;


&lt;p&gt;Let's look an example data set.&lt;/p&gt;
&lt;h3&gt;An Example&lt;/h3&gt;
&lt;p&gt;
Say we have a service called "Foobar" and we're conducting an experiment on our landing page.  Our goal is to improve the conversion rate by at least 10%.  When a new visitor arrives on the landing page we randomly assign them one of three treatments: the control, Treatment A, or Treatment B.
&lt;/p&gt;

&lt;p&gt;
Let's also say these treatments involve the headline copy. For example, the control treatment's headline copy might be "Foobar is a great service!  Sign up here."  One of the experimental treatments might have "Foobar lets you stay in touch with family all across the country &amp;mdash; easily."
&lt;/p&gt;

&lt;p&gt;
You run the experiment for a few days and get the following data:
&lt;/p&gt;
&lt;center&gt;
&lt;table class="monthly-data"&gt;
&lt;tr class="top"&gt;
&lt;th colspan="4"&gt;A/B Testing Example Data for the Foobar Service&lt;/th&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;th style="width: 15ex;"&gt;Treatment&lt;/th&gt;
&lt;th style="width: 8ex;"&gt;Visitors Treated&lt;/th&gt;
&lt;th style="width: 8ex;"&gt;Visitors Registered&lt;/th&gt;
&lt;th style="width: 8ex;"&gt;Conversion Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class="statistic"&gt;Control&lt;/td&gt;
&lt;td&gt;1,406&lt;/td&gt;
&lt;td&gt;356&lt;/td&gt;
&lt;td class="negative"&gt;25.32%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td class="statistic"&gt;Treatment A&lt;/td&gt;
&lt;td&gt;1,488&lt;/td&gt;
&lt;td&gt;372&lt;/td&gt;
&lt;td class="negative"&gt;25.67%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class="statistic"&gt;Treatment B&lt;/td&gt;
&lt;td&gt;1,392&lt;/td&gt;
&lt;td&gt;425&lt;/td&gt;
&lt;td class="negative"&gt;30.53%&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;
From the data above you'd conclude that Treatment B is the winner, but you have to be careful &amp;mdash; if the conversion rates were closer or if your &lt;em&gt;sample size&lt;/em&gt; were smaller you wouldn't be able to tell which treatment won.  For example, can you say for certain that Treatment A is better than the control treatment, or could it just be due to chance?
&lt;/p&gt;

&lt;h3&gt;Sample Size Matters&lt;/h3&gt;
&lt;p&gt;
The &lt;em&gt;sample size&lt;/em&gt; of a treatment is the number of people who received that treatment.  The larger the sample size the more certain you are that the sample's performance reflects the real performance of the treatment.
&lt;/p&gt;

&lt;p&gt;
For example, what if the above data looked like this, instead?
&lt;/p&gt;

&lt;center&gt;
&lt;table class="monthly-data"&gt;
&lt;tr class="top"&gt;
&lt;th colspan="4"&gt;A/B Testing Example Data for the Foobar Service&lt;/th&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;th style="width: 15ex;"&gt;Treatment&lt;/th&gt;
&lt;th style="width: 8ex;"&gt;Visitors Treated&lt;/th&gt;
&lt;th style="width: 8ex;"&gt;Visitors Registered&lt;/th&gt;
&lt;th style="width: 8ex;"&gt;Conversion Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class="statistic"&gt;Control&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td class="negative"&gt;30.00%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td class="statistic"&gt;Treatment A&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td class="negative"&gt;50.00%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class="statistic"&gt;Treatment B&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td class="negative"&gt;44.44%&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;
Which treatment is the best, now?  You might be inclined to say that Treatment A is the winner because it has a higher conversion rate.  But this is akin to saying that you know a coin is biased because you flipped it three times and got all heads.
&lt;/p&gt;

&lt;p&gt;
That might be unlikely, but it's not impossible.  The larger the sample size the more certain you are that the effects you're observing are from real differences in the treatments and not from pure chance.  In fact, none of these results are &lt;em&gt;statistically significant&lt;/em&gt;, i.e., they're just as likely to be caused by chance as by real differences in the treatments.
&lt;/p&gt;

&lt;p&gt;
Since sample size is per-treatment there are primarily two ways to increase it: use fewer treatments or run the experiment for longer.
&lt;/p&gt;

&lt;h3&gt;What's Next?&lt;/h3&gt;
&lt;p&gt;
There's a lot more to cover when it comes to A/B testing.  Here are a few topics I'll be writing about over the coming weeks:
&lt;dl&gt;
	&lt;dt&gt;Implementation&lt;/dt&gt;
	&lt;dd&gt;Once we understand what A/B testing is about, how do we implement it?  Do different products require different implementations?&lt;/dd&gt;
	&lt;dt&gt;Statistical Significance&lt;/dt&gt;
	&lt;dd&gt;Once we have results from our A/B test, how can we quantify our level of certainty?  How long do we have to run an experiment before we can be certain of the results?&lt;/dd&gt;
	&lt;dt&gt;Hypothesis Testing&lt;/dt&gt;
	&lt;dd&gt;What if we want to test more complex behavior?  What if the data we get back can't be modeled as a simple percentage?&lt;/dd&gt;
	&lt;dt&gt;Best Practices&lt;/dt&gt;
	&lt;dd&gt;What is worth testing?  How do you balance short-term and long-term goals in the context of testing?&lt;/dd&gt;
&lt;/dl&gt;
&lt;/p&gt;

&lt;p&gt;
That's it for today.  Feel free to &lt;a href="http://20bits.com/2008/10/05/an-introduction-to-ab-testing/#comments"&gt;leave a comment&lt;/a&gt; and let me know what you want me to write about next.  Cheers!
&lt;/p&gt;</description>
      <pubDate>Mon, 06 Oct 2008 04:00:15 +0000</pubDate>
      <link>http://20bits.com/article/an-introduction-to-ab-testing</link>
    </item>
    <item>
      <title>Data-Driven Development</title>
      <description>&lt;p&gt;
There are &lt;a href="http://andrewchenblog.com/"&gt;lots&lt;/a&gt; &lt;a href="http://500hats.typepad.com/"&gt;of&lt;/a&gt; &lt;a href="http://startuplessonslearned.blogspot.com/"&gt;smart&lt;/a&gt; people out there talking about metrics and tech startups, but the one thing they all have in common is an empirical mind-set.
&lt;/p&gt;

&lt;p&gt;
Another common thread is that a lot of these practices come from other, older industries that have had time to mature.  It's high-time we apply this to the startup world.
&lt;/p&gt;

&lt;h3&gt;What is Data-Driven Development?&lt;/h3&gt;
&lt;p&gt;
Data-Driven Development is centered around the belief that business decisions &amp;mdash; whether technical, artistic, or financial &amp;mdash; are best made based on what is &lt;em&gt;actually happening&lt;/em&gt; rather than your personal model of the world.
&lt;/p&gt;

&lt;p&gt;
Of course, everyone agrees with that. The problem is that everyone always believes their version of the world is the correct one.  It's not, at least not all the time.  How do you know when you're correct and when you're not?
&lt;/p&gt;

&lt;p&gt;
Applying the principles of Data-Driven Development helps you understand what actually works and what doesn't.
&lt;/p&gt;

&lt;h3&gt;Principles of Data-Driven Development&lt;/h3&gt;
&lt;p&gt;
Three key principles are as follows:
&lt;dl&gt;
	&lt;dt&gt;Everyone Is Biased&lt;/dt&gt;
	&lt;dd&gt;Decisions should be made through the lens of empiricism rather than the lens of intuition.&lt;/dd&gt;
	
	&lt;dt&gt;Universal Instrumentation&lt;/dt&gt;
	&lt;dd&gt;Without visibility you can't tell when you're succeeding &amp;mdash; or failing.&lt;/dd&gt;
	
	&lt;dt&gt;No Sacred Cows&lt;/dt&gt;
	&lt;dd&gt;The most dangerous beliefs are the ones held universally.  Test (and measure) everything.&lt;/dd&gt;
&lt;/ul&gt;
&lt;/p&gt;

&lt;h3&gt;Why Data-Driven Development?&lt;/h3&gt;
&lt;p&gt;
Business decisions center around risk/reward calculations.  Data-Driven Development gives you clearer pictures of both.  The best decisions an early-stage startup can make are the high-yield, low-risk ones, i.e., the low-hanging fruit.  Without real data you won't know what tree you're picking from, let alone what fruit.
&lt;/p&gt;

&lt;p&gt;
Also, it's worth noting that Data-Driven Development isn't about any single practice, at least not in the same way that Agile is associated with things like XP and Scrum.  It is pragmatic and understands its own limitations.  Data-Driven Development won't help you know what to value, nor does it aim to make perfect decisions &amp;mdash; only calculated ones.
&lt;/p&gt;

&lt;h3&gt;Practical Examples and the Future of 20bits&lt;/h3&gt;
&lt;p&gt;
As you can see the byline of this blog has changed to "Driven by Data."  I'm going to refocus this blog to be about Data-Driven Development.  These are some topics you can expect to read about:
&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;Case studies to illustrate Data-Driven Development&lt;/li&gt;
	&lt;li&gt;Tutorials on probability, statistics, and data analysis&lt;/li&gt;
	&lt;li&gt;A/B and multivariate testing&lt;/li&gt;
	&lt;li&gt;Metrics, their use (and misuse)&lt;/li&gt;
	&lt;li&gt;Operational problems like data warehousing and extraction&lt;/li&gt;
	&lt;li&gt;...and so much more!&amp;trade;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;
The rule I've written for myself is that every post has to involve at least one of code, data, or insight.  Hopefully this one counts towards the latter.  And in the spirit of Data-Driven Development, please drop me a line and tell me directly what you want to hear about.
&lt;/p&gt;

&lt;p&gt;
Cheers, and happy coding!
&lt;/p&gt;</description>
      <pubDate>Wed, 01 Oct 2008 07:00:08 +0000</pubDate>
      <link>http://20bits.com/article/data-driven-development</link>
    </item>
    <item>
      <title>Erlang: A Generalized TCP Server</title>
      <description>&lt;p&gt;
In my last few articles about Erlang we've covered the basics of &lt;a href="/article/network-programming-in-erlang/"&gt;network programming&lt;/a&gt; with &lt;tt&gt;gen_tcp&lt;/tt&gt; and Erlang/OTP's &lt;a href="http://20bits.com/articles/erlang-a-generic-server-tutorial"&gt;gen_server&lt;/a&gt;, or generic server, module.  Let's combine the two.

&lt;/p&gt;

&lt;p&gt;
In most people's minds "server" means network server, but Erlang uses the terminology in the most abstract sense.  &lt;tt&gt;gen_server&lt;/tt&gt; is really a server that operates using Erlang's message passing as its base protocol.  We can graft a TCP server onto that framework, but it requires some work.
&lt;/p&gt;

&lt;h3&gt;The Structure of a Network Server&lt;/h3&gt;
&lt;p&gt;
Most network servers have a similar architecture.  First they create a listening socket that listens for incoming connection.  They then enter an accept state in which they loop until termination, accepting each new connection as it arrives and starting the real client/server work.
&lt;/p&gt;

&lt;p&gt;
To see this in action recall the simple echo server from my network programming article:
&lt;/p&gt;
&lt;pre class="brush: erlang"&gt;-module(echo).
-author('Jesse E.I. Farmer &amp;lt;jesse@20bits.com&amp;gt;').
-export([listen/1]).

-define(TCP_OPTIONS, [binary, {packet, 0}, {active, false}, {reuseaddr, true}]).

% Call echo:listen(Port) to start the service.
listen(Port) -&gt;
    {ok, LSocket} = gen_tcp:listen(Port, ?TCP_OPTIONS),
    accept(LSocket).

% Wait for incoming connections and spawn the echo loop when we get one.
accept(LSocket) -&gt;
    {ok, Socket} = gen_tcp:accept(LSocket),
    spawn(fun() -&gt; loop(Socket) end),
    accept(LSocket).

% Echo back whatever data we receive on Socket.
loop(Socket) -&gt;
    case gen_tcp:recv(Socket, 0) of
        {ok, Data} -&gt;
            gen_tcp:send(Socket, Data),
            loop(Socket);
        {error, closed} -&gt;
            ok
    end.&lt;/pre&gt;

&lt;p&gt;
As you can see, &lt;tt&gt;listen&lt;/tt&gt; creates a listening socket and immediately calls &lt;tt&gt;accept&lt;/tt&gt;.  This waits for an incoming connection, spawns a new worker (&lt;tt&gt;loop&lt;/tt&gt;) that does the real work, and then waits for the next incoming connection.
&lt;/p&gt;

&lt;p&gt;
In this code the parent process owns both the listen socket and the accept loop.  As we'll see this doesn't work so well when we try to integrate the accept/listen loop with &lt;tt&gt;gen_server&lt;/tt&gt;.
&lt;/p&gt;

&lt;h3&gt;Abstracting The Network Server&lt;/h3&gt;
&lt;p&gt;
Network servers come in two parts: connection handling and business logic.  As I described above the connection handling is basically the same for every network server.  Ideally we'd be able to do something like
&lt;/p&gt;

&lt;pre class="brush: erlang"&gt;-module(my_server).
start(Port) -&gt;
	connection_handler:start(my_server, Port, business_logic).

business_logic(Socket) -&gt;
	% Read data from the network socket and do our thang.&lt;/pre&gt;

&lt;p&gt;
Let's go ahead and do just this.
&lt;/p&gt;

&lt;h3&gt;Implementing A Generic Network Server&lt;/h3&gt;
&lt;p&gt;
The problem with implementing a network server using &lt;tt&gt;gen_server&lt;/tt&gt; is that the call to &lt;tt&gt;gen_tcp:accept&lt;/tt&gt; is blocking.  If we were to call this in the server's initialization routine, for example, the whole &lt;tt&gt;gen_server&lt;/tt&gt; mechanism would block until a client connected.
&lt;/p&gt;

&lt;p&gt;
There are two ways to get around this.  One involves using a lower-level connection mechanism that supports non-blocking (or asynchronous) accepting.  There are then a whole family of functions, most notably &lt;tt&gt;gen_tcp:controlling_process&lt;/tt&gt;, that helps you manage who receives what messages when clients connect.
&lt;/p&gt;

&lt;p&gt;
A simpler and, in my opinion, more elegant solution is to have a single process that owns the listening socket.  This process does two things: spawns new acceptors and listens for "connection received" messages.  When it receives a message it knows to spawn a new acceptor.&lt;/p&gt;

&lt;p&gt;
An acceptor is free to call the blocking &lt;tt&gt;gen_tcp:accept&lt;/tt&gt; since it's running in its own process.  When it receives a connection it fires an asynchronous message back to the parent process and immediately calls the business logic function.
&lt;/p&gt;

&lt;p&gt;
Here's the code.  I've commented where appropriate, so hopefully it's readable.
&lt;/p&gt;

&lt;pre class="brush: erlang"&gt;-module(socket_server).
-author('Jesse E.I. Farmer &amp;lt;jesse@20bits.com&amp;gt;').
-behavior(gen_server).

-export([init/1, code_change/3, handle_call/3, handle_cast/2, handle_info/2, terminate/2]).
-export([accept_loop/1]).
-export([start/3]).

-define(TCP_OPTIONS, [binary, {packet, 0}, {active, false}, {reuseaddr, true}]).

-record(server_state, {
		port,
		loop,
		ip=any,
		lsocket=null}).

start(Name, Port, Loop) -&gt;
	State = #server_state{port = Port, loop = Loop},
	gen_server:start_link({local, Name}, ?MODULE, State, []).

init(State = #server_state{port=Port}) -&gt;
	case gen_tcp:listen(Port, ?TCP_OPTIONS) of
   		{ok, LSocket} -&gt;
   			NewState = State#server_state{lsocket = LSocket},
   			{ok, accept(NewState)};
   		{error, Reason} -&gt;
   			{stop, Reason}
	end.

handle_cast({accepted, _Pid}, State=#server_state{}) -&gt;
	{noreply, accept(State)}.

accept_loop({Server, LSocket, {M, F}}) -&gt;
	{ok, Socket} = gen_tcp:accept(LSocket),
	% Let the server spawn a new process and replace this loop
	% with the echo loop, to avoid blocking 
	gen_server:cast(Server, {accepted, self()}),
	M:F(Socket).
	
% To be more robust we should be using spawn_link and trapping exits
accept(State = #server_state{lsocket=LSocket, loop = Loop}) -&gt;
	proc_lib:spawn(?MODULE, accept_loop, [{self(), LSocket, Loop}]),
	State.

% These are just here to suppress warnings.
handle_call(_Msg, _Caller, State) -&gt; {noreply, State}.
handle_info(_Msg, Library) -&gt; {noreply, Library}.
terminate(_Reason, _Library) -&gt; ok.
code_change(_OldVersion, Library, _Extra) -&gt; {ok, Library}.&lt;/pre&gt;

&lt;p&gt;
We use &lt;tt&gt;gen_server:cast&lt;/tt&gt; to pass asynchronous messages back to the listening process.  When the listening process receives the message &lt;tt&gt;accepted&lt;/tt&gt; it spawns a new acceptor.
&lt;/p&gt;

&lt;p&gt;
Right now this server is not very robust because if the active acceptor fails, for whatever reason, the server will stop accepting connections.  To make it more OTP-like we should be trapping exits and firing off a new acceptor in the event that a connection fails.
&lt;/p&gt;

&lt;h3&gt;A "Generic" Echo Server&lt;/h3&gt;
&lt;p&gt;
The echo server is the easiest server to write, so let's do it using our new abstract socket server.
&lt;/p&gt;

&lt;pre class="brush: erlang"&gt;-module(echo_server).
-author('Jesse E.I. Farmer &amp;lt;jesse@20bits.com&amp;gt;').

-export([start/0, loop/1]).

% echo_server specific code
start() -&gt;
	socket_server:start(?MODULE, 7000, {?MODULE, loop}).
loop(Socket) -&gt;
    case gen_tcp:recv(Socket, 0) of
        {ok, Data} -&gt;
            gen_tcp:send(Socket, Data),
            loop(Socket);
        {error, closed} -&gt;
            ok
    end.&lt;/pre&gt;

&lt;p&gt;
As you can see the "server" becomes nothing more than its business logic.  The connection handling has been generalized and pushed off into its own &lt;tt&gt;socket_server&lt;/tt&gt;.  The loop in our generic server is actually identical to the loop in our original echo server, too.
&lt;/p&gt;

&lt;p&gt;
Hopefully you all can learn from this as much as I did.  I finally feel like I'm starting to understand Erlang.
&lt;/p&gt;

&lt;p&gt;
Also, feel free to leave a comment, especially if you have any thoughts on how I can improve my code.  Cheers!
&lt;/p&gt;</description>
      <pubDate>Mon, 16 Jun 2008 06:00:28 +0000</pubDate>
      <link>http://20bits.com/article/erlang-a-generalized-tcp-server</link>
    </item>
    <item>
      <title>Erlang: An Introduction to Records</title>
      <description>&lt;p&gt;
Internally Erlang has only two internal compound data types: lists and tuples.  Neither of these data types support named access, so creating associative arrays a la PHP, Ruby, or Python is an impossibility without additional libraries.
&lt;/p&gt;

&lt;p&gt;
That is, in Ruby, I could do: &lt;/p&gt;
&lt;pre class="brush: ruby"&gt;server_opts = {:port =&gt; 8080, :ip =&gt; '127.0.0.1', :max_connections =&gt; 10}&lt;/pre&gt;

&lt;p&gt;
while in Erlang there's no such support at the language (syntax) level.
&lt;/p&gt;

&lt;p&gt;
To get around this limitation the Erlang VM provides a pseudo data type called &lt;em&gt;records&lt;/em&gt;.  Records to support named access with some cruft.  We'll see why I call these "pseudo" data types later on.
&lt;/p&gt;

&lt;h3&gt;Defining Records&lt;/h3&gt;
&lt;p&gt;
Records are more similar to &lt;tt&gt;structs&lt;/tt&gt; in C than associative arrays in that that require you to define their contents up front and they can only hold data.  Here's an example record that stores connection options for a server of some kind.&lt;/p&gt;
&lt;pre class="brush: erlang"&gt;-module(my_server).

-record(server_opts,
	{port,
	ip="127.0.0.1",
	max_connections=10}).

% The rest of your code goes here.&lt;/pre&gt;

&lt;p&gt;
Records are defined using the &lt;tt&gt;-record&lt;/tt&gt; directive.  The first parameter is the name of the record and the second parameter is a tuple that contains the fields in the record and their default values.
&lt;/p&gt;

&lt;p&gt;
In our case we've defined a &lt;tt&gt;server_opts&lt;/tt&gt; record that has three fields: a port, a binding IP, and the number of maximum connections allowed.  There is no default port, but the default value of &lt;tt&gt;ip&lt;/tt&gt; is "127.0.0.1" and the default value of &lt;tt&gt;max_connections&lt;/tt&gt; is 10.
&lt;/p&gt;

&lt;h3&gt;Creating Records&lt;/h3&gt;
&lt;p&gt;
Records are created by using the hash (#) symbol.  Using the &lt;tt&gt;server_opts&lt;/tt&gt; record from above the following are all valid ways to create a record.&lt;/p&gt;

&lt;pre class="brush: erlang"&gt;Opts1 = #server_opts{port=80}.&lt;/pre&gt;

&lt;p&gt;
This creates a &lt;tt&gt;server_opts&lt;/tt&gt; record with &lt;tt&gt;port&lt;/tt&gt; set to 80.  The other fields have their default value.
&lt;/p&gt;

&lt;pre class="brush: erlang"&gt;Opts2 = #server_opts{port=80, ip="192.168.0.1"}.&lt;/pre&gt;
&lt;p&gt;This create a &lt;tt&gt;server_opts&lt;/tt&gt; like the above, expect now &lt;tt&gt;ip&lt;/tt&gt; is set to "192.168.0.1".&lt;/p&gt;

&lt;p&gt;
In short, when creating a record you can include whatever fields you like.  Omitted fields will take on their default value.
&lt;/p&gt;

&lt;h3&gt;Accessing Records&lt;/h3&gt;
&lt;p&gt;
Accessing records is clumsy and it's where they start to reveal their cruft.  If I want to access the &lt;tt&gt;port&lt;/tt&gt; field I can do&lt;/p&gt;
&lt;pre class="brush: erlang"&gt;Opts = #server_opts{port=80, ip="192.168.0.1"},
Opts#server_opts.port&lt;/pre&gt;

&lt;p&gt;
Yep, that's right, any time you want to access a record you have to include the record's name.  Why?  Because records aren't really internal data types, they're a compiler trick.
&lt;/p&gt;

&lt;p&gt;
Internally records are tuples that look something like this:&lt;/p&gt;&lt;pre class="brush: erlang"&gt;{server_opts, 80, "127.0.0.1", 10}&lt;/pre&gt;

&lt;p&gt;
The compiler maps the named fields to their position in the tuple.
&lt;/p&gt;

&lt;p&gt;
The VM keeps track of record definitions and the compiler translates all the record logic to tuple logic when you compile your Erlang program.  That is, there is no record "type," so you have to tell Erlang what record we're talking about every time you access one.
&lt;/p&gt;

&lt;h3&gt;Updating Records&lt;/h3&gt;
&lt;p&gt;
Updating records works much like creating records.  For example, &lt;/p&gt; &lt;pre class="brush: erlang"&gt;Opts = #server_opts{port=80, ip="192.168.0.1"},
NewOpts = Opts#server_opts{port=7000}.&lt;/pre&gt;

&lt;p&gt;
would first create a &lt;tt&gt;server_opts&lt;/tt&gt; record.  &lt;tt&gt;NewOpts = Opts#{port=7000}&lt;/tt&gt; create a copy of &lt;tt&gt;Opts&lt;/tt&gt; with a port number of 7000 rather than 80 and bind it to 
&lt;tt&gt;NewOpts&lt;/tt&gt;.&lt;/p&gt;

&lt;h3&gt;Matching Records and Guard Statements&lt;/h3&gt;
&lt;p&gt;
This wouldn't be a tutorial about Erlang unless we talked about pattern matching.  Let's say we want to do something particular with a server if it is running on port 8080 and something else otherwise.&lt;/p&gt; &lt;pre class="brush: erlang"&gt;handle(Opts=#server_opts{port=8080}) -&gt;
	% do special port 8080 stuff
handle(Opts=#server_opts{}) -&gt;
	% default stuff&lt;/pre&gt;

&lt;p&gt;
Guard statements work similarly.  For example, binding to ports below 1024 often require root access, so we might want to special cast that:
&lt;/p&gt;
&lt;pre class="brush: erlang"&gt;handle(Opts) when Opts#server_opts.port &lt;= 1024 -&gt;
	% requires root access
handle(Opts=#server_opts{}) -&gt;
	% Doesn't require root access&lt;/pre&gt;

&lt;h3&gt;Using Records&lt;/h3&gt;
&lt;p&gt;
In my limited time using Erlang I've seen records used primarily for two things.  First, records are used to keep state, especially when using the generic server behaviour.  Since Erlang is side-effect free state cannot be kept globally.  Instead it must be passed around from function to function.
&lt;/p&gt;

&lt;p&gt;
Perhaps a subset of the first, records are also used to keep track of configurable options.
&lt;/p&gt;

&lt;p&gt;
There are limitations to records, however.  Most notably the ability to add and remove fields on the fly.  Like C structs the structure of the record is defined beforehand.
&lt;/p&gt;

&lt;p&gt;
If you want to to add and remove fields on the fly, or if you don't know what fields you'll have until runtime, you should use &lt;a href="http://www.erlang.org/doc/man/dict.html"&gt;dicts&lt;/a&gt; rather than records.
&lt;/p&gt;

&lt;h3&gt;Further Reading&lt;/h3&gt;
&lt;ul&gt;
	&lt;li&gt;&lt;a href="http://www.erlang.org/doc/reference_manual/records.html"&gt;Records Reference Manual&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href="http://www.erlang.org/doc/programming_examples/records.html"&gt;Records Programming Examples&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href="http://damienkatz.net/2008/03/what_sucks_abou.html"&gt;What Sucks About Erlang&lt;/a&gt; (includes section on records)&lt;/li&gt;
&lt;/ul&gt;</description>
      <pubDate>Sun, 15 Jun 2008 07:00:17 +0000</pubDate>
      <link>http://20bits.com/article/erlang-an-introduction-to-records</link>
    </item>
    <item>
      <title>Erlang: A Generic Server Tutorial</title>
      <description>&lt;p&gt;
One of the benefits of working with Erlang is that it was designed with real-world applications in mind.  This is reflected in OTP, or Open Telecommunications Platform, a set of standard libraries that come with the default Erlang VM.
&lt;/p&gt;

&lt;p&gt;
Erlang/OTP implements in a generic way lots of networking paradigms, including finite state machines (&lt;a href="http://www.erlang.org/doc/design_principles/fsm.html"&gt;&lt;tt&gt;gen_fsm&lt;/tt&gt;&lt;/a&gt;), event handling (&lt;a href="http://www.erlang.org/doc/design_principles/events.html"&gt;&lt;tt&gt;gen_event&lt;/tt&gt;&lt;/a&gt;), and client/server interaction (&lt;a href="http://www.erlang.org/doc/design_principles/gen_server_concepts.html"&gt;&lt;tt&gt;gen_server&lt;/tt&gt;&lt;/a&gt;).  We're going to cover on the last library, &lt;tt&gt;gen_server&lt;/tt&gt;, or Erlang/OTP's generic server library.
&lt;/p&gt;

&lt;h3&gt;The Client/Server Model&lt;/h3&gt;
&lt;p&gt;
The client/server model is based around many clients connecting to a single, central server.  The clients can send and receive message from the server while the server maintains a global state.
&lt;/p&gt;

&lt;p&gt;
Here's a picture. &lt;img src="http://assets.20bits.com/20080609/client-server.png" alt="" title="client-server" width="305" height="188" class="math size-full wp-image-157" /&gt;
&lt;/p&gt;

&lt;p&gt;
A common instance where the client/server model makes sense is when you have some resource you want to distribute among several people.  The server controls access and allocation of the resource and the clients consume it.
&lt;/p&gt;

&lt;h3&gt;The Code&lt;/h3&gt;
&lt;p&gt;
Code speaks louder than words, so without further ado here is a simple server server that simulates a library.  People can check out and return books from the library, but there's only one copy of each book.
&lt;/p&gt;

&lt;pre class="brush: erlang"&gt;-module(library).
-author('Jesse E.I. Farmer &amp;lt;jesse@20bits.com&amp;gt;').
-behaviour(gen_server).

-export([init/1, handle_call/3, handle_cast/2, handle_info/2, terminate/2, code_change/3]).

-export([start/0, checkout/2, lookup/1, return/1]).

% These are all wrappers for calls to the server
start() -&gt; gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).
checkout(Who, Book) -&gt; gen_server:call(?MODULE, {checkout, Who, Book}).	
lookup(Book) -&gt; gen_server:call(?MODULE, {lookup, Book}).
return(Book) -&gt; gen_server:call(?MODULE, {return, Book}).

% This is called when a connection is made to the server
init([]) -&gt;
	Library = dict:new(),
	{ok, Library}.

% handle_call is invoked in response to gen_server:call
handle_call({checkout, Who, Book}, _From, Library) -&gt;
	Response = case dict:is_key(Book, Library) of
		true -&gt;
			NewLibrary = Library,
			{already_checked_out, Book};
		false -&gt;
			NewLibrary = dict:append(Book, Who, Library),
			ok
	end,
	{reply, Response, NewLibrary};

handle_call({lookup, Book}, _From, Library) -&gt;
	Response = case dict:is_key(Book, Library) of
		true -&gt;
			{who, lists:nth(1, dict:fetch(Book, Library))};
		false -&gt;
			{not_checked_out, Book}
	end,
	{reply, Response, Library};

handle_call({return, Book}, _From, Library) -&gt;
	NewLibrary = dict:erase(Book, Library),
	{reply, ok, NewLibrary};

handle_call(_Message, _From, Library) -&gt;
	{reply, error, Library}.

% We get compile warnings from gen_server unless we define these
handle_cast(_Message, Library) -&gt; {noreply, Library}.
handle_info(_Message, Library) -&gt; {noreply, Library}.
terminate(_Reason, _Library) -&gt; ok.
code_change(_OldVersion, Library, _Extra) -&gt; {ok, Library}.&lt;/pre&gt;

&lt;h3&gt;Breaking It Down&lt;/h3&gt;
&lt;p&gt;
The first line of interest is &lt;tt&gt;-behaviour(gen_server).&lt;/tt&gt;  This tells Erlang that we'll be using &lt;tt&gt;gen_server&lt;/tt&gt; module for our behavior.
&lt;/p&gt;

&lt;p&gt;
Next we implement wrappers for server calls.  We start the library server by calling &lt;tt&gt;library:start/0&lt;/tt&gt;, which in turn calls &lt;tt&gt;gen_server:start_link/4&lt;/tt&gt;.
&lt;/p&gt;
&lt;p&gt;
Whatever we pass to &lt;tt&gt;start_link/4&lt;/tt&gt; will be passed to &lt;tt&gt;init/1&lt;/tt&gt; later, which is the callback that handles connection events.  In our case we just want to create a new dictionary to store which books have been checked out.
&lt;/p&gt;

&lt;p&gt;
Once we've started the server we want to be able to check out books, see if a book has been checked out, and return books.  We implement wrappers to handle these functions, each of which invokes &lt;tt&gt;gen_server:call/2&lt;/tt&gt;.
&lt;/p&gt;

&lt;p&gt;
&lt;tt&gt;gen_server:call&lt;/tt&gt; is used for synchronous communication between the client and the server.  That is, it is used when the server expects a response.  These calls are handled by &lt;tt&gt;handle_call&lt;/tt&gt; (big surprise, huh?).
&lt;/p&gt;

&lt;p&gt;
All of the meat is in the &lt;tt&gt;handle_call&lt;/tt&gt; definitions.  As you can see the server understands three messages: &lt;tt&gt;checkout&lt;/tt&gt;, &lt;tt&gt;lookup&lt;/tt&gt;, and &lt;tt&gt;return&lt;/tt&gt;.  We have one definition of &lt;tt&gt;handle_call&lt;/tt&gt; for each possible message and a default action that returns an error when it receives a message it doesn't understand.
&lt;/p&gt;

&lt;p&gt;
Here's an example of how you'd actually use the library server.  All of the commands are executed in the Erlang shell, &lt;tt&gt;erl&lt;/tt&gt;.
&lt;pre class="brush: erlang"&gt;1&gt; c(library).
{ok,library}
2&gt; library:start().
{ok,&lt;0.39.0&gt;}
3&gt; library:checkout(jesse, "American Creation").
ok
4&gt; library:lookup("American Creation").
{who,jesse}
5&gt; library:checkout(james, "American Creation").
{already_checked_out,"American Creation"}
6&gt; library:return("American Creation").
ok
7&gt; library:checkout(james, "American Creation").
ok&lt;/pre&gt;
&lt;/p&gt;

&lt;h3&gt;Other Goodies and Caveats&lt;/h3&gt;
&lt;p&gt;
Writing code with gen_server isn't all academic.  There are real benefits.
&lt;/p&gt;

&lt;h4&gt;Abstraction&lt;/h4&gt;
&lt;p&gt;
The greatest benefit of gen_server is the abstraction it provides.  By encapsulating the essence of the client/server model we can focus on the business logic rather than low-level event management.
&lt;/p&gt;

&lt;p&gt;
More importantly, however, it abstracts away the &lt;em&gt;protocol&lt;/em&gt;.  The code behind the scenes can change without affecting the client/server behavior.
&lt;/p&gt;

&lt;h4&gt;Supervision&lt;/h4&gt;
&lt;p&gt;
Although we don't make use of it here, gen_server supports supervision behaviors.  If a call throws an exception the server can capture it and restart the appropriate section of code.  This is handled using &lt;tt&gt;handle_info&lt;/tt&gt;.  This becomes more important if the server is spawning additional processes.
&lt;/p&gt;

&lt;h4&gt;Code Swapping&lt;/h4&gt;
&lt;p&gt;
We don't make use of this either, but gen_server supports hot code swapping using the &lt;tt&gt;code_changed&lt;/tt&gt; callback.  This is one place where Erlang really shines and gen_server carries it through to the client/server model.
&lt;/p&gt;

&lt;h4&gt;Caveats&lt;/h4&gt;
&lt;p&gt;
It's not all awesome, though.  It's surprisingly tricky to write gen_server code that handles TCP/IP connections.  I'll give an example of mixing networking and gen_server in a future article, but there are all sorts of control and blocking issues that have to be dealt with.
&lt;/p&gt;

&lt;p&gt;
Leave a comment if you have any cool &lt;tt&gt;gen_server&lt;/tt&gt; examples out there.
&lt;/p&gt;</description>
      <pubDate>Mon, 09 Jun 2008 07:00:33 +0000</pubDate>
      <link>http://20bits.com/article/erlang-a-generic-server-tutorial</link>
    </item>
    <item>
      <title>Politics and Tufte's Lie Factor</title>
      <description>&lt;p&gt;
I admit it, I'm a political junkie.  I'm also a math guy who loves design.  Politics gets emotional fast and people are quick to stretch whatever data they have to fit their small, partisan aims.
&lt;/p&gt;

&lt;p&gt;
Pundits and partisans misuse statistics all the time, but I happened upon a real gem that perfectly illustrates Edward Tufte's "Lie Factor."
&lt;/p&gt;

&lt;h3&gt;The Lie Factor&lt;/h3&gt;
&lt;p&gt;
In 1983 Edward Tufte wrote a book called &lt;em&gt;The Visual Display of Quantitative Information&lt;/em&gt;.  In it he states the following principle: &lt;blockquote&gt;The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the quantities represented.&lt;/blockquote&gt;
&lt;/p&gt;

&lt;p&gt;
The Lie Factor measures the extent to which a graph violates this principle.  Mathematically it can be stated as follows: &lt;img src="http://assets.20bits.com/20080525/result.png" alt="" title="result" width="395" height="44" class="math size-full wp-image-146" /&gt;
&lt;/p&gt;

&lt;p&gt;
The lie factor should be between 0.95 and 1.05.  If it is outside that range then either the graph creator didn't know what the they were doing or they were intentionally trying to distort the facts.
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Update&lt;/strong&gt;: I realized after experimenting with Excel that the reason Jay's graph looks the way it does is because that's the Excel default.  Stupid on Excel's part, but it's still careless not to notice.
&lt;/p&gt;

&lt;h3&gt;The Culprit&lt;/h3&gt;
&lt;p&gt;
On Friday &lt;a href="http://www.realclearpolitics.com/horseraceblog/2008/05/a_review_of_obamas_voting_coal.html"&gt;Jay Cost&lt;/a&gt; over a Real Clear Politics made a post entitled "A Review of Obama's Voting Coalition."  It contained no commentary, only six graphs.  Here's the fifth graph: &lt;img src="http://assets.20bits.com/20080525/votes-per-pledged-delegate.gif" alt="" title="votes-per-pledged-delegate" width="499" height="362" class="math size-full wp-image-147" /&gt;
&lt;/p&gt;

&lt;p&gt;
For those who don't know, the Democrats nominate their candidate based on the number of delegates.  Most states allocate their delegates proportionally based on the popular vote in each congressional district. One side-effect of this is that a vote in a sparsely populated congressional district can be worth more delegates that one in a densely populated congressional district.
&lt;/p&gt;

&lt;p&gt;
But looking at this graph I was taken aback.  Is it really true that each vote received by Obama was worth three times as many delegates as a vote received by Clinton?  Take a closer look, though: the "zero point" on the graph is not zero but 10,200.  The absolute difference is the same but the relative difference is skewed.  Lie factor!
&lt;/p&gt;

&lt;p&gt;
To see why this matters look at the corrected graph. &lt;img src="http://assets.20bits.com/20080525/untitled3.png" alt="" title="untitled3" width="496" height="363" class="math size-full wp-image-148" /&gt;
&lt;/p&gt;

&lt;p&gt;
As you can see the difference is much less stark.
&lt;/p&gt;

&lt;h3&gt;The Effect&lt;/h3&gt;
&lt;p&gt;
It looks like Clinton gets around 11,750 votes per delegate and Obama gets around 10,800.  This is around a 13.2% difference in the data.
&lt;/p&gt;

&lt;p&gt;
The size of the effect on the graph, however, shows a 61.3% difference between the two numbers.  That's a Lie Factor of around &lt;strong&gt;4.64&lt;/strong&gt;!  Someone needs to review their Tufte.
&lt;/p&gt;

&lt;h3&gt;The Echo Chamber&lt;/h3&gt;
&lt;p&gt;
One reason I don't like the political blogosphere is that it's totally predictable.  The same characters say and act the same way, all the time.  They may as well be giving out advance copies of their script.
&lt;/p&gt;

&lt;p&gt;
So, of course, &lt;a href="http://www.mydd.com/story/2008/5/25/5473/21652"&gt;Jerome Armstrong&lt;/a&gt;, the creator of MyDD and a vociferous Clinton supporter, placed this graph on the front page of his site without a hint of irony or self-reflection.  He didn't even bother to analyze the graph and see if it really said what he thought it did. 
&lt;/p&gt;

&lt;p&gt;
This is a great example of another phenomenon called &lt;a href="http://en.wikipedia.org/wiki/Confirmation_bias"&gt;confirmation bias&lt;/a&gt;, where people search out or skew information so that it conforms to their currently held beliefs.  In this case, Jerome just blindly posted a highly misleading graph because it supported his thesis that Clinton should be the Democratic nominee.
&lt;/p&gt;

&lt;p&gt;
It's a comedy of errors, to be sure, but at least we can learn what &lt;em&gt;not&lt;/em&gt; to do if we don't want to make ourselves look clueless.
&lt;/p&gt;


&lt;p&gt;
P.S., &lt;a href="http://www.math.yorku.ca/SCS/Gallery/lie-factor.html"&gt;this website&lt;/a&gt; has a list of the graph examples that Tufte himself used to illustrate the Lie Factor principle.  Check it out.
&lt;/p</description>
      <pubDate>Sun, 25 May 2008 10:57:58 +0000</pubDate>
      <link>http://20bits.com/article/politics-and-tuftes-lie-factor</link>
    </item>
    <item>
      <title>An EventMachine Tutorial</title>
      <description>&lt;p&gt;
&lt;a href="http://rubyeventmachine.com/"&gt;Ruby / EventMachine&lt;/a&gt; is an event-driven networking library for Ruby, similar to &lt;a href="http://twistedmatrix.com/"&gt;Twisted&lt;/a&gt; for Python.  Certain aspects of it are also similar to Erlang/OTP's &lt;a href="http://www.erlang.org/doc/man/gen_server.html"&gt;gen_server&lt;/a&gt; module.
&lt;/p&gt;

&lt;h3&gt;Why EventMachine?&lt;/h3&gt;
&lt;p&gt;
EventMachine satisfies two key requirements.  First, because EventMachine is an implementation of the &lt;a href="http://en.wikipedia.org/wiki/Reactor_pattern"&gt;reactor pattern&lt;/a&gt;, it separates networking logic from application logic.  This means you don't have to worry about handling the low-level connections and socket logic.  Instead, you just implement callbacks for the appropriate networking events.
&lt;/p&gt;

&lt;p&gt;
Second, EventMachine is lightweight and supports system-level networking primitives.  This means that Ruby's speed issues aren't a problem: the performance-critical stuff is in C/C++ and it uses the best your OS has to offer (e.g., epoll in Linux).
&lt;/p&gt;

&lt;p&gt;
In short, EventMachine makes it really easy to write scalable networking services in Ruby.  And who doesn't want to do that?
&lt;/p&gt;

&lt;h3&gt;Installing EventMachine&lt;/h3&gt;
&lt;p&gt;
EventMachine comes in a Ruby gem called &lt;tt&gt;eventmachine&lt;/tt&gt;.  Just run &lt;tt&gt;gem install eventmachine&lt;/tt&gt; to get the ball rolling.
&lt;/p&gt;
&lt;p&gt;
Note that EventMachine requires a C++ compiler on your system to build the native extensions.
&lt;/p&gt;

&lt;h3&gt;Using EventMachine&lt;/h3&gt;
&lt;p&gt;
I learn by example, so let's dive in.
&lt;/p&gt;

&lt;h4&gt;echo service&lt;/h4&gt;
&lt;p&gt;
The echo service is a traditional UNIX service that accepts incoming network connections and echos back whatever the client sends to it, byte-for-byte.  With EventMachine it's really simple.
&lt;/p&gt;
&lt;pre class="brush: ruby"&gt;#!/usr/bin/env ruby

require 'rubygems'
require 'eventmachine'

module EchoServer  
  def receive_data(data)
    send_data(data)
  end
end

EventMachine::run do
  host = '0.0.0.0'
  port = 8080
  EventMachine::start_server host, port, EchoServer
  puts "Started EchoServer on #{host}:#{port}..."
end&lt;/pre&gt;

&lt;p&gt;
Running the above code will start an echo daemon listening on port 8080 for all incoming connections.  Let's break it down.
&lt;/p&gt;

&lt;p&gt;
First look at the bottom of the code where we call &lt;tt&gt;EventMachine::run&lt;/tt&gt;.  This starts an event loop.  
It expects a block to be passed in where we would typically start any clients or servers that will live in that loop and will never terminate unless we explicitly call &lt;tt&gt;EventMachine::stop_event_loop&lt;/tt&gt;. 
&lt;/p&gt;

&lt;p&gt;
In our case we start our echo service using &lt;tt&gt;EventMachine::start_server&lt;/tt&gt;.  The first two parameters are the host and port.  The combination 0.0.0.0:8080 means that EchoServer will listen for connections on port 8080 from any IP address.
&lt;/p&gt;

&lt;p&gt;
The third parameter is the handler.  Typically the handler is a Ruby module which defines the appropriate callbacks.  This is so we don't pollute the global namespace with callback functions.  EchoServer only defines &lt;tt&gt;receive_data&lt;/tt&gt;, which is called whenever we receive data over a connection.
&lt;/p&gt;

&lt;p&gt;
Finally, in the EchoServer module we call &lt;tt&gt;send_data&lt;/tt&gt; whenever EventMachine invokes &lt;tt&gt;receive_data&lt;/tt&gt;, which sends data over the connection that initiated the callback.
&lt;/p&gt;

&lt;h3&gt;HTTP client&lt;/h3&gt;
&lt;p&gt;
EventMachine can also be used to create clients.  Rather than calling &lt;tt&gt;EventMachine::start_server&lt;/tt&gt; we call &lt;tt&gt;EventMachine::connect&lt;/tt&gt;.  Here's a program that connects to an HTTP server and prints out the headers it receives, EventMachine-style.
&lt;/p&gt;

&lt;pre class="brush: ruby"&gt;#!/usr/bin/env ruby

require 'rubygems'
require 'eventmachine'

module HttpHeaders 
  def post_init
    send_data "GET /\r\n\r\n"
    @data = ""
  end
  
  def receive_data(data)
    @data &lt;&lt; data
  end
  
  def unbind
    if @data =~ /[\n][\r]*[\n]/m
      $`.each {|line| puts "&gt;&gt;&gt; #{line}" }
    end
    
    EventMachine::stop_event_loop
  end
end

EventMachine::run do
  EventMachine::connect 'microsoft.com', 80, HttpHeaders
end&lt;/pre&gt;

&lt;p&gt;
If you change 'microsoft.com' to &lt;tt&gt;ARGV[0]&lt;/tt&gt; you can pass in whatever website you'd like on the command see what headers it returns.
&lt;/p&gt;

&lt;p&gt;
Here we see a new callback, &lt;tt&gt;post_init&lt;/tt&gt;.  This is called after a connection is set up.  If you're a client this means you've just connected to a server and if you're a server this means a new client has just connected.
&lt;/p&gt;

&lt;p&gt;
We also use the &lt;tt&gt;unbind&lt;/tt&gt; callback, which is called when either end of the connection is closed.  In our case this means the server closed the connection because it has sent us all the data it's going to send.  If you were implementing a server it would mean that a client had disconnected.
&lt;/p&gt;

&lt;p&gt;
&lt;tt&gt;unbind&lt;/tt&gt; and &lt;tt&gt;post_init&lt;/tt&gt; are complementary.  The former is called whenever a connection is closed and the later whenever a connection is created.  I'm not sure why they weren't named in a way that implies their relationship, but there you have it.
&lt;/p&gt;

&lt;p&gt;
These are the three main callbacks, though.  Do something when a connection is created, when we receive data, and when a connection is closed.  Everything else is pretty much a variation on these plus &lt;tt&gt;send_data&lt;/tt&gt; for sending data.
&lt;/p&gt;

&lt;h3&gt;Further Reading&lt;/h3&gt;
&lt;p&gt;
The mini-tutorial above covers the basics, but EventMachine supports a lot more.  I'd recommend looking over the &lt;a href="http://rubyeventmachine.com/"&gt;official EventMachine website&lt;/a&gt; and the &lt;a href="http://doachristianturndaily.info/eventmachine_rdoc/"&gt;EventMachine RDoc&lt;/a&gt; for more technical details.
&lt;/p&gt;

&lt;p&gt;
There's also an article about using &lt;a href="http://nutrun.com/weblog/distributed-programming-with-jabber-and-eventmachine/"&gt;EventMachine with Jabber&lt;/a&gt; to create a Jabber Bot that's worth reading.
&lt;/p&gt;

&lt;p&gt;
Leave a comment if you have any suggestions or insights.  Cheers!
&lt;/p&gt;</description>
      <pubDate>Wed, 21 May 2008 19:59:46 +0000</pubDate>
      <link>http://20bits.com/article/an-eventmachine-tutorial</link>
    </item>
    <item>
      <title>Facebook Users Just Want Entertainment</title>
      <description>&lt;p&gt;
Starting in late 2007 Facebook began instituting its media strategy in earnest with &lt;a href="http://blog.facebook.com/blog.php?post=6972252130"&gt;Facebook Pages&lt;/a&gt;.  &lt;a href="http://www.facebook.com/business/?pages"&gt;According to Facebook&lt;/a&gt; pages offer "a unique experience where users can become more deeply connected with your business or brand."
&lt;/p&gt;

&lt;p&gt;
It's good to see Facebook becoming conscious about how they can help shape the future of branding, since this is where the real money is for social networks.  Let's see how Facebook Pages has evolved since last November.
&lt;/p&gt;

&lt;h3&gt;Factoids&lt;/h3&gt;
&lt;p&gt;
I went into this project without any pre-conceptions of what I would find.  I never really used Facebook pages and wasn't sure if there were any definite conclusions to be found in the data.  At best I thought the numbers might be useful for third parties.  Here are some interesting facts:
&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;As of May 18th, 2008 there are &lt;strong&gt;190,365&lt;/strong&gt; pages and &lt;strong&gt;50,800,399&lt;/strong&gt; fans across all pages.
	&lt;li&gt;&lt;strong&gt;One third&lt;/strong&gt; of all pages are dedicated to &lt;strong&gt;musicians&lt;/strong&gt;, but this category represents &lt;strong&gt;37%&lt;/strong&gt; of all fans.&lt;/li&gt;
	&lt;li&gt;After musicians the category with the second-largest number of fans is &lt;strong&gt;TV Shows&lt;/strong&gt; even though it has &lt;strong&gt;3.8 times&lt;/strong&gt; fewer fans.&lt;/li&gt;
	&lt;li&gt;&lt;strong&gt;7.9%&lt;/strong&gt; of pages are in the "other business" category, the largest business category, but only &lt;strong&gt;3.6%&lt;/strong&gt; of fans.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;
&lt;strong&gt;NOTE&lt;/strong&gt;: Facebook changes the copy from "fans" to something type-specific.  For example, politicians don't have "fans" they have "supporters."  I'm going to use fan in the general sense.
&lt;/p&gt;

&lt;p&gt;
It turns out that sports, entertainment, and politics are the three broad categories that perform the best on Facebook, as we'll see below.
&lt;/p&gt;

&lt;h3&gt;Trends&lt;/h3&gt;
&lt;p&gt;
There are two ways to measure the size of a category: one, by the number of pages in that category; two, by the number of fans in that category.  Let's look at both.
&lt;/p&gt;

&lt;a href='http://assets.20bits.com/20080519/pct-pages.png'&gt;&lt;img src="http://assets.20bits.com/20080519/pct-pages.png" alt="" title="pct-pages" width="274" height="300" class="math size-medium wp-image-141" /&gt;&lt;/a&gt;

&lt;p&gt;
The graph includes the ten largest categories by the number of pages in each, with the remaining 55 categories grouped into one.  The interesting thing is that the graph is divded almost evenly into thirds, consisting of one single category (musicians), the 1-9 most common categories, and the remaining 55 categories.
&lt;/p&gt;

&lt;p&gt;
This doesn't really tell us how Facebook Pages are performing by category only how people are investing in Facebook pages.  Let's look at the users' side of things.
&lt;/p&gt;

&lt;a href='http://assets.20bits.com/20080519/pct-fans.png'&gt;&lt;img src="http://assets.20bits.com/20080519/pct-fans.png" alt="" title="pct-fans" width="293" height="300" class="math size-medium wp-image-140" /&gt;&lt;/a&gt;
&lt;p&gt;
There are two things worth noting in this graph.  One, the categories make a qualitative shift towards entertainment and politics and away from general businesses.  Two, the graph becomes even more lop-sided, with musicians taking up almost 40% of the graph and "other" falling to around 20%.
&lt;/p&gt;


&lt;h3&gt;Usage&lt;/h3&gt;
&lt;p&gt;
So, here's a question: what categories fair best?  Let's look at the 100 largest pages by number of fans and see how they break down by category.
&lt;/p&gt;

&lt;a href='http://assets.20bits.com/20080519/top100.png'&gt;&lt;img src="http://assets.20bits.com/20080519/top100.png" alt="" title="top100" width="300" height="219" class="math size-medium wp-image-142" /&gt;&lt;/a&gt;

&lt;p&gt;
The difference here is even more stark.  &lt;strong&gt;48%&lt;/strong&gt; of the top 100 pages are musician pages and &lt;strong&gt;17%&lt;/strong&gt; are for TV shows.  No other category has more than 10% of the fans.
&lt;/p&gt;

&lt;p&gt;
Let's take a look at how pages are paying off by taking the difference between the percentage of fans in a category and the percentage of pages.  All else being equal we'd expect pages different categories to have a similar "return on investment."  Anything beyond that can only be explained by how Facebook users interact with pages.
&lt;/p&gt;

&lt;p&gt;
That is, if 10% of all pages are in a category but 15% of all fans are in that same category, we say that category has a 5% "ROI."  This metric allows us to see which categories are most likely to pay off.
&lt;/p&gt;

&lt;center&gt;
&lt;table class="monthly-data"&gt;
	&lt;tr class="top"&gt;
		&lt;th colspan="4"&gt;Facebook Page Categories by ROI&lt;/th&gt;
	&lt;/tr&gt;
	&lt;tr class="odd"&gt;
		&lt;th style="width: 15ex;"&gt;Category&lt;/th&gt;
		&lt;th style="width: 9ex;" class="date"&gt;% of Pages&lt;/th&gt;
		&lt;th style="width: 9ex;" class="date"&gt;% of Fans&lt;/th&gt;
		&lt;th style="width: 8ex;" class="date"&gt;ROI&lt;/th&gt;
	&lt;/tr&gt;
	&lt;tr&gt;&lt;td class="statistic"&gt;TV Show                                   &lt;/td&gt;&lt;td&gt;1.20%&lt;/td&gt;&lt;td&gt;9.65%&lt;/td&gt;&lt;td&gt;8.45%&lt;/td&gt;&lt;/tr&gt;
	&lt;tr class="odd"&gt;&lt;td class="statistic"&gt;Film                                      &lt;/td&gt;&lt;td&gt;1.44%&lt;/td&gt;&lt;td&gt;5.75%&lt;/td&gt;&lt;td&gt;4.31%&lt;/td&gt;&lt;/tr&gt;
	&lt;tr&gt;&lt;td class="statistic"&gt;Musician                                  &lt;/td&gt;&lt;td&gt;34.11%&lt;/td&gt;&lt;td&gt;37.03%&lt;/td&gt;&lt;td&gt;2.93%&lt;/td&gt;&lt;/tr&gt;
	&lt;tr class="odd"&gt;&lt;td class="statistic"&gt;Politician                                &lt;/td&gt;&lt;td&gt;1.85%&lt;/td&gt;&lt;td&gt;4.43%&lt;/td&gt;&lt;td&gt;2.57%&lt;/td&gt;&lt;/tr&gt;
	&lt;tr&gt;&lt;td class="statistic"&gt;Actor                                     &lt;/td&gt;&lt;td&gt;1.48%&lt;/td&gt;&lt;td&gt;3.20%&lt;/td&gt;&lt;td&gt;1.72%&lt;/td&gt;&lt;/tr&gt;
	&lt;tr class="odd"&gt;&lt;td class="statistic"&gt;Comedian                                  &lt;/td&gt;&lt;td&gt;0.94%&lt;/td&gt;&lt;td&gt;2.02%&lt;/td&gt;&lt;td&gt;1.09%&lt;/td&gt;&lt;/tr&gt;
	&lt;tr&gt;&lt;td class="statistic"&gt;Game                                      &lt;/td&gt;&lt;td&gt;0.62%&lt;/td&gt;&lt;td&gt;1.41%&lt;/td&gt;&lt;td&gt;0.78%&lt;/td&gt;&lt;/tr&gt;
	&lt;tr class="odd"&gt;&lt;td class="statistic"&gt;Food and Beverage                         &lt;/td&gt;&lt;td&gt;0.91%&lt;/td&gt;&lt;td&gt;1.67%&lt;/td&gt;&lt;td&gt;0.77%&lt;/td&gt;&lt;/tr&gt;
	&lt;tr&gt;&lt;td class="statistic"&gt;Athlete                                   &lt;/td&gt;&lt;td&gt;0.92%&lt;/td&gt;&lt;td&gt;1.65%&lt;/td&gt;&lt;td&gt;0.72%&lt;/td&gt;&lt;/tr&gt;
	&lt;tr class="odd"&gt;&lt;td class="statistic"&gt;Sports Team                               &lt;/td&gt;&lt;td&gt;1.37%&lt;/td&gt;&lt;td&gt;1.96%&lt;/td&gt;&lt;td&gt;0.59%&lt;/td&gt;&lt;/tr&gt;
	&lt;tr&gt;&lt;td class="statistic"&gt;Sports / Athletics                        &lt;/td&gt;&lt;td&gt;1.16%&lt;/td&gt;&lt;td&gt;1.72%&lt;/td&gt;&lt;td&gt;0.57%&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;
The most striking thing, for me, is how conventional these categories are: entertainment, politics, and sports.
&lt;/p&gt;

&lt;h3&gt;Conclusions&lt;/h3&gt;
&lt;p&gt;
Facebook's future rests in the branding of traditional media verticals.  They have captured a powerful demographic and have nearly perfected a distribution mechanism.  They deny they're a media company even as their VP of Product Marketing &lt;a href="http://www.news.com/8301-10784_3-9946606-7.html"&gt;says&lt;/a&gt; they're the "net's cable company."
&lt;/p&gt;

&lt;p&gt;
Entertainment and sports pages perform above expectations, while generic "business" pages perform below expectations.  The popularity of the political categories is to be expected since Facebook's largest userbase is in the US and 2008 is a Presidential election year.
&lt;/p&gt;

&lt;p&gt;
MySpace seems to understand that social networking and entertainment go hand-in-hand.  It's about time Facebook embraces the same &amp;mdash; their users already have.
&lt;/p&gt;

&lt;p&gt;
&lt;div class="download"&gt;Download the &lt;a href="http://assets.20bits.com/downloads/facebook-pages-data.xls"&gt;the full dataset&lt;/a&gt; in a Microsoft Excel spreadsheet&lt;/div&gt;
&lt;/p&gt;</description>
      <pubDate>Mon, 19 May 2008 16:40:32 +0000</pubDate>
      <link>http://20bits.com/article/facebook-users-just-want-entertainment</link>
    </item>
    <item>
      <title>Facebook Bans Google Friend Connect</title>
      <description>&lt;p&gt;
Facebook announced today on their official developers' blog that they have &lt;a href="http://developers.facebook.com/news.php?blog=1&amp;story=111"&gt;banned Google Friend Connect&lt;/a&gt;, stating privacy concerns.
&lt;/p&gt;

&lt;p&gt;
&lt;a href="http://www.google.com/friendconnect/"&gt;Google Friend Connect&lt;/a&gt; is a service that allows users to share their social data, such as personal information and friends, with websites that embed the Google-created widgets.  This data can come from many social networks, including Facebook, Hi5, Orkut, and Google Talk.
&lt;/p&gt;

&lt;p&gt;
The key section in the second-to-last paragraph: &lt;blockquote&gt;Now that Google has launched Friend Connect, we've had a chance to evaluate the technology. We've found that it redistributes user information from Facebook to other developers without users' knowledge, which doesn't respect the privacy standards our users have come to expect and is a violation of our Terms of Service. Just as we've been forced to do for other applications that redistribute data in a way users might not expect or understand, we've had to suspend Friend Connect's access to Facebook user information until it comes into compliance.&lt;/blockquote&gt;
&lt;/p&gt;

&lt;p&gt;
They claim that they have "reached out to Google several times about this issue," but do not state what conversations, if any, took place.  Nor do they spell out exactly how Google Friend Connect violates the Terms of Service.
&lt;/p&gt;

&lt;p&gt;
Facebook announced on May 9th, 2008 that they will be launching their own competitor to Google Friend Connect, &lt;a href="http://developers.facebook.com/news.php?blog=1&amp;story=108"&gt;Facebook Connect&lt;/a&gt;.  Both Google Friend Connect and Facebook Connect came on the heels of MySpace's May 8th announcement of their &lt;a href="http://www.news.com/8301-13577_3-9939286-36.html"&gt;Data Availability&lt;/a&gt; project.
&lt;/p&gt;

&lt;h3&gt;Analysis&lt;/h3&gt;
&lt;p&gt;
First, it's exciting to see competition in the data portability space.  What seemed like a fantasy just a year ago is now an inexorable trend: data will flow freely across all social networks.  Or, as &lt;a href="http://blogs.forrester.com/charleneli/2008/03/the-future-of-s.html"&gt;Charlene Li said&lt;/a&gt;, "Social networks will be like air."
&lt;/p&gt;

&lt;p&gt;
Google, MySpace, Yahoo!, and Facebook all have huge stakes in this game, each controlling a slice of the social networking pie.  Facebook and MySpace have "social networks" in their own right, but don't forget that friend data can come from services like email and IM, too.
&lt;/p&gt;


&lt;p&gt;
Second, Facebook is skirting a fine, legalistic line.  They don't claim they have a problem with Google Friend Connect taking data from Facebook.  Rather, their problem is that Google Friend Connect supposedly then shares this data with third-parties.  Of course, the blog post announcing all this is rather opaque and gives no specifics.
&lt;/p&gt;

&lt;p&gt;
But does anyone sincerely believe this isn't just Facebook pressing its competitive advantages?  They're about to launch their own version of Friend Connect and crippling your competitor in anticipation is a play right out of the Microsoft platform handbook.
&lt;/p&gt;

&lt;p&gt;
I think the folks at Facebook are just upset because Google, for once, got the drop on them.  The only way they know how to respond is with muscle rather than grace.
&lt;/p&gt;

&lt;p&gt;
Facebook is a business, so I understand it has to operate out of self-interest, but I hope they're not so self-deluded as to believe this move was motivated by privacy concerns.  The original launch of Facebook Beacon is enough to know that Facebook doesn't have privacy on the mind all the time.
&lt;/p&gt;

&lt;p&gt;
On a more general level, Facebook likes to play the world domination game, as &lt;a href="http://discussionleader.hbsp.com/haque/2008/05/http20bitscom20080506thestateo.html"&gt;Umair Haque&lt;/a&gt; has pointed out countless times.  Using privacy as a front Facebook acts the paternalist.
&lt;/p&gt;

&lt;p&gt;
Does Facebook know best?  Are they the best arbiters of my privacy? Thanks, Facebook, but no thanks.  I should be able to do with my data as I please.
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Update&lt;/strong&gt;: &lt;a href="http://www.techcrunch.com/2008/05/15/he-said-she-said-in-google-v-facebook/#comment-2299812"&gt;TechCrunch&lt;/a&gt; has more, including a follow-up from both Google and Facebook.
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Update 2&lt;/strong&gt;: John Furrier has an &lt;a href="http://furrier.org/2008/05/15/facebook-just-pulled-a-netscape-hey-facebook-what-are-you-thinking/"&gt;interesting post&lt;/a&gt; where he compares Facebook's strategy to Netscape rather than Microsoft.
&lt;/p&gt;</description>
      <pubDate>Thu, 15 May 2008 13:27:16 +0000</pubDate>
      <link>http://20bits.com/article/facebook-bans-google-friend-connect</link>
    </item>
    <item>
      <title>Interview Questions: Database Indexes</title>
      <description>&lt;p&gt;
Continuing my series on &lt;a href="http://20bits.com/tag/interview"&gt;interview questions&lt;/a&gt;, I'm going to spend some time covering ops and sysadmin questions.  We'll start by writing up an introduction to database indexes and their structure.
&lt;/p&gt;

&lt;h3&gt;The Question&lt;/h3&gt;
&lt;p&gt;
Most consumer-facing web startups these days use one of the major open source databases, either MySQL or PostgreSQL, to some degree.  If you want to prove your worth it's a good idea to get down to the nitty gritty and gain some understanding about these databases' internals.
&lt;/p&gt;

&lt;p&gt;
So, the question: "Explain to me what databases indexes are and how they work."
&lt;/p&gt;

&lt;h3&gt;The Answer&lt;/h3&gt;
&lt;p&gt;
In a nutshell a database index is an auxiliary data structure which allows for faster retrieval of data stored in the database.  They are keyed off of a specific column so that queries like "Give me all people with a last name of 'Smith'" are fast.
&lt;/p&gt;

&lt;h3&gt;The Theory&lt;/h3&gt;
&lt;p&gt;
Database tables, at least conceptually, look something like this: &lt;pre&gt;id	age	last_name	hometown
--	--	--		--
1	10	Johnson		San Francisco, CA
2	27	Smith		San Joe, CA
3	15	Rose		Palo Alto, CA
4	64	Farmer		Mill Valley, CA
5	55	Pauling		San Francisco, CA
6	17	Smith		Oakland, CA
...	...	...		...
100	49	Meyer		Berkeley, CA
101	30	Wayne		Monterey, CA
102	18	Schwartz	San Francisco, CA
104	6	Johnson		San Francisco, CA
...	...	...		...
10000	41	Fetterman	Mountain View, CA
10001	25	Breyer		Redwood City, CA&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
That is, a table is a collection of &lt;a href="http://en.wikipedia.org/wiki/Tuple"&gt;tuples&lt;/a&gt;&lt;span class="footnote"&gt;For bonus points, the "relational" in "relational database" comes from this fact, not from the idea that there are "relations" between tables.&lt;/span&gt;.  If we have a file like this sitting on disk how do we get all records that have a last name of 'Smith?'
&lt;/p&gt;

&lt;p&gt;
The code would wind up looking something like this: &lt;pre class="brush: python"&gt;results = []
for row in rows:
	if row[2] == 'Smith':
		results.append[row]&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
Finding the appropriate records requires checking the conditions (here, having a last name of 'Smith') for each row.  This is linear in the number of rows which, for many databases, could be millions or billions of rows.  Bad news.
&lt;/p&gt;

&lt;p&gt;
How can we make it faster?
&lt;/p&gt;

&lt;h3&gt;Database Indexes&lt;/h3&gt;
&lt;p&gt;
Any type of data structure that allows for (potentially) faster access can be considered an index.  Let's look at some.
&lt;/p&gt;

&lt;h4&gt;Hash Indexes&lt;/h4&gt;
&lt;p&gt;
Take the same example from above, finding all people with a last name of 'Smith.'  One solution would be to create a &lt;a href="http://en.wikipedia.org/wiki/Hash_function#Hash_tables"&gt;hash table&lt;/a&gt;.  The keys of the hash would be based off of the &lt;tt&gt;last_name&lt;/tt&gt; field and the values would be pointers to the database row.
&lt;/p&gt;

&lt;p&gt;
This type of index is called, unsurprisingly, a "hash index."  Most databases support them but they're generally not the default type.  Why?
&lt;/p&gt;

&lt;p&gt;
Well, consider a query like this: "Find all people who are younger than 45."  Hashes can deal with equality but not inequality.  That is, given the hashes of two fields, there's just no way for me to tell which is greater than the other, only whether they're equal or not.
&lt;/p&gt;

&lt;h4&gt;B-tree Indexes&lt;/h4&gt;

&lt;p&gt;
The data structure most commonly used for database indexes are &lt;a href="http://en.wikipedia.org/wiki/B-tree"&gt;B-trees&lt;/a&gt;, a specific kind of self-balancing tree.  A picture's worth a thousand words, so here's an example.
&lt;img src="http://assets.20bits.com/20080513/b-tree.png" alt="B-tree" title="b-tree" width="494" height="206" class="math size-full wp-image-135" /&gt;
&lt;/p&gt;

&lt;p&gt;
The main benefit of a B-tree is that it allows logarithmic selections, insertions, and deletions in the worst case scenario.  And unlike hash indexes it stores the data in an ordered way, allowing for faster row retrieval when the selection conditions include things like inequalities or prefixes.
&lt;/p&gt;

&lt;p&gt;
For example, using the tree above, to get the records for all people younger than 13 requires looking at only the left branch of the tree root.
&lt;/p&gt;

&lt;h4&gt;Other Indexes&lt;/h4&gt;
&lt;p&gt;
Hash indexes and B-tree indexes are the most common types of database indexes, but there are others, too.  MySQL supports &lt;a href="http://en.wikipedia.org/wiki/R-tree"&gt;R-tree&lt;/a&gt; indexes, which are used to query spatial data, e.g., "Show me all cities within ten miles of San Francisco, CA."
&lt;/p&gt;

&lt;p&gt;
There are also &lt;a href="http://en.wikipedia.org/wiki/Bitmap_index"&gt;bitmap indexes&lt;/a&gt;, which allow for almost instantaneous read operations but are expensive to change and take up a lot of space.  They are best for columns which have only a few possible values.
&lt;/p&gt;

&lt;h3&gt;Subtleties&lt;/h3&gt;
&lt;h4&gt;Performance&lt;/h4&gt;
&lt;p&gt;
Indexes don't come for free.  What you gain for in retrieval speed you lose in insertion and deletion speed because every time you alter a table the indexes must be updated accordingly.  If your table is updating frequently it's possible that having indexes will cause overall performance of your database to suffer.
&lt;/p&gt;

&lt;p&gt;
There is also a space penalty, as the indexes take up space in memory or on disk.  A single index is smaller than the table because it doesn't contain all the data, only pointers to the data, but in general the larger the table the larger the index&lt;span class="footnote"&gt;Technically the size of an index is going to be proportional to the cardinality of the column being indexed.&lt;/span&gt;.
&lt;/p&gt;

&lt;h4&gt;Design&lt;/h4&gt;
&lt;p&gt;
Nodes in a B-tree contain a value and a number of pointers to children nodes.  For database indexes the "value" is really a pair of values: the indexed field and a pointer to a database row. That is, rather than storing the row data right in the index, you store a pointer to the row on disk.
&lt;/p&gt;

&lt;p&gt;
For example, if we have an index on an &lt;tt&gt;age&lt;/tt&gt; column, the value in the B-tree might be something like (34, 0x875900).  34 is the age and 0x875900 is a reference to the location of the data, rather than the data itself.
&lt;/p&gt;

&lt;p&gt;
This often allows indexes to be stored in memory even for tables that are so large they can only be stored on disk.
&lt;/p&gt;

&lt;p&gt;
Furthermore, B-tree indexes are typically designed so that each node takes up &lt;a href="http://en.wikipedia.org/wiki/Block_(data_storage)"&gt;one disk block&lt;/a&gt;.  This allows each node to be read in with a single disk operation.
&lt;/p&gt;

&lt;p&gt;
Also, for the pedants among us, many databases use &lt;a href="http://en.wikipedia.org/wiki/B%2B_tree"&gt;B+ trees&lt;/a&gt; rather than classic B-trees for generic database indexes.  InnoDB's &lt;tt&gt;BTREE&lt;/tt&gt; index type is closer to a B+ tree than a B-tree, for example.
&lt;/p&gt;

&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;
Database indexes are auxiliary data structures that allow for quicker retrieval of data.  The most common type of index is a B-tree index because it has very good general performance characteristics and allows a wide range of comparisons, including both equality and inequalities.
&lt;/p&gt;

&lt;p&gt;
The penalty for having a database index is the cost required to update the index, which must happen any time the table is altered.  There is also a certain about of space overhead, although indexes will be smaller than the table they index.
&lt;/p&gt;

&lt;p&gt;
For specific data types different indexes might be better suited than a B-tree.  R-trees, for example, allow for quicker retrieval of spatial data.  For fields with only a few possible values bitmap indexes might be appropriate.
&lt;/p&gt;

&lt;h3&gt;Good Question, Bad Question&lt;/h3&gt;
&lt;p&gt;
I like this question because it shows whether the interviewee is curious enough to dive into these details.  For certain higher-level engineering positions knowing this should be second-nature, but even for a generic web development position knowing how your database works will only help you improve the performance of your web application.
&lt;/p&gt;

&lt;p&gt;
Also, it's just arcane enough that you can go through the motions without knowing it, but not so arcane that it's inaccessible to someone without an advanced education.  Any decent programmer should be able to understand it &amp;mdash; the exceptional ones will go out of their way to learn it.
&lt;/p&gt;</description>
      <pubDate>Tue, 13 May 2008 12:56:53 +0000</pubDate>
      <link>http://20bits.com/article/interview-questions-database-indexes</link>
    </item>
    <item>
      <title>Powerset Launches. Verdict: Meh.</title>
      <description>&lt;p&gt;
&lt;a href="http://powerset.com"&gt;Powerset&lt;/a&gt;, the much-hyped natural-language search company, has finally &lt;a href="http://www.techcrunch.com/2008/05/11/powerset-launches-showcase-for-user-search-experience/"&gt;launched a public product&lt;/a&gt;: a showcase for its search technology that "enhances the Wikipedia experience."  It's live right now on its homepage, so go check it out.
&lt;/p&gt;

&lt;p&gt;
Are you back?  That sound you heard is the technology world shrugging in unison.  For all the hype Powerset has gotten over the last year and a half this showcase leaves a &lt;a href="http://en.wikipedia.org/wiki/Chicxulub_Crater"&gt;Chicxulub-sized&lt;/a&gt; gap between expectation and execution.
&lt;/p&gt;

&lt;p&gt;
Even ignoring all the press, it's not that impressive on the face of it.  Using their example queries as templates it took me about five seconds to find queries which not only returned appropriate results on Google but simultaneously returned nonsense on Powerset.
&lt;/p&gt;

&lt;p&gt;
On a personal note, I really wanted to like Powerset.  The people working there are all super-smart and I know they've put a lot of blood and sweat into this launch.  But I have to be honest.  If someone over there reads this just know I do it because I want to see the company launch a great product.
&lt;/p&gt;

&lt;h3&gt;A Failure of Execution&lt;/h3&gt;
&lt;p&gt;
Let's start by diving into a little Google vs. Powerset one-on-one.
&lt;/p&gt;


&lt;p&gt;
&lt;table class="pset-compare"&gt;
	&lt;tr&gt;
		&lt;th style="width: 40ex;"&gt;Query&lt;/th&gt;
		&lt;th style="width: 3ex;"&gt;&lt;/th&gt;
		&lt;th style="width: 3ex;"&gt;&lt;/th&gt;
		&lt;th style="width: 3ex;"&gt;Winner&lt;/th&gt;
	&lt;/tr&gt;
	&lt;tr class="odd"&gt;
		&lt;td&gt;Who is on Google's board?&lt;/td&gt;
		&lt;td&gt;&lt;a href="http://www.powerset.com/explore/pset?q=Who+is+on+Google%27s+board%3F&amp;x=0&amp;y=0"&gt;Powerset&lt;/a&gt;&lt;/td&gt;
		&lt;td&gt;&lt;a href="http://www.google.com/search?client=safari&amp;rls=en-us&amp;q=who+is+on+google's+board&amp;ie=UTF-8&amp;oe=UTF-8"&gt;Google&lt;/a&gt;&lt;/td&gt;
		&lt;td&gt;Powerset&lt;/td&gt;
	&lt;/tr&gt;
	&lt;tr&gt;
		&lt;td&gt;Who is on both Google and Apple's board?&lt;/td&gt;
		&lt;td&gt;&lt;a href="http://www.powerset.com/explore/pset?q=Who+is+on+both+Google+and+Apple%27s+board%3F&amp;submit.x=32&amp;submit.y=7"&gt;Powerset&lt;/a&gt;&lt;/td&gt;
		&lt;td&gt;&lt;a href="http://www.google.com/search?client=safari&amp;rls=en-us&amp;q=Who+is+on+both+Google+and+Apple%27s+board%3F&amp;ie=UTF-8&amp;oe=UTF-8"&gt;Google&lt;/a&gt;&lt;/td&gt;
		&lt;td&gt;Google&lt;/td&gt;
	&lt;/tr&gt;
	&lt;tr class="odd"&gt;
		&lt;td&gt;How did Hitler die?&lt;/td&gt;
		&lt;td&gt;&lt;a href="http://www.powerset.com/explore/pset?q=How+did+Hitler+die%3F&amp;x=0&amp;y=0"&gt;Powerset&lt;/a&gt;&lt;/td&gt;
		&lt;td&gt;&lt;a href="http://www.google.com/search?hl=en&amp;client=safari&amp;rls=en-us&amp;q=how+did+hitler+die%3F&amp;btnG=Search"&gt;Google&lt;/a&gt;&lt;/td&gt;
		&lt;td&gt;Google&lt;/td&gt;
	&lt;/tr&gt;
	&lt;tr&gt;
		&lt;td&gt;How did Adolf Hitler die?&lt;/td&gt;
		&lt;td&gt;&lt;a href="http://www.powerset.com/explore/pset?q=How+did+Adolf+Hitler+die%3F&amp;x=0&amp;y=0"&gt;Powerset&lt;/a&gt;&lt;/td&gt;
		&lt;td&gt;&lt;a href="http://www.google.com/search?hl=en&amp;client=safari&amp;rls=en-us&amp;q=how+did+adolf+hitler+die%3F&amp;btnG=Search"&gt;Google&lt;/a&gt;&lt;/td&gt;
		&lt;td&gt;Tie&lt;/td&gt;
	&lt;/tr&gt;
	&lt;tr class="odd"&gt;
		&lt;td&gt;What is the longest suspension bridge in the United States?&lt;/td&gt;
		&lt;td&gt;&lt;a href="http://www.powerset.com/explore/pset?q=What+is+the+longest+suspension+bridge+in+the+United+States%3F&amp;x=0&amp;y=0"&gt;Powerset&lt;/a&gt;&lt;/td&gt;
		&lt;td&gt;&lt;a href="http://www.google.com/search?hl=en&amp;client=safari&amp;rls=en-us&amp;pwst=1&amp;sa=X&amp;oi=spell&amp;resnum=0&amp;ct=result&amp;cd=1&amp;q=What+is+the+longest+suspension+bridge+in+the+United+States%3F&amp;spell=1"&gt;Google&lt;/a&gt;&lt;/td&gt;
		&lt;td&gt;Google&lt;/td&gt;
	&lt;/tr&gt;
	&lt;tr&gt;
		&lt;td&gt;When did the United States enter Iraq?&lt;/td&gt;
		&lt;td&gt;&lt;a href="http://www.powerset.com/explore/pset?q=When+did+the+United+States+enter+Iraq%3F&amp;x=0&amp;y=0"&gt;Powerset&lt;/a&gt;&lt;/td&gt;
		&lt;td&gt;&lt;a href="http://www.google.com/search?hl=en&amp;client=safari&amp;rls=en-us&amp;q=When+did+the+United+States+enter+Iraq%3F&amp;btnG=Search"&gt;Google&lt;/a&gt;&lt;/td&gt;
		&lt;td&gt;Powerset&lt;/td&gt;
	&lt;/tr&gt;
	&lt;tr class="odd"&gt;
		&lt;td&gt;Who was the tenth President of the US?&lt;/td&gt;
		&lt;td&gt;&lt;a href="http://www.powerset.com/explore/pset?q=Who+was+the+tenth+President+of+the+US%3F&amp;x=0&amp;y=0"&gt;Powerset&lt;/a&gt;&lt;/td&gt;
		&lt;td&gt;&lt;a href="http://www.google.com/search?client=safari&amp;rls=en-us&amp;q=Who+was+the+tenth+President+of+the+US%3F&amp;ie=UTF-8&amp;oe=UTF-8"&gt;Google&lt;/a&gt;&lt;/td&gt;
		&lt;td&gt;Google&lt;/td&gt;
	&lt;/tr&gt;
&lt;/table&gt;
&lt;/p&gt;
&lt;br /&gt;
&lt;p&gt;
You get the idea.  I tried to pick questions that Powerset is designed to answer, i.e., fact-based trivia easily found on Wikipedia.
&lt;/p&gt;
&lt;p&gt;
Some of the failures are pretty egregious, honestly. Searching for "Who was the tenth President of the US?" fails to return a single relevant result in the first page of Powerset whereas Google's entire first page is relevant, even limiting Google to searching just Wikipedia.
&lt;/p&gt;

&lt;p&gt;
In other cases it appears Powerset has a poor understanding of synonymy, returning irrelevant results for "How did Hitler die?" but returning the correct answer for "How did Adolf Hitler die?"  Of course, Google returns the correct answer in both cases.
&lt;/p&gt;

&lt;p&gt;
Guys, this is supposed to be the exact area where you excel.  What gives?  You're not even living up to your new, lowered expectations.
&lt;/p&gt;

&lt;h3&gt;A Failure of Marketing&lt;/h3&gt;
&lt;p&gt;
Marketing is a two-edged sword.  The net effect of good marketing is to solidify your brand in the minds of consumers.  This is good if you execute, but can also make it difficult to change course later.
&lt;/p&gt;

&lt;p&gt;
Powerset fell into this trap.  They started with ambitions of being a Google-killer, but reading their &lt;a href="http://www.powerset.com/about/"&gt;about page&lt;/a&gt; now it sounds more like they're aiming to be the Google-enhancer.  This is a respectable business, of course, but it is hard to swallow after a year and a half of being promised a revolutionary new search paradigm.
&lt;/p&gt;

&lt;p&gt;
They might be  &lt;a href="http://venturebeat.com/2008/04/10/powerset-dont-call-us-a-search-engine/"&gt;repudiating the Google-killer label&lt;/a&gt; now, but here's an excerpt from a February, 2007 &lt;a href="http://www.powerset.com/news/parc.pdf"&gt;press release&lt;/a&gt;: &lt;blockquote&gt;â€The time is right to tell the world about &lt;strong&gt;the game-changing technology weâ€™ve created&lt;/strong&gt;,â€ said Ron Kaplan, Powerset Chief Technology and Science Officer, who previously created and managed the Natural Language Research Group at PARC. "I am glad to join Powersetâ€™s team of world-class linguists and search engineers to help this technology &lt;strong&gt;revolutionize the way people access information&lt;/strong&gt;."&lt;/blockquote&gt;
&lt;/p&gt;

&lt;p&gt;
Too much marketing before a product launches can back you into a corner and in Powerset's case it will be difficult at this point to avoid being compared directly to Google in the press.
&lt;/p&gt;

&lt;h3&gt;Conclusions&lt;/h3&gt;
&lt;p&gt;
Powerset was started in 2005 and has been using Xerox PARC's natural language processing technology for over a year, now.  In that time they've been pumping out press releases talking about how they will revolutionize not just search but the way humans and computers interact.
&lt;/p&gt;

&lt;p&gt;
What do they have to show for it?  Not much, judging by their latest product.  As a search tool it is more interesting than useful, shining in only a few, pre-selected cases.  The advantages over Google are so minimal and the defects so large that I would never consider using this as my main means of searching Wikipedia, let alone the Web at large.
&lt;/p&gt;

&lt;p&gt;
To me this product smells like a tech demo, not a fully-featured product launch, intended to convince someone outside Powerset that they really are producing something amazing.  &lt;a href="http://www.news.com/8301-13953_3-9940887-80.html"&gt;There are rumors&lt;/a&gt; that Powerset is looking to sell or raise another round of financing, and have recently hired David Wehner, a managing director at Allen &amp;amp; Co.
&lt;/p&gt;

&lt;p&gt;
This launch might be enough to convince investors to re-up or buyers to fork over the dough, but speaking as an end-user I'll take another look at what Powerset has to offer when it can tell me who &lt;a href="http://en.wikipedia.org/wiki/John_Tyler"&gt;John Tyler&lt;/a&gt; was.
&lt;/p&gt;</description>
      <pubDate>Mon, 12 May 2008 11:59:53 +0000</pubDate>
      <link>http://20bits.com/article/powerset-launches-verdict-meh</link>
    </item>
    <item>
      <title>How I Grow My Blog</title>
      <description>&lt;p&gt;
I was talking with &lt;a href="http://zellunit.com/"&gt;Matt Humphrey&lt;/a&gt; the other day and he asked me, "How did you grow your blog?"  My answer at the time wasn't very enlightening, so I thought I'd sit down and hammer out my strategy for growing 20bits.
&lt;/p&gt;

&lt;h3&gt;General Principles&lt;/h3&gt;
&lt;ol&gt;
	&lt;li&gt;
	&lt;h4&gt;Play to Your Strengths&lt;/h4&gt;
	&lt;p&gt;
	Not everyone is witty and not everyone is penetrating.  Personally I'm good at writing longer, article-sized posts, so that's what I write most of the time.  I'm also not great at written humor &amp;mdash; I usually just come off sounding like a smug asshole.
	&lt;/p&gt;
	&lt;p&gt;
	As an exercise I might try to write some shorter articles or include a parenthetical joke, but I understand my strengths and use them to my advantage.
	&lt;/p&gt;
	&lt;/li&gt;
	&lt;li&gt;
	&lt;h4&gt;Pick Your Audience&lt;/h4&gt;
	&lt;p&gt;
	Before you even start writing you have to decide on an audience.  When I first started blogging I was all over the map.  Since then I've tightened up this blog to focus on technology, technology news, and the some aspects of Silicon Valley life.
	&lt;/p&gt;
	&lt;p&gt;
	This limits me in some respects, but helps me in others.  Since, you know, I'm &lt;em&gt;working in technology&lt;/em&gt; in the Bay Area, it actually does a lot to advance my career, even if I'm never going to be quoted in Time magazine.
	&lt;/p&gt;
	&lt;/li&gt;
	&lt;li&gt;
	&lt;h4&gt;Be Interesting, and Know When You Aren't&lt;/h4&gt;
	&lt;p&gt;
	Try to write interesting things.  This means things your audience would be interested in reading, by the way.  I can't count the number of times I wrote an article that I was really proud of only to realize that I was the only one who gave a crap.
	&lt;/p&gt;
	&lt;p&gt;
	Of course, nobody can be interesting all the time.  Take stock of your mistakes and learn to identify when a post will actually be interesting.  And please, don't fall into the trap of thinking you're better than your audience.
	&lt;/p&gt;
	&lt;/li&gt;
	&lt;li&gt;
	&lt;h4&gt;Be Succinct&lt;/h4&gt;
	&lt;p&gt;
	Be as short as you can be without compromising your central argument.  This applies to any kind of expository writing, in my opinion, and blogging is no different.
	&lt;/p&gt;
	&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;Tactical Principles&lt;/h3&gt;
&lt;ol&gt;
	&lt;li&gt;
	&lt;h4&gt;Make Friends&lt;/h4&gt;
	&lt;p&gt;
	Reach out to other people in your field and make connections.  Ask people out for coffee to discuss their latest work.  Promote other people when they say something worthwhile.
	&lt;/p&gt;
	&lt;p&gt;
	Basically, you want people to be able to associate your website with your face.  Think of it as a branding exercise with the pleasant side-effect of getting to meet really awesome people.
	&lt;/p&gt;
	&lt;/li&gt;
	&lt;li&gt;
	&lt;h4&gt;Be Quick in Spotting Popularity&lt;/h4&gt;
	&lt;p&gt;
	If you spot something you know is going to take off and you have a response, write it up as quickly as you can and get it out there.
	&lt;/p&gt;
	&lt;p&gt;
	Nowadays you can use &lt;a href="http://techmeme"&gt;TechMeme&lt;/a&gt; to measure this.  Find an upcoming article there that few people have responded to and be the first to respond.
	&lt;/p&gt;
	&lt;/li&gt;
	&lt;li&gt;
	&lt;h4&gt;Be Controversial, but not a Jerk&lt;/h4&gt;
	&lt;p&gt;
	Controversy generates interest.  Don't just use a linkbait headline with a milquetoast body.  That's just half-assing it.
	&lt;/p&gt;
	&lt;p&gt;
	That said, don't be a jerk.  If you call people out expect to get called out in return.  And be ready to change your tune when someone shows you the error of your ways.  In short, have &lt;a href="http://bobsutton.typepad.com/my_weblog/2006/07/strong_opinions.html"&gt;strong opinions, weakly held.&lt;/a&gt;
	&lt;/p&gt;
	&lt;/li&gt;
	&lt;li&gt;
	&lt;h4&gt;Know When To Promote, and Then Go All Out&lt;/h4&gt;
	&lt;p&gt;
	When you have a post you know will get traction don't be afraid to promote it by calling in favors.  But don't be the boy who cried wolf, either, asking your friends and connections to promote every single story you write.  Save if for the good stuff.
	&lt;/p&gt;
&lt;/ol&gt;

&lt;p&gt;
I try to follow the above consciously with every article I write.  So far it has paid dividends.
&lt;/p&gt;</description>
      <pubDate>Fri, 09 May 2008 11:18:02 +0000</pubDate>
      <link>http://20bits.com/article/how-i-grow-my-blog</link>
    </item>
    <item>
      <title>The State of the Platform: Update</title>
      <description>&lt;p&gt;
My article about &lt;a href="/article/the-state-of-the-facebook-platform/"&gt;The State of the Facebook Platform&lt;/a&gt; has been spreading through the blogosphere like a game of telephone.  &lt;a href="http://andrewchen.typepad.com/andrew_chens_blog/2008/05/has-the-faceboo.html#comments"&gt;Lots&lt;/a&gt; &lt;a href="http://www.sarahlacy.com/sarahlacy/2008/05/facebook-platfo.html"&gt;of&lt;/a&gt; &lt;a href="http://blog.playfish.com/2008/05/07/facebooks-stricter-app-regulations-are-a-good-thing/"&gt;people&lt;/a&gt; &lt;a href="http://twitter.com/Scobleizer/statuses/805095480"&gt;have&lt;/a&gt; &lt;a href="http://venturebeat.com/2008/05/06/facebooks-platform-issues-fewer-developers-slower-app-growth"&gt;chimed&lt;/a&gt; in with their own opinions.
&lt;/p&gt;

&lt;p&gt;
I wanted to write a follow-up post to clarify my opinion and address some of the responses.
&lt;/p&gt;

&lt;h3&gt;What I'm Claiming&lt;/h3&gt;
&lt;p&gt;
My claims are simple and uncontroversial.  I observed two things: one, the activity level in the Facebook forums is a fraction of what it was four months ago; two, Facebook apps launched today are much less likely to succeed.
&lt;/p&gt;

&lt;p&gt;
The trends for these two observations are highly correlated and exhibit the same peak around February 2nd, 2008.  What happened around that time?  One, Facebook began instituting increasingly demanding and arbitrary developer policies.  Two, other networks began launching fully-featured competitors to Facebook's platform.
&lt;/p&gt;

&lt;p&gt;
From the high correlation, the timing of events, and comments from people working in the industry, I concluded that developers are less interested in Facebook today because there's less return on their investment of labor.
&lt;/p&gt;

&lt;h3&gt;What I'm NOT Claiming&lt;/h3&gt;
&lt;p&gt;
I'm not claiming that the Facebook Platform is unhealthy.  Nor am I claiming that it was a bad idea for Facebook to implement the policy changes they did.
&lt;/p&gt;

&lt;p&gt;
I'm certainly not claiming that any of the data implies either of the above.  Indeed, it's still possible to find &lt;a href="http://adonomics.com/about/10726707410"&gt;success&lt;/a&gt; on the Facebook Platform.  It just requires more effort than it used to.
&lt;/p&gt;

&lt;p&gt;
Also, most emphatically, &lt;em&gt;I'm not talking about Facebook users&lt;/em&gt;.  The article was only about developers and their decision to create software for Facebook, not about Facebook as a whole, which is still seeing phenomenal success.
&lt;/p&gt;

&lt;h3&gt;Other Hypotheses&lt;/h3&gt;
&lt;p&gt;
The most common alternate hypotheses for these trends was summarized by &lt;a href="http://blog.jeffreymcmanus.com/"&gt;Jeffery McManus&lt;/a&gt;: &lt;blockquote&gt;This is not a terrific metric for developer activity â€” it doesnâ€™t measure what you purport to measure. Developers generally view and post to forums when they have problems; if fewer developers are posting to the forums, it may mean that there are more developers who are having less trouble.&lt;/blockquote&gt;
&lt;/p&gt;

&lt;p&gt;
I call this the "documentation hypothesis" and addressed it briefly in my original article.  I think it's an unappealing explanation for a few reasons.
&lt;/p&gt;

&lt;p&gt;
First, if it were true, we'd expect to see spikes in forum activity whenever a new issue arose on the Platform, especially since Facebook's changes tend to be radical and out of nowhere.  The decline in activity is virtually monotonic, however, and the data shows no such spikes.
&lt;/p&gt;

&lt;p&gt;
Second, even if it were true, it doesn't explain the correlation between forum activity and application success, nor does it explain the sudden decline beginning around February 2nd.  As an explanation it just isn't sufficient.
&lt;/p&gt;

&lt;h3&gt;Is the Trend Good or Bad?&lt;/h3&gt;
&lt;p&gt;
I understand the tone of the article was bearish, but I was writing it from the perspective of a developer deciding whether or not to commit to the Facebook Platform.  There are lots of perspectives, though.
&lt;/p&gt;

&lt;dl&gt;
	&lt;dt&gt;Facebook's Perspective&lt;/dt&gt;
	&lt;dd&gt;
	&lt;p&gt;
	I believe Facebook is making these changes intentionally.  They have a love-hate relationship with companies like Slide.  Strategically speakingthese companies got in at the very beginning and quickly cordoned off sections of the social graph for themselves, largely out of Facebook's reach.  Messages on FunWall don't go through Facebook, for example.
	&lt;/p&gt;
	&lt;p&gt;
	This is clearly not in Facebook's strategic interest, but they can't just boot these companies out because a significant number of Facebook users would throw a fit.  From Facebook's perspective these trends in developer engagement are good because it allows them to reassert control and improve their image as the "high-quality social network."
	&lt;/p&gt;
	&lt;/dd&gt;
	&lt;dt&gt;Facebook Users' Perspective&lt;/dt&gt;
	&lt;dd&gt;
	&lt;p&gt;
	Let's face it, most Facebook users don't like to be pestered by applications.  For them these changes are good.  And judging by Facebook's &lt;a href="http://www.alexa.com/data/details/traffic_details/facebook.com?site0=myspace.com&amp;site1=facebook.com&amp;y=r&amp;z=3&amp;h=300&amp;w=610&amp;c=1&amp;u%5B%5D=myspace.com&amp;u%5B%5D=facebook.com&amp;x=2008-05-07T20%3A35%3A12.000Z&amp;check=www.alexa.com&amp;signature=n49bq4%2B6Z5asVqN59LzvZZubXw8%3D&amp;range=max&amp;size=Medium"&gt;traffic stats&lt;/a&gt; it isn't hurting them one bit.
	&lt;/p&gt;
	&lt;/dd&gt;
	&lt;dt&gt;Advertisers' Perspective&lt;/dt&gt;
	&lt;dd&gt;
	&lt;p&gt;
	For advertisers these developments are universally good.  If the bar for application development is higher it means the applications that succeed will be of a higher quality.  Nobody wants to advertise on "What color barf are you?" and Facebook doesn't want that application to be front-and-center, either.  It just looks bad.
	&lt;/p&gt;
	&lt;/dd&gt;
	&lt;dt&gt;Developers' Perspective&lt;/dt&gt;
	&lt;dd&gt;
	&lt;p&gt;
	For developers this is a mixed bag.  Facebook's cavalier attitude about platform policy means that you're playing on shifting ground.  On top of that the changes they've already made mean it's harder for applications to succeed, on average.
	&lt;/p&gt;
	&lt;p&gt;
	Still, for companies like &lt;a href="http://www.socialgn.com/"&gt;SGN&lt;/a&gt; and &lt;a href="http://www.playfish.com/"&gt;PlayFish&lt;/a&gt;, who want to make quality applications, this means that they don't have to worry about competing with win-at-all-cost, spammy applications.
	&lt;/p&gt;
	&lt;p&gt;
	I just wouldn't recommend developing &lt;em&gt;only&lt;/em&gt; on Facebook, as they've shown they're willing to change and bend the rules at a whim and for their own benefit.  You know that as soon as Facebook decides they don't like what you're doing they'll do everything in their power to hinder you.  Hedge your bets.
	&lt;/p&gt;
	&lt;/dd&gt;
&lt;/dl&gt;

&lt;h3&gt;Hype Cycle&lt;/h3&gt;
&lt;p&gt;
Don't forget about the &lt;a href="http://en.wikipedia.org/wiki/Hype_cycle"&gt;hype cycle&lt;/a&gt;, either.  All technologies go through a phase of inflated expectations followed by a trough of disillusionment.
&lt;/p&gt;


&lt;p&gt;
I'd say we're right in the middle of the trough of disillusionment.  Companies like Zynga, SocialMedia, Slide, RockYou, and SGN are going to slug through the slope of enlightenment.
&lt;/p&gt;

&lt;p&gt;
Will we have a social operating system or a revolutionary social commerce system waiting at the end?  Probably not.  Will we have innovative casual gaming platforms?  I'd take that bet.
&lt;/p&gt;

&lt;p&gt;
I'm interested in hearing other perspectives, too, particularly investors' perspectives.  Does anyone have any insight on that?
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Update:&lt;/strong&gt; &lt;a href="http://runningwithfoxes.com/2008/05/07/facebook-platform-thinning-of-the-herd/"&gt;Nick Gonzalez&lt;/a&gt;, formerly of TechCrunch and now of SocialMedia, makes a similar point about the hype cycle.   I also like the Darwinian nature of his post's title: "The Thinning of the Herd."  Heh.
&lt;/p&gt;</description>
      <pubDate>Wed, 07 May 2008 14:12:44 +0000</pubDate>
      <link>http://20bits.com/article/the-state-of-the-platform-update</link>
    </item>
    <item>
      <title>The State of the Facebook Platform</title>
      <description>&lt;p&gt;
Something is wrong in the Facebook developer community.  Starting in March I began noticing that the level of activity in the &lt;a href="http://forum.developers.facebook.com/" onclick="javascript:urchinTracker('/outbound/forum.developers.facebook.com/');"&gt;Facebook developers forum&lt;/a&gt; was dropping sharply.
&lt;/p&gt;
&lt;p&gt;
But it's numbers that matter, not vague impressions, so does the data back me up?  Is the Facebook developer community retreating from the public space of the forums?  The answer is yes, on both accounts.
&lt;/p&gt;
&lt;p&gt;
Looking at five key metrics we'll see that the activity level of the Facebook forums is a fraction of what it was at the beginning of 2008.  The number of active users&lt;span class="footnote"&gt;An "active user" is defined as someone who has made at least one post in the period being considered. Here, a month&lt;/span&gt; has declined &lt;strong&gt;27%&lt;/strong&gt; since January, for example.  And this is the best-performing metric discussed.
&lt;/p&gt;
&lt;p&gt;
Furthermore, this decline in forum activity correlates to an overall decline in activity on the Facebook platform.  Applications launched in early January were on average &lt;strong&gt;1.5 times&lt;/strong&gt; more successful than applications launched at the end of March.
&lt;/p&gt;
&lt;p&gt;
The turning point occurred in early February where several interlocking factors came into play.  First, Facebook finally saw real competition in the form of other social networking platforms, particularly Hi5.
&lt;/p&gt;
&lt;p&gt;
Second, Facebook started instituting increasingly demanding and arbitrary rules on platform developers, which they then enforced selectively and for their own benefit.
&lt;/p&gt;
&lt;p&gt;
Third, a trend of application consolidation began and accelerated through March, locking up developer resources inside private companies.
&lt;/p&gt;
&lt;h3&gt;Contents&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="#activity"&gt;What is Activity?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#data"&gt;The Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#trend"&gt;The Trend&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#analysis"&gt;Analysis and Hypotheses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#app-data"&gt;Facebook Application Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusions"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;a name="activity"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;What is Activity?&lt;/h3&gt;
&lt;p&gt;
You can define activity in a lot of ways and I tried to be as liberal as possible.  The Facebook forums have three main objects worth measuring: users, threads, and posts.
&lt;/p&gt;
&lt;p&gt;
The forum is "active" when users are signing up and creating new threads and posts.  To measure this I created a script in Ruby that can scrape any PunBB-based forum, like the Facebook forum.  This basically results in a local copy of the forum database.
&lt;/p&gt;
&lt;p&gt;
I then broke the data down in two ways.  In the first I compare the activity in Janurary 2008 to the activity in April 2008.  In the second I break down all activity from the launch of the forum in October 2008 through April 2008 on a weekly basis.
&lt;/p&gt;
&lt;p&gt;
&lt;strong&gt;Disclaimer:&lt;/strong&gt; I'm the creator of &lt;a href="http://adonomics.com" onclick="javascript:urchinTracker('/outbound/adonomics.com');"&gt;Adonomics&lt;/a&gt;, a key player in the Facebook platform ecosystem, although I no longer work for the company that now owns it.
&lt;/p&gt;
&lt;p&gt;&lt;a name="data"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;The Data&lt;/h3&gt;
&lt;p&gt;&lt;center&gt;&lt;/p&gt;
&lt;table class="monthly-data"&gt;
&lt;tr class="top"&gt;
&lt;th colspan="4"&gt;Monthly Statistics for the Facebook Developer Forum&lt;/th&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;th style="width: 15ex;"&gt;Month:&lt;/th&gt;
&lt;th style="width: 8ex;" class="date"&gt;Jan 2008&lt;/th&gt;
&lt;th style="width: 8ex;" class="date"&gt;Apr 2008&lt;/th&gt;
&lt;th style="width: 2ex;"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class="statistic"&gt;Posts per day&lt;/td&gt;
&lt;td&gt;461&lt;/td&gt;
&lt;td&gt;225&lt;/td&gt;
&lt;td class="negative"&gt;-51%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td class="statistic"&gt;Signups per day&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td class="negative"&gt;-29%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class="statistic"&gt;Threads per day&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;td&gt;44&lt;/td&gt;
&lt;td class="negative"&gt;-44%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td class="statistic"&gt;Active users&lt;/td&gt;
&lt;td&gt;1,606&lt;/td&gt;
&lt;td&gt;1,168&lt;/td&gt;
&lt;td class="negative"&gt;-27%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class="statistic"&gt;Highly active users&lt;/td&gt;
&lt;td&gt;461&lt;/td&gt;
&lt;td&gt;225&lt;/td&gt;
&lt;td class="negative"&gt;-47%&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;p&gt;&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;
An "active user" is defined as someone who makes at least one post in that month.  A "highly active user" is defined as someone who makes at least five posts in that month.  The per day metrics are averaged over the number of days in each month.
&lt;/p&gt;
&lt;p&gt;
The big picture doesn't look so good.  But is the lower level of activity in April just a fluke or has this been a consistent trend?  Let's extend the timeline from October 2007 to April 2008 and take a look at these metrics on a weekly basis.
&lt;/p&gt;
&lt;p&gt;&lt;a name="trend"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;The Trend&lt;/h3&gt;
&lt;p&gt;
I'm going to post weekly graphs for only three metrics: posts per week, signups per week, and active users per week.  The other two metrics show the same trends.
&lt;/p&gt;
&lt;p&gt;
Each graph contains the weekly data plus a red line that represents a four-week moving average.  This is to remove any noise produced from dips or spikes in activity and reveal the actual trend.
&lt;/p&gt;
&lt;p&gt;&lt;a rel="lightbox" href="http://20bits.com/wp-content/uploads/2008/05/posts.png" class="" onclick="javascript:urchinTracker('/file/wp-content/uploads/2008/05/posts.png');"&gt;&lt;img src="http://assets.20bits.com/20080506/posts-thumb.png" alt="Posts Per Week" title="Posts per Week" class="math wp-image-120" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a rel="lightbox" href="http://20bits.com/wp-content/uploads/2008/05/signups.png" class="" onclick="javascript:urchinTracker('/file/wp-content/uploads/2008/05/signups.png');"&gt;&lt;img src="http://assets.20bits.com/20080506/signups-thumb.png" alt="Posts Per Week" title="Posts per Week" class="math wp-image-120" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;You can look at the graph for &lt;a target="_blank" href="http://20bits.com/wp-content/uploads/2008/05/active-users.png" onclick="javascript:urchinTracker('/file/wp-content/uploads/2008/05/active-users.png');"&gt;weekly active users&lt;/a&gt;, too, but it exhibits the same downward trend.&lt;/p&gt;
&lt;p&gt;&lt;a name="analysis"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Analysis and Hypotheses&lt;/h3&gt;
&lt;p&gt;
These graphs show the two main trends for the Facebook developers forum: engagement peaked in late-January while new signups have continually dropped since the launch of the forum.
&lt;/p&gt;
&lt;p&gt;
In fact, since the start of 2008 we've seen &lt;strong&gt;3.4%&lt;/strong&gt; week over week decline in new posts.  For new signups and active users these numbers are &lt;strong&gt;1.7%&lt;/strong&gt; and &lt;strong&gt;0.8%&lt;/strong&gt;, respectively.
&lt;/p&gt;
&lt;p&gt;
The other metrics, new threads and highly active users, show the same downward trend.
&lt;/p&gt;
&lt;p&gt;
Why has there been a steady, three-month decline in activity on the Facebook developer forum?  What explains the peak in late-January?
&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;Other platforms are more attractive&lt;/dt&gt;
&lt;dd&gt;
&lt;p&gt;
	Since Janurary several other social networks have launched their own platforms.  Those based on OpenSocial are wholly incompatible with Facebook and so far Bebo is still the only company to launch using Facebook's architecture for their base.
	&lt;/p&gt;
&lt;p&gt;
	This means developing for other networks becomes an either-or proposition.  Coding on OpenSocial precludes me from spending that time coding for Facebook.  Hi5 and MySpace, in particular, are interesting because they potentially offer the huge growth opportunities that developers saw in the first six months of the Facebook Platform.
	&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;Developers are consolidating&lt;/dt&gt;
&lt;dd&gt;
&lt;p&gt;
	Networks like Zynga and Social Gaming Network (SGN) have cropped up in the last few months and have made it their business to consolidate the game space on Facebook, probably the only real vertical that has found success on the platform.  Bigger companies like Slide and RockYou have been actively recruiting from the Facebook developer pool all along, too.
	&lt;/p&gt;
&lt;p&gt;
	Perhaps the open community that existed four months ago is closing up.  Developers working for the same company talk to each other rather than on the forums, and developers working for different companies don't want to talk in public for competitive reasons.
	&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;Facebook has made it too hard to win&lt;/dt&gt;
&lt;dd&gt;
&lt;p&gt;
	Starting in the middle of January Facebook began instituting ad hoc solutions to curb the spread of abusive and spammy apps.  The side-effect of these measures is to make it harder for applications to spread.
	&lt;/p&gt;
&lt;p&gt;
	These measure include banning the word "message" from news feed items, disallowing passive news feed stories, instituting feedback-based request and notification limits, and banning "forced invites."
	&lt;/p&gt;
&lt;p&gt;
	The ad hoc and arbitrary nature of these rules makes it hard to keep up because Facebook generally gives less than a week's notice. This only serves to increase the relative cost of developing on the Facebook platform.
	&lt;/p&gt;
&lt;p&gt;
	This is not to forget mini-scandals like the Facebook/CBS partnership, where &lt;a href="http://www.google.com/search?client=safari&amp;#038;rls=en-us&amp;#038;q=cbs+facebook&amp;#038;ie=UTF-8&amp;#038;oe=UTF-8" onclick="javascript:urchinTracker('/outbound/www.google.com/search?client=safari_038_rls=en-us_038_q=cbs+facebook_038_ie=UTF-8_038_oe=UTF-8');"&gt;Facebook removed invite restrictions&lt;/a&gt; on CBS' sponsored March Madness application, even though there were other, independent applications in the same category.  It's hard to say how this affected developer morale, but it showed that Facebook was willing to hurt independent developers when it benefitted them.
	&lt;/p&gt;
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;
One other possibility is that the external developer resources are now rich enough that people have to spend less time asking questions.  This might be a contributing factor but is probably not a critical one given that the decline beginning in late-January is so well-defined.
&lt;/p&gt;
&lt;p&gt;&lt;a name="app-data"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Facebook Application Data&lt;/h3&gt;
&lt;p&gt;
If it's true that winning on Facebook is not as easy as it used to be then that should be reflected in the application statistics.  How can we measure this?
&lt;/p&gt;
&lt;p&gt;
Success, for most developers, is defined as attaining a certain level of activity within a specific timeframe.  There are two ways to measure this: one, take time-based cohorts of applications (e.g., all applications started in the same week) and measure their average level of activity some number of weeks later; two, measure the number of active applications as a percentage of the total application space over time.
&lt;/p&gt;
&lt;p&gt;
I will do both.  For the first I'm going to take weekly cohorts and measure their average level of activity three weeks later.  That is, I am going to group all applications by the week they were launched&lt;span class="footnote"&gt;Really, I am going to group them by the week they began to be tracked by Adonomics.&lt;/span&gt; and see how they were doing three weeks later&lt;span class="footnote"&gt;Three weeks is arbitrary, but you can look for yourself â€” we see the same trend whether it's one week, two weeks, three weeks, or four weeks&lt;/span&gt;.
&lt;/p&gt;
&lt;p&gt;
For the second I am going to measure the number of applications with at least 100 daily active users (DAU)&lt;span class="footnote"&gt;Again, 100 is arbitrary, but we see the same trend whether it's 10, 100, or 1,000. &lt;/span&gt; as a percentage of the total number of applications on Facebook.
&lt;/p&gt;
&lt;p&gt;Here are the graphs.&lt;/p&gt;
&lt;p&gt;&lt;a rel="lightbox" href="http://20bits.com/wp-content/uploads/2008/05/app-success.png" class="" onclick="javascript:urchinTracker('/file/wp-content/uploads/2008/05/app-success.png');"&gt;&lt;img src="http://assets.20bits.com/20080506/app-success-thumb.png" alt="Posts Per Week" title="Posts per Week" class="math wp-image-120" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a rel="lightbox" href="http://20bits.com/wp-content/uploads/2008/05/active-pct.png" class="" onclick="javascript:urchinTracker('/file/wp-content/uploads/2008/05/active-pct.png');"&gt;&lt;img src="http://assets.20bits.com/20080506/active-pct-thumb.png" alt="Posts Per Week" title="Posts per Week" class="math wp-image-120" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;
These exhibit the same trends as the data from the Facebook developer forums.
&lt;/p&gt;
&lt;p&gt;&lt;a name="conclusions"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Conclusions&lt;/h3&gt;
&lt;p&gt;
Correlation is not causation, of course, so we can't say whether the decline in developer activity means less application activity, if developers are leaving because applications are no longer as successful as they used to be, or whether there is an unknown factor causing both of these phenomena.
&lt;/p&gt;
&lt;p&gt;
What we can say is that the vitality of both the Facebook developer community and the Facebook platform is not what it was even four months ago, and that these two phenomena are closely related.
&lt;/p&gt;
&lt;p&gt;
Moreover, talking to developers and investors inside the industry it's clear that the excitement over the Facebook Platform and its promise have waned.  Application companies are branching out to other social networks not because they necessarily show more promise than Facebook, but because the future of the Facebook Platform has become murky.
&lt;/p&gt;
&lt;p&gt;
Nobody knows how committed Facebook is to improving the platform or the role applications are meant to play in the overall Facebook ecosystem.  Signals like the reduced level of direct participation in the developer community, increasingly restrictive developer policies, and the &lt;a href="http://www.facebook.com/FacebookPreviews" onclick="javascript:urchinTracker('/outbound/www.facebook.com/FacebookPreviews');"&gt;Facebook profile redesign&lt;/a&gt; seem to indicate that they are trying to regain control over some, if not all, aspects of application development while maintaining an aloof demeanor towards developers.
&lt;/p&gt;
&lt;p&gt;
It boils down to this: investing most of your man-hours into Facebook at this point in time is a mistake.  The potential return on that investment, a year after launch, is a fraction of what it once was.  And the fact that Facebook continues to change the rules and selectively break them for their own benefit means the risk is comparatively higher.
&lt;/p&gt;
&lt;p&gt;
It is better to branch out into other social networks or to piggy-back on Facebook as a means to establish your own, more independent social network.  This is what the top companies like Slide, RockYou, Zynga, and SGN are doing, and what many of the independent Facebook developers I've talked with want to do.
&lt;/p&gt;
&lt;p&gt;
The luster of the Facebook Platform might be gone, but that doesn't mean there aren't opportunities in the space.  I just wouldn't go looking for them at the other end of Facebook's newsfeed.
&lt;/p&gt;
&lt;h3&gt;Misc.&lt;/h3&gt;
&lt;p&gt;
All application data courtesy of &lt;a href="http://adonomics.com" onclick="javascript:urchinTracker('/outbound/adonomics.com');"&gt;Adonomics&lt;/a&gt;.  A spreadsheet  containing the forum data is also available under a Creative Commons Attribution License, below.
&lt;/p&gt;
&lt;div class="download"&gt;Download the &lt;a href="http://assets.20bits.com/downloads/facebook-data.xls" onclick="javascript:urchinTracker('/filehttp://assets.20bits.com/downloads/facebook-data.xls');"&gt;Facebook data&lt;/a&gt; used to generate the graphs for this article.&lt;/div&gt;
&lt;div class="notice warning"&gt;I've posted &lt;a href="/article/the-state-of-the-platform-update"&gt;an update&lt;/a&gt; clarifying some of my opinions.&lt;/div&gt;</description>
      <pubDate>Tue, 06 May 2008 08:00:41 +0000</pubDate>
      <link>http://20bits.com/article/the-state-of-the-facebook-platform</link>
    </item>
    <item>
      <title>Network Programming in Erlang</title>
      <description>&lt;p&gt;
Since I'm learning Erlang I thought my first non-trivial piece of code would be in an area where the language excels: network programming.
&lt;/p&gt;

&lt;p&gt;
Network programming (or socket programming) is a pain in the ass in most languages.  I first learned how to do it in C using &lt;a href="http://beej.us/guide/bgnet/"&gt;Beej's Guide to Network Programming&lt;/a&gt;.  Read it if you dare.
&lt;/p&gt;

&lt;p&gt;
The big roadblock for most server applications is concurrency.  Languages like, where concurrency was an afterthought, make developing robust server software more difficult than it has to be. 
&lt;/p&gt;

&lt;p&gt;
Even so-called modern languages like Java, Ruby, or Python don't handle it all &lt;em&gt;that&lt;/em&gt; well, although you are relieved from the pain of managing all the minute details of the network connections.  Erlang, on the other hand, was built with this purpose on mind.
&lt;/p&gt;

&lt;p&gt;
I won't be writing any user-facing applications in Erlang any time soon, but I thought, "If I'm going to learn Erlang I may as well learn its strong points first."
&lt;/p&gt;

&lt;p&gt;
To that end I decided to try to replicate the suite of classic UNIX daemons like &lt;tt&gt;echo&lt;/tt&gt; and &lt;tt&gt;chargen&lt;/tt&gt;.
&lt;/p&gt;

&lt;h3&gt;echo&lt;/h3&gt;
&lt;p&gt;
Echo is a service that spits back whatever data is handed to it over a TCP connection, bit-for-bit.  Here it is in Erlang. &lt;pre class="brush: erlang"&gt;-module(echo).
-author('Jesse E.I. Farmer &amp;lt;jesse@20bits.com&amp;gt;').

-export([listen/1]).

-define(TCP_OPTIONS, [binary, {packet, 0}, {active, false}, {reuseaddr, true}]).

% Call echo:listen(Port) to start the service.
listen(Port) -&gt;
	{ok, LSocket} = gen_tcp:listen(Port, ?TCP_OPTIONS),
	accept(LSocket).

% Wait for incoming connections and spawn the echo loop when we get one.
accept(LSocket) -&gt;
	{ok, Socket} = gen_tcp:accept(LSocket),
	spawn(fun() -&gt; loop(Socket) end),
	accept(LSocket).

% Echo back whatever data we receive on Socket.
loop(Socket) -&gt;
	case gen_tcp:recv(Socket, 0) of
		{ok, Data} -&gt;
			gen_tcp:send(Socket, Data),
			loop(Socket);
		{error, closed} -&gt;
			ok
	end.&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
We can start this service by calling &lt;tt&gt;echo:listen(&amp;lt;port number&amp;gt;).&lt;/tt&gt; from the Erlang shell, e.g., &lt;tt&gt;echo:listen(8888).&lt;/tt&gt; will start the &lt;tt&gt;echo&lt;/tt&gt; service on port 8888 of your machine.  You can then telnet to port 8888 &amp;mdash; &lt;tt&gt;telnet 127.0.0.1 8888&lt;/tt&gt; &amp;mdash; and see it in action.
&lt;/p&gt;

&lt;p&gt;
Here's the breakdown of the program, by function.
&lt;dl&gt;
	&lt;dt&gt;listen(Port)&lt;/dt&gt;
	&lt;dd&gt;
	Creates a socket that listens for incoming connections on port &lt;strong&gt;Port&lt;/strong&gt; and passes off control to &lt;strong&gt;accept&lt;/strong&gt;.
	&lt;/dd&gt;
	&lt;dt&gt;accept(LSocket)&lt;/dt&gt;
	&lt;dd&gt;
	Waits for incoming connections on &lt;strong&gt;LSocket&lt;/strong&gt;.  Once it receives a connection it spawns a new process that runs the &lt;strong&gt;loop&lt;/strong&gt; function and then waits for the next connection.
	&lt;/dd&gt;
	&lt;dt&gt;loop(Socket)&lt;/dt&gt;
	&lt;dd&gt;
	Waits for incoming data on &lt;strong&gt;Socket&lt;/strong&gt;.  Once it receives the data it immediately sends the same data back across the socket.  If there is an error it exits.
	&lt;/dd&gt;
&lt;/dl&gt;
&lt;/p&gt;

&lt;p&gt;
There are a few things worth discussing in this example.
&lt;/p&gt;

&lt;h3&gt;Spawning Processes&lt;/h3&gt;
&lt;p&gt;
Processes in Erlang are a basic data type.  They follow the &lt;a href="http://en.wikipedia.org/wiki/Actor_model"&gt;actor model&lt;/a&gt; of concurrent computation and make network processes a breeze.
&lt;/p&gt;

&lt;p&gt;
We create new processes using &lt;strong&gt;spawn&lt;/strong&gt;, which takes a &lt;strong&gt;Fun&lt;/strong&gt;, or functional object, as its input.  You can think of them as functions. Control of the process is handed off to the functional object passed in, like a callback.
&lt;/p&gt;

&lt;h3&gt;Functional Objects&lt;/h3&gt;
&lt;p&gt;
Erlang, being a functional programming language, supports functions as first-class objects via the &lt;strong&gt;Fun&lt;/strong&gt;, or functional object, data type.  Functions can create new functions, return functions, modify functions, and pass functions around.
&lt;/p&gt;

&lt;p&gt;
The syntax to create a new functional object is like this: &lt;pre class="brush: erlang"&gt;MyFunction = fun(...) -&gt;
	% Your Erlang code here
	end.&lt;/pre&gt;
&lt;/p&gt;

&lt;h3&gt;CHARGEN&lt;/h3&gt;
&lt;p&gt;
&lt;tt&gt;chargen&lt;/tt&gt; is a service that spews back a stream of characters when you connect to it.  You can read &lt;a href="http://en.wikipedia.org/wiki/CHARGEN"&gt;all about it&lt;/a&gt;, but it's not that interesting.  There's a canonical pattern that it prints out.
&lt;/p&gt;

&lt;p&gt;
Here it is in Erlang. &lt;pre class="brush: erlang"&gt;-module(chargen).
-author('Jesse E.I. Farmer &amp;lt;jesse@20bits.com&amp;gt;').

-export([listen/1]).

-define(START_CHAR, 33).
-define(END_CHAR, 127).
-define(LINE_LENGTH, 72).

-define(TCP_OPTIONS, [binary, {packet, 0}, {active, false}, {reuseaddr, true}]).

% Call chargen:listen(Port) to start the service.
listen(Port) -&gt;
	{ok, LSocket} = gen_tcp:listen(Port, ?TCP_OPTIONS),
	accept(LSocket).

% Wait for incoming connections and spawn the chargen loop when we get one.
accept(LSocket) -&gt;
	{ok, Socket} = gen_tcp:accept(LSocket),
	spawn(fun() -&gt; loop(Socket) end),
	accept(LSocket).

loop(Socket) -&gt;
	loop(Socket, ?START_CHAR).

loop(Socket, ?END_CHAR) -&gt;
	loop(Socket, ?START_CHAR);
loop(Socket, StartChar) -&gt;
	Line = make_line(StartChar),
	case gen_tcp:send(Socket, Line) of
		{error, _Reason} -&gt;
			exit(normal);
		ok -&gt;
			loop(Socket, StartChar+1)
	end.


make_line(StartChar) -&gt;
	make_line(StartChar, 0).

% Generate a new chargen line -- [13, 10] is CRLF.
make_line(_, ?LINE_LENGTH) -&gt;
	[13, 10];
make_line(?END_CHAR, Pos) -&gt;
	make_line(?START_CHAR, Pos);
make_line(StartChar, Pos) -&gt;
	[StartChar | make_line(StartChar + 1, Pos + 1)].&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
As with &lt;tt&gt;echo&lt;/tt&gt; we can start this by dropping into the Erlang shell and running &lt;tt&gt;chargen:listen(8888)&lt;/tt&gt; to start chargen running on port 8888 (or another port of your choice).
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;accept&lt;/strong&gt; and &lt;strong&gt;listen&lt;/strong&gt; are identical to the functions in &lt;tt&gt;echo&lt;/tt&gt;, but here are the differences:
&lt;dl&gt;
	&lt;dt&gt;loop(Socket, StartChar)&lt;/dt&gt;
	&lt;dd&gt;
	Calls &lt;strong&gt;make_line(StartChar)&lt;/strong&gt; to get the CHARGEN line starting with StartChar, writes it to the socket, and then advances to the next line. 
	&lt;/dd&gt;
	&lt;dt&gt;make_line(StartChar, Pos)&lt;/dt&gt;
	&lt;dd&gt;
	Recursively generates a CHARGEN line, keeping track of the current position in the line with &lt;strong&gt;Pos&lt;/strong&gt;.
	&lt;/dd&gt;
&lt;/dl&gt;
&lt;/p&gt;

&lt;p&gt;
There are a few key conceptual differences, too.
&lt;/p&gt;

&lt;h3&gt;Definitions&lt;/h3&gt;
&lt;p&gt;
As in C we can define constants in Erlang with the &lt;tt&gt;-define&lt;/tt&gt; directive.  These are resolved at compile-time.  You can reference the definition by prefixing it with a question mark, &lt;tt&gt;?&lt;/tt&gt;, so as to differentiate it from a variable.
&lt;/p&gt;

&lt;h3&gt;Function Definition Matching&lt;/h3&gt;
&lt;p&gt;
As with assignment, function calls are done via matching.  When you call a function it looks for the first matching definition.  For example, if we invoke &lt;tt&gt;loop(Socket)&lt;/tt&gt; it finds the appropriate definition, viz., the definition that takes a single argument.
&lt;/p&gt;

&lt;p&gt;
We can fix arguments, too, which is how you deal with loop control in Erlang.  &lt;tt&gt;?END_CHAR&lt;/tt&gt; is 127, so if we call &lt;tt&gt;loop(Socket, 127)&lt;/tt&gt; it first matches that definition rather than the more general &lt;tt&gt;loop(Socket, StartChar)&lt;/tt&gt; definition.
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;make_line&lt;/strong&gt; works the same way.  If we're at the last position in the line we return a carriage return and line feed and stop recursing.
&lt;/p&gt;

&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;
I created these to be legible and easily understood.  Working through them helped me understand a lot about the inner workings of Erlang and hopefully they'll do the same for you.  A full-on project page will be coming shortly, but for now you can download the package here.
&lt;/p&gt;

&lt;div class="download"&gt;&lt;a href="http://20bits.com/downloads/erlang-services-0.1.zip"&gt;erlang-services-0.1.zip&lt;/a&gt; or &lt;a href="http://assets.20bits.com/downloads/erlang-services-0.1.tar.gz"&gt;erlang-services-0.1.tar.gz&lt;/a&gt;&lt;/div&gt;</description>
      <pubDate>Fri, 02 May 2008 00:00:16 +0000</pubDate>
      <link>http://20bits.com/article/network-programming-in-erlang</link>
    </item>
    <item>
      <title>Help, Facebook's Hacking Me!</title>
      <description>&lt;p&gt;
BBC's technology program, Click, is claiming to have &lt;a href="http://news.bbc.co.uk/2/hi/technology/7376738.stm"&gt;"exposed a security flaw in the social networking site Facebook which could compromise privacy."&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;
ReadWriteWeb, without a trace of humor, followed on with an article called &lt;a href="http://www.readwriteweb.com/archives/facebook_hacked_again.php"&gt;Facebook Hacked Again&lt;/a&gt;.  Yes, the title of the post &lt;em&gt;was&lt;/em&gt; that sensationalist.
&lt;/p&gt;

&lt;p&gt;
Fortunately for we Facebook users the BBC and ReadWriteWeb show a fundamental misunderstanding of what is happening, how applications can purportedly "steal" user information, and then proceed to scare us by obfuscating the possible solutions.
&lt;/p&gt;

&lt;h3&gt;The BBC's Mistakes&lt;/h3&gt;
&lt;p&gt;
Since the BBC's report is all video, here's a screen capture and a transcript of the voice-over that accompanies it.&lt;img src="http://assets.20bits.com/20080501/31337h4x0r.png" alt="An 31337 H4X0R" title="31337h4x0r" width="500" height="281" class="math size-full wp-image-114" /&gt;
&lt;/p&gt;
&lt;p&gt;
And the transcript:
&lt;blockquote&gt;We managed to write a very simple application which steals a user's personal Facebook details, and those of all their friends, without their knowledge.&lt;/blockquote&gt;
&lt;/p&gt;

&lt;p&gt;
Their report bothers me first as an engineer because the BBC talks as if this is some sort of sophisticated attack.  Just look at the screen capture.
&lt;/p&gt;

&lt;p&gt;
That's right &amp;mdash; unless you're elite enough to be sitting in room lit like a rave working with two MacBook Pros there's just no way you'd be able to pull this shit off.  Leave it to us, kid, we're professionals.
&lt;/p&gt;

&lt;p&gt;
Snark aside, here's what's happening.  In the summer of 2006 Facebook opened up their &lt;a href="http://wiki.developers.facebook.com/index.php/API"&gt;REST API&lt;/a&gt; to third-party websites.  Yes, this actually pre-dates the platform, which launched less than a year ago in May of 2007.
&lt;/p&gt;

&lt;p&gt;
Among other things the API permits people to grant external websites permission to access a user's data.  Since the launch of the Facebook platform most application exist on Facebook, but the API remains the same.&lt;/p&gt; 

&lt;p&gt;
When you try to log into or add an application here's an example of what you'd see.  I've highlighted some relevant parts.
&lt;img src="http://assets.20bits.com/20080501/add-application1.png" alt="Add Application screen" title="add-application1" width="410" height="394" class="math size-full wp-image-116" /&gt;
&lt;/p&gt;

&lt;p&gt;
So the BBC's claim that application can access a user's data "without their knowledge" is dubious at best.  Sure, it's likely that the user will bypass all that text and go right for the big blue button, but the BBC report makes it sounds like applications are doing something sneaky.
&lt;/p&gt;

&lt;p&gt;
Sorry, folks, but it's right there: Allow this application to know who I am and access my information.  Check.   
&lt;/p&gt;

&lt;p&gt;
Imagine this exposÃ© instead. "BBC Uncovers Fatal Flaw in Valet Parking System," in which our intrepid reporter poses as a valet and drives off with someone's car.  It's so easy, and there's nothing stopping them!
&lt;/p&gt;

&lt;p&gt;
But we trust valets not to do it because the valet will get fired and the police will arrest him.  And it's the same on Facebook. In fact, Facebook requires developers adhere to its &lt;a href="http://developers.facebook.com/terms.php"&gt;Terms of Use&lt;/a&gt; which explicitly forbids such uses of user information.  Of course using this data for identity theft is more than just a violation of Facebook's Terms of Use, it's a violation of the law.
&lt;/p&gt;

&lt;h3&gt;Exaggerated Dangers&lt;/h3&gt;
&lt;p&gt;
The BBC mentions the above Terms of Use clause in passing, but then states quickly that your information is at risk if even only one of your friends installs an application.  Yikes!  Is that true?
&lt;/p&gt;

&lt;p&gt;
Well, yes and no.  Yes, under certain configurations applications can get information about a user's friends even if those friends haven't installed the application.  But you're nowhere near as helpless as the BBC makes you seem.
&lt;/p&gt;

&lt;p&gt;
Here is a screenshot of Facebook's &lt;a href="http://www.facebook.com/privacy/?view=platform&amp;tab=other"&gt;Application Privacy&lt;/a&gt; page: &lt;img src="http://assets.20bits.com/20080501/application-privacy.png" alt="" title="application-privacy" width="499" height="413" class="math size-full wp-image-117" /&gt;
&lt;/p&gt;

&lt;p&gt;
Notice the text above the field of options. &lt;blockquote&gt;The following settings apply only to Facebook Platform applications to which you have not already granted access or explicitly restricted. For these applications, the information you select will be available to friends and other users who can already see your information on Facebook&lt;/blockquote&gt;
&lt;/p&gt;

&lt;p&gt;
The BBC and ReadWriteWeb are a day late and a dollar short.  Not only is it against Facebook's rules to "steal" user data in this way, but Facebook actually provides mechanisms that allow users to secure their data.  I, personally, don't let applications I haven't installed see more than my Facebook photo.  They can't get my name, date of birth, location &amp;mdash; any of that.
&lt;/p&gt;
&lt;p&gt;
To summarize, the BBC and ReadWriteWeb didn't really uncover anything except a way to abuse a feature intentionally built into the Facebook platform in a way that Facebook anticipated two years ago.  What they claim is technically accurate but the dangers are grossly exaggerated.
&lt;/p&gt;

&lt;p&gt;
There are at least four levels of protection.
&lt;ol&gt;
	&lt;li&gt;Facebook forbids developers from storing user data in their Terms of Use.&lt;/li&gt;
	&lt;li&gt;Facebook provides mechanisms for me to hide data from applications I have installed directly.&lt;/li&gt;
	&lt;li&gt;For application that I haven't installed but my friends have installed, I have full control over what they can and cannot see on Facebook's &lt;a href="http://www.facebook.com/privacy/?view=platform&amp;tab=other"&gt;Application Privacy&lt;/a&gt; page.&lt;/li&gt;
	&lt;li&gt;Above all this, there is the law.  Identity theft is illegal and using something like Facebook to steal personal data probably only increases the risks.  If I were looking to steal someone's identity I'd rather just look through their garbage, personally.&lt;/li&gt;
&lt;/ol&gt;
&lt;/p&gt;

&lt;p&gt;
This is not a hack and Facebook has controls for dealing with this on both the developer side and user side.  Don't buy into the BBC's and RWW's sensationalism.  Please.
&lt;/p&gt;</description>
      <pubDate>Thu, 01 May 2008 16:46:10 +0000</pubDate>
      <link>http://20bits.com/article/help-facebooks-hacking-me</link>
    </item>
    <item>
      <title>Interview Questions: Counting Bits</title>
      <description>&lt;p&gt;
Continuing my series of &lt;a href="/tag/interview"&gt;interview questions&lt;/a&gt;, today I bring you the classic bit-counting problem.
&lt;/p&gt;

&lt;p&gt;
The setup usually goes something like this.  We're receiving gigabytes of data per second.  Each chunk of data comes with a header that contains an unsigned 32-bit integer.  Let's call that integer the routing number.  We choose the routing destination based on the number of on bits in the binary representation of the routing number.
&lt;/p&gt;

&lt;p&gt;
Write a routine that returns the number of on bits in the binary representation of an unsigned 32-bit integer in C.
&lt;/p&gt;

&lt;h3&gt;The Naive Solution&lt;/h3&gt;
&lt;p&gt;
As usual there's a naive solution.  In this case you could loop through each bit at a time, counting the number of ones. &lt;pre class="cpp"&gt;int bitcount(unsigned int n) {
	int count = 0;    
	while (n) {
		count += n &amp; 0x1u;
		n &gt;&gt;= 1;
	}
	return count;
}&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
&lt;tt&gt;&gt;&gt;&lt;/tt&gt; is the right bit-shift operator.  It drops the right-most bit from the binary representation of an integer.  So, &lt;tt&gt;0x1001 &gt;&gt; 1&lt;/tt&gt; is equal to &lt;tt&gt;0x0100&lt;/tt&gt;.
&lt;/p&gt;

&lt;p&gt;
The above has a few issues.  First, it takes O(n) time, where n is the length of the binary representation of the integer.  Can we do better? Second, it doesn't take into account the fact that n is a 32-bit integer&lt;span class="footnote"&gt;Let's ignore the subtleties of integer types in C for now, ok?&lt;/span&gt;.
&lt;/p&gt;

&lt;h3&gt;Pre-computation&lt;/h3&gt;
&lt;p&gt;
Since speed was a requirement something that takes linear time is probably a bad idea.  The key idea is to realize that a deterministic function, like &lt;tt&gt;bitcount&lt;/tt&gt;, is no different than a hash where the keys are the inputs to the function and the values are the output of the function.
&lt;/p&gt;

&lt;p&gt;
This is principle behind memoization, for example, but here we're sitting pretty.  Since both the input and output are unsigned integers we can create a regular array, call it &lt;tt&gt;bit_table&lt;/tt&gt;, where &lt;tt&gt;bit_table[i]&lt;/tt&gt; is the number of on bits in the binary representation of &lt;tt&gt;i&lt;/tt&gt;.
&lt;/p&gt;

&lt;p&gt;
Furthermore since we have the constraint that the integer is 32-bits we can, in theory, pre-compute the entirety of &lt;tt&gt;bit_table&lt;/tt&gt; and include it in a header.  It'd work like this: &lt;pre class="cpp"&gt;// Pre-compute this elsewhere and put it here.
static unsigned int bit_table32[0x1u &lt;&lt; 32];

int bitcount_32(unsigned int n) {
	return bit_table32[n &amp; 0xFFFFFFFFu];
}&lt;/pre&gt;
&lt;/p&gt;

&lt;h3&gt;Size Constraints&lt;/h3&gt;
&lt;p&gt;
&lt;tt&gt;bit_table32&lt;/tt&gt; is going to contain 4,294,967,296 integers.  Depending on the size of an integer on your platform this will probably take up several gigabytes of memory.  If we want a constant-time algorithm that takes up significantly less memory we can create a 16-bit table and use bit arithmetic.
&lt;pre class="cpp"&gt;// Pre-compute this elsewhere and put it here.
static unsigned int bit_table16[0x1u &lt;&lt; 16];

// This only works for 32-bit integers but takes constant time.
int bitcount_32(unsigned int n) {
	return bit_table16[n &amp; 0xFFFFu] + bit_table16[(n &gt;&gt; 16) &amp; 0xFFFFu];
}&lt;/pre&gt;
&lt;/p&gt;

&lt;h3&gt;The Unrestricted Case&lt;/h3&gt;
&lt;p&gt;
If we don't know how many bits the integer will contain (say we moved from a 32-bit to a 64-bit platform) then we can iterate over the binary representation 16 bits at a time, using the pre-computed table at each step&lt;span class="footnote"&gt;For the hard-core bit-counters out there, the C specification requires that integers contain &lt;em&gt;at least&lt;/em&gt; 16 bits.&lt;/span&gt;.
&lt;pre class="cpp"&gt;// Pre-compute this elsewhere and put it here.
static unsigned int bit_table16[0x1u &lt;&lt; 16];

// This works for any sized integer but no longer takes constant time.
int bitcount(unsigned int n) {
	int count = 0;
	while (n) {
		count += bit_table16(n &amp; 0xFFFFu);
		n &gt;&gt;= 16;
	}
	
	return count;
}&lt;/pre&gt;
&lt;/p&gt;

&lt;h3&gt;Good or Bad Question?&lt;/h3&gt;
&lt;p&gt;
This question suffers from the same problems that the &lt;a href="/article/interview-questions-loops-in-linked-lists"&gt;reversing linked lists&lt;/a&gt; in that you probably either know the solution or you don't.
&lt;/p&gt;

&lt;p&gt;
That said, the solution here — pre-computing a list of values to CPU time — is much more common than the tortoise and hare solution in the previous question, so the likelihood of it dawning on you during the interview is that much greater.  Plus I've been asked this question so many times that it's one of those must-know exercises, in my opinion, even if the question itself could be better.
&lt;/p&gt;</description>
      <pubDate>Wed, 30 Apr 2008 00:00:59 +0000</pubDate>
      <link>http://20bits.com/article/interview-questions-counting-bits</link>
    </item>
    <item>
      <title>Learning Erlang</title>
      <description>&lt;p&gt;
Last week I decided to learn Erlang, a functional programming language developed by Ericsson in 1987 for use in telecommunications environments.  It's probably the strangest non-toy programming language I've ever tried to learn, so I thought I'd share some of my realizations.
&lt;/p&gt;

&lt;h3&gt;Variables vs. Atoms&lt;/h3&gt;
&lt;p&gt;
First, variables in Erlang are not like variables as most programmers think about them.  Fortunately for me they're a lot like variables as mathematicians think of them.
&lt;/p&gt;

&lt;p&gt;
That is, variables in Erlang are either bound or unbound, and bound variables cannot be rebound in the same context.  This means that variables are write-once.  
&lt;/p&gt;
&lt;p&gt;
If I declare &lt;tt&gt;Name = "Jim".&lt;/tt&gt; I cannot later declare &lt;tt&gt;Name = "Betty".&lt;/tt&gt; in the same context.  Erlang will throw a matching error because it's trying to match the right-hand side, "Betty," against the left-hand side, which is bound to "Jim."
&lt;/p&gt;

&lt;p&gt;
When the left-hand side is unbound any match will succeed and assignment will occur, but if the left-hand side is bound Erlang will try to match the right-hand side to the value of the bound variable.  Thus, if &lt;tt&gt;"Jim"&lt;/tt&gt; is bound to &lt;tt&gt;Name&lt;/tt&gt;, both &lt;tt&gt;Name = "Jim".&lt;/tt&gt; and &lt;tt&gt;"Jim" = Name.&lt;/tt&gt; will succeed, but &lt;tt&gt;Name = "Betty".&lt;/tt&gt; will fail.  Weird, huh?
&lt;/p&gt;

&lt;p&gt;
Second, "context" in Erlang means lexical scope.  What's more, there is no global scope.  This is to enforce a no-side-effects style of programming, I suppose.
&lt;/p&gt;

&lt;p&gt;
Finally, variables in Erlang start with a capital letter.  Always. That is, &lt;tt&gt;Var&lt;/tt&gt; is always a variable but &lt;tt&gt;var&lt;/tt&gt; is never a variable.  If you execute &lt;tt&gt;var = 5.&lt;/tt&gt; you'll get a matching error.
&lt;/p&gt;

&lt;p&gt;
In this case &lt;tt&gt;var&lt;/tt&gt; is treated as an atom by Erlang.  Atoms satisfy the same role that symbols do in Ruby.  Any literal that isn't another data-type, variable, or function is an atom.
&lt;/p&gt;

&lt;p&gt;
Atoms usually start with lower-case letters but you can also denote atoms by enclosing the name in single quotes.  So, &lt;tt&gt;Var&lt;/tt&gt; is a variable but &lt;tt&gt;'Var'&lt;/tt&gt; is an atom.  &lt;tt&gt;var&lt;/tt&gt; is an atom, too, and never a variable.
&lt;/p&gt;

&lt;h3&gt;Data Types&lt;/h3&gt;
&lt;p&gt;
In addition to atoms there are other data types.  All the favorites are here, like integers, floats, and strings.  We also have &lt;strong&gt;Funs&lt;/strong&gt;, or "functional objects," which are anonymous functions.
&lt;/p&gt;

&lt;p&gt;
Erlang also has two basic compound data types: lists and tuples.  These are analogues of the same objects in Python.  Items in both lists and tuples are separated by commas, but lists are enclosed by brackets, &lt;tt&gt;[]&lt;/tt&gt;, and tuples by curly braces, &lt;tt&gt;{}&lt;/tt&gt;.
&lt;/p&gt;

&lt;p&gt;
For example, &lt;tt&gt;[1,3.4,true]&lt;/tt&gt; is a list and &lt;tt&gt;{person, 25, "Jason"}&lt;/tt&gt; is a tuple.
&lt;/p&gt;

&lt;p&gt;
There are no booleans in Erlang.  Instead the atoms &lt;tt&gt;true&lt;/tt&gt; and &lt;tt&gt;false&lt;/tt&gt; are used.
&lt;/p&gt;

&lt;h3&gt;Assignment vs. Pattern Matching&lt;/h3&gt;
&lt;p&gt;
In every other language I've ever used assignment works something like this: &lt;tt&gt;var x = 5&lt;/tt&gt;.  In Erlang there is no assignment, at least not in this sense.  Rather, Erlang matches patterns and variables will match any pattern.

&lt;p&gt;
Consider the following (using the &lt;tt&gt;erl&lt;/tt&gt; shell): &lt;pre class="brush: erlang"&gt;1&gt; {ip, IP} = {ip, "192.168.0.1"}.
{ip,"192.168.0.1"}
2&gt; IP.
"192.168.0.1"&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
Erlang is matching the left-hand and right-hand sides and trying to align them.  &lt;tt&gt;IP&lt;/tt&gt; is a variable (we know this because it starts with a capital letter), so it matches any pattern.  &lt;tt&gt;ip&lt;/tt&gt; is an atom (we know this because it starts with a lower-case letter).
&lt;/p&gt;

&lt;p&gt;
In this case alignment is possible because &lt;tt&gt;ip&lt;/tt&gt; matches on both sides and IP is bound to the value &lt;tt&gt;"192.168.0.1"&lt;/tt&gt;. 
&lt;/p&gt;
&lt;p&gt;
Now consider this:
&lt;pre class="brush: erlang"&gt;1&gt; {foobar, IP} = {ip, "192.168.0.1"}.
** exception error: no match of right hand side value {ip,"192.168.0.1"}&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
Here we get an error because &lt;tt&gt;foobar&lt;/tt&gt; and &lt;tt&gt;ip&lt;/tt&gt; are different atoms, making a match impossible.  If instead we did
&lt;pre class="brush: erlang"&gt;1&gt; {Atom, IP} = {ip, "192.168.0.1"}.  
{ip,"192.168.0.1"}
2&gt; Atom.
ip
3&gt; IP.
"192.168.0.1"&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
Here there's no error because &lt;tt&gt;Atom&lt;/tt&gt; is a variable.  It is bound with a value of &lt;tt&gt;ip&lt;/tt&gt;, which is an atom.
&lt;/p&gt;

&lt;p&gt;
Here's a more subtle example.  &lt;pre class="brush: erlang"&gt;1&gt; {A, {B, C}} = {first, {second, third}}.
{first,{second,third}}
2&gt; A.
first
3&gt; B.
second
4&gt; C.
third
5&gt; {X, Y} = {first, {second, third}}.
{first,{second,third}}
6&gt; X.
first
7&gt; Y.
{second,third}&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
If you understand why A, B, C, X, and Y get bound to the values that they do then I think you're a long way towards understanding how &lt;tt&gt;=&lt;/tt&gt; works in Erlang.
&lt;/p&gt;

&lt;h3&gt;Looping vs. Recursion&lt;/h3&gt;
&lt;p&gt;
Since variables are bound to their lexical scope it makes procedural-style looping in Erlang difficult.  &lt;tt&gt;i++&lt;/tt&gt; is not only verboten, it is syntactically invalid.
&lt;/p&gt;

&lt;p&gt;
Instead loops are done through recursion.  Here is the factorial function:
&lt;pre class="brush: erlang"&gt;-module(factorial).
-export([factorial/1]).

factorial(0) -&gt; 1.
factorial(N) -&gt;
	N * factorial(N-1).&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
Briefly, &lt;tt&gt;-module&lt;/tt&gt; defines an Erlang module, which is the mechanism by which the language supports code separation.  &lt;tt&gt;-export&lt;/tt&gt; tells Erlang which functions in this module to export.  The &lt;tt&gt;/1&lt;/tt&gt; after &lt;tt&gt;factorial&lt;/tt&gt; on the export line is the function's &lt;a href="http://en.wikipedia.org/wiki/Arity"&gt;arity&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
As with variable assignment, Erlang uses pattern matching in defining functions.  Since &lt;tt&gt;0&lt;/tt&gt; is an integer literal, all instances of &lt;tt&gt;factorial(0)&lt;/tt&gt; match it.  Any other calls to &lt;tt&gt;factorial&lt;/tt&gt; with a single argument match the second and &lt;tt&gt;N&lt;/tt&gt; is bound to that argument.
&lt;/p&gt;

&lt;h3&gt;Tail Recursion&lt;/h3&gt;
&lt;p&gt;
Since iterative loops are difficult in Erlang making sure your recursive functions are tail recursive is important.  This means the last call a recursive function should make it to itself.
&lt;/p&gt;

&lt;p&gt;
The &lt;tt&gt;factorial&lt;/tt&gt; function above is not tail recusrive &amp;mdash; the last call it makes is to &lt;tt&gt;*&lt;/tt&gt; rather than &lt;tt&gt;factorial&lt;/tt&gt;.
&lt;/p&gt;

&lt;p&gt;
To fix this we need to re-write &lt;tt&gt;factorial&lt;/tt&gt; to make use of an accumulator.
&lt;pre class="brush: erlang"&gt;-module(factorial).
-export([factorial/1]).

factorial(N) -&gt;
	factorial(N,1).

factorial(0, Acc) -&gt;
	Acc,
factorial(N,Acc) -&gt;
	factorial(N-1, N*Acc).&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
Thanks to Erlang's pattern matching capabilities we don't even have to redefine the interface.  We only export the &lt;tt&gt;factorial&lt;/tt&gt; function that supports one argument.
&lt;/p&gt;

&lt;h3&gt;The Future&lt;/h3&gt;
&lt;p&gt;
Erlang is about concurrency and message-passing, so for my first exercise I'm going to try to create some simple network services. 
&lt;/p&gt;

&lt;p&gt;
Also, does anyone know of a GeSHi plugin for Erlang?
&lt;/p&gt;</description>
      <pubDate>Tue, 29 Apr 2008 16:37:54 +0000</pubDate>
      <link>http://20bits.com/article/learning-erlang</link>
    </item>
    <item>
      <title>The Future is Discovery, not Just Search</title>
      <description>&lt;p&gt;
Let's start with a picture from &lt;a href="http://www.radarnetworks.com/"&gt;Radar Networks'&lt;/a&gt; CEO Nova Spivack:
&lt;img src="http://assets.20bits.com/20080425/keyword-search-slide.png" alt="" title="keyword-search-slide" width="500" height="376" class="math size-full wp-image-106" /&gt;
&lt;/p&gt;

&lt;p&gt;
Erick Schonfeld, asking "&lt;a href="http://www.techcrunch.com/2008/04/25/is-keyword-search-about-to-hit-its-breaking-point/"&gt;Is Keyword Search About to Hit its Breaking Point?&lt;/a&gt;," talks about Spivack's view of the future of the web.  According to him it lies ever-more-refined search technologies such as semantic search, natural language search, and artificial intelligence.  A quote: &lt;blockquote&gt;Keyword search engines return haystacks, but what we really are looking for are the needles . The problem with keyword search such as Googleâ€™s approach is that only highly cited pages make it into the top results. You get a huge pile of results, but the page you wantâ€”the â€œneedleâ€ you are looking forâ€”may not be highly cited by other pages and so it does not appear on the first page. This is because keyword search engines donâ€™t understand your question, they just find pages that match the words in your question.&lt;/blockquote&gt;
&lt;/p&gt;

&lt;p&gt;
Spivack wants to "do for data what the Web did for documents" and develop a standard, uniform system for semantic metadata.  It's the classic "dumb software, smart data" idea.  Tagging works to a degree, but it's neither uniform nor standard &amp;mdash; the same tag can mean two different things for two different people, and two different tags can mean the same thing.
&lt;/p&gt;

&lt;p&gt;
That said, the premise underpinning Spivack's whole argument is that search will is the correct interface when faced with a world of exponentially-increasing information.  His version of the future says, "Keyword search will become increasingly inefficient and the solution is to develop semantically-aware systems that search based on meaning, rather than content."
&lt;/p&gt;

&lt;h3&gt;Search and Discovery&lt;/h3&gt;
&lt;p&gt;
Let's take a step back and think of other situations where we are faced with more information than we can handle at once, for example, music.  How do you get new music?  If you want some new hip hop do you search for it?
&lt;/p&gt;

&lt;p&gt;
In truth, nobody I know searches for new music.  How can you search for something you don't know, anyhow?  Search doesn't just profit off intent, it requires it.  To find new hip hop I'd ask a friend who is into that scene and get his opinion, or browse through the new releases at my local record store or iTunes.
&lt;/p&gt;

&lt;p&gt;
The same pattern exists on TV.  People don't search for new shows, they discover them either through friends and advertisements, or by channel surfing.
&lt;/p&gt;

&lt;h3&gt;A Bi-Modal Future&lt;/h3&gt;
&lt;p&gt;
The future of information on the Web does not rest in super-advanced search, but in both search and discovery.  This bi-modal existence makes sense because people behave in two ways depending on whether they have intent or not.
&lt;/p&gt;

&lt;p&gt;
If someone knows what they want, say, the average RBI among hitters in the American League, then search is perfect.  If, however, you're in a channel surfing mood, then search is worthless because you don't know what you want &amp;mdash; but you will when you see it.
&lt;/p&gt;

&lt;p&gt;
Lots of sites straddle this divide.  Yelp, for example, helps in discovery by giving you sensible metadata in the form of ratings.  This fits into Spivack's hypothesis.  I have some level of intent (e.g., "I want Thai food in San Francisco"), but not much.
&lt;/p&gt;

&lt;p&gt;
But sites like YouTube fall clear off discovery side of the gap.  Who searches YouTube unless they're trying to find a video they've already seen and want to show a friend?  Furthermore, who uses the metadata on the site (besides, perhaps, related video) to find new content?  Most of the highly-rated and highly-viewed stuff, speaking for myself and my friends, are not the things I watch regularly.
&lt;/p&gt;

&lt;p&gt;
Instead I discover videos on YouTube through my social network or by serendipitously finding a great video embedded in a website I happen to be reading.  Indeed, there are whole sites, like &lt;a href="http://www.stumbleupon.com/"&gt;StumbleUpon&lt;/a&gt;, whose main mechanic is serendipity.
&lt;/p&gt;

&lt;p&gt;
I'm still uncovering new information, but I'm sure as heck not searching for it in the search enging sense of the word.
&lt;/p&gt;

&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;
In short, search is what we do when we have an idea of what we want and discovery is what we do the rest of the time.  When looking for something to watch on TV people don't search, they channel surf.  And when people want to find facts people search, they don't stumble around aimlessly.
&lt;/p&gt;

&lt;p&gt;
As information density increases and more pieces of media, information, knowledge, and, in general, data become available online both mechanics, search and discovery, will have to be developed to accommodate the volume. Why?&lt;/p&gt;

&lt;p&gt;
In a world with more and more data the percentage of data that we are actively able to query becomes smaller and smaller.  That is, if there is more data not only do we know less as a percentage of all the information out there, but we have less knowledge of what we do and do not know.
&lt;/p&gt;

&lt;p&gt;This is where discovery fits and it's a mistake to think the only solution is a single, ultra-intelligent search agent, or a single, unifying data structure for the Web.
&lt;/p&gt;

&lt;p&gt;
Human behavior tells us otherwise.
&lt;/p&gt;</description>
      <pubDate>Fri, 25 Apr 2008 11:35:49 +0000</pubDate>
      <link>http://20bits.com/article/the-future-is-discovery-not-just-search</link>
    </item>
    <item>
      <title>Interview Questions: Loops in Linked Lists</title>
      <description>&lt;p&gt;
This is part of my series on &lt;a href="/tag/interview"&gt;interview questions&lt;/a&gt;, so welcome aboard!
&lt;/p&gt;

&lt;p&gt;
This installment deals with a common question about &lt;a href="http://en.wikipedia.org/wiki/Linked_list"&gt;linked lists&lt;/a&gt; &amp;mdash; how do we detect when one has a loop?
&lt;/p&gt;
&lt;h3&gt;Linked Lists&lt;/h3&gt;

&lt;p&gt;
Linked lists are one of the most simple data structures and most aspiring programmers learn them early on.  But for completeness' sake let's cover that ground.
&lt;/p&gt;

&lt;p&gt;
A linked list is a sequence of nodes.  Each node contains a piece of data and a reference to the next node in the list.  Graphically it looks like this 
&lt;img src="http://assets.20bits.com/20080417/linked-list.png" alt="" title="linked-list" width="501" height="99" class="math" /&gt;
&lt;/p&gt;

&lt;h3&gt;Loopy Linked Lists&lt;/h3&gt;
&lt;p&gt;
It's possible, though, that a node in a linked list might point to a previous element in the list.  This is bad for many reasons, not the least of which is that any loop which iterates over all the nodes in the list by accessing the next node will never terminate.
&lt;/p&gt;
&lt;p&gt;
So, it becomes important to detect when linked lists have loops.  Here's what one such errant linked list looks like.
&lt;/p&gt;
&lt;p&gt;
&lt;img src="http://assets.20bits.com/20080417/loopy-linked-list.png" alt="" title="loopy-linked-list" width="501" height="232" class="math" /&gt;
&lt;/p&gt;

&lt;h3&gt;The Easy Solution&lt;/h3&gt;
&lt;p&gt;
The easy solution is to keep track of every node seen so far and check if the current node is in that list.  Here's a very simple linked list implementation in Ruby.
&lt;pre class="brush: ruby"&gt;class Node
  attr_accessor :data, :next
  
  def initialize(data = nil)
    @data = data
    @next = nil
  end
end&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;Here is the simple solution for detecting loops using the above implementation &lt;pre class="brush: ruby"&gt;def has_loop?(node)
  seen = []
  until node.next.nil? do
    return true if seen.include? node
    seen &lt;&lt; node
    node = node.next
  end
  false
end&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
This solution is workable but sub-optimal (surprise!).  This has O(n&lt;sup&gt;2&lt;/sup&gt;) complexity in CPU and O(n) complexity in memory, but a solution with O(n) complexity in CPU and O(1) complexity in memory is possible.  In fact, this question is usually posed to preclude the above solution.
&lt;/p&gt;

&lt;h3&gt;The Tortoise and the Hare&lt;/h3&gt;
&lt;p&gt;
The better solution involves a bit of mathematical thinking.  If there is a loop then that means any iterator, no matter how many steps it takes per iteration, must hit the offending node.
&lt;/p&gt;

&lt;p&gt;
So, if we have two iterators, one of which has a length that is a multiple of the other, they'll eventually land on the same node.  The usual solution is to have one iterator advance one at a time ("the tortoise") and a second iterator advance two at a time ("the hare").
&lt;/p&gt;

&lt;p&gt;
That algorithm looks like this &lt;pre class="brush: ruby"&gt;def has_loop?(node)
  slow = node
  fast = node
  until slow.next.nil? or fast.next.nil? do
    slow = slow.next
    fast = fast.next.next
    return true if (slow == fast)
  end
  false
end&lt;/pre&gt;
&lt;/p&gt;

&lt;h3&gt;More Questions&lt;/h3&gt;
&lt;p&gt;
Assuming you answer the above correctly and quickly the interviewer will probably follow up with some related questions.  How do you fix the linked list when you detect a loop?  What is the linked list has multiple loops?  How do you determine the size of the loop?
&lt;/p&gt;

&lt;p&gt;
I'll leave you to ponder these questions.  Cheers!
&lt;/p&gt;</description>
      <pubDate>Thu, 17 Apr 2008 00:00:15 +0000</pubDate>
      <link>http://20bits.com/article/interview-questions-loops-in-linked-lists</link>
    </item>
    <item>
      <title>The Best Facebook Ad Network</title>
      <description>&lt;p&gt;
Even though most Facebook application developers make money off of low-CPM display ads from one of the many Facebook-specific ad networks, browsing the &lt;a href="http://forum.developers.facebook.com/"&gt;developer forum&lt;/a&gt; shows that lots of people don't have good information about which ad network is right for them.  I'm here to tell you, once and for all, which ad network is the best.
&lt;/p&gt;

&lt;h3&gt;The State of Advertising&lt;/h3&gt;
&lt;p&gt;
There are dozens of Facebook ad networks with new ones cropping up all the time, including, in no particular order, &lt;a href="http://www.socialmedia.com/"&gt;SocialMedia&lt;/a&gt;, &lt;a href="https://www.rockyouads.com/ams/partner/publisher/index.php"&gt;RockYou&lt;/a&gt;, &lt;a href="http://www.lookery.com/"&gt;Lookery&lt;/a&gt;, &lt;a href="http://cubics.com/"&gt;Cubics&lt;/a&gt;, &lt;a href="http://www.videoegg.com"&gt;VideoEgg&lt;/a&gt;, and &lt;a href="http://adblade.com/"&gt;AdBlade&lt;/a&gt;.  They vary according to their terms, deals, performance, and stability.
&lt;/p&gt;

&lt;p&gt;
As you can see the Facebook ad market is still very immature.  Because of that most ad networks don't have the inventory to satisfy the demand and the fragmentation makes it difficult to get major advertising agencies to spend on Facebook.
&lt;/p&gt;

&lt;p&gt;
It doesn't help that most developers are inexperienced when it comes to advertising, so one-on-one deals are probably out of reach for most of them.
&lt;/p&gt;

&lt;h3&gt;The Best Facebook Ad Network&lt;/h3&gt;

&lt;p&gt;
In light of the above, most developers turn to one of many ad networks to rep their inventory.  If they're lucky they might see a $2 &lt;a href="http://en.wikipedia.org/wiki/Cost_per_mille"&gt;eCPM&lt;/a&gt;, but in truth they'll probably see an order or two magnitude less.
&lt;/p&gt;

&lt;p&gt;
Lookery, for example, has a "guaranteed 12Â¢ CPM" program for &lt;strike&gt;US&lt;/strike&gt; your traffic.  They had to stop signing people up because the demand overwhelmed them.  That tells you something about how much you can expect to earn by running simple display ads on Facebook.
&lt;/p&gt;

&lt;p&gt;
So, which ad network is the best?  The best ad network is the one that earns you the most money in the long run&lt;span class="footnote"&gt;Since ad-supported companies exist in a &lt;a href="/article/web-20-and-two-sided-markets"&gt;two-sided market&lt;/a&gt; it's important to realize that short-term gains (e.g., tricking your users into installing spyware) might be readily available, but at the cost of future growth.  How you find the balance point is beyond the scope of this article.&lt;/span&gt;.
&lt;/p&gt;

&lt;h3&gt;You Set Me Up! or A/B Testing for Fun and Profit&lt;/h3&gt;
&lt;p&gt;
Ok, ok, I admit it, I set you up.  But it's true.  Your choice of ad network should be simple.  Does Lookery make me more money in the long run than any of the others?  If yes, use Lookery.
&lt;/p&gt;

&lt;p&gt;
You can determine this quantitatively with &lt;a href="http://en.wikipedia.org/wiki/A/B_testing"&gt;A/B testing&lt;/a&gt;.  Here's a basic example, in PHP:
&lt;/p&gt;
&lt;p&gt;
&lt;pre class="brush: php"&gt;function get_random_ad_code() {
	$ad_codes = array(
		'lookery'     =&gt; 'Your Lookery ad code',
		'adblade'     =&gt; 'Your AdBlade ad code',
		'socialmedia' =&gt; 'Your SocialMedia ad code',
		'rockyou'     =&gt; 'Your RockYou ad code'
	);
	
	return $ad_codes[array_rand($ad_codes)];
}

echo get_random_ad_code();&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
&lt;tt&gt;get_random_ad_code&lt;/tt&gt; will return each ad code with equal probability.  Assuming all other variables are constant (i.e., the ads appear in the same place, with the same colors, and without any other ads on the page) then you can look at the earnings reports for each of the above and know, for certain, which ad network performs best on average &lt;em&gt;for your app&lt;/em&gt;.
&lt;/p&gt;

&lt;p&gt;
Yes, that's right &amp;mdash; there's no single "best" ad network.  Two ad networks perform differently depending on their inventory.  Some might have ads that do well for EU traffic and poorly for US traffic, or vice versa.
&lt;/p&gt;

&lt;p&gt;
If you ask "which ad network is best?" on the Facebook developer forums you'll get a million anecdotes and little data, but &lt;a href="/article/decisions-without-data"&gt;decisions without data are guesses&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
So test your own app with all the ad networks and answer the question yourself.  &lt;em&gt;Then&lt;/em&gt; you'll really know which Facebook ad network is the best.
&lt;/p&gt;</description>
      <pubDate>Wed, 16 Apr 2008 00:00:50 +0000</pubDate>
      <link>http://20bits.com/article/the-best-facebook-ad-network</link>
    </item>
    <item>
      <title>Amazon, the Tech Company</title>
      <description>&lt;p&gt;
Larry Dignan over at ZDNet &lt;a href="http://blogs.zdnet.com/BTL/?p=8471"&gt;writes&lt;/a&gt;:
&lt;blockquote&gt;
Amazon on Monday announced persistent storage for its EC2 service and whatâ€™s notable is how quickly the e-tailer is running ahead of the competition. In fact, Amazonâ€™s real business down the line will be its cloud services. Amazon will be like a book store that sells cocaine out the back door. Books will be just a front to sell storage and cloud computing.
&lt;/blockquote&gt;
&lt;/p&gt;

&lt;p&gt;
I've been speculating about this for a while.  Why is Amazon pushing their technology so hard?  Their business has been in retail and has been profitable since Q1 2002, if I recall correctly.
&lt;/p&gt;

&lt;p&gt;
But it hit me over lunch with &lt;a href="http://andrewchen.typepad.com/"&gt;Andrew Chen&lt;/a&gt; when we were talking about what it means to "be an X company," where X is media, retail, technology, or whatever.
&lt;/p&gt;

&lt;p&gt;
Think of it like this: what is Amazon's &lt;a href="http://en.wikipedia.org/wiki/Core_competency"&gt;core competency&lt;/a&gt;?  Andrew mentioned how Amazon uses technology to save money at every corner.  They have their warehouses right next to FedEx and use technology to plot the shortest route to pick up all the books needed for a shipment.
&lt;/p&gt;

&lt;p&gt;
In short, their core competency is their ability to develop and leverage their technology stack, including SimpleDB, EC2, and S3, towards making retail ultra-efficient.
&lt;/p&gt;

&lt;p&gt;
All these advantages are worthless in a world where the retail business is mostly digital.  Amazon &lt;a href="http://www.amazon.com/MP3-Music-Download/b?ie=UTF8&amp;node=163856011"&gt;knows this&lt;/a&gt; and that's why they're opening their technology stack.
&lt;/p&gt;

&lt;p&gt;
It would make no sense otherwise, since it's precisely that technology which gave Amazon a competitive advantage in a world where books and music were shipped to your doorstep rather than downloaded to your iPod.
&lt;/p&gt;</description>
      <pubDate>Mon, 14 Apr 2008 20:26:44 +0000</pubDate>
      <link>http://20bits.com/article/amazon-the-tech-company</link>
    </item>
    <item>
      <title>Interview Questions: When It's Your Turn</title>
      <description>&lt;p&gt;
This is part of my series about &lt;a href="http://20bits.com/tag/interview"&gt;interview questions&lt;/a&gt;.  As promised this is about interview strategy rather than specific technical interview questions.  I'll continue with that next week.
&lt;/p&gt;
&lt;p&gt;
Every tech interview I've ever had has four stages:
&lt;ol&gt;
	&lt;li&gt;Small talk and swapping brief personal bios.&lt;/li&gt;
	&lt;li&gt;Questions about your previous employment and projects.&lt;/li&gt;
	&lt;li&gt;Technical questions and brain teasers.&lt;/li&gt;
	&lt;li&gt;Turning the tables: "Do you have any questions for me?"&lt;/li&gt;
&lt;/ol&gt;
&lt;/p&gt;

&lt;p&gt;
The meat of the interview is in the second and third parts where you can directly show your knowledge, skill, and passion, but don't underestimate the value of the fourth part.
&lt;/p&gt;

&lt;h3&gt;Don't be Afraid to Ask Hard Questions&lt;/h3&gt;
&lt;p&gt;
Most people use the fourth part to ask "What is it like working here?"-type questions.  If you think you're going to get interesting responses by all means ask those, but most interviewers I know lie to some degree to make their job sound approximately ten times more awesome than it really is.  They probably don't want to admit that there are parts of their job they hate to themselves, let alone some interviewee.
&lt;/p&gt;

&lt;p&gt;
Besides, if you want to know the bad parts about the job &amp;mdash; and there will be some &amp;mdash; just ask that question directly.  They'll either be forthright or they won't and it's pretty easy to discern between the two cases.
&lt;/p&gt;

&lt;h3&gt;Using the Questions to Show Off&lt;/h3&gt;
&lt;p&gt;
In Joel Spolsky's article &lt;a href="http://www.joelonsoftware.com/articles/GuerrillaInterviewing3.html"&gt;The Guerilla Guide to Interviewing&lt;/a&gt; he says that you want to hire people who are two things: one, smart; two, able to get things done.
&lt;/p&gt;

&lt;p&gt;
Since &lt;a href="http://thedailywtf.com/Articles/Riddle-Me-An-Interview.aspx"&gt;Interview 2.0&lt;/a&gt; is the common interview style in most technology companies these days you don't always have the chance to show off how smart you are, but the fourth part offers a path to redemption.
&lt;/p&gt;

&lt;p&gt;
Let's say you're interviewing at Amazon and have a background in mathematics.  You should be asking the engineers questions about the interesting mathematical things they do, have done, or have tried to do with their massive data sets.  This shows that you're not only engaged with the interviewer and the company, but have knowledge that can be brought to bear.
&lt;/p&gt;

&lt;p&gt;
The same applies if you're a marketer or whatever.  If you feel like you haven't had the chance to show the interviewer all you have to offer then asking intelligent questions that you know something about is a great strategy.
&lt;/p&gt;

&lt;h3&gt;A Hard-Learned Lesson&lt;/h3&gt;
&lt;p&gt;
I learned this lesson the hard way.  About a month after I left &lt;a href="http://sugarinc.com"&gt;Sugar, Inc.&lt;/a&gt; and two months after I launched &lt;a href="http://appaholic.com"&gt;Appaholic&lt;/a&gt; I was interviewing at Facebook.  For most of the interviews the technical/quizzy type questions went well.  I had even sent in solutions to two of their job puzzles before I came in.  
&lt;/p&gt;
&lt;p&gt;
I was a little frustrated that most of the CS-type questions were about designing databases (as in, writing one from scratch) since I'd never had to do that before.  You can never know too much, though, so I only blame myself.
&lt;/p&gt;

&lt;p&gt;
When it came time to ask them questions, instead of using the strategy above and showing them I did have a solid grasp of the fundamentals they were looking for, I asked them the following question: "Facebook is a _____ company.  What would you put in the blank?"
&lt;/p&gt;

&lt;p&gt;
Every single person said "technology" and then I probed them about that.  "But you guys make money by selling attention.  How does that not make you a media company?"
&lt;/p&gt;

&lt;p&gt;
This is a bad question to ask engineers, even high-ranking ones, because most engineers don't give a crap &amp;mdash; they just want to create cool products and gizmos and bristle when people interject marketing and business mumbo-jumbo.&lt;/p&gt;

&lt;p&gt;And boy did they bristle.  I won't name names, but it was clear this wasn't a welcome question.  My time would have been better spent asking them technical questions because it would've created a discussion they wanted to take part in.
&lt;/p&gt;

&lt;p&gt;
I thought I was being clever but instead I torpedoed my chances of getting an offer there by annoying my interviewers and reinforcing their opinions about my technical skills.
&lt;/p&gt;

&lt;p&gt;
Not one month later I sold Appaholic/Adonomics, so it worked out well, but I still view it as a strategic mistake.  Lesson learned!
&lt;/p&gt;</description>
      <pubDate>Fri, 11 Apr 2008 18:36:48 +0000</pubDate>
      <link>http://20bits.com/article/when-its-your-turn</link>
    </item>
    <item>
      <title>Web 2.0 and Two-sided Markets</title>
      <description>&lt;p&gt;
Last Friday Andrew Chen &lt;a href="http://andrewchen.typepad.com/andrew_chens_blog/2008/04/your-ad-support.html"&gt;wrote a great article&lt;/a&gt; about ad-supported websites.  His thesis boils down to the following snippet.
&lt;blockquote&gt;
The key thing here is: &lt;strong&gt;The users of your website are not really your customers&lt;/strong&gt;.

Instead, the entire process of gathering eyeballs is just to sell to your ACTUAL customers, who are the ad agencies and advertisers. Get it? Your Web 2.0 consumer startup is actually a B2B that sells inventory to brand advertisers.
&lt;/blockquote&gt;
&lt;/p&gt;

&lt;h3&gt;Subtleties&lt;/h3&gt;
&lt;p&gt;
He brings up a worthwhile point, which is that many developers looking to make the next hot thing don't really understand the role advertisers play in their ecosystem.  For developers business development, marketing, and advertising are all dirty words.  They certainly don't view themselves as trafficking in attention, even though many if not most hot web properties make money by selling their users' attention to advertisers.
&lt;/p&gt;

&lt;p&gt;
This is one extreme &amp;mdash; marketing and advertising are dirty and our first duty is to our users, not our advertisers.  Andrew's is the other &amp;mdash; if you're making money through advertising then you're  going to have to serve them above your users.
&lt;/p&gt;

&lt;p&gt;
As it happens the situation is more complicated.  Media companies, which I define as companies that sustain themselves by buying and selling attention, exist in a &lt;a href="http://en.wikipedia.org/wiki/Two-sided_markets"&gt;two-sided market&lt;/a&gt;.  This includes ad-supported companies. (Yes, Facebook and Google, you're media companies, not technology companies.)
&lt;/p&gt;

&lt;p&gt;
Here's a diagram that describes the economics of an ad-supported media company &lt;img src="http://assets.20bits.com/20080408/media-economics.png" alt="" title="media-economics" width="492" height="167" class="math size-full wp-image-92" /&gt;
&lt;/p&gt;

&lt;p&gt;
In a two-sided market both sides can and do affect the other.  Most developers are at least tangentially aware of this when they express fears that ads will drive users away.  But if you do it right the relationship between you, your audience, and advertisers can be symbiotic rather than oppositional.
&lt;/p&gt;

&lt;p&gt;
A good example of this sort of relationship is my former employer, &lt;a href="http://sugarinc.com"&gt;Sugar, Inc.&lt;/a&gt;  On occasion advertisers will pay to dress up the avatars that live at the top of their content sites like &lt;a href="http://popsugar.com"&gt;PopSugar&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
The audience likes it because they enjoy fashion and it adds personality to the site.  The advertisers like it because it's a unique value proposition and it attaches them strongly with Sugar, Inc.'s brand.  And Sugar, Inc. likes it because it makes them money.
&lt;/p&gt;

&lt;h3&gt;Creating Value vs. Stealing Value&lt;/h3&gt;
&lt;p&gt;
It comes down to this: it's possible, if you're clever enough, to create situations where everyone wins, even if you're an ad-supported media company.  And your goal as an an entrepreneur &lt;em&gt;should&lt;/em&gt; be to create value.
&lt;/p&gt;

&lt;p&gt;
To create value it's important to align your interests with your customers.  This might be more difficult in a two-sided market since, in effect, you have two sets of customers.
&lt;/p&gt;
&lt;p&gt;
But as Andrew points out there are other economies that are possible in the Web 2.0 world, such as e-commerce, subscription fees, or virtual goods.  These might be easier if they're appropriate, but it's a mistake to think that supporting yourself through advertising automatically makes you an enemy of your audience.
&lt;/p&gt;</description>
      <pubDate>Tue, 08 Apr 2008 00:00:09 +0000</pubDate>
      <link>http://20bits.com/article/web-20-and-two-sided-markets</link>
    </item>
    <item>
      <title>Interview Questions: Shuffling an Array</title>
      <description>&lt;p&gt;
This is part of my &lt;a href="http://20bits.com/tag/interview/"&gt;interview question series&lt;/a&gt;.  It's about shuffling arrays.
&lt;/p&gt;

&lt;h3&gt;The Question&lt;/h3&gt;
&lt;p&gt;
You have an array A of size N.  Write a routine that shuffles the array in-place.  The only restrictions are that all possible permutations of A must be possible and equally likely.
&lt;/p&gt;

&lt;p&gt;
This interview question serves as a test for basic algorithm construction.  There's a canonical solution that's not too difficult to arrive at if you've never seen it before, so it's a good combination of "what do you know?" and "what can you do?"
&lt;/p&gt;

&lt;h3&gt;Workin' it out&lt;/h3&gt;
&lt;p&gt;
I'm going to create my solution in Ruby because that's the language the company that asked me this question used.
&lt;/p&gt;

&lt;p&gt;
The first solution most people arrive at is subtly wrong.  &lt;a href="http://www.codinghorror.com/blog/archives/001008.html?r=31644"&gt;Jeff Atwood&lt;/a&gt; made the mistake in his blog post.  The algorithm, in words, goes like this: iterate through each item in the array, pick another element at random, and swap the two.
&lt;/p&gt;

&lt;p&gt;
In Ruby the above algorithm would look like this.

&lt;pre class="brush: ruby"&gt;class Array
  def shuffle_naive!
    n = size
    until n == 0
      k = rand(size) #This is the line which proves our undoing
      n = n -1
      self[n], self[k] = self[k], self[n]
    end
  end
end&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
This solution seems correct if not optimal, but there's a subtle problem: not all outcomes are equally likely. 
&lt;/p&gt;

&lt;p&gt;
The root cause of this is because this algorithm is drawing from a sample space of size N&lt;sup&gt;N&lt;/sup&gt;, while the sample space of all permutations on an N-element array is only N!.
&lt;/p&gt;
&lt;p&gt;
That is, for the naive shuffle, for each of the N steps in the iteration we make one of N decisions for a total of N&lt;sup&gt;N&lt;/sup&gt; possible outcomes.
&lt;/p&gt;

&lt;p&gt;
But N&lt;sup&gt;N&lt;/sup&gt; &gt; N! for all N &gt; 1 and, more importantly, N! is not a divisor of N&lt;sup&gt;N&lt;/sup&gt;.  This means we're going to prefer at least one of the permutations more than the others, so the algorithm doesn't select among the possible permutations uniformly.  
&lt;/p&gt;

&lt;h3&gt;KFC, KFY&lt;/h3&gt;
&lt;p&gt;
The "best" solution is the &lt;a href="http://en.wikipedia.org/wiki/Knuth_shuffle"&gt;Knuth-Fischer-Yates shuffle&lt;/a&gt;.  Here it is in Ruby
&lt;pre class="brush: ruby"&gt;class Array
  def shuffle!
    n = size
    until n == 0
      k = rand(n) #You can see I'm doing rand(n) rather than rand(size)
      n = n - 1
      self[n], self[k] = self[k], self[n]
    end
    self
  end
end&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
This works because it's an iterative version of an essentially recursive algorithm.  If we know how to shuffle an array of size N-1 then shuffling an array of size N is easy &amp;mdash; first shuffle the sub-array consisting of the first N-1 elements and then randomly swap in the last element to any of the N slots.
&lt;/p&gt;

&lt;p&gt;
There's a proper inductive proof in there if you're so inclined, but it's not particularly illuminating.
&lt;/p&gt;

&lt;h3&gt;Good Questions, Bad Questions&lt;/h3&gt;
&lt;p&gt;
My next article is going to be more about the interview process rather than specific questions.  One key thing to understand in an interview is what information the interviewer is looking for in asking their question.  Hint: it's not always the answer.
&lt;/p&gt;

&lt;p&gt;
Among other things they want to suss out the limits of your knowledge, how you solve problems, how quickly you resort to help, and a whole assortment of other, behavioral things, that they get because you're right there (hopefully) engaging in a dialogue.
&lt;/p&gt;</description>
      <pubDate>Mon, 07 Apr 2008 00:00:14 +0000</pubDate>
      <link>http://20bits.com/article/interview-questions-shuffling-an-array</link>
    </item>
    <item>
      <title>Decisions Without Data</title>
      <description>&lt;p&gt;
If you've ever worked on a project where you have to build something, be it software or anything else, you've seen it happen &amp;mdash; people, especially designers and engineers, argue over the most petty stuff.  
&lt;/p&gt;

&lt;p&gt;
You know what kind of argument I mean.  How many widgets should we let people enter at a time?  Should we use horizontal or vertical navigation?  Everyone knows they have "the" optimal answer and the situation quickly devolves into a game of verbal chicken, where the first one to realize it's a stupid argument loses.
&lt;/p&gt;

&lt;p&gt;
Having seen this over and over at all levels of decision-making I've found a sentiment that stops the situation from devolving.  It goes like this: &lt;strong&gt;decisions without data are guesses&lt;/strong&gt;.
&lt;/p&gt;

&lt;p&gt;
Whenever one of these decisions pops up I ask three questions.
&lt;/p&gt;
&lt;p&gt;
&lt;ol&gt;
&lt;li&gt;What is the goal?&lt;/li&gt;
&lt;li&gt;What metrics tell us whether we're closer or farther from the goal?&lt;/li&gt;
&lt;li&gt;What data have we collected and what data do we need?&lt;/li&gt;
&lt;/ol&gt;
&lt;/p&gt;
&lt;h3&gt;Example: Publisher Ad Choices&lt;/h3&gt;
&lt;p&gt;
In the world of Facebook applications most developers, when it comes to making money, are members of the Ron Popeil school of business: set it and forget it.  If you browse the forums for ad-related topics there are a few questions that recur over and over.  What ad network is best?  Where should I put ads on my application?  What color scheme should my ads have?
&lt;/p&gt;

&lt;p&gt;
In this case, the developers implicitly have a goal of making money in mind.  The two key metrics are total pageviews and revenue per pageview (RPM).  The data needed to calculate these metrics are total pageviews, which you can get by using Google Analytics, and revenue, which every ad network reports directly.
&lt;/p&gt;

&lt;p&gt;
So, let's take the first question, which ad network should I use?  I might think &lt;a href="http://adblade.com/"&gt;AdBlade&lt;/a&gt; is the best and have ten stories that back up my claim, while you might think &lt;a href="http://www.rockyouads.com"&gt;RockYou&lt;/a&gt; is the best from your own experiences.  We could go back and forth all day, but there's only one correct answer: the best ad network is the one that, for a given level of traffic, offers the highest RPM.
&lt;/p&gt;

&lt;p&gt;
You can measure this by using &lt;a href="http://en.wikipedia.org/wiki/A/B_testing"&gt;A/B testing&lt;/a&gt; across multiple ad networks.  Once you've collected information about how each ad network performs there's no room for arguments backed by anecdotes.  The best choice is right there in the numbers.
&lt;/p&gt;

&lt;h3&gt;Example: Website Layout&lt;/h3&gt;
&lt;p&gt;
These arguments crop up all the time when talking about website design.  Let's say you're creating the latest and greatest social networking site.  What should the homepage look like?
&lt;/p&gt;

&lt;p&gt;
First, you need to settle on a goal for what the homepage is supposed to do.  Do you want lots of users to sign up?  Do you want lots of users of a certain &lt;em&gt;type&lt;/em&gt; (e.g., more engaged users, only women, etc.) to sign up?  Let's say you just want as many people to sign up as possible.
&lt;/p&gt;

&lt;p&gt;
The metric you're probably interested in in this case is the percentage of people who visit the homepage and then go through the signup process.  Measuring this requires that you track a user through multiple parts of the site and identify which ones sign up and which ones don't.  You can do this by assigning the potential user a unique identifier and persisting it through the entire signup process.
&lt;/p&gt;

&lt;p&gt;
Now you know for a given homepage layout what percentage of users sign up.  What homepage design is the best?  Should we use 14-point or 16-point headers?  Should be use an off-white or grey background?  Depending on the granularity of the design elements you want to test you can do this either through A/B testing, as in the previous example, or through a more complex &lt;a href="http://en.wikipedia.org/wiki/Multivariate_testing"&gt;multivariate testing scheme&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
People love arguing about what designs are "best," but this process forces you to ask, "Best for what?"  If our goal is to get signups then the best design is the one that produces the most signups and that's something we can measure directly.  Once we've done that not only is there no more room for silly arguments but metrics might reveal that both opinions were wrong.
&lt;/p&gt;

&lt;h3&gt;Conclusions&lt;/h3&gt;
&lt;p&gt;
Maybe it's my background as a scientist and mathematician, but I treat website development and design as an empirical venture.  We should come to the task with a definite idea of what we're trying to achieve and at every step make the decision that the data says is best.
&lt;/p&gt;

&lt;p&gt;
Not only does this produce better and more justifiable decisions but it prevents time-wasting arguments.  If someone comes at you with an opinion you can just shoot back, "What data do you have?"  If they have nothing but opinions and anecdotes you know they're not making a decision, they're guessing.
&lt;/p&gt;</description>
      <pubDate>Sat, 05 Apr 2008 00:00:26 +0000</pubDate>
      <link>http://20bits.com/article/decisions-without-data</link>
    </item>
    <item>
      <title>Interview Questions: Two Bowling Balls</title>
      <description>&lt;p&gt;
This post is the first in a series I'm calling "interview questions," where I discuss interview questions I've been handed in my time out here in the Bay Area.  Since I'm an engineer by trade most of the questions relate directly to technical topics.  I'll also cover general interview strategies and advice — probably by serving myself up as an example of what &lt;em&gt;not&lt;/em&gt; to do in an interview.
&lt;/p&gt;

&lt;p&gt;
I know people keep a repertoire of interview questions at hand, so I'm not going to name names when discussing the questions.  Anyhow, let's get started!
&lt;/p&gt;

&lt;h3&gt;The Question&lt;/h3&gt;
&lt;p&gt;
You're standing in front of a 100 story building with two identical bowling balls.  You've been tasked with testing the bowling balls' resilience.  The building has a stairwell with a window at each story from which you can (conveniently) drop bowling balls.  
&lt;/p&gt;

&lt;p&gt;
To test the bowling balls you need to find the first floor at which they break.  It might be the 100th floor or it might be the 50th floor, but if it breaks somewhere in the middle you know it will break at every floor above.
&lt;/p&gt;

&lt;p&gt;
Devise an algorithm which guarantees you'll find the first floor at which one of your bowling balls will break.  You're graded on your algorithm's worst-case running time.
&lt;/p&gt;

&lt;p&gt;
&lt;h3&gt;Warning: Stop reading here if you're not interested in seeing any of my solutions!&lt;/h3&gt;
&lt;/p&gt;

&lt;h3&gt;A Few Preliminaries&lt;/h3&gt;
&lt;p&gt;
The original problem stated that the building had 100 floors, but it may as well have N floors.  Using N rather than 100 will make it easier to quantify the performance of the algorithm, so that's what I'm going to do.
&lt;/p&gt;
&lt;h3&gt;Solution 1: The Naïve Solution&lt;/h3&gt;
&lt;p&gt;
Ok, there's one blindingly obvious solution: take one of the bowling balls and drop it from every floor, starting from the first.  At worst this will take N tries, where N is the number of stories on the building.
&lt;/p&gt;
&lt;p&gt;
&lt;strong&gt;Interview Advice:&lt;/strong&gt;: In an actual interview situation don't be afraid to say the obvious solution, even if you know there's a better one.  Problem solving is iterative and your answer should be, too.
&lt;/p&gt;

&lt;h3&gt;Solution 2: Two Bowling Balls&lt;/h3&gt;
&lt;p&gt;
We know the first solution is probably sub-optimal because it doesn't make use of both bowling balls.  To give us some ideas let's just pick a floor, say the 50th floor, and drop one of the balls — we has nothing to lose since we know we can do it with only one.
&lt;/p&gt;

&lt;p&gt;
If our building is 100 floors and we dropped one of the balls from the 50th floor one of two thing will happen: the ball will either break or it won't.  If it breaks then we know the floor we're looking for is somewhere between floors 1-49.  If it doesn't then we know it's somewhere between floors 51-100.  In either case we've halved the size of the search space and now need at most N/2 (or 50) tries.&lt;/p&gt;

&lt;p&gt;
But 50 was arbitrary.  What about other numbers?  What happens if we drop the ball on the third floor?  If it breaks then we can use the second ball to test floors 1-2, taking at most 3 tries.  If it doesn't break then we try the same experiment again, dropping the ball from another floor.
&lt;/p&gt;

&lt;p&gt;
Here's one possible strategy: pick a number S and call it the skip number.  We drop one ball every S floors until it breaks on the k&lt;sup&gt;th&lt;/sup&gt; try.  We then use the second ball to try every floor between floors (k-1)*S and k*S.
&lt;/p&gt;

&lt;p&gt;
As an example, let N=100 and S=4.  We'd try floors 4,8,12,16,... with one bowling ball until it breaks.  Let's say it breaks on the 60th floor.  Since it didn't break on the 56th floor we know the culprit is somewhere on floor 57, 58, 59, and we can use the second ball to test those floor one at a time using the naÃ¯ve strategy.
&lt;/p&gt;

&lt;p&gt;
What is the best skip size?  Obviously S=100 isn't ideal since that is equivalent to the naÃ¯ve strategy, as is S=1.  But we know both S=50 and S=4 are better, so there must be an optimal strategy somewhere between.  To find this strategy let L(S) be the number of drops requires in the worst-case scenario for a skip number of S.  If you work it out you'll get &lt;img class="math" src="http://assets.20bits.com/20080403/latex-4.png" alt="" title="latex" width="163" height="43"/&gt;
&lt;/p&gt;

&lt;p&gt;
We want to minimize this function.  Bringing back our high school calculus, the derivative of L(S) is
&lt;img src="http://assets.20bits.com/20080403/latex-1.png" alt="" title="latex-1" class="math"/&gt;
&lt;/p&gt;

&lt;p&gt;
Setting the derivative equal to zero implies &lt;img src="http://assets.20bits.com/20080403/latex-2.png" alt="" title="latex-2" width="78" height="23" class="math"/&gt;
&lt;/p&gt;

&lt;p&gt;
For N=100 this gives an optima skip of S=10.  If N isn't a perfect square you'll have to work out which skip gives the "correct" solution.
&lt;/p&gt;

&lt;h3&gt;Solution 3: You can do better...&lt;/h3&gt;
&lt;p&gt;
At this point in the interview you're probably pretty happy with yourself.  The above took you a few minutes to work out, perhaps with some prodding by the interviewer.  But then you hear that dreaded question, "Can you do any better?"
&lt;/p&gt;

&lt;p&gt;
The interviewer isn't a jackass, though, and gives you a hint.  He points out that it seems like we should be able to find a solution that works equally well irrespective of where the bad floor is.  That is, it should take the same number of turns if its on the 100th floor as it would if it were on a lower floor.
&lt;/p&gt;

&lt;p&gt;
We have a baseline for ourselves.  For N=100 and S=10 we know we can do it in at most 19 turns.  This can act as a sort of counter — if we beat this number at every step we've come up with a strictly better algorithm.  So, at every step, we want to be able to find the floor in question in no more than 18 steps.
&lt;/p&gt;
&lt;p&gt;
Let's start by dropping the first ball on the 18th floor.  If it breaks we can test floors 1-17 with the second ball, taking at most 18 turns.  If it doesn't break, we've used up one of our turns, leaving us with 17 turns left.
&lt;/p&gt;
&lt;p&gt;
So, the next floor we should test is 18+17, or the 35th floor.  If it breaks we can test floors 19-35, taking at most 18 turns.  We can continue this way, shrinking the step size by one each time.  Now we know we can do it in at least 18 steps.
&lt;/p&gt;

&lt;p&gt;
But why not 17?  If we repeat the above steps, starting with a counter of 17 rather than 18, we get an algorithm that takes at most 16 steps.  Then, using 16 as a counter, we get an algorithm that takes at most 15 steps.  We can't do this forever, since there's no possible algorithm that takes at most one step.  So where is the end of the line?
&lt;/p&gt;

&lt;p&gt;
The problem is that for this algorithm to work the first ball needs to be able to skip one fewer each time and still cover all 100 floors.  If we set our counter to C that means we must have 1+2+...+C &gt; 100.&lt;/p&gt;

&lt;p&gt;
Here's the math:&lt;img src="http://assets.20bits.com/20080403/latex-12.png" alt="" title="latex-12" width="357" height="150" class="math"/&gt;
&lt;/p&gt;

&lt;p&gt;
Using the quadratic formula to find the exact solution and then taking into account the fact that we want an integer solution gives
&lt;/p&gt;
&lt;img src="http://assets.20bits.com/20080403/latex-21.png" alt="" title="latex-21" width="215" height="53" class="math"/&gt;
&lt;p&gt;
as the worst possible case for our third strategy.  &lt;tt&gt;L(100) = 14&lt;/tt&gt;, which checks out.
&lt;/p&gt;

&lt;p&gt;
That's the best solution I know, and it was the best solution the interviewer knew, too.  Can you do any better?
&lt;/p&gt;

&lt;h3&gt;After The Interview&lt;/h3&gt;
&lt;p&gt;
This was one of a few questions I was asked by one of four interviewers.  I worked through the problem above, basically as it was written out, albeit with more digressions.  How did the interview wind up going?  I wasn't offered a job.  At least I got a good interview question out of it, though.
&lt;/p&gt;</description>
      <pubDate>Thu, 03 Apr 2008 00:00:01 +0000</pubDate>
      <link>http://20bits.com/article/interview-questions-two-bowling-balls</link>
    </item>
    <item>
      <title>Memo to OpenSocial: It's about distribution, stupid!</title>
      <description>&lt;p&gt;
With the launch of Google's &lt;a href="http://code.google.com/apis/opensocial/"&gt;OpenSocial&lt;/a&gt; project last week and the subsequent announcement that &lt;a href="http://biz.yahoo.com/bw/071101/20071101006542.html?.v=1"&gt;MySpace will be one of the participating social networks&lt;/a&gt; the developer community on Facebook and the technology blogosphere is wondering what this means for Facebook's platform strategy.  The short answer: not much.  Why?  Because it's not about users &lt;em&gt;per se&lt;/em&gt;, it's about distribution.
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Note:&lt;/strong&gt; This was cross-posted to the &lt;a href="http://blog.adonomics.com/2007/11/07/memo-to-opensocial-its-about-distribution-stupid/"&gt;Adonomics blog&lt;/a&gt;.
&lt;/p&gt;

&lt;h3&gt;The Hype Machine&lt;/h3&gt;
&lt;p&gt;
Talking heads love narratives.  After MySpace announced their partnership the narrative became "the young and arrogant Facebook was shown up by the experienced and prudent Google, who in one fell swoop obsoleted their whole platform strategy."  &lt;a href="http://www.techcrunch.com/2007/11/01/confirmed-myspace-to-join-google-opensocial/"&gt;Michael Arrington&lt;/a&gt; asked if this were "checkmate" for Google.  &lt;a href="http://www.gadgetell.com/2007/11/myspace-joins-opensocial-will-facebook-bow/"&gt;Others&lt;/a&gt; have been asking whether or not Facebook should just give up and join OpenSocial itself.
&lt;/p&gt;

&lt;p&gt;
&lt;a href="http://weblogs.hitwise.com/bill-tancer/2007/11/opensocial_what_a_difference_a.html"&gt;Hitwise&lt;/a&gt; even published a graph showing the total size of all the OpenSocial partners vs. Facebook.  Yikes!  That sure looks bad for Facebook.
&lt;/p&gt;

&lt;p&gt;
But like most media narratives the story turns out to be more subtle.
&lt;/p&gt;

&lt;h3&gt;The Value of the Facebook Platform&lt;/h3&gt;
&lt;p&gt;
These subtleties first appear when you consider the value application developers get from Facebook.  Nothing in this world is free, including developing applications for Facebook platform.  In exchange for agreeing to dress your application in the Facebook blues they are offering you, the developer, two things:
&lt;ol&gt;
&lt;li&gt;Lower cost of acquisition per user&lt;/li&gt;
&lt;li&gt;An unparalleled distribution mechanism&lt;/li&gt;
&lt;/ol&gt;
&lt;/p&gt;

&lt;p&gt;
The cost of acquiring a user on Facebook is orders of magnitude cheaper on Facebook than on the web at large.  Facebook effectively offers a single sign-on solution with their API, on top of which it only takes one click for a user to add an application to their profile.  &lt;/p&gt;

&lt;p&gt;
OpenSocial clears this hurdle, too.  If I'm in MySpace an OpenSocial widget will use MySpace's login system to verify whatever it needs to verify.  Ok, so I can install OpenSocial widgets in one click.  Once a user has decided they want to add my widget the barrier to entry is minimal.
&lt;/p&gt;

&lt;p&gt;
Distribution, however, deals with the step before this.  How do I get users information about my application to begin with?
&lt;/p&gt;

&lt;h3&gt;It's About Distribution, Stupid!&lt;/h3&gt;
&lt;p&gt;
Last September something remarkable happened: Facebook &lt;a href="http://blog.facebook.com/blog.php?post=2207967130"&gt;launched the newsfeed and mini-feed&lt;/a&gt;.  Although most people didn't realize it at the time this effectively "activated" the social network underlying Facebook, making it possible for information to flow efficiently through the connections in that network.  Information that I used to have to seek out now came to me without any effort on my part!
&lt;/p&gt;

&lt;p&gt;
This made it possible for Facebook to become a distribution engine.  Through the newsfeed Facebook could, in theory, distribute anything: advertisements, my friends' activity, and even software.  &lt;/p&gt;

&lt;p&gt;
Heck, Facebook could partner with local governments to send out public health announcements to local Facebook users.  
This is powerful stuff and it's what drove the Facebook-made photos and events applications to be larger than Flickr and Evite, respectively.
&lt;/p&gt;

&lt;p&gt;
So when Facebook opened up the platform plenty of people knew it would be possible for them to achieve the same kind of success.  iLike saw &lt;a href="http://mashable.com/2007/06/11/ilike-facebook-app-success/"&gt;three million users&lt;/a&gt; add their application in the first two weeks.  Even now, three and a half months later, it's possible to get &lt;a href="http://adonomics.com/about/17603244640"&gt;a million users in less than a month&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
You might not hear stories like iLike's any more but that doesn't mean the Facebook platform isn't growing.  The number of application installs across all of Facebook has been remarkably consistent since the launch of the platform growing at a rate of about &lt;strong&gt;1.5%&lt;/strong&gt; or &lt;strong&gt;2.96 million&lt;/strong&gt; installs per day.
&lt;/p&gt;

&lt;img class="math" src='http://20bits.com/wp-content/uploads/2007/11/installs2.png' alt='installs1.png' /&gt;

&lt;p&gt;
That's right.  Every day almost &lt;strong&gt;3 million&lt;/strong&gt; people click that big blue "Install" button and don't remove the application. And this growth shows absolutely no sign of slowing.
&lt;/p&gt;

&lt;h3&gt;OpenSocial vs. Facebook&lt;/h3&gt;
&lt;p&gt;
To quote Dave McClure, "&lt;a href="http://500hats.typepad.com/500blogs/2007/08/facebook-not-fo.html"&gt;open is not better, better is better!&lt;/a&gt;"
&lt;/p&gt;

&lt;p&gt;
Ignoring questions of the quality of the users, OpenSocial's incorporation of MySpace &lt;em&gt;appears&lt;/em&gt; to give them the edge.  MySpace, after all, has 200 million users and Facebook has 50 million.  But the size of the potential userbase doesn't matter nearly as much as the ability for an application developer to activate that userbase.
&lt;/p&gt;

&lt;p&gt;
Peter Chane, the Group Product Manager at Google for OpenSocial, &lt;a href="http://www.insidefacebook.com/2007/11/02/comparing-facebook-platform-to-opensocial-interview-with-peter-chane-group-product-manager-at-google/"&gt;has stated&lt;/a&gt; that OpenSocial will have some components of Facebook's distribution system (newsfeed/mini-feed) but not others (notifications/requests).
&lt;/p&gt;

&lt;p&gt;
Whether the OpenSocial model can surpass Facebook-level distribution is &lt;em&gt;the key question&lt;/em&gt;.  It's not about how many users are using the social networks on OpenSocial &amp;mdash; it's not even clear that there's going to be any real interaction between users on different networks, anyhow.  It's not about whether OpenSocial is more or less "open" than Facebook.  It's about whether developers can build high-quality applications (not just widgets) using the OpenSocial technology and distribute them efficiently through the social graph.
&lt;/p&gt;

&lt;p&gt;
There are all sorts of unknowns even though OpenSocial is a week old. How does OpenSocialâ€™s feed-based system compare to Facebookâ€™s newsfeed? How does the quality of the social graph wired up with OpenSocial compare to Facebookâ€™s? What impact do these factors have on the efficiency of distribution?
&lt;/p&gt;

&lt;h3&gt;In Closing...&lt;/h3&gt;
&lt;p&gt;
The point is this: don't be blinded by the big numbers of MySpace.  Until OpenSocial shows it can activate those users in a way that is more viral than Facebook it's an unproven technology, even ignoring the fact that as of this post OpenSocial doesn't provide any meaningful way to interact between containers.  If you're on MySpace you're not going to be able to switch to another social network and take your MySpace data with you.
&lt;/p&gt;

&lt;p&gt;
That it's been a week since Google launched OpenSocial and the only story about iLike is not that they're on track to get another windfall of users via OpenSocial, but that their &lt;a href="http://www.techcrunch.com/2007/11/05/opensocial-hacked-again/"&gt;Ning application&lt;/a&gt; has been hacked makes be believe Facebook's king can more than meet the threat from Google's attack.
&lt;/p&gt;</description>
      <pubDate>Wed, 07 Nov 2007 02:59:09 +0000</pubDate>
      <link>http://20bits.com/article/memo-to-opensocial-its-about-distribution-stupid</link>
    </item>
    <item>
      <title>Graph Theory: Part III (Facebook)</title>
      <description>&lt;p&gt;
In the &lt;a href="/article/graph-theory-part-i-introduction/"&gt;first&lt;/a&gt; and &lt;a href="http://20bits.com/articles/graph-theory-part-ii-linear-algebra"&gt;second&lt;/a&gt; parts of my series on graph theory I defined graphs in the abstract, mathematical sense and connected them to matrices.  In this part we'll see a real application of this connection: determining influence in a social network.
&lt;/p&gt;

&lt;p&gt;
Recall that a graph is a collection of vertices (or nodes) and edges between them.  The vertices are abstract nodes and edges represent some sort of relationship between them.  In the case of a social network the vertices are people and the edges represent a kind of social or person-to-person relationship.  "Friends of," "married to," "slept with," and "went to school with" are all example of possible relationships that could determine edges in a social graph.
&lt;/p&gt;

&lt;p&gt;
So, right away, you can see how this applies to Facebook.  They have a huge collection of just this sort of data: who is friends with whom, who is in a relationship with whom, who is married to whom, who went to college with whom, etc.  Can anything useful be done with this?
&lt;/p&gt;

&lt;h3&gt;What Can Be Done With a Social Graph&lt;/h3&gt;
&lt;p&gt;
Let's step back and think like a marketer for a second.  Facebook, thanks to the newsfeed, is essentially a word-of-mouth engine.  Everything I do, from installing applications to commenting on photos, is broadcast to all my friends via the newsfeed.  Intuitively, however, we know that some people are just more influential than others.&lt;/p&gt;
&lt;p&gt;
If my cool friend writes a note about an awesome new shop he found in the &lt;a href="http://en.wikipedia.org/wiki/Lower_Haight,_San_Francisco,_California"&gt;Lower Haight&lt;/a&gt; I'm probably going to pay more attention.  People like this, who are influential and highly connected, are a marketers dream.  If I can identify and target these people, "infecting" them with my marketing, I'll get ten times my return than I would going after random people in my target demographic.
&lt;/p&gt;

&lt;p&gt;
Facebook is almost certainly doing something like this already with respect to the newsfeed.  They process billions of newsfeed items per day.  How do they know which messages are most important to me?  Well, it stands to reason that the messages that are most important to me are the ones from the &lt;em&gt;people&lt;/em&gt; who are most important to me.  So, as Facebook, I want to be able to calculate the relative level of importance of a person's friends and use that measurement to weight whether their newsfeed items get displayed for their friends.
&lt;/p&gt;

&lt;p&gt;
There are several problems. Can we come up with a good measure of social importance or influence?  Are there multiple measures, and if so, what are their relative merits?
&lt;/p&gt;

&lt;h3&gt;Measurements of Social Influence&lt;/h3&gt;

&lt;p&gt;
Let's start simple.  One way to measure influence is connectivity.  People who have lots of friends tend to have more influence (indeed, it's possible they have more friends precisely because they &lt;em&gt;are&lt;/em&gt; influential).  Recall from the first part that the &lt;em&gt;degree&lt;/em&gt; of a node in a graph is the number of other nodes to which it is connected.  For a graph where "is friends with" is the edge relationship then the degree corresponds to the &lt;em&gt;number of friends&lt;/em&gt;.
&lt;/p&gt;

&lt;h3&gt;Degree Centrality&lt;/h3&gt;
&lt;p&gt;
Let's call this influence function I&lt;sub&gt;d&lt;/sub&gt; ("d" for degree).  Thus, if &lt;em&gt;p&lt;/em&gt; is a person then I&lt;sub&gt;d&lt;/sub&gt;(p) is the measure of their influence.  Mathematically we get
&lt;img class="math" src='http://assets.20bits.com/misc/degree-influence.png' alt='degree-influence.png'/&gt;
&lt;/p&gt;

&lt;p&gt;
The main advantage of this is that it's dead-simple to calculate.  If you represent your graph as an adjacency matrix, as in &lt;a href="/article/graph-theory-part-ii-linear-algebra"&gt;the second part of this series&lt;/a&gt;, then the influence of a node is just the row-sum of the corresponding row — an operation which is very fast and easily paralleizable.
&lt;/p&gt;

&lt;p&gt;
The downside of this is that its naive.  Consider the following graphs.
&lt;/p&gt;
&lt;div class="math"&gt;
&lt;h4&gt;Single person with high degree&lt;/h4&gt;
&lt;img src='http://assets.20bits.com/misc/high-degree.png' alt='high-degree.png'/&gt;
&lt;/div&gt;
&lt;div class="math"&gt;
&lt;h4&gt;Single person low degree but high connectivity&lt;/h4&gt;
&lt;img src='http://assets.20bits.com/misc/low-degree.png' alt='low-degree.png'/&gt;
&lt;/div&gt;

&lt;p&gt;
Using I&lt;sub&gt;d&lt;/sub&gt; as a measure of influence the first person, p&lt;sub&gt;1&lt;/sub&gt;, has a higher measure of influence because they are directly connected to eight people.  The second person, p&lt;sub&gt;2&lt;/sub&gt;, however, has the potential to influence up to 9 people.  This happens in the real world, too.  Consider a corporate hierarchy in a large company.  The CEO only has direct relationships with his board, the VPs, and maybe a few other employees.  He is undeniably more influential than an administrative assistant to the deputy regional director of sales for Southern Montana and yet might have fewer direct connections.
&lt;/p&gt;

&lt;h3&gt;Using Eigenvalues&lt;/h3&gt;
&lt;p&gt;
One way to capture this sort of indirect influence is to use a measurement called eigenvalue (or eigenvector) centrality.  The idea is this:  a person's influence is proportional to total influence of the people to whom he is connected.  We'll call this influence measure I&lt;sub&gt;e&lt;/sub&gt;.
&lt;/p&gt;

&lt;p&gt;
Let's say I'm the CEO at X Corp.  There are four VPs each of whom has an influence of 5 and these are the people to whom I am connected directly.  Then this measure says that there is some number λ such that &lt;img class="math" src='http://assets.20bits.com/misc/ceo-eigen.png' alt='ceo-eigen.png'/&gt;
&lt;/p&gt;

&lt;p&gt;
λ determines how much influence people share with each other through their connections.  If λ is small then the CEO has a lot of influence, if it is large then he has little.  How do we calculate λ?
&lt;/p&gt;

&lt;p&gt;
Let G be a social network or social graph, where vertices are people and edges are some sort of social relationship, and let A be the adjacency matrix of G.  If there are N people in the social network labeled p&lt;sub&gt;1&lt;/sub&gt;, p&lt;sub&gt;2&lt;/sub&gt;,... then we can generalize the above and say that 
&lt;/p&gt;
&lt;img class="math" src='http://assets.20bits.com/misc/eigenvalue-eqn.png' alt='eigenvalue-eqn.png'/&gt;

&lt;p&gt;
Remember that A&lt;sub&gt;i,j&lt;/sub&gt; is 1 if p&lt;sub&gt;i&lt;/sub&gt; and p&lt;sub&gt;j&lt;/sub&gt; are joined by an edge and 0 otherwise.  It's also important to notice that λ is a function of the &lt;em&gt;graph&lt;/em&gt; not of any individual node.
&lt;/p&gt;

&lt;p&gt;
If we call x&lt;sub&gt;i&lt;/sub&gt; = I&lt;sub&gt;e&lt;/sub&gt;(p&lt;sub&gt;i&lt;/sub&gt;) then we can form a vector &lt;b&gt;x&lt;/b&gt; whose i&lt;sup&gt;th&lt;/sup&gt; coordinate is the influence of the i&lt;sup&gt;th&lt;/sup&gt; person.  We can rewrite the above equation using vectors and matrices, then, as follows:
&lt;/p&gt;
&lt;img class="math" src='http://assets.20bits.com/misc/eigen-recip.png' alt='eigen-recip.png'/&gt;

&lt;p&gt;
or&lt;/p&gt;

&lt;img class="math" src='http://assets.20bits.com/misc/eigenvector-eqn.png' alt='eigen-recip.png'/&gt;

&lt;h3&gt;Eigenvalue Problem&lt;/h3&gt;
&lt;p&gt;
If you remember from the &lt;a href="/article/graph-theory-part-ii-linear-algebra"&gt;second part&lt;/a&gt; of this series this is the classical eigenvalue equation (hence the name of this influence measure).
&lt;/p&gt;
&lt;p&gt;
Calculating someone's influence according to I&lt;sub&gt;e&lt;/sub&gt; is therefore equivalent to calculating what is known as the "principal component" or "principal eigenvector."  Luck for us there are tons of &lt;a href="http://en.wikipedia.org/wiki/Eigenvalue_algorithm"&gt;eigenvalue algorithms&lt;/a&gt; out there.
&lt;/p&gt;

&lt;h3&gt;The Power Method&lt;/h3&gt;
&lt;p&gt;
The easiest, but not necessarily the most efficient, way of calculating the principal eigenvalue and eigenvector is called the &lt;a href="http://en.wikipedia.org/wiki/Power_method"&gt;power method&lt;/a&gt;.  The idea is that if you take successive powers of a matrix A, normalize it, and take successive powers then the result approximates the principal eigenvector.
&lt;/p&gt;

&lt;p&gt;
You can read the mathematical justification for why this at Wikipedia.  Here is some Python code which takes as its input the adjacency matrix of a graph, an initial starting vector, and the level of error you're willing to permit:
&lt;/p&gt;
&lt;pre class="code" lang="python"&gt;from numpy import *

def norm2(v):
	return sqrt(v.T*v).item()
	
def PowerMethod(A, y, e):
	while True:
		v = y/norm2(y)
		y = A*v
		t = dot(v.T,y).item()
		if norm2(y - t*v) &lt;= e*abs(t):
			return (t, v)&lt;/pre&gt;

&lt;p&gt;
The upside to using this method is that it is relatively easy to compute (Google uses a variant of this to calculate PageRank, for example) and that it encompasses more subtleties about how nodes possibly influence each other.  The downsides are mostly technical: there are certain situations where the power method fails to work.
&lt;/p&gt;

&lt;h3&gt;Other Measures&lt;/h3&gt;
&lt;p&gt;
There are other measure, too.  The two most common are called &lt;em&gt;betweenness&lt;/em&gt; and &lt;em&gt;closeness&lt;/em&gt;.
&lt;/p&gt;

&lt;p&gt;
Without going into detail, the &lt;em&gt;betweenness&lt;/em&gt; of a node P is average number of shortest paths between nodes which contain P.  So, nodes that are in high-density areas where most nodes are separated by one or two degrees have a high betweenness score.  The &lt;em&gt;closeness&lt;/em&gt; of a node P is the average length of the shortest path from P to all nodes which are connected to it by some path. 
&lt;/p&gt;

&lt;p&gt;
Both of these measurements are fairly sophisticated and difficult to calculate for large graphs.
&lt;/p&gt;

&lt;h3&gt;Experimental Measurement&lt;/h3&gt;
&lt;p&gt;
The downside to all these measure is that it only takes into account the topology of the graph.  That is, it ignores the fact that the nodes, as people, are performing actions.  In the case of Facebook we can measure influence directly by measuring how activity spreads throughout the network.  Let's say we have a person P with friends F1, F2, F3, etc.  As Facebook we send out a message on behalf of person P to his friends and count how many act on the message.
&lt;/p&gt;

&lt;p&gt;
Statistically we can model this using a &lt;a href="http://en.wikipedia.org/wiki/Random_variable"&gt;random variable&lt;/a&gt;.  Let X&lt;sub&gt;i&lt;/sub&gt; be the number of people "influenced" by a message sent on behalf of p&lt;sub&gt;i&lt;/sub&gt;, the i&lt;sup&gt;th&lt;/sup&gt; person on our network.  We can then calculate the &lt;a href="http://en.wikipedia.org/wiki/Expected_value"&gt;expected value&lt;/a&gt; of X&lt;sub&gt;i&lt;/sub&gt;.
&lt;/p&gt;

&lt;p&gt;
We can then take into account information from the profile data and answer questions like "What is the expected number of people this person will influence given he is a male, 32, a marketing executive who graduated from Harvard with 42 friends and 102 wall posts?"
&lt;/p&gt;

&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;
These techniques work on any graph, including the social graph Facebook has.  When you have a graph as complete as Facebook you're able to do a lot of interesting stuff.  Imagine I'm a marketer who wants to have a sponsored newsfeed item.  Facebook can charge a premium because they're able to target the influencers by using techniques like the ones above.
&lt;/p&gt;

&lt;p&gt;
Of course I can't say whether Facebook is using some, none, or all of the techniques I described.  But that doesn't mean application developers can't.  By keeping track of who influences who you can use these techniques to maximize your exposure.  Fancy that!
&lt;/p&gt;</description>
      <pubDate>Fri, 02 Nov 2007 07:00:28 +0000</pubDate>
      <link>http://20bits.com/article/graph-theory-part-iii-facebook</link>
    </item>
    <item>
      <title>Graph Theory: Part II (Linear Algebra)</title>
      <description>&lt;p&gt;
This is the second part in my series on graph theory.  &lt;a href="/article/graph-theory-part-i-introduction"&gt;Part I&lt;/a&gt; included the basic definitions of graph theory, gave some concrete examples where one might want to use graph theory to tackle a problem, and concluded with some common objects one finds doing graph theory.
&lt;/p&gt;

&lt;p&gt;
I'm going to cover three things in this post: vector spaces, linear transformations and matrices, and eigenvectors and eigenvalues.
&lt;/p&gt;

&lt;h3&gt;Vector Spaces&lt;/h3&gt;
&lt;p&gt;
Linear algebra is the study of vector spaces and and their transformations.  No good mathematical text can begin without definitions, so lets dive in:
&lt;/p&gt;
&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;. Let &lt;b&gt;R&lt;/b&gt; be the &lt;a href="http://en.wikipedia.org/wiki/Real_numbers"&gt;real numbers&lt;/a&gt;.  A &lt;em&gt;vector space&lt;/em&gt; over the reals is a set V and two &lt;a href="http://en.wikipedia.org/wiki/Binary_operation"&gt;binary operations&lt;/a&gt; +: V×V → V and ·: &lt;b&gt;R&lt;/b&gt;×V → V, called &lt;em&gt;vector addition&lt;/em&gt; and &lt;em&gt;scalar multiplication&lt;/em&gt;, respectively, which satisfy the following
&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;p&gt;(V,+) is an &lt;a href="http://en.wikipedia.org/wiki/Abelian_group"&gt;Abelian group&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalar multiplication distributes over vector addition&lt;/strong&gt;&lt;p&gt;Formally, α·(v+w) = α·v + α·w for all α ∈ &lt;b&gt;R&lt;/b&gt; and v,w ∈ V.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vector addition distributes over scalar multiplication&lt;/strong&gt;
&lt;p&gt;Formally, (α + β)·v = α·v + β·v for all α,β ∈ &lt;b&gt;R&lt;/b&gt; and v ∈ V&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalar multiplication is compatible with vector addition&lt;/strong&gt;
&lt;p&gt;Formally, α·(β·v) = (αβ)·v for all α,β ∈ &lt;b&gt;R&lt;/b&gt; and v ∈ V&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalar multiplication has an identity element&lt;/strong&gt;&lt;p&gt;Formally, 1·v = v for all v ∈ V, where 1 is the multiplicative identity.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;
These properties aren't arbitrary, even though they might look like it.  The most common vector space is n-dimensional Euclidean space &lt;b&gt;R&lt;/b&gt;&lt;sup&gt;n&lt;/sup&gt;.  &lt;b&gt;R&lt;/b&gt;&lt;sup&gt;2&lt;/sup&gt; is the Cartesian plane we grew up with in grade school and &lt;b&gt;R&lt;/b&gt;&lt;sup&gt;3&lt;/sup&gt; is the three-dimensional space we live in every day.
&lt;/p&gt;

&lt;p&gt;
A vector is an element of a vector space.  For &lt;b&gt;x&lt;/b&gt; ∈ &lt;b&gt;R&lt;/b&gt;&lt;sup&gt;n&lt;/sup&gt; we write &lt;b&gt;x&lt;/b&gt; = (x&lt;sub&gt;1&lt;/sub&gt;, x&lt;sub&gt;2&lt;/sub&gt;, ..., x&lt;sub&gt;n&lt;/sub&gt;), i.e., an ordered tuple of n components.  For example, (1,2,3) is an element of &lt;b&gt;R&lt;/b&gt;&lt;sup&gt;3&lt;/sup&gt;, as is (√2, 5/3, 2.12).
&lt;/p&gt;

&lt;p&gt;
Let's work in &lt;b&gt;R&lt;/b&gt;&lt;sup&gt;3&lt;/sup&gt;.  Take v = (1,0,0) and u = (1,1,0).  Then v+u = (1+1, 1+0, 0+0) = (2,1,0).  That is, addition on &lt;b&gt;R&lt;/b&gt;&lt;sup&gt;3&lt;/sup&gt; is just coordinate-wise addition from &lt;b&gt;R&lt;/b&gt;.  Scalar multiplication works the same way, so that √3 · (1,10,0) = (√3,10√3,0).
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Note&lt;/strong&gt;. If we're just talking about some abstract vector &lt;b&gt;v&lt;/b&gt; in some vector space V over the reals then we refer to the i&lt;sup&gt;th&lt;/sup&gt; coordinate of &lt;b&gt;v&lt;/b&gt; as v&lt;sub&gt;i&lt;/sub&gt;.
&lt;/p&gt;

&lt;h3&gt;Linear Transformations and Matrices&lt;/h3&gt;
&lt;p&gt;
Any time you see a mathematical object you should immediately ask yourself, "What are the transformations of this object?"  In linear algebra the transformations between vector spaces are called &lt;em&gt;linear transformations&lt;/em&gt; (boring, eh?).  The definition:
&lt;/p&gt;
&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;. Let V,W be vector spaces over the reals.  A function v: V → W is a &lt;em&gt;linear transformation&lt;/em&gt; if it satisfies the following conditions:
&lt;ol&gt;
&lt;li&gt;f(v+u) = f(v) + f(u) for all v,u ∈ V&lt;/li&gt;
&lt;li&gt;f(α·v) = α·f(v) for all α ∈ &lt;b&gt;R&lt;/b&gt; and v ∈ V&lt;/li&gt;
&lt;/ol&gt;
&lt;/p&gt;

&lt;p&gt;
It turns out that every linear transformation can be written as a &lt;a href="http://en.wikipedia.org/wiki/Matrix_(mathematics)"&gt;matrix&lt;/a&gt;&lt;span class="footnote"&gt;Yeah, yeah, the matrix representation is only unique up to a choice of &lt;a href="http://en.wikipedia.org/wiki/Basis_(linear_algebra)"&gt;basis&lt;/a&gt;.&lt;/span&gt;.  Remember those guys from high-school algebra?
&lt;/p&gt;

&lt;p&gt;
Matrices are nice because &lt;em&gt;matrix multiplication corresponds to the composition of linear transformations&lt;/em&gt;.  That is, let V be a vector space over the reals and let f and g be linear transformations on V whose matrices are A and B respectively.  Then, if &lt;b&gt;v&lt;/b&gt; ∈ V, we have (A · B)&lt;b&gt;v&lt;/b&gt; = f(g(&lt;b&gt;v&lt;/b&gt;)).  Matrix multiplication is a well-defined, computationally simple operation, where as the composition of linear transformations is comparatively difficult.&lt;/p&gt;

&lt;p&gt;
I don't want to write a full-on tutorial about multiplying matrices, so I recommend reading the &lt;a href="http://en.wikipedia.org/wiki/Matrix_multiplication"&gt;Wikipedia article&lt;/a&gt; on the subject.  Here's an example of a 3×2 matrix.  All of the entries are assumed to be real numbers.
&lt;/p&gt;
&lt;img class="math" src='http://assets.20bits.com/misc/matrix.png' alt='matrix.png'/&gt;

&lt;p&gt;
This takes as its input a three-dimensional vector and outputs a two-dimensional vector.  In general, an m×n matrix takes as its input an m-dimensional vector and outputs an n-dimensional vector.
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;. Let A be an m×n matrix over the reals.  Then the entry in the i&lt;sup&gt;th&lt;/sup&gt; row and j&lt;sup&gt;th&lt;/sup&gt; column is denoted as a&lt;sub&gt;ij&lt;/sub&gt;.  One might even write A = (a&lt;sub&gt;ij&lt;/sub&gt;).
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;.  Let A be an m×n matrix over the reals.  The &lt;em&gt;transpose&lt;/em&gt; of A, denoted A&lt;sup&gt;T&lt;/sup&gt;, is the matrix B defined by b&lt;sub&gt;ij&lt;/sub&gt; = a&lt;sub&gt;ji&lt;/sub&gt;, i.e., the matrix in which the rows and columns are swapped.
&lt;/p&gt;

&lt;p&gt;
The transpose is important for all sorts of things, as we'll see.
&lt;/p&gt;

&lt;h3&gt;Eigenvalues and Eigenvectors&lt;/h3&gt;
&lt;p&gt;
Eigenvalues and eigenvectors are two of the most important objects in linear algebra.  They are defined as follows.
&lt;/p&gt;
&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;.  Let A be a square n×n matrix and let &lt;b&gt;v&lt;/b&gt; ∈ &lt;b&gt;R&lt;/b&gt;&lt;sup&gt;n&lt;/sup&gt;.  &lt;b&gt;v&lt;/b&gt; is an eigenvector of A if there exists a real number λ such that
&lt;img class="math" src='http://assets.20bits.com/misc/eigenvector.png' alt='eigenvector.png'/&gt;
λ is called an &lt;em&gt;eigenvalue of A&lt;/em&gt; and &lt;b&gt;v&lt;/b&gt; is the &lt;em&gt;eigenvector corresponding to λ&lt;/em&gt;
&lt;/p&gt;

&lt;p&gt;
Who cares about eigenvectors?  They're useful for all sorts of things.  Calculating a webpage's &lt;a href="http://en.wikipedia.org/wiki/PageRank"&gt;PageRank&lt;/a&gt;, for example, is really a problem of finding a certain eigenvector.  You can read about &lt;a href="http://en.wikipedia.org/wiki/Eigenvalue_algorithm"&gt;algorithms to calculate eigenvalues&lt;/a&gt;, but I'll cover more of that ground in Part III.
&lt;/p&gt;

&lt;p&gt;
For now, just keep this idea in your head.  We're going to be doing something very much like PageRank to calculate a person's standing in a social network.
&lt;/p&gt;

&lt;h3&gt;Graphs and Matrices&lt;/h3&gt;
&lt;p&gt;
So far I've not even mentioned graphs, so you're probably wondering what the hell any of this has to do with graph theory.  It turns out every graph has several associated matrices that are very useful for analyzing the properties of that graph.
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;.  Let G be a graph with n vertices and m edges, i.e., V = {v&lt;sub&gt;1&lt;/sub&gt;, ..., v&lt;sub&gt;n&lt;/sub&gt;} and E = {e&lt;sub&gt;1&lt;/sub&gt;, ..., e&lt;sub&gt;m&lt;/sub&gt;}.  The &lt;em&gt;incidence matrix&lt;/em&gt; of G is the m×n matrix C(G) = (c&lt;sub&gt;ij&lt;/sub&gt;) defined by
&lt;img class="math" src='http://assets.20bits.com/misc/incidence.png' alt='incidence.png'/&gt;
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;.  Let G be a graph with n vertices, V = {v&lt;sub&gt;1&lt;/sub&gt;, ..., v&lt;sub&gt;n&lt;/sub&gt;}.  The &lt;em&gt;adjacency graph&lt;/em&gt; of G is the n×n square matrix A(G) = (a&lt;sub&gt;ij&lt;/sub&gt;) defined by 
&lt;img class="math" src='http://assets.20bits.com/misc/adjacency.png' alt='adjacency.png'/&gt;
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;. Let G be an undirected graph with n vertices.  The &lt;em&gt;degree matrix&lt;/em&gt; of G, D(G), is the matrix defined by
&lt;img class="math" src='http://assets.20bits.com/misc/degree.png' alt='degree.png'/&gt;
&lt;/p&gt;

&lt;p&gt;
These are the three basic matrices associated with a graph.  The incidence matrix encapsulates vertex-edge relationships, the adjacency matrix encapsulates vertex-vertex relationships, and the degree matrix encapsulates information about the degrees.  For a concrete example consider the cycle graph C&lt;sub&gt;4&lt;/sub&gt;:&lt;img class="math" src='http://assets.20bits.com/misc/cycle-labeled.png' alt='cycle-labeled.png'/&gt;&lt;/p&gt;

&lt;p&gt;
Here are the corresponding matrices: &lt;img class="math" src='http://assets.20bits.com/misc/matrices.png' alt='matrices.png'/&gt;
&lt;/p&gt;

&lt;p&gt;
There's one last matrix that is worth knowing.
&lt;/p&gt;
&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;. Let G be a graph.  The &lt;em&gt;Laplacian matrix&lt;/em&gt; or &lt;em&gt;graph Laplacian&lt;/em&gt; for G is defined as L(G) = D(G) - A(G), where D is the degree matrix and A is the adjacency matrix.
&lt;/p&gt;

&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;
It turns out that lots of information about the graph is stored in these matrices.  From these graphs we can calculate things like the number of spanning trees, the algebraic connectivity, etc.  Most of these are well-defined eigenvalue problems, and so become computationally feasible.
&lt;/p&gt;

&lt;p&gt;
In Part III we'll use these matrices to tackle the problem of influence or prestige in social networks.
&lt;/p&gt;</description>
      <pubDate>Thu, 02 Aug 2007 14:56:59 +0000</pubDate>
      <link>http://20bits.com/article/graph-theory-part-ii-linear-algebra</link>
    </item>
    <item>
      <title>Rules of Thumb for Successful Facebook Applications</title>
      <description>&lt;p&gt;
Creating &lt;a href="http://appaholic.com"&gt;Appaholic&lt;/a&gt; has given me the opportunity to see what apps succeed and why.  Here are some rules of thumb to consider when writing your Facebook app.
&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;&lt;strong&gt;The Complexity Ceiling&lt;/strong&gt;&lt;p&gt;
Facebook is simple.  The features on Facebook are simple, even compared to similar features on other sites, e.g., Flickr vs. Photos.  My hypothesis is that no application which is more complex than the most complex feature on Facebook will succeed. You can compare three similar apps: &lt;a href="http://appaholic.com/display/2395952879+2949245143+2481647302"&gt;Bookshelf, Bookshare, and Visual Bookshelf&lt;/a&gt;.  Of these Visual Bookshelf started the latest, is the most simple, and also has two orders of magnitude more users than the other applications.  The more simple you make your app the better it will do.&lt;/p&gt;&lt;/li&gt;
	&lt;li&gt;&lt;strong&gt;Try to Be Social&lt;/strong&gt;&lt;p&gt;
Just because you've built it doesn't mean the users will come.  Facebook is still first and foremost a social platform and that's why people are using it.  If your application has no social component it's just going to flop.  Not only will it be difficult to spread virally but there's little compelling reason for people to install it.  For example, let's say you have a blog and want to promote its content with a Facebook app.  You should &lt;em&gt;not&lt;/em&gt; just create a Facebook app which displays your blog posts in people's profiles; rather, try to add a social component.  What are my favorite articles?  What have I commented on?  It's then easy to see what my &lt;em&gt;friends&lt;/em&gt; like and what &lt;em&gt;they&lt;/em&gt; have commented on.&lt;/p&gt;&lt;/li&gt;
	&lt;li&gt;&lt;strong&gt;If You Can't Be Social, Be Viral&lt;/strong&gt;&lt;p&gt;
Writing good social applications is hard.  If it weren't Facebook wouldn't be worth what it is.  You can, however, write an application which does nothing more than spread itself.  If your idea is funny enough, like &lt;a href="http://apps.facebook.com/apps/application.php?id=2458301688" rel="nofollow"&gt;Vampires&lt;/a&gt; or &lt;a href="http://apps.facebook.com/apps/application.php?id=2341504841" rel="nofollow"&gt;Zombies&lt;/a&gt;, people will use it.  It's still too early to tell whether these apps have staying power, but you'll at least get your fifteen minutes of Facebook fame.&lt;/p&gt;&lt;/li&gt;
	&lt;li&gt;&lt;strong&gt;Don't Be Too Weird&lt;/strong&gt;&lt;p&gt;
People are used to how Facebook works.  If your application is totally foreign they just won't understand it, even if it's the most usable, straight-forward piece of software.  The top applications tend to fall into one of two broad categories (again, iLike is the exception).  In the first category the applications are viral for virality's sake.  Why this works is self-evident, since the applications exist for no other reason than to spread themselves.  In the second the applications augment or complement an existing Facebook feature.  &lt;a href="http://apps.facebook.com/apps/application.php?id=2425101550" rel="nofollow"&gt;Top Friends&lt;/a&gt;, &lt;a href="http://apps.facebook.com/apps/application.php?id=2439131959" rel="nofollow"&gt;Graffiti&lt;/a&gt;, and &lt;a href="http://apps.facebook.com/apps/application.php?id=2345673396" rel="nofollow"&gt;X Me&lt;/a&gt; all fall into this category, for example, which behave like Facebook's friends list, wall, and poke features, respectively.&lt;/p&gt;&lt;/li&gt;
	&lt;li&gt;&lt;strong&gt;Fads Exist&lt;/strong&gt;&lt;p&gt;
Even though the Facebook platform is only two months old fads have already come and gone.  If you're just looking to get a respectable number of users in a short period of time then it's worth paying attention to these fads.  The "quotes application" genre is an example.  There are applications which add quotes to your profile from everything from Star Wars and Family Guy to Friends and Scrubs.  Don't blame me if the fad becomes unpopular and your app fizzles, though.&lt;/p&gt;&lt;/li&gt;
	&lt;li&gt;&lt;strong&gt;Quality Matters&lt;/strong&gt;&lt;p&gt;
This might be obvious, but quality matters.  Apps like iLike and Causes provide a set of very high-quality, social features, so even though they're not the most simple apps they're still compelling.  You also have to be prepared to deal with your growth.  You can &lt;a href="http://www.insidefacebook.com/2007/07/17/predicting-growth-with-appaholic/"&gt;model your growth using Appaholic&lt;/a&gt; to predict how many users you'll have in a few days, weeks, or whatever.  If it's more than you anticipated make sure you have the hardware to handle it â€” users &lt;em&gt;will&lt;/em&gt; uninstall apps that are slow or broken, even if it's not the apps' fault.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;
These rules aren't hard-and-fast, and many successful applications break some or all of them.  iLike and Causes, for example, are relatively complex compared to most other applications and even Facebook itself, but they're still wildly successful.  They get to bend the rules because they're extremely social and have high-quality features.  If you can match that level of quality then go for it, but I personally think it's easier to create a simple application and add features as it grows than to create a complex application and hope users don't get lost.  Users are more important than features.
&lt;/p&gt;</description>
      <pubDate>Wed, 01 Aug 2007 15:25:02 +0000</pubDate>
      <link>http://20bits.com/article/rules-of-thumb-for-successful-facebook-applications</link>
    </item>
    <item>
      <title>Graph Theory: Part I (Introduction)</title>
      <description>&lt;p&gt;
This is the first in a multi-part series about graph theory here on 20bits.  This started out of me wanting to write about some of the mathematical aspects of Facebook, but I realized that many people might not have a sufficient background to just jump right in.  Rather than cover the all the ground in one article I decided to break it up into multiple parts.  This is the first part, a quick introduction to graph theory.
&lt;/p&gt;

&lt;p&gt;
&lt;a href="http://en.wikipedia.org/wiki/Graph_theory"&gt;Graph theory&lt;/a&gt; is a fundamental area of study in discrete mathematics.  As the name implies graph theory is about &lt;em&gt;graphs&lt;/em&gt;, so I'll first define graph and then discuss why people are so interested in studying these critters.  I'm going to assume you're familiar with the idea of &lt;a href="http://en.wikipedia.org/wiki/Ordered_pair"&gt;ordered pairs&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Set"&gt;sets&lt;/a&gt;.
&lt;/p&gt;

&lt;h3&gt;Some Definitions&lt;/h3&gt;
&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;. A &lt;em&gt;graph&lt;/em&gt; G is a pair (V,E) of sets, called the &lt;em&gt;vertex set&lt;/em&gt; and &lt;em&gt;edge set&lt;/em&gt;.  V is a collection of abstract &lt;em&gt;vertices&lt;/em&gt;, written {v&lt;sub&gt;1&lt;/sub&gt;, v&lt;sub&gt;2&lt;/sub&gt;,...,v&lt;sub&gt;n&lt;/sub&gt;} and E is a collection of ordered pairs of vertices, called &lt;em&gt;edges&lt;/em&gt;.
&lt;/p&gt;

&lt;p&gt;
As you can see, this is pretty abstract.  The definition leaves you free to define both what the vertices and edges are, precisely.  Vertices could be cities and edges could be interstate highways.  An example:
&lt;/p&gt;

&lt;img class="math" src='http://assets.20bits.com/misc/graph-directed.png' alt='graph-directed.png'/&gt;

&lt;p&gt;
This type of graph is called a &lt;em&gt;directed graph&lt;/em&gt; because some of the edges have a direction, i.e., they only go one way.  Going back to the original definition, we have V = {v&lt;sub&gt;1&lt;/sub&gt;,v&lt;sub&gt;2&lt;/sub&gt;,v&lt;sub&gt;3&lt;/sub&gt;,v&lt;sub&gt;4&lt;/sub&gt;} and E = {(v&lt;sub&gt;1&lt;/sub&gt;,v&lt;sub&gt;2&lt;/sub&gt;),(v&lt;sub&gt;1&lt;/sub&gt;,v&lt;sub&gt;3&lt;/sub&gt;),(v&lt;sub&gt;1&lt;/sub&gt;,v&lt;sub&gt;4&lt;/sub&gt;),(v&lt;sub&gt;2&lt;/sub&gt;,v&lt;sub&gt;3&lt;/sub&gt;)}
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;. Let G be a graph, then we write V(G) to mean the vertex set of G and E(G) to mean the edge set.  We will just write V and E if the context makes it clear which graph we're talking about.
&lt;/p&gt;
&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;. Let G be a graph and let u,v ∈ V(G).  We write v ~ u if there is an edge connecting v to u, i.e., if (v,u) ∈ E(G).
&lt;/p&gt;

&lt;p&gt;
Sometimes we don't care about direction and can make edges directionless.  These sorts of graphs are called &lt;em&gt;undirected graphs&lt;/em&gt; and look like this
&lt;/p&gt;

&lt;img class="math" src='http://assets.20bits.com/misc/graph.png' alt='graph.png'/&gt;

&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;. A graph G is an &lt;em&gt;undirected graph&lt;/em&gt; if u ~ v implies v ~ u for all u,v ∈ V(G).
&lt;/p&gt;

&lt;h3&gt;Concrete examples&lt;/h3&gt;
&lt;p&gt;
I'd be remiss if I kept talking like graph theory is some pie-in-the-sky theoretical abstraction.  In fact, many real-world situations can be modeled using graph theory.  Some examples:
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Shipping routes&lt;/strong&gt;&lt;p&gt;The vertices are shipping hubs and the edges are the routes between them&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Social networks&lt;/strong&gt;&lt;p&gt;The vertices are people and the edges are social connections (e.g., p ~ q is p is a friend of q)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Telecommunications networks&lt;/strong&gt;&lt;p&gt;The vertices are computers on the network and the edges are the network connections between them&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Disease transmission&lt;/strong&gt;&lt;p&gt;The vertices are organisms which can carry the disease and the edges represent one organism spreading it to another&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sexual networks&lt;/strong&gt;&lt;p&gt;The vertices are people and the edges denote which pairs have slept together (see, e.g., &lt;a href="http://researchnews.osu.edu/archive/chainspix.htm"&gt;The Structure of Romantic and Sexual Relations at Jefferson High&lt;/a&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;More Definitions&lt;/h3&gt;
&lt;p&gt;
Essentially any situation where you want to consider pairs of objects and the connections between those pairs can be analyzed using graph theory.  There are a few more definitions to cover.
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;. A &lt;em&gt;graph loop&lt;/em&gt; or just &lt;em&gt;loop&lt;/em&gt; is a vertex v which is connected to itself, i.e., v ~ v.  A graph with no such loops is called a &lt;em&gt;simple&lt;/em&gt; graph.
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Note&lt;/strong&gt;. Some authors allow multiple edges between vertices.  In this situation graphs with no loops and at most one edge between any two vertices is called a &lt;em&gt;simple graph&lt;/em&gt;.  Although this type of graph is less than ideal for analysis it occurs relatively frequently in reality, e.g., two different roads connecting a pair of cities or redundant network connections.  I'll make sure to note where we are dealing with such graphs and how we work around it.
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;.  Let G be a graph.  A &lt;em&gt;path&lt;/em&gt; or a &lt;em&gt;walk&lt;/em&gt; is a collection of vertices v&lt;sub&gt;1&lt;/sub&gt;, v&lt;sub&gt;2&lt;/sub&gt;, ..., v&lt;sub&gt;k&lt;/sub&gt;} such that v&lt;sub&gt;i&lt;/sub&gt; ~ v&lt;sub&gt;i+1&lt;/sub&gt; for all i, 1 ≤ i &lt; k.  A path with no repeated vertices is called &lt;em&gt;simple&lt;/em&gt;, and a path such that v&lt;sub&gt;k&lt;/sub&gt; ~ v&lt;sub&gt;1&lt;/sub&gt; is called a &lt;em&gt;closed path&lt;/em&gt;, &lt;em&gt;closed walk&lt;/em&gt;, or a &lt;em&gt;cycle&lt;/em&gt;.
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;. A &lt;em&gt;weighted graph&lt;/em&gt; G is a graph such that each edge in E(G) has an associated weight, typically a real number.
&lt;/p&gt;

&lt;p&gt;
Weighted graphs are the stuff of many famous algorithms.  There are a whole slew of algorithms dedicated to finding the &lt;a href="http://en.wikipedia.org/wiki/Shortest_path_problem"&gt;shortest path&lt;/a&gt; between two vertices in a weighted graph, where "shortest" means the path with the smallest weight.  The algorithms vary in performance depending on several factors: the ratio of vertices to edges, whether the graph has negative weights, whether we have a good heuristic for determining what a path might cost, etc. 
&lt;/p&gt;

&lt;h3&gt;Important Graphs&lt;/h3&gt;
&lt;p&gt;
There are some graphs every student of computer science or discrete mathematics is just sort of expected to know.
&lt;/p&gt;
&lt;ol&gt;

&lt;h5&gt;The Complete Graph&lt;/h5&gt;
&lt;p&gt;
The complete graph on n vertices, written K&lt;sub&gt;n&lt;/sub&gt;, has n vertices and an edge for every pair of distinct vertices.  K&lt;sub&gt;4&lt;/sub&gt; looks like this, for example:
&lt;/p&gt;
&lt;img class="math" src='http://assets.20bits.com/misc/graph-complete.png' alt='graph-complete.png' /&gt;

&lt;h5&gt;The Cycle&lt;/h5&gt;
&lt;p&gt;
The cycle of n vertices, written C&lt;sub&gt;n&lt;/sub&gt; is the undirected graph on n vertices which consists of precisely one cycle containing every vertex.  C&lt;sub&gt;4&lt;/sub&gt; looks like &lt;/p&gt;
&lt;img class="math" src='http://assets.20bits.com/misc/graph-cycle.png' alt='graph-cycle.png' /&gt;

&lt;h5&gt;The Complete Bipartite Graph&lt;/h5&gt;
&lt;p&gt;
The complete bipartite graph, written K&lt;sub&gt;n,m&lt;/sub&gt; is an undirected graph that has the union of two disjoint sets of size n and m, respectively, as its vertex set, i.e.,  V(K&lt;sub&gt;n,m&lt;/sub&gt;) = V ∪ U, V ∩ U = ∅, |V| = n, and |U| = m.  Its edges satisfy the property that v ~ u for all v ∈ V and all u ∈ U.  It's easier to draw than to write, I think, so here is K&lt;sub&gt;2,2&lt;/sub&gt;: &lt;/p&gt;

&lt;img class="math" src='http://assets.20bits.com/misc/graph-bipartite.png' alt='graph-bipartite.png' /&gt;

&lt;p&gt;
Here V = {v&lt;sub&gt;1&lt;/sub&gt;,v&lt;sub&gt;2&lt;/sub&gt;} and U = {v&lt;sub&gt;3&lt;/sub&gt;, v&lt;sub&gt;4&lt;/sub&gt;}.  Any graph G whose vertex set V(G) can be partitioned into two sets such that no two vertices in a partition share an edge is called &lt;em&gt;bipartite&lt;/em&gt;, meaning "of two parts."  The complete bipartite graph is the bipartite graph which has as many edges as possible without ceasing to be bipartite.
&lt;/p&gt;

&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;
That's it for the basics.  Hopefully you understand what a graph is, some of the problems graph theory is useful in analyzing, and some of the common constructs in graph theory.  In Part II I'm going to cover the relationship between graph theory and &lt;a href="http://en.wikipedia.org/wiki/Linear_algebra"&gt;linear algebra&lt;/a&gt;.  My ultimate plan is to use this relationship to come up with a way to rank people's influence in social networks, using a measure called &lt;a href="http://en.wikipedia.org/wiki/Eigenvector_centrality"&gt;eigenvector centrality&lt;/a&gt;.
&lt;/p&gt;

&lt;h3&gt;Errata&lt;/h3&gt;
&lt;p&gt;
I left out a few definitions yesterday, so here they are.
&lt;/p&gt;
&lt;p&gt;
&lt;strong&gt;Definition&lt;/strong&gt;. Let G be a graph and let v ∈ V(G) be a vertex.  The &lt;em&gt;out degree&lt;/em&gt; of v, written deg&lt;sup&gt;-&lt;/sup&gt;(v), is the number of edges directed away from v.  Conversely, the &lt;em&gt;in degree&lt;/em&gt; of v, written deg&lt;sup&gt;+&lt;/sup&gt;(v), is the number of edges directed towards v.  If G is undirected then deg&lt;sup&gt;+&lt;/sup&gt;(v) = deg&lt;sup&gt;-&lt;/sup&gt;(v) for all v ∈ V(G), so we call this number the &lt;em&gt;degree&lt;/em&gt; of v and write it deg(v). 
&lt;/p&gt;

&lt;p&gt;
The plus (+) and minus (-) have to do with the idea of &lt;em&gt;flow&lt;/em&gt;.  If you imagine a directed graph G and some substance diffusing through the graph along the edges then the out degree and the in degree measure the degree to which that vertex loses (-) or accumulates (+) the substance, respectively. 
&lt;/p&gt;</description>
      <pubDate>Tue, 31 Jul 2007 17:28:32 +0000</pubDate>
      <link>http://20bits.com/article/graph-theory-part-i-introduction</link>
    </item>
    <item>
      <title>Appaholic and Inside Facebook</title>
      <description>&lt;p&gt;
This is more an update than an article.  For those who don't know my big project for the last three weeks has been &lt;a href="http://appaholic.com"&gt;Appaholic&lt;/a&gt;, a great utility for graphing the growth of Facebook applications.  It's been on the front pages &lt;a href="http://digg.com/software/Appaholic_Awesome_Alexa_style_Graphs_for_Facebook_Apps"&gt;digg&lt;/a&gt; and &lt;a href="http://mashable.com/2007/07/07/appaholic/"&gt;Mashable&lt;/a&gt; and has been making rounds in the blogosphere (today &lt;a href="http://scobleizer.com/2007/07/15/if-only-i-got-1-for-each-google-reader-opened/#comments"&gt;Robert Scoble&lt;/a&gt; linked to it).
&lt;/p&gt;

&lt;p&gt;
I've also been given a guest blogger position at &lt;a href="http://insidefacebook.com/"&gt;Inside Facebook&lt;/a&gt;, a blog about Facebook stuff.  I'm going to put most of my Facebook editorial over there for now and keep 20bits more focused on code and technology.  I think I'm going to write an article about using &lt;a href="http://en.wikipedia.org/wiki/Eigenvector_centrality"&gt;eigenvector centrality&lt;/a&gt; to determine the influencers in a social network, for example.
&lt;/p&gt;

&lt;p&gt;
Anyhow, this was just a head's up and an explanation for why posting has been slower than before.  Cheers!
&lt;/p&gt;</description>
      <pubDate>Mon, 16 Jul 2007 00:37:20 +0000</pubDate>
      <link>http://20bits.com/article/appaholic-and-inside-facebook</link>
    </item>
    <item>
      <title>The Social Graph, Facebook, and Virality</title>
      <description>&lt;p&gt;
The social graph is the web of connections between friends, family, and acquaintances that everyone has.  I know my friend who knows someone who works at the company I want to interview at, so he connects us and I get a shiny new job after acing my interview.  It helps me meet new people, find new music I like, and generally navigate my social world.  If I find something because of the connection in my social graph I'm much more apt to trust its worth.  After all, people I know recommended it!
&lt;/p&gt;

&lt;p&gt;
The thing all social networking websites, like MySpace, LinkedIn, Friendster, and Facebook, have in common is that they try to create a virtual copy of the social graph.  The graph serves as a way to spread whatever content the site is pushing.  If the representation of the graph is good enough I'll know what my friends are reading, watching, doing this weekend, etc.
&lt;/p&gt;

&lt;p&gt;
I don't think it's controversial to say that among all the major social networking sites Facebook has the best representation of the social graph.  MySpace's representation is too connected &amp;mdash; my real-life connections might be there, but I also have a million other connections that don't reflect anything in the real social graph.  Friendster's representation isn't connected enough since none of my friends use it.  Facebook, however, has hit the sweet spot: collecting friends a la MySpace is actively discouraged by the features on the site, but they began with a narrow enough demographic that they were able to reach a critical mass among college students.
&lt;/p&gt;

&lt;h3&gt;The Social Graph and Virality&lt;/h3&gt;
&lt;p&gt;
The social graph is closely related to so-called "viral" services.  News can spread faster through the lines in the social graph than it could otherwise.  By having an accurate virtual representation of the social graph Facebook is able to amplify the usual word-of-mouth effect.  Everything I want I can broadcast instantly to all my connections.  Facebook does this using the News Feed and Mini-Feed features, showing my friends' activities chronologically and my own activities in my personal profile.
&lt;/p&gt;

&lt;p&gt;
Now we have the Facebook platform.  Some people mistakenly treat it as a fancy "widget platform," but they don't realize that the platform has moved Facebook far beyond the tired MySpace + widgets = success formula.  In addition to providing compelling content every social networking site wants to build a quality copy of the social graph, but in one fell swoop Facebook has given away access to the highest quality copy on the web, namely theirs.  Every developer or entrepreneur looking to build a social networking site will have to ask themselves from now on, "Would it just be better to write my application on the Facebook platform?"
&lt;/p&gt;

&lt;p&gt;
I don't think one can overstate the power of the position this has put the Facebook in, from both a business and technology perspective.  If they can actually cultivate a market around Facebook apps &amp;mdash; a proposition which is becoming increasingly likely with the creation of programs like &lt;a href="http://www.baypartners.com/appfactory/"&gt;appfactory&lt;/a&gt; &amp;mdash; they might very well move into truly revolutionary quarters.
&lt;/p&gt;

&lt;h3&gt;Powerful, Yes, But no Magic Bullet&lt;/h3&gt;

&lt;p&gt;
But what does this mean for the app developer who just wants to create something popular?  The integration with the Facebook news feed allows apps to dramatically increase their virality by piggy-backing on Facebook's copy of the social graph.  But the Facebook platform isn't a magic bullet.  Turning your website into a Facebook app doesn't mean it will become an instant viral sensation &amp;mdash; you also have to understand how to use the social graph.
&lt;/p&gt;

&lt;p&gt;
Let's say you have a popular blog and want to increase people's interest by somehow integrating with Facebook.  Your first inclination is probably to let users add a feed of stories from your blog to their profile.  Even if your blog is popular this idea probably won't do much good for two reasons: one, readers will have to leave Facebook to get the content, interrupting whatever they were doing; two, there is no social component.
&lt;/p&gt;

&lt;p&gt;
The first problem is probably not worth your time to solve if all you're doing is recycling content from your blog.  The second problem is more fatal because it runs against the idea of the social graph.  The content being served is in no way related to the person whose profile I'm viewing.  It's not stories they've commented on, stories they've recommend, or anything of that sort, it's just a boring list of all stories.  Unless there is a social component do not expect Facebook to miraculously drive 30 million users to your website.
&lt;/p&gt;

&lt;p&gt;
Actually, don't expect it to drive traffic to you site at all.  Like I said, Facebook users don't want to leave Facebook.  As a concrete example, I wrote the &lt;a href="http://apps.facebook.com/apps/application.php?id=2368380368"&gt;PopSugar 100&lt;/a&gt; application for Sugar Publishing.  It drives about 5K visits per week, which is respectable, but a drop in the bucket relative to our usual traffic.  You can try to monetize the application directly on Facebook but if you're a medium-sized startup already monetizing your userbase in other ways then a Facebook app probably best serves as a way to increase your valuation.
&lt;/p&gt;

&lt;h3&gt;Case by Case&lt;/h3&gt;
&lt;p&gt;
&lt;strong&gt;Edit&lt;/strong&gt;: Joyce Park of Renkoo has written up a guide about &lt;a href="http://renkoo.wordpress.com/2007/07/03/how-to-design-for-the-facebook-platform/"&gt;designing for the Facebook platform&lt;/a&gt; that hits a lot of the points here.  You should read that, too.
&lt;/p&gt;

&lt;p&gt;
To see that the Facebook isn't a panacea for your viral distribution problems let's look at some case studies. 
&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="http://appaholic.com/display/2412849054"&gt;digg.com&lt;/a&gt;&lt;/strong&gt;
&lt;p&gt;
&lt;h4&gt;Users per hour&lt;/h4&gt;
&lt;object width="350" height="220"&gt;&lt;param name="quality" value="high" /&gt;&lt;param name="bgcolor" value="#ffffff" /&gt;&lt;embed src="http://appaholic.com/charts.swf?library_path=http%3A%2F%2Fappaholic.com%2Fcharts_library&amp;stage_width=350&amp;stage_height=220&amp;php_source=http%3A%2F%2Fappaholic.com%2Fxml.php%3Fdisplay%3D2412849054%26f%3DUsersPerHour%26range%3D&amp;license=J1XPVENC9UOL.NS5T4Q79KLYCK07EK" quality="high" bgcolor="#ffffff" width="350" height="220" type="application/x-shockwave-flash"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;
The digg app is an example of an application which you would expect to do well on Facebook since it has a huge social component but which is essentially floundering.  Where other apps with much less impact than digg are growing at a rate of hundreds of users per hour, digg's app is almost static.  
&lt;/p&gt;
&lt;p&gt;
This is because, ironically, they don't harness the social graph at all.  All the app lets me do is post a list of stories I've dugg into my profile.  I've suggested &lt;a href="/article/5-ways-to-improve-the-digg-app"&gt;five ways&lt;/a&gt; the folks at digg could improve it.  The easiest way is to post all news items I submit or digg, at my preference, to my news feed.  Not only does this benefit digg by bringing in more users but it makes digg much more useful because I can promote my stories on digg through my corner of the social graph.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="http://appaholic.com/display/2341504841&amp;range=7d"&gt;Zombies&lt;/a&gt;&lt;/strong&gt;
&lt;p&gt;
&lt;h4&gt;Users per hour&lt;/h4&gt;
&lt;object width="350" height="220"&gt;&lt;param name="quality" value="high" /&gt;&lt;param name="bgcolor" value="#ffffff" /&gt;&lt;embed src="http://appaholic.com/charts.swf?library_path=http%3A%2F%2Fappaholic.com%2Fcharts_library&amp;stage_width=350&amp;stage_height=220&amp;php_source=http%3A%2F%2Fappaholic.com%2Fxml.php%3Fdisplay%3D2341504841%26f%3DUsersPerHour%26range%3D7d&amp;license=J1XPVENC9UOL.NS5T4Q79KLYCK07EK" quality="high" bgcolor="#ffffff" width="350" height="220" type="application/x-shockwave-flash"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;
This app is an example of an application which is entirely viral but provides little content.  As you'd expect, it spread very fast initially but is becoming saturated.  I suspect, as time goes on, people will uninstall it because there is little there to hold people's interest.  Being viral helps your application spread but if there is no substance to back it up people &lt;em&gt;will&lt;/em&gt; eventually leave.  We'll see if the authors of Zombies can create compelling content to keep people involved.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="http://appaholic.com/display/2439131959"&gt;Graffiti&lt;/a&gt;&lt;/strong&gt;
&lt;p&gt;
&lt;h4&gt;Users per hour&lt;/h4&gt;
&lt;object width="350" height="220"&gt;&lt;param name="quality" value="high" /&gt;&lt;param name="bgcolor" value="#ffffff" /&gt;&lt;embed src="http://appaholic.com/charts.swf?library_path=http%3A%2F%2Fappaholic.com%2Fcharts_library&amp;stage_width=350&amp;stage_height=220&amp;php_source=http%3A%2F%2Fappaholic.com%2Fxml.php%3Fdisplay%3D2439131959%26f%3DUsersPerHour%26range%3D&amp;license=J1XPVENC9UOL.NS5T4Q79KLYCK07EK" quality="high" bgcolor="#ffffff" width="350" height="220" type="application/x-shockwave-flash"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;
Graffiti is probably the perfect example of a Facebook application right now.  Not only is it extremely viral, it augments an existing Facebook service so there is essentially zero learning curve.  I'm informed when someone's Graffiti wall is updated via the news feed and people can share particularly clever drawings with each other. 
&lt;/p&gt;
&lt;p&gt;
It also shows the power social networking software can have in our real lives.  So many friends of mine have the Graffitti Wall that it has started seeping into my real social interactions.  Universal access among people connected to me means that it becomes a useful avenue of social expression.  More plainly, when something amusing happens I might think of a great thing to draw on someone's Graffitti Wall.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="http://appaholic.com/display/2360569570"&gt;Booze Mail&lt;/a&gt;&lt;/strong&gt;
&lt;p&gt;
&lt;h4&gt;Users per hour&lt;/h4&gt;
&lt;object width="350" height="220"&gt;&lt;param name="quality" value="high" /&gt;&lt;param name="bgcolor" value="#ffffff" /&gt;&lt;embed src="http://appaholic.com/charts.swf?library_path=http%3A%2F%2Fappaholic.com%2Fcharts_library&amp;stage_width=350&amp;stage_height=220&amp;php_source=http%3A%2F%2Fappaholic.com%2Fxml.php%3Fdisplay%3D2360569570%26f%3DUsersPerHour%26range%3D&amp;license=J1XPVENC9UOL.NS5T4Q79KLYCK07EK" quality="high" bgcolor="#ffffff" width="350" height="220" type="application/x-shockwave-flash"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;
Booze Mail, like Graffiti, does a great job at exploiting the social graph.  I'm informed whenever a friend of mine sends someone a drink, so not only do I learn about Booze Mail, but I get a better glimpse at my friend's social network.  I'd bet money that Booze Mail is going to be one of the million+ user applications within a week.  Indeed, of the 20 apps posting larger absolute per-day gains, only three have less than a million users.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="http://appaholic.com/display/2920895692"&gt;Matches&lt;/a&gt;&lt;/strong&gt;
&lt;p&gt;
&lt;h4&gt;Users per hour&lt;/h4&gt;
&lt;object width="350" height="220"&gt;&lt;param name="quality" value="high" /&gt;&lt;param name="bgcolor" value="#ffffff" /&gt;&lt;embed src="http://appaholic.com/charts.swf?library_path=http%3A%2F%2Fappaholic.com%2Fcharts_library&amp;stage_width=350&amp;stage_height=220&amp;php_source=http%3A%2F%2Fappaholic.com%2Fxml.php%3Fdisplay%3D2920895692%26f%3DUsersPerHour%26range%3D&amp;license=J1XPVENC9UOL.NS5T4Q79KLYCK07EK" quality="high" bgcolor="#ffffff" width="350" height="220" type="application/x-shockwave-flash"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;
Matches is one of a dozen-or-so apps posting regular hourly losses.  It serves as an example of the pitfalls inherent in the Facebook platform.  If you look at the &lt;a href="http://uchicago.facebook.com/apps/application.php?id=2920895692"&gt;Matches discussion board&lt;/a&gt; you'll see that, for whatever reason, it isn't working for most people.  Part of the problem is that Facebook's model requires the app developer to shoulder all the load.  
&lt;/p&gt;
&lt;p&gt;If you created a popular app then you'd best be prepared to deal with Facebook-sized traffic.  If you can't then people &lt;em&gt;will&lt;/em&gt; get fed up.  Apps like this show that people are definitely willing to uninstall applications that don't do what they want (regardless of whose fault it is).  So even once you have a userbase you're not guaranteed to keep it.
&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;Conclusions&lt;/h3&gt;
&lt;p&gt;
The Facebook platform has a potential to revolutionize the way we think about the social web.  It serves as a gateway to a high-quality copy of the social graph that exists in real life.  This graph, in turn, lets people share content, ideas, money, goods, and all sorts of things with ever-increasing efficiency.  The degree to which the graph permits targetted selection, e.g., find me all influential people who like to read between the ages of 20 and 25, is an advertiser's wet dream.
&lt;/p&gt;
&lt;p&gt;
But the platform is no magic bullet.  Applications you think would do well sometimes falter and you're not guaranteed to be a viral hit just because you've created a widget for your blog.  Apps are rising and falling all the time and the market is still taking shape.  It's an exciting time, regardless.
&lt;/p&gt;

&lt;div class="notice warning"&gt;
All stats brought to you by &lt;a href="http://appaholic.com"&gt;Appaholic&lt;/a&gt;.
&lt;/div&gt;</description>
      <pubDate>Wed, 11 Jul 2007 01:25:51 +0000</pubDate>
      <link>http://20bits.com/article/the-social-graph-facebook-and-virality</link>
    </item>
    <item>
      <title>5 Ways to Improve the Digg App</title>
      <description>&lt;p&gt;
The &lt;a href="http://apps.facebook.com/apps/application.php?id=2412849054"&gt;digg.com Facebook application&lt;/a&gt; has a little under 20,000 users.  According to the digg blog they reached &lt;a href="http://blog.digg.com/?p=67"&gt;1 million registered users&lt;/a&gt; in early March.  Even if we reduce this number to the number of active digg users we can see that only a small percentage of the digg userbase is using the Facebook application.  Why?
&lt;/p&gt;

&lt;p&gt;
Demographics aren't the reason since, according to &lt;a href="http://blogs.zdnet.com/micro-markets/?p=446"&gt;some reports&lt;/a&gt; about digg's demographics I'd be very surprised if most people on digg didn't also have a Facebook account. The reason is actually pretty simple. Digg's application doesn't exploit the core of Facebook's platform: the social graph.
&lt;/p&gt;

&lt;h4&gt;Solutions&lt;/h4&gt;
&lt;p&gt;
digg.com is a &lt;em&gt;social&lt;/em&gt; bookmarking site.  More than just letting people submit and vote on links it lets people see what their friends submitted, voted, and commented on.  The social component is import to digg's success, but Facebook's social graph is more interesting because it reflects the social connections we have in real life.  Digg can use this to its advantage.
&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;h5&gt;Put Digg on Facebook&lt;/h5&gt;
&lt;p&gt;
Digg has more to gain than to lose from the Facebook, not least because Facebook has an order of magnitude more active users than digg.  So why not start big and create a Facebook-ified version of digg? 
Users would submit, vote, and comment on items directly from Facebook.
&lt;/p&gt;
&lt;p&gt;
This would pit digg against Facebook's own link-submission mechanism, but with the added benefit of the digg algorithm to surface the interesting content.  Wouldn't it be awesome if the most common way people submitted links to Facebook was via the digg application?  Monetization shouldn't be an issue since Facebook allows for ads.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h5&gt;More interesting profile box&lt;/h5&gt;
&lt;p&gt;
People like to have interesting and customizable profile pages.  The more personalized it is, the better.  The digg application should give us the option of including all or some of the content we're interested in showing off.
&lt;/p&gt;
&lt;p&gt;
Some people show their interest by digging stories and some by commenting on stories, so both of those should be options.  For active submitters being able to include submitted stories is also important since this makes their Facebook profile act as a mini-advertisement for the stories they've submitted.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h5&gt;Use the newsfeed&lt;/h5&gt;
&lt;p&gt;
The newsfeed is one of the keys to viral success on Facebook and the digg app doesn't use it at all.  This is probably one reason why it only has 20k users.  The newsfeed should update whenever I digg a story, comment on a story, or submit a story.
&lt;/p&gt;
&lt;p&gt;
For video submissions (or image submissions once digg gets an image section) the newsfeed post could contain a thumbnail version.  There should always be a "digg this" link directly in the newsfeed item.  This keeps the interaction within the flow of Facebook and would increase my use of digg since I know at least 100-some people would always see what stories I submit.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h5&gt;Use the Graph, Luke!&lt;/h5&gt;
&lt;p&gt;
I'm on Facebook.  I have my friends.  They may or may not be my friends on digg, although I know a lot of my friends use digg.  The digg app should let me see the stories they've dugg, submitted, and commented on whether we're friends on digg or not.
&lt;/p&gt;
&lt;p&gt;
The app could also break down the diggs by network.  What's the most popular story in the San Francisco, CA network?  How about the Google, Inc. network?  At the very least I'd like to see what's popular in &lt;em&gt;my&lt;/em&gt; networks, especially since digg doesn't store any of this kind of information on their end.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h5&gt;Sharing and Inviting&lt;/h5&gt;
&lt;p&gt;
The digg app should make it easier to share digg links on Facebook.  I can't count the number of times per day I send links to people via IM or Facebook because I think they're funny, interesting, or whatever else.  More often than not I found them on digg.
&lt;/p&gt;
&lt;p&gt;
The digg app should also include an "invite friends" page which allows me to invite en masse all my friends who haven't installed the digg app.  This is perhaps the easiest way to get Facebook users actively using digg, either directly or via the digg app.
&lt;/p&gt;
&lt;/li&gt;

&lt;h4&gt;The Flow&lt;/h4&gt;
&lt;p&gt;
With these features let's imagine the flow of the Facebook app.  I come to the Facebook.  I see that my friend has submitted a story to digg, so I click the "digg this" link without ever leaving Facebook.
&lt;/p&gt;
&lt;p&gt;
I submit a story to digg.  My mini-feed is updated stating that 'Jesse submitted "The most hilarious video EVER!!!!!!" to digg.'  My friends see this in their newsfeed and all click "digg this," because if I say it's the most hilarious video ever then they know I'm serious.
&lt;/p&gt;
&lt;p&gt;
You get the idea.  Basically Facebook provides an excellent way for digg to spread into a richer social setting.
&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;
&lt;p&gt;
Facebook has done something remarkable in modeling the social graph that exists in the real world.  Opening up this data is as remarkable as it would be if Google released their internal graph of the web, in my opinion.
&lt;/p&gt;
&lt;p&gt;
While the most popular apps right now are generally gimmicks just look at the numbers.  The largest applications have maybe 3-4 million users, but Facebook has over 25 million registered users.  Zuckerberg has stated that only 1/3 of the Facebook has interacted with a Facebook app. Although he was talking about his surprise with the speed at which the Facebook platform I see the potential for huge gains.  Beyond gimmicks I believe the apps that will be the most successful and most valuable for Facebook will be those that effectively exploit the social graph.
&lt;/p&gt;
&lt;p&gt;
If you view digg as an enhanced version of Facebook's own link-sharing mechanism the fit is almost perfect.  Not only would Facebook benefit from digg's technology but digg would benefit by a more effective and viral distribution mechanism.
&lt;/p&gt;

&lt;p&gt;
Any other ideas?  Leave a comment!
&lt;/p&gt;

&lt;p&gt;
P.S., give my Facebook app, &lt;a href="http://apps.facebook.com/apps/application.php?id=2949245143"&gt;Bookshelf&lt;/a&gt;, a try.  It lets you post your personal collection of books, CDs, DVDs, and video games and share them with your friends and neighbors.
&lt;/p&gt;</description>
      <pubDate>Thu, 21 Jun 2007 22:08:49 +0000</pubDate>
      <link>http://20bits.com/article/5-ways-to-improve-the-digg-app</link>
    </item>
    <item>
      <title>More Facebook Application Gotchas</title>
      <description>&lt;p&gt;
This is a continuation of my previous article, &lt;a href="/article/5-facebook-application-gotchas"&gt;5 Facebook Application Gotchas&lt;/a&gt;.
&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;h5&gt;User invites&lt;/h5&gt;
&lt;p&gt;
Everyone loves user invites.  Well, every application developer, at least.  Requests are achieved by calling &lt;tt&gt;$facebook-&gt;api_client-&gt;notifications_sendRequest(...)&lt;/tt&gt;.  Facebook only allows you to send notifications to ten users at a time, however.
&lt;/p&gt;
&lt;p&gt;
To implement an invite page create a URL called, say, &lt;tt&gt;http://apps.facebook.com/myapp/process&lt;/tt&gt;.  You have a form which lets users select their friends who haven't installed the app, as follows
&lt;/p&gt;
&lt;pre class="brush: php"&gt;
global $facebook;
$fql = "SELECT uid, strlen(books)
            FROM user
            WHERE uid IN (SELECT uid2 FROM friend WHERE uid1 = {$facebook-&gt;user})
            AND has_added_app = 0";
$uids = $facebook-&gt;api_client-&gt;fql_query($fql);
// Render the form using the above UIDS
&lt;/pre&gt;
&lt;p&gt;
Have the form you render submit to &lt;tt&gt;/process&lt;/tt&gt;, which should behave roughly as follows
&lt;/p&gt;
&lt;pre class="brush: php"&gt;global $facebook;
if (isset($request['uids'])) {
	if (empty($request['uids']))
		$facebook-&gt;redirect('APP URL');

	$array = explode(',', $request['uids']);
} elseif (isset($request['users'])) {
	$array = array_keys($request['users']);
} else {
	$facebook-&gt;redirect('APP URL');
}

$uids = array();
while (count($array) &gt; 0 and count($uids) &lt; 10) {
	$uids[] = array_shift($array);
}

$url = $facebook-&gt;api_client-&gt;sendRequest($uids, 'MyApp', $msg, $img_url, true);

if ($url) {
	$facebook-&gt;redirect($url . "&amp;canvas=1&amp;next=" . urlencode("process?uids=" . implode(',', $array)));
} else {
	$facebook-&gt;redirect('APP URL');
}&lt;/pre&gt;
&lt;p&gt;
&lt;strong&gt;Note&lt;/strong&gt;, the &lt;tt&gt;next&lt;/tt&gt; parameter tells the request URL where to redirect after a batch has been processed.  If it is a canvas page you should include &lt;tt&gt;canvas=1&lt;/tt&gt;.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h5&gt;CSS ids&lt;/h5&gt;
&lt;p&gt;
Facebook only allows inline styles and styles within &lt;tt&gt;style&lt;/tt&gt; tags.  To prevent you from affecting the layout of the rest of Facebook it inserts a wrapper div around your content and assigns it an id of the form &lt;tt&gt;app_XXXX&lt;/tt&gt;, where XXXX is your application ID number.  It then affixes &lt;tt&gt;#app_XXXX&lt;/tt&gt; to all your CSS rules.
&lt;/p&gt;
&lt;p&gt;
This means that, for example, CSS hacks which involve things like &lt;tt&gt;html &amp;gt; *&lt;/tt&gt; won't work since they will come out as &lt;tt&gt;#app_XXXX html &amp;gt; *&lt;/tt&gt; on the other side.  For ID rules, however, it does something much more annoying: it rewrites the ID itself.  So, e.g., a rule like &lt;tt&gt;#MyDiv h1&lt;/tt&gt; becomes &lt;tt&gt;#app_XXXX_MyDiv h1&lt;/tt&gt; rather than &lt;tt&gt;#app_XXXX #MyDiv h1&lt;/tt&gt;.  There's no good reason for this, AFAIK, but it means using IDs on a page inside Facebook becomes tedious &amp;mdash; you need to know your application ID number.
&lt;/p&gt;
&lt;p&gt;
To work around this I just uses classes when writing Facebook pages.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h5&gt;Facebook is Strict&lt;/h5&gt;
&lt;p&gt;
In order to strip out bad elements and alter the CSS Facebook actually &lt;a href="http://en.wikipedia.org/wiki/Lexical_analysis" rel="nofollow"&gt;lexes&lt;/a&gt; and parses your code.  And it is strict.  Like, &lt;strong&gt;really&lt;/strong&gt; strict &amp;mdash; much more strict than any browser.
&lt;/p&gt;
&lt;p&gt;
If you're the developer you can see the error messages and I advise you to clean them up.  With CSS at least bad elements just get dropped.  I'd also bet money than acceptance into the application directory is contingent on you outputting well-formed FBML.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h5&gt;Use an Icon&lt;/h5&gt;
&lt;p&gt;
The process for what applications get accepted and what applications get rejected from the application directory is &lt;a href="http://uchicago.facebook.com/topic.php?uid=2205007948&amp;topic=5855" rel="nofollow"&gt;totally opaque&lt;/a&gt;.  The best we have is that an application must "work," have at least five users, &lt;strong&gt;have an icon&lt;/strong&gt;, and follow the TOS.
&lt;/p&gt;
&lt;p&gt;
Looking over the application directory you see plenty of apps with terrible icons, so the quality doesn't so much matter as presence does.  Just create one and upload it before you submit your application to the directory.  You can change it later if you want, but you won't get accepted without one.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h5&gt;No conditional comments&lt;/h5&gt;
&lt;p&gt;
The Facebook JavaScript doesn't always play nice with IE.  That is, you can find permutations which work in Safari and Firefox but fail in IE.  Shucks.  Unfortunately Facebook doesn't allow &lt;a href="http://www.quirksmode.org/css/condcom.html" rel="nofollow"&gt;conditional comments&lt;/a&gt; which would give you the ability to let your application degrade nicely in IE.
&lt;/p&gt;
&lt;p&gt;
What's more, because of the way Facebook rewrites CSS, most of the CSS hacks don't work.  Facebook does, however, pass along the user agent when it requests data from your server, so if you must absolutely have browser-specific code you'll have to push the logic back into the PHP (or Java, if you swing that way).
&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;
I had a sixth gotcha but I forgot what it was.  Oops!
&lt;/p&gt;</description>
      <pubDate>Thu, 21 Jun 2007 21:55:32 +0000</pubDate>
      <link>http://20bits.com/article/more-facebook-application-gotchas</link>
    </item>
    <item>
      <title>5 Facebook Application Gotchas</title>
      <description>&lt;p&gt;
Everyone and their uncle is writing Facebook applications for the new &lt;a href="http://developers.facebook.com/"&gt;Facebook Platform&lt;/a&gt;.  I, too, have my own offering, written by myself and the other OpenHive guys: &lt;a href="http://apps.facebook.com/apps/application.php?id=2949245143"&gt;Bookshelf&lt;/a&gt;.  Even though the platform was released almost a month ago there are still plenty of tricks, gotchas, and other undocumented oddities that deserve to be brought to light.
&lt;/p&gt;

&lt;h4&gt;Gotchas, tips, and tricks&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;h5&gt;The Timeout&lt;/h5&gt;
&lt;p&gt;
For those who know what I'm talking about already the answer is &lt;strong&gt;12 seconds&lt;/strong&gt;.  Everyone else read on.
&lt;/p&gt;
&lt;p&gt;
Facebook canvas pages (URLs of the form &lt;tt&gt;http://apps.facebook.com/yourapp/foo&lt;/tt&gt;) work on a proxy model.  In the application configuration you specify a callback URL so that when someone visits &lt;tt&gt;http://apps.facebook.com/yourapp/foo&lt;/tt&gt; Facebook in turn requests &lt;tt&gt;http://mydomain.com/myapp/foo&lt;/tt&gt;.  Facebook fetches the FBML from your callback URL and renders it on the canvas page.
&lt;/p&gt;
&lt;p&gt;
If your callback takes too long to respond Facebook spits out this ugly message: &lt;blockquote&gt;
There are still a few kinks Facebook and the makers of &amp;lt;application name&amp;gt; are trying to iron out. We appreciate your patience as we try to fix these issues. Your problem has been logged - if it persists, please come back in a few days. Thanks!&lt;/blockquote&gt;
&lt;/p&gt;
&lt;p&gt;
Ignore the fact that this error message is awful (try back in a few days?!), for now.  I did some testing (i.e., a PHP file and a call to sleep) and found that the timeout is set to around &lt;strong&gt;12 seconds&lt;/strong&gt;.  Although it should never, ever take this long to render any webpage, if you're doing a lot of processing you might run afoul of this limit, so watch out.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h5&gt;The Load&lt;/h5&gt;
&lt;p&gt;
Because canvas pages work on a proxy model your servers will have to handle the load Facebook throws at it.  For some apps, like iLike, this means growing from zero to three million users in a week.  If you plan on creating a popular app then you'll need to plan and benchmark for high concurrency situations.
&lt;/p&gt;
&lt;p&gt;
To start you should make sure your database is well optimized.  Read my article on &lt;a href="/article/10-tips-for-optimizing-mysql-queries-that-dont-suck"&gt;MySQL optimization tips&lt;/a&gt; for some ideas of what that means &amp;mdash; most of the tips are database neutral.
&lt;/p&gt;
&lt;p&gt;
Second you should use a tool like &lt;a rel="nofollow" href="http://httpd.apache.org/docs/2.0/programs/ab.html"&gt;ab&lt;/a&gt; with the concurrency set high and try to maximize your requests served per second.  In short, if you're going to be hosting a popular Facebook application be prepared to deal with Facebook-magnitude loads.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h5&gt;The Session&lt;/h5&gt;
&lt;p&gt;
Before you can talk with Facebook you must initialize a session using the &lt;tt&gt;Facebook&lt;/tt&gt; class provided by the Facebook API library.  You cannot tell if the session is valid by whether the session_key field in your object is null &amp;mdash; sometimes it looks completely valid but has actually expires.  The REST client will throw an exception if you try to do anything with an invalid session, so it's something to avoid.
&lt;/p&gt;
&lt;p&gt;
You can get your session data you can call &lt;tt&gt;auth_getSession()&lt;/tt&gt;.  It returns an array that contains the timeout so you can check directly if the session has expired.  If the timeout is set to 0 then the session lasts forever.  You can also use try/catch to make sure your sessions are valid:
&lt;/p&gt;
&lt;pre class="brush: php"&gt;$fbuid = $facebook-&gt;get_loggedin_user();
if ($fbuid) {
	try {
		if ($facebook-&gt;api_client-&gt;users_isAppAdded()) {
			// The user has added our app
		} else {
			// The user has not added our app
		}
	
	} catch (Exception $ex) {
		//this will clear cookies for your app and redirect them to a login prompt
		$facebook-&gt;set_user(null, null);
		$facebook-&gt;redirect($_SERVER['SCRIPT_URI']);
		exit;
	}
} else {
	// The user has never used our app
}&lt;/pre&gt;
&lt;p&gt;
The above will guarantee that you always have a valid session. (Thanks to &lt;a href="http://aditya-mukherjee.com/"&gt;Aditya&lt;/a&gt; for information about session expiration.)
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h5&gt;The JS&lt;/h5&gt;
&lt;p&gt;
The Facebook Platform supports three means of dynamic, client-side content: iframes, flash, and javascript wrappers.  By using iframes you are essentially given free reign to do what you will.  Flixster uses Javascript in an iframe to create its UI elements, for example.
&lt;/p&gt;
&lt;p&gt;
Flash is flash and can be embedded using the &lt;tt&gt;fb:swf&lt;/tt&gt; FBML tag.  The Javascript wrappers, however, are where the gotchas pop up.  Facebook supports three pieces of Javascript functionality: showing a DOM element, hiding a DOM element, and replacing the contents of a DOM element with HTML returned from a remote URL.
&lt;/p&gt;
&lt;p&gt;
You can show, hide, or toggle an element with id &lt;tt&gt;foo&lt;/tt&gt; by giving an element &lt;tt&gt;clicktoshow&lt;/tt&gt;, &lt;tt&gt;clicktohide&lt;/tt&gt;, or &lt;tt&gt;clicktotoggle&lt;/tt&gt; attributes with the value &lt;tt&gt;foo&lt;/tt&gt;, respectively.
&lt;/p&gt;
&lt;p&gt;
To swap out the content of an element with remote content use &lt;tt&gt;clickrewriteurl&lt;/tt&gt; and &lt;tt&gt;clickrewriteform&lt;/tt&gt;.  The first parameter contains the URL and the second parameter is the id for a form element containing parameters to pass to the URL.  You can combine &lt;tt&gt;clicktoshow&lt;/tt&gt;, &lt;tt&gt;clicktohide&lt;/tt&gt;, and &lt;tt&gt;clicktotoggle&lt;/tt&gt; in a single element but cannot combine these with &lt;tt&gt;clickrewriteurl&lt;/tt&gt;.
&lt;/p&gt;
&lt;p&gt;
To get around this you can mark it up as follows:
&lt;/p&gt;
&lt;pre class="brush: html"&gt;
&lt;div clickrewriteurl="your_url"&gt;
	&lt;a href="#" clicktoshow="id_to_show"&gt;Click me!&lt;/a&gt;
&lt;/div&gt;
&lt;/pre&gt;
&lt;p&gt;
This is useful to, for example, show a progress indicator or "Saving..." text as you process something asynchronously.  Make sure to test this in all major browsers since I've seen this fail in IE under circumstances.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h5&gt;Using Lighttpd&lt;/h5&gt;
&lt;p&gt;
&lt;a href="http://www.lighttpd.net/"&gt;lighttpd&lt;/a&gt; is an increasingly popular webserver.  It is much lighter than Apache at the expense of Apache's modularity and extensibility.  A common scenario would be to use it for serving static content.
&lt;/p&gt;
&lt;p&gt;
However, many people are using it in place of Apache as a full, dedicated webserver.  The problem arises when you try to submit large amounts of data via &lt;tt&gt;POST&lt;/tt&gt; to a Facebook canvas page.  If the data is large enough Facebook will send your app an &lt;tt&gt;Expect: 100-continue&lt;/tt&gt; header, which lighttpd doesn't understand.  This results in lighttpd throwing an HTTP 417 error (pretty obscure, eh?), which Facebook spits right back in the users face.
&lt;/p&gt;
&lt;p&gt;
To get around this you need to either use something besides lighttpd which does support the 100-continue header (e.g., Apache) or submit the data directly to your server and then redirect to the Facebook after the data is processed.
&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;
The Facebook Platform is still young and changes weekly.  Keeping abreast of the changes can be daunting, so let me know if this helped at all.
&lt;/p&gt;</description>
      <pubDate>Tue, 19 Jun 2007 23:40:57 +0000</pubDate>
      <link>http://20bits.com/article/5-facebook-application-gotchas</link>
    </item>
    <item>
      <title>An Introduction to FBML</title>
      <description>&lt;p&gt;
On May 24&lt;sup&gt;th&lt;/sup&gt;, 2007 Facebook released the &lt;a href="http://developers.facebook.com/"&gt;Facebook platform&lt;/a&gt;.  This is the complement to their previous API, based around the &lt;a href="http://developers.facebook.com/documentation.php?v=1.0&amp;doc=fql"&gt;Facebook Query Language (FQL)&lt;/a&gt;.  Where FQL allows you do create applications from Facebook data, the Facebook platform, via &lt;a href="http://developers.facebook.com/documentation.php?doc=fbml"&gt;Facebook Markup Langage (FBML)&lt;/a&gt;, allows you to embed your application in the Facebook.  Finally, Facebook has entered the world of widgets.
&lt;/p&gt;

&lt;p&gt;
Lucky for us Facebook actually has a "widget strategy."  MySpace's "widget strategy" isn't really a strategy at all; rather, it's a consequence of the fact that they basically allow people to enter anything they want into their profiles.  A "MySpace widget" works just as well on MySpace as it does anywhere else.  The Facebook platform, however, gives you access to the jewel of the Facebook universe: the social graph.
&lt;/p&gt;

&lt;p&gt;
On the one hand this means that a Facebook widget really only works on the Facebook (at least until some other website supports FBML).  On the other hand this means that you can create much richer applications by exploiting information about your users' relationships.  Since the mini-feed informs your friends whenever you install an application you also get an excellent viral way to spread your application and your brand.  But to do that you need to understand FBML.
&lt;/p&gt;

&lt;h4&gt;Your Data as FBML&lt;/h4&gt;
&lt;p&gt;
One of the most important concepts associated with Web 2.0 is the independence of data and presentation.  You see this in things like XML/XSLT and HTML/CSS.  Let's say you have a database-backed web application.  Most of the time you're going to be surfacing this data as HTML.  You have other options, of course.  Maybe your reader wants his data in an RSS feed.  The underlying data is the same but the format in which it is presented is different.
&lt;/p&gt;

&lt;p&gt;
For those still in a SAT mindset we get the following analogy: HTML is to a browser as RSS is to a feed reader, and RSS is to a feed reader as FBML is to Facebook.  Graphically the relationship is this:
&lt;/p&gt;
&lt;pre&gt;

                            HTML
User &amp;lt;------&amp;gt;   Browser   &amp;lt;------&amp;gt; Server

                            RSS
User &amp;lt;------&amp;gt; Feed reader &amp;lt;------&amp;gt; Server

                            FBML
User &amp;lt;------&amp;gt;  Facebook   &amp;lt;------&amp;gt; Server
&lt;/pre&gt;


&lt;h4&gt;The Nuts and Bolts of FBML&lt;/h4&gt;
&lt;p&gt;
FBML isn't quite HTML and isn't quite proprietary.  The closest analog I can think of is &lt;a href="http://en.wikipedia.org/wiki/ColdFusion#Code_example"&gt;ColdFusion&lt;/a&gt;, ironically the language in which MySpace is written.  FBML consists of a subset of HTML (no &lt;tt&gt;script&lt;/tt&gt; tags, for example) and a set of proprietary extensions.
&lt;/p&gt;

&lt;p&gt;
These extensions act like HTML tags and can be divided into two broad classes: markup tags and procedural tags.  Markup tags include UI elements and are generally directly translated into HTML.  The &lt;tt&gt;&lt;a href="http://wiki.f8.facebook.com/index.php/Fb:header"&gt;fb:header&lt;/a&gt;&lt;/tt&gt; tag, for example, produces the HTML for a Facebook-style header.  
&lt;/p&gt;
&lt;p&gt;Other tags like &lt;tt&gt;&lt;a href="http://wiki.f8.facebook.com/index.php/Fb:if-can-see"&gt;fb:if-can-see&lt;/a&gt;&lt;/tt&gt; have a programmatic component.  In this case the content between the tags is rendered only if the current user has permission to do whatever is specified in the tag's attributed.  For example
&lt;/p&gt;
&lt;pre class="brush: html"&gt;&lt;fb:if-can-see uid="12345" what="profile"&gt;
	You're allowed to see 12345's profile, chum!
	
	&lt;fb:else&gt;
		No profile for you!
	&lt;/fb:else&gt;
&lt;/fb:if-can-see&gt;&lt;/pre&gt;

&lt;p&gt;
This would display "You're allowed to see my profile, chum!" if the current user could see user 12345's profile and display "No profile for you!" otherwise.
&lt;/p&gt;

&lt;p&gt;
Some tags are more complicated, like &lt;tt&gt;&lt;a href="http://wiki.f8.facebook.com/index.php/Fb:switch"&gt;fb:switch&lt;/a&gt;&lt;/tt&gt;.  &lt;tt&gt;fb:switch&lt;/tt&gt; evaluates each of the &lt;tt&gt;fb:&lt;/tt&gt; tags inside and returns the first one which does not evaluate to an empty string, e.g.,
&lt;/p&gt;
&lt;pre class="brush: html"&gt;&lt;fb:switch&gt;
	&lt;fb:photo pid="12345" /&gt;
	&lt;fb:profile-pic uid="54321" /&gt;
	&lt;fb:default&gt;You can't see either the photo or the profile pic&lt;/fb:default&gt;  
&lt;/fb:switch&gt;&lt;/pre&gt;

&lt;p&gt;
This would display the photo with &lt;tt&gt;pid&lt;/tt&gt; 12345 if it could, otherwise it would try to display the profile picture of user 54321.  If neither of these can be displayed (e.g., the privacy settings are such that you're not allowed to see them) then it will display the content in &lt;tt&gt;fb:default&lt;/tt&gt;.
&lt;/p&gt;

&lt;p&gt;
If you want to play around with FBML without installing and configuring your own application you can use Facebook's &lt;a href="http://developers.facebook.com/tools.php?fbml"&gt;FBML test console&lt;/a&gt;.
&lt;/p&gt;

&lt;h4&gt;Integrating With Facebook&lt;/h4&gt;
&lt;p&gt;
FBML itself isn't so complicated, but integrating your existing application with the Facebook Platform can be a pain, especially since the whole process isn't very well documented.  The first thing you need to do is install the &lt;a href="http://www.facebook.com/developers"&gt;Developer Application&lt;/a&gt;, which allows you to manage the applications you create.
&lt;/p&gt;

&lt;p&gt;
Each application has a unique &lt;em&gt;API key&lt;/em&gt; which doesn't ever change.  When you create an application you also get a &lt;em&gt;secret&lt;/em&gt; which you should never share &amp;mdash; it's the only way the Facebook knows that an application is the application it claims to be.
&lt;/p&gt;

&lt;p&gt;
So, to create a new application go to the Developer application, click on &lt;a href="http://www.facebook.com/developers/apps.php"&gt;My Applications&lt;/a&gt; and then &lt;a href="http://www.facebook.com/developers/editapp.php?new"&gt;Apply for another key&lt;/a&gt;.  Here you enter the name of your application.  After agreeing to the Terms of Service click submit and you'll be redirected back to the My Application page.  Once there click on "Edit settings" for your new application.
&lt;/p&gt;

&lt;p&gt;
I'll wait until you get to the "Edit Settings" page.  The key part here is to understand the &lt;strong&gt;Callback (URL)&lt;/strong&gt; field.  If you enter as the callback URL &lt;tt&gt;http://mydomain.com/myapp/&lt;/tt&gt; then all requests directed to &lt;tt&gt;http://facebook.com/myapp&lt;/tt&gt; will go to &lt;tt&gt;http://mydomain.com/myapp&lt;/tt&gt;.  The callback URL serves as the base URL from which all requests are made.  If you ask Facebook for &lt;tt&gt;foo.php&lt;/tt&gt; it will try to fetch the FBML from &lt;tt&gt;http://mydomain.com/myapp/foo.php&lt;/tt&gt;, interpret it, and display the results.
&lt;/p&gt;

&lt;h4&gt;The Library&lt;/h4&gt;
&lt;p&gt;
One could write an application which consists solely of static FBML pages if they wanted, but it would be pretty boring.  To aide integration Facebook provides both Java and PHP &lt;a href="http://developers.facebook.com/resources.php"&gt;client libraries&lt;/a&gt;.  We'll focus on the PHP5 library.
&lt;/p&gt;

&lt;p&gt;
The client library includes an example application called "Footprints" which is very instructive.  The library provides a &lt;tt&gt;Facebook&lt;/tt&gt; object, initialized with your API key and secret, which helps control the flow of the application.
&lt;/p&gt;

&lt;pre class="brush: php"&gt;$api_key = 'YOUR API KEY';
$secret = 'YOUR SECRET';
$facebook = new Facebook($api_key, $secret);&lt;/pre&gt;

&lt;p&gt;
Facebook allows &lt;a href="http://developers.facebook.com/anatomy.php"&gt;several points of integration&lt;/a&gt; and the &lt;tt&gt;$facebook&lt;/tt&gt; object is the glue which allows you to push data to each of those integration points.
&lt;/p&gt;

&lt;p&gt;
An important fact to note is that the Facebook platform contains both push and pull APIs.  All user-specific data follows a push model.  That is, if you want to publish data on a users profile, send a message, make a request, publish an item on a user's mini-feed, etc., you must push the request.  All other data is fetched from your server by the Facebook when users access URLs like &lt;tt&gt;http://apps.facebook.com/myapp/do_something.php&lt;/tt&gt;.
&lt;/p&gt;

&lt;p&gt;
Here is the procedure by which users install an application, giving that application permission to push data to their profile, mini-feed, etc.
&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;User visits &lt;tt&gt;http://apps.facebook.com/myapp/&lt;/tt&gt; and Facebook requests &lt;tt&gt;http://mydomain.com/myapp/&lt;/tt&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The application requests the user install the application by invoking &lt;tt&gt;$facebook-&gt;require_login()&lt;/tt&gt; if the application plans to push user-specific data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The user/application go through the &lt;a href="http://developers.facebook.com/documentation.php?doc=auth"&gt;authentication process&lt;/a&gt;.  After the end of the authentication process (presuming the user follows through) the application is given the user's Facebook uid and a session key via a &lt;tt&gt;POST&lt;/tt&gt; request.  These are required to push user-specific data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The application can now push data to a user's profile or mini-feed, make application-related requests on their behalf, etc.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h5&gt;The Nuts and Bolts of the Facebook Object&lt;/h5&gt;
&lt;p&gt;
The Facebook object contains all the methods you'll need to interact with the Facebook platform.  After a user has authenticated you'll probably be interested in the following:
&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;$facebook-&gt;redirect($url)&lt;/dt&gt;
&lt;dd&gt;Redirects to the given URL.  This is required because the the headers have already been sent by the time the Facebook requests data from your application.&lt;/dd&gt;
&lt;dt&gt;$facebook-&gt;require_login() and $facebook-&gt;require_add()&lt;/dt&gt;
&lt;dd&gt;Requires the user to login to your application or install it, respectively.&lt;/dd&gt;
&lt;dt&gt;$facebook-&gt;get_login_url() and $facebook-&gt;get_add_url()&lt;/dt&gt;
&lt;dd&gt;Returns the URL for your application's login or install page, respectively.&lt;/dd&gt;
&lt;dt&gt;$facebook-&gt;api_client-&gt;feed_publishStoryToUser($title, $body, ...)&lt;/dt&gt;
&lt;dd&gt;Publishes a feed item for the currently authenticated user.&lt;/dd&gt;
&lt;dt&gt;$facebook-&gt;api_client-&gt;friends_get()&lt;/dt&gt;
&lt;dd&gt;Returns the friends of the currently authenticated user.&lt;/dd&gt;
&lt;dt&gt;$facebook-&gt;api_client-&gt;friends_getAppUsers()&lt;/dt&gt;
&lt;dd&gt;Returns the friends of the currently authenticated user who also have the application installed.&lt;/dd&gt;
&lt;dt&gt;$facebook-&gt;api_client-&gt;groups_get($uid=null,$gids=null)&lt;/dt&gt;
&lt;dd&gt;Returns the specified groups (all by default) for the specified user (the current user by default).&lt;/dd&gt;
&lt;dt&gt;$facebook-&gt;api_client-&gt;profile_setFBML($markup, $uid=null)&lt;/dt&gt;
&lt;dd&gt;Sets the profile box FBML for the specified users (defaults to the current user).&lt;/dd&gt;
&lt;/dl&gt;

&lt;p&gt;
This list is by no means comprehensive, but these are the highlights.  There are also functions which deal with photos, notifications, and events.  There's no real documentation for these functions outside of the library source, although there is a one-to-one correspondence with methods in the api_client and the methods listed in the &lt;a href="http://developers.facebook.com/documentation.php?v=1.0&amp;method=auth.createToken"&gt;sidebar of the developer documentation&lt;/a&gt;.  This is definitely the &lt;em&gt;least&lt;/em&gt; document part of the Facebook platform.
&lt;/p&gt;

&lt;h4&gt;AJAX and other miscellany&lt;/h4&gt;
&lt;p&gt;
No Web 2.0 application would be complete without AJAX.  Of course the whole point of the Facebook platform is to give developers access to the Facebook without compromising security, so unadorned Javascript and &lt;tt&gt;script&lt;/tt&gt; tags are out of the question.
&lt;/p&gt;

&lt;p&gt;
To solve this Facebook provides a very basic &lt;a href="http://developers.facebook.com/step_by_step.php#redirect"&gt;mock AJAX&lt;/a&gt; system.  You create a dummy form which contains the various values you're interested in and point it at an element which activates the AJAX request.  It's a little hackish but the alternative (no Javascript at all) is probably worse.  The examples in the above documentation are as clear as I could write it, so just read those.
&lt;/p&gt;

&lt;p&gt;
In addition Facebook supports Flash and iframes on canvas pages.  This means you could, in theory, embed your page directly into the Facebook.
&lt;/p&gt;

&lt;h4&gt;Resources&lt;/h4&gt;
&lt;p&gt;
From the above you should understand the basics of how Facebook interacts with an application.  The Facebook expects your application to output FBML which it then transforms into a page for your user.  In addition you can use the &lt;tt&gt;Facebook&lt;/tt&gt; object to get information about the current user, such as their friends, groups, photos, and notifications.
&lt;/p&gt;

&lt;p&gt;
But the above only touches the important parts.  A lot of the platform remains undocumented and the best way to learn more is to just dive in.  Here are some helpful resources.
&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="http://developers.facebook.com/documentation.php"&gt;Developers Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://developers.facebook.com/anatomy.php"&gt;Anatomy of a Facebook Application&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://developers.facebook.com/step_by_step.php"&gt;Step-by-step Guide to Creating an Application&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://developers.facebook.com/faq.php"&gt;Developer FAQ&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://wiki.developers.facebook.com/index.php/Main_Page"&gt;Platform Wiki&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://developers.facebook.com/clientlibs/facebook-platform.tar.gz"&gt;PHP5 Client Library&lt;/a&gt;, including a sample Facebook application.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;/p&gt;
&lt;h4&gt;Speculation&lt;/h4&gt;
&lt;p&gt;
There are some totally undocumented aspects of FBML.  One that sticks out, using my ColdFusion analogy above, is the &lt;tt&gt;fb:query&lt;/tt&gt; tag.  You can see the stub on the &lt;a href="http://wiki.f8.facebook.com/index.php/FBML"&gt;FBML documentation&lt;/a&gt; at the wiki.  
&lt;/p&gt;
&lt;p&gt;
One oddity with the current platform is the way it integrates FBML and FQL.  You can issue FQL queries directly via the &lt;tt&gt;Facebook&lt;/tt&gt; object.  This effectively doubles the latency of your application since the Facebook first issues a request to your application which then in turn might issue several FQL queries back to the Facebook before returning the finalized FBML.  My suspicion is that FBML either at point contained or will contain the ability to execute FQL directly on the Facebook and iterate through the resultset.&lt;/p&gt;

&lt;p&gt;
Cheers, and happy coding!
&lt;/p&gt;</description>
      <pubDate>Mon, 04 Jun 2007 05:50:26 +0000</pubDate>
      <link>http://20bits.com/article/an-introduction-to-fbml</link>
    </item>
    <item>
      <title>Designing Content-focused Websites</title>
      <description>&lt;p&gt;
Every website has two fundamental components: data and one or more users/readers who consume that data.  This data can be produced by many ways &amp;mdash; an author or editorial staff, other users of the website, a database, etc.  I'm not interested in the question of what data a user is interested in consuming.  That is, I'm not interested in giving editorial advice for someone looking to create a popular blog.
&lt;/p&gt;

&lt;p&gt;
Rather, given that a user is at a website which has data they want to consume, I'm interested in the question of how best to deliver that data.  This question intersects the realms of technology, usability, and design.
&lt;/p&gt;

&lt;p&gt;
In thinking about this question I've come up with three categories into which most any website fits.  By analyzing these categories I believe one can arrive at some solid, general advice for how to structure websites.  Some might accuse me of being "too academic," but I think there's something to be learned about designing websites by understanding these categories and your website's relation to them.
&lt;/p&gt;

&lt;h4&gt;Contents&lt;/h4&gt;
&lt;p&gt;
&lt;ol&gt;
	&lt;li&gt;&lt;a href="#categories"&gt;The Three Categories&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href="#content-focused"&gt;Content-focused Websites&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href="#surface"&gt;Surfacing Content&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href="#estimate"&gt;Choosing and Estimate&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href="#conclusions"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/p&gt;

&lt;a name="categories"&gt;&lt;/a&gt;
&lt;h4&gt;The Three Categories&lt;/h4&gt;

&lt;dl&gt;
	&lt;dt&gt;Application-focused&lt;/dt&gt;
	&lt;dd&gt;
	&lt;p&gt;
	Application-focused websites are those which enable the user to complete some specific task.  The primary question to ask of one of these websites is "How well does it work?"  They have little user-user interaction and often no author per se.
	&lt;/p&gt;
	&lt;p&gt;
	Most of Google's websites fall into this category, for example.  Google's base business is centered around aggregating and organizing information.  A more pedestrian example would be a website which helps you complete your annual tax returns or find tickets for nearby concerts.
	&lt;/p&gt;
	&lt;/dd&gt;
	&lt;dt&gt;Content-focused&lt;/dt&gt;
	&lt;dd&gt;
	&lt;p&gt;
	Content-focused websites are those which provide regularly updated topical content.  The primary question to ask of one of these websites is "What information does it provide?"  There will always be at least one author and there might be an extensive degree of user-user interaction, but this interaction is always subordinate to the content.
	&lt;/p&gt;
	&lt;p&gt;
	Blogs and other news-oriented websites, including online magazines and newspapers, fall into this category.  Wikipedia is also an example, albeit one where the line between "readership" and "authorship" is blurred.  This is why the categories are defined in terms of data/user interaction rather than author/user interaction.  However, Wikipedia would be no less a website if it had the same &lt;em&gt;content&lt;/em&gt; it currently has but were only authored by, say, a certified editorial staff.  In other words, it is the content that matters, not the means by which the content is generated.
	&lt;/p&gt;
	&lt;p&gt;
	Another example is Livejournal, which allows user-user interaction in comments, groups, and via its "friends" feature (which is really a subscription feature in disguise).  User-user interaction is not the primary focus of LJ, however, and it is generally only used as a way to surface interesting content.
	&lt;/p&gt;
	&lt;/dd&gt;
	&lt;dt&gt;User-focused&lt;/dt&gt;
	&lt;dd&gt;
	&lt;p&gt;
	User-focused websites are those which are based upon user-user interaction.  The primary question to ask of one of these websites is "Who is using this website?"  There might be topical content or searchable data, but this is incidental to the relationships between users.
	&lt;/p&gt;
	&lt;p&gt;
	Most social networks, like Facebook, Yahoo 360Âº, Friendster, and MySpace, fall into this category.  Nobody would use Facebook for photo sharing or storing contact information were it not for the fact that all their friends are using it, too.  MySpace was originally a content-focused website, centering around bands and their music, but has since evolved (some might say degenerated) into a user-focused website where most people just use it as a platform to promote their own personality to other users of MySpace.
	&lt;/p&gt;
	&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;
I don't intend for these categories to be absolute, but rather just a useful tool for reasoning about websites and website design.  If you can think of any websites which do not fall into &lt;em&gt;any&lt;/em&gt; of the above categories I'd love to hear&lt;span class="footnote"&gt;Some websites themselves are the content, e.g., an art student's website in which the piece of art is the website.  As far as I'm concerned these are one-off affairs with no unifying logic outside of the usual artistic conventions.&lt;/span&gt;.  
&lt;/p&gt;

&lt;a name="content-focused"&gt;&lt;/a&gt;
&lt;h4&gt;Content-focused Websites&lt;/h4&gt;

&lt;p&gt;
So, you have a blog and you're writing interesting stuff that has an audience.  This in itself is no small feat, but arguably the harder part is knowing how to present that information so that any given reader gets information he wants to read, even if they didn't know they wanted to read it before coming to your site.  This applies to any content-focused website. How do I give the reader the most relevant and interesting content with the least amount of effort on their part?
&lt;/p&gt;

&lt;p&gt;
Many content-focused websites don't even have real registration, e.g., wordpress blogs where registering doesn't actually confer any additional benefits.  How are the authors of the content supposed to serve up interesting content if they don't know anything about an individual reader's preferences?  And that's the key to designing a good content-focused website &amp;mdash; can you come up with a way to estimate your readers' preferences?  If yes then you just serve up content according to that estimate.
&lt;/p&gt;

&lt;a name="surface"&gt;&lt;/a&gt;
&lt;h4&gt;Surfacing Content&lt;/h4&gt;

&lt;p&gt;
Let's assume that you're an author of a content-focused website (a blog, say) and write quality content which has an audience.  For your website the data consists in a collection of posts and your job, given that people are actually interested in what you have to write, is to surface the content which is most interesting to a given reader.  There are three ways which you can do that.
&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;h5&gt;Global Preference Estimation&lt;/h5&gt;
&lt;p&gt;
Global preference estimation is the idea that if you know nothing about a specific reader your best estimate is the average case.  If your article about Widgets has been read more than any other article then it's not a bad bet that the average reader would also find it worth reading, for example.  Here are some ways to estimate global preferences, with explanations where necessary.
&lt;ul&gt;
&lt;li&gt;Recency&lt;/li&gt;
&lt;li&gt;Pageviews&lt;/li&gt;
&lt;li&gt;Number or recency of comments&lt;/li&gt;
&lt;li&gt;Number of inbound links&lt;/li&gt;
&lt;li&gt;In general if your site has a feature which requires readers to take a definite action on a post, e.g., commenting, viewing, emailing, etc., then you can measure preferences by the numbers of times a post has been acted upon.&lt;/li&gt;
&lt;li&gt;Featured articles &amp;mdash; if you have a good understanding of your audience explicitly surface content you believe they'd find interesting.&lt;/li&gt;
&lt;li&gt;Average post rating, if your website supports ratings.&lt;/li&gt;
&lt;li&gt;A promotion model ("Promoted Articles") based on explicit votes (X votes marks a story as promoted) or votes over time (a la digg).&lt;/li&gt;
&lt;/ul&gt;
&lt;/p&gt;
&lt;p&gt;
The pro of global preference estimation is that is it relatively easy to implement and does not suffer from sparsity problems.  That is, a specific user does not need to register all their preferences for it return good results.  Instead preferences are collected in aggregate so that one reader's habits are as good as any other's from the perspective of a global estimate.  The con is that this estimate only deals with averages.  At best this will let you please most of the people some of the time.
&lt;/p&gt;

&lt;/li&gt;
&lt;li&gt;&lt;h5&gt;Local Preference Estimation&lt;/h5&gt;
&lt;p&gt;
Local preference estimation is based on implicit and explicit information you have about a specific reader, such as their reading, browsing, and commenting patterns.  If you can collect enough data you can surface content that often the user doesn't even realize they were looking for.
&lt;/p&gt;

&lt;p&gt;
The easiest way to get a local preference estimation is to use the most obvious fact about a reader &amp;mdash; you know when they are reading something.  It's a fairly safe assumption that the reader is interested in whatever they are reading, so it stands to reason they would also be interested in related content.  Coming up with a way to surface related content is therefore one of the first things a content-focused website should implement, in my opinion.
&lt;/p&gt;

&lt;p&gt;
For sites on which the readers are creating the content another way to measure interest is to allow readers to befriend each other.  Since this friendship is essentially arbitrary they will take "friendship" to mean whatever you tell them it means.  If you use friendship status as a means to surface interesting content then they will befriend people creating interesting content. That suggests presenting the user with the following:
&lt;ul&gt;
&lt;li&gt;Content created by my friends&lt;/li&gt;
&lt;li&gt;Content commented on by my friends&lt;/li&gt;
&lt;li&gt;Content voted on by my friends&lt;/li&gt;
&lt;li&gt;Content read by my friends&lt;/li&gt;
&lt;li&gt;etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;/p&gt;

&lt;p&gt;
If you want to get very fancy (and very technical) you can create a content recommendation system.  Reader A like stories 1, 3, and 5.  Reader B likes stories 3, 5, and 7.  It's probable that Reader A would like story 7 and Reader B would like story 1.  Techniques for content recommendation dive straight into the fields of &lt;a href="http://en.wikipedia.org/wiki/Information_retrieval"&gt;information retrieval&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Data_mining"&gt;data mining&lt;/a&gt;.  Given other local preference estimates you can come up with what it means for a reader to "like" some piece of content.  You register their preference and then use standard IR and data-mining techniques&lt;span class="footnote"&gt;For example, &lt;a href="http://en.wikipedia.org/wiki/Slope_One"&gt;slope one&lt;/a&gt; recommenders or clustering recommenders based on similarity metrics, such as cosine similarity.&lt;/span&gt; to extract patterns about their tastes.  This really only works well if you have a lot of diffuse content and a large, active readership.
&lt;/p&gt;

&lt;p&gt;
The upside of local preference estimation is that it can give fairly accurate results.  Google, for example, bases much of their business around contextual information.  If you have a Google account they know your searching habits and what Google ads you're seeing around the web.  From this, in turn, they can recommend to you all sorts of things.  The con is that to get accurate results you need a lot of data.  Google and Yahoo! can pull it off because they have terabytes upon terabytes of data.  The average blog, however, will have a harder time. 
&lt;/p&gt;
&lt;li&gt;&lt;h5&gt;Explicit Preferences&lt;/h5&gt;
&lt;p&gt;
Explicit preferences are just that, preferences which the user has made known or wants to make known.  To accommodate these preferences it is best for the website to simply get out of the readers way.  Here, search is king.
&lt;/p&gt;

&lt;p&gt;
Let's say the user remember an old post you wrote on your blog about Widgets, but can't remember the exact title or some of the secondary content.  The first thing they will probably want to do is search for "Widget."  Search isn't easy (otherwise Google wouldn't be a multi-billion dollar company), so it's not uncommon to leave search up to a third-party application.  For this blog I trust Google to index it and for my readers to use Google to search it &amp;mdash; I know Google will do a better job than any native Wordpress search functionality would.
&lt;/p&gt;

&lt;p&gt;
Aside from search a common feature in the Web 2.0 world, at least, is the tag cloud.  If you tag your content with semantically meaningful tags then the tag cloud provides a sort of topographical map of your content.  Presumably you tagged that post about Widgets with "widget," so a user looking for some post on Widgets will be able to find it by looking through all Widget-related content.
&lt;/p&gt;
&lt;/ol&gt;

&lt;a name="estimate"&gt;&lt;/a&gt;
&lt;h4&gt;Choosing an Estimate&lt;/h4&gt;
&lt;p&gt;
For content-focused sites with worthwhile content the most important job is to surface the most interesting content.  What constitutes "interesting" varies from site-to-site and audience-to-audience but abstractly speaking the process is the same. That is, you need to come up with some way to measure how interesting a given piece of content is and display views of your content ranked according to that measure.
&lt;/p&gt;

&lt;p&gt;
For example, recency is going to be an important component of what is interesting on a news-focused site, but is hardly sufficient.  The news that Grandma Smith died just isn't as interesting as the news that a Presidential candidate was caught doing drugs, for example.  Traditional news outlets use editorial discretion to surface the interesting news.  Good editorial staffs lead to successful newspapers.
&lt;/p&gt;

&lt;p&gt;
The internet, however, affords more direct access to your audience's tastes.  Sites like digg exploit this by allowing users to vote directly on articles.  The measure of how "interesting" content is then a function of both the recency of the content and the number of votes.  The only essential difference between a site like digg and a traditional news blog is the way in which they measure how interesting a given piece of content is.
&lt;/p&gt;

&lt;p&gt;
What measures work depends heavily on both the content and audience, however.  A new measure might make for a novel kind of content-focused website but it is no guarantee that that website will be successful, even if the content has an audience.  The mechanics of the metric might not sit well with your audience.  For example they might not understand a digg-like voting mechanism, making any metric based on "votes" totally ineffective.
&lt;/p&gt;

&lt;p&gt;
So the problem for a would-be website author is two-fold: create quality content that has an audience and determine a preference estimate which surfaces the content most interesting to both the audience as a whole and a specific reader.  There are many proven measures listed above which work well, although the truly breakaway successes are usually those that either have some novel means of content creation, preference estimation, or both.
&lt;/p&gt;

&lt;a name="conclusions"&gt;&lt;/a&gt;
&lt;h4&gt;Conclusions&lt;/h4&gt;
&lt;p&gt;
Most every website falls into one of three categories, each of which is defined in terms of data-user interaction.  Content-focused websites are those which regularly generate topical content, such as online newspapers, blogs, digg, or Wikipedia.  The most pertinent question for these websites is "What information does it provide?"
&lt;/p&gt;

&lt;p&gt;
For a reader to answer this question the author of a content-focused website needs to provide a window into their content.  Presuming the author actually wants the reader to stay around and consume more content these windows need to do more than just show random content, they need to show &lt;em&gt;interesting&lt;/em&gt; content  It is therefore important for the author to find a way to estimate the preferences of his readers.
&lt;/p&gt;

&lt;p&gt;
This can be accomplished at either the global, aggregate level or the local, contextual level.  A global estimate surfaces content which is interesting to the average reader while a local estimate surfaces content interesting to a specific reader, given what you know about them.  In addition readers sometimes make their preferences known explicitly in which case there should also be a path for readers who are looking for specific content, e.g., a proper search function.  Assuming you are actually writing worthwhile content then a good estimate goes a long way towards converting users to your site.
&lt;/p&gt;

&lt;p&gt;
Above all it is important to think clearly about getting to your readers what they want as easily as possible.  I often find it useful to imagine I know nothing about where content resides in my site and go from there.  Is what I see interesting enough for me to keep looking?  Is so, how long before it becomes uninteresting?  If not, how long before it does?  Could I find what I wanted if I really had to?
&lt;/p&gt;

&lt;p&gt;
Finally, I'd love to get any and all feedback on this article.  I've been tossing these ideas around in my head for a few months now and thought now was a good time to write them down for the first time.  Cheers!
&lt;/p&gt;</description>
      <pubDate>Thu, 17 May 2007 01:00:57 +0000</pubDate>
      <link>http://20bits.com/article/designing-content-focused-websites</link>
    </item>
    <item>
      <title>Introduction to Dynamic Programming</title>
      <description>&lt;p&gt;
Dynamic programming is a method for efficiently solving a broad range of search and optimization problems which exhibit the characteristics of &lt;a href="http://en.wikipedia.org/wiki/Overlapping_subproblem"&gt;overlappling subproblems&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Optimal_substructure"&gt;optimal substructure&lt;/a&gt;.  I'll try to illustrate these characteristics through some simple examples and end with an exercise.  Happy coding!
&lt;/p&gt;

&lt;!--more--&gt;

&lt;h4&gt;Contents&lt;/h4&gt;
&lt;p&gt;
&lt;ol&gt;
	&lt;li&gt;&lt;a href="#subproblems"&gt;Overlapping Subproblems&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href="#optimal"&gt;Optimal Substructure&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href="#knapsack"&gt;The Knapsack Problem&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href="#everyday"&gt;Everyday Dynamic Programming&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/p&gt;

&lt;a name="subproblems"&gt;&lt;/a&gt;
&lt;h4&gt;Overlapping Subproblems&lt;/h4&gt;
&lt;p&gt;
A problem is said to have overlapping subproblems if it can be broken down into subproblems which are reused multiple times.  This is closely related to recursion.  To see the difference consider the &lt;a href="http://mathworld.wolfram.com/Factorial.html"&gt;factorial&lt;/a&gt; function, defined as follows (in Python):
&lt;pre class="brush: python"&gt;def factorial(n):
	if n == 0: return 1
	return n*factorial(n-1)&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
Thus the problem of calculating &lt;tt&gt;factorial(n)&lt;/tt&gt; depends on calculating the subproblem &lt;tt&gt;factorial(n-1)&lt;/tt&gt;.  This problem does &lt;strong&gt;not&lt;/strong&gt; exhibit &lt;em&gt;overlapping&lt;/em&gt; subproblems since &lt;tt&gt;factorial&lt;/tt&gt; is called exactly once for each positive integer less than n.
&lt;/p&gt;

&lt;h5&gt;Fibonacci Numbers&lt;/h5&gt;
&lt;p&gt;
The problem of calculating the n&lt;sup&gt;th&lt;/sup&gt; &lt;a href="http://en.wikipedia.org/wiki/Fibonacci_number"&gt;Fibonacci number&lt;/a&gt; does, however, exhibit overlapping subproblems.  The naÃ¯ve recursive implementation would be
&lt;pre class="brush: python"&gt;def fib(n):
	if n == 0: return 0
	if n == 1: return 1
	return fib(n-1) + fib(n-2)&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
The problem of calculating &lt;tt&gt;fib(n)&lt;/tt&gt; thus depends on both &lt;tt&gt;fib(n-1)&lt;/tt&gt; and &lt;tt&gt;fib(n-2)&lt;/tt&gt;.  To see how these subproblems overlap look at how many times fib is called and with what arguments when we try to calculate &lt;tt&gt;fib(5)&lt;/tt&gt;:&lt;pre&gt;fib(5)
fib(4) + fib(3)
fib(3) + fib(2) + fib(2) + fib(1)
fib(2) + fib(1) + fib(1) + fib(0) + fib(1) + fib(0) + fib(1)
fib(1) + fib(0) + fib(1) + fib(1) + fib(0) + fib(1) + fib(0) + fib(1)&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
At the k&lt;sup&gt;th&lt;/sup&gt; stage we only need to know the values of &lt;tt&gt;fib(k-1)&lt;/tt&gt; and &lt;tt&gt;fib(k-2)&lt;/tt&gt;, but we wind up calling each multiple times.  Starting from the bottom and going up we can calculate the numbers we need for the next step, removing the massive redundancy.
&lt;pre class="brush: python"&gt;def fib2(n):
	n2, n1 = 0, 1
	for i in range(n-2): 
		n2, n1 = n1, n1 + n2
	return n2+n1&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
In &lt;a href="http://en.wikipedia.org/wiki/Big_O_notation"&gt;Big-O&lt;/a&gt; notation the &lt;tt&gt;fib&lt;/tt&gt; function takes O(c&lt;sup&gt;n&lt;/sup&gt;) time, i.e., exponential in n, while the &lt;tt&gt;fib2&lt;/tt&gt; function takes O(n) time.  If this is all too abstract take a look at this graph comparing the runtime (in microseconds) of &lt;tt&gt;fib&lt;/tt&gt; and &lt;tt&gt;fib2&lt;/tt&gt; versus the input parameter.
&lt;/p&gt;

&lt;div class="math"&gt;
&lt;a rel="lightbox" href="/include/images/fib_performance.png"&gt;&lt;img src="http://assets.20bits.com/20070508/fib_performance_thumb.png" width="250" height="187" /&gt;&lt;/a&gt;
&lt;/div&gt;

&lt;p&gt;
The above problem is pretty easy and for most programmers is one of the first examples of the performance issues surrounding recursion versus iteration.  In fact, I've seen many instances where the Fibonacci example leads people to believe that recursion is inherently slow.  This is not true, but in cases where we can define a problem with overlapping subproblems recursively using the above technique will always reduce the execution time.
&lt;/p&gt;

&lt;p&gt;Now, for the second characteristic of dynamic programming: optimal substructure.&lt;/p&gt;

&lt;a name="optimal"&gt;&lt;/a&gt;
&lt;h4&gt;Optimal Substructure&lt;/h4&gt;
&lt;p&gt;
A problem is said to have &lt;a href="http://en.wikipedia.org/wiki/Optimal_substructure"&gt;optimal substructure&lt;/a&gt; if the globally optimal solution can be constructed from locally optimal solutions to subproblems.  The general form of problems in which optimal substructure plays a roll goes something like this.  Let's say we have a collection of objects called A.  For each object o in A we have a "cost," c(o).  Now find the subset of A with the maximum (or minimum) cost, perhaps subject to certain constraints. 
&lt;/p&gt;

&lt;p&gt;
The brute-force method would be to generate every subset of A, calculate the cost, and then find the maximum (or minimum) among those values.  But if A has n elements in it we are looking at a search space of size 2&lt;sup&gt;n&lt;/sup&gt; if there are no constraints on A.  Oftentimes n is huge making a brute-force method computationally infeasible.  Let's take a look at an example.
&lt;/p&gt;

&lt;h5&gt;Maximum Subarry Sum&lt;/h5&gt;
&lt;p&gt;
Let's say we're given an array of integers.  What (contiguous) subarray has the largest sum?  For example, if our array is [1,2,-5,4,7,-2] then the subarray with the largest sum is [4,7] with a sum of 11.  One might think at first that this problem reduces to finding the subarray with all positive entries, if one exists, that maximizes the sum.  But consider the array [1,5,-3,4,-2,1].  The subarray with the largest sum is [1, 5, -3, 4] with a sum of 7.
&lt;/p&gt;

&lt;p&gt;
First, the brute-force solution.  Because of the constraints on the problem, namely that the subsets under consideration are contiguous, we only have to check O(n&lt;sup&gt;2&lt;/sup&gt;) subarrays (why?).  Here it is, in Python: &lt;pre class="brush: python"&gt;def msum(a):
	return max([(sum(a[j:i]), (j,i)) for i in range(1,len(a)+1) for j in range(i)])&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
This returns both the sum and the offsets of the subarray.  Let's see if we can't find an optimal substructure to exploit.
&lt;/p&gt;

&lt;p&gt;
We are given an input array &lt;tt&gt;a&lt;/tt&gt;.  I'm going to use Python notation so that &lt;tt&gt;a[0:k]&lt;/tt&gt; is the subarray starting at 0 and including every element up to and including &lt;tt&gt;k-1&lt;/tt&gt;.  Let's say we know the subarray of &lt;tt&gt;a[0:i]&lt;/tt&gt; with the largest sum (and that sum).  Using just this information can we find the subarray of &lt;tt&gt;a[0:i+1]&lt;/tt&gt; with the largest sum?
&lt;/p&gt;

&lt;p&gt;
Let a[j:k+1] be the optimal subarray, t the sum of a[j:i], and s the optimal sum.  If t+a[i] is greater than s then set a[j:i+1] as the optimal array and set s = t.  If t + a[i] is negative, however, the contiguity constraint means that we cannot include a[j:i+1] in our subarray since any such subarray will have a smaller sum than a subarray without it.  So, if t+a[i] is negative set t = 0 and set the left-hand bound of the optimal subarray to i+1.
&lt;/p&gt;

&lt;p&gt;
To visualize consider the array [1,2,-5,4,7,-2].
&lt;pre&gt;Set s = -infinity, t = 0, j = 0, bounds = (0,0)
(1   2  -5   4   7  -2 
(1)| 2  -5   4   7  -2  (set t=1.  Since t &gt; s, set s=1 and bounds = (0,1))
(1   2)|-5   4   7  -2  (set t=3.  Since t &gt; s, set s=3, and bounds = (0,2))
 1   2  -5(| 4   7  -2  (set t=-2. Since t &lt; 0, set t=0 and j = 3 )
 1   2  -5  (4)| 7  -2  (set t=4.  Since t &gt; s, set s=4 and bounds = (3,4))
 1   2  -5  (4   7)|-2  (set t=11. Since t &gt; s, set s=11 and bounds = (3,5))
 1   2  -5  (4   7) -2| (set t=9.  Nothing happens since t &lt; s)
&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
This requires only one pass through the array and at each step we're only keeping track of three variables: the current sum from the left-hand edge of the bounds to the current point (t), the maximal sum (s), and the bounds of the current optimal subarray (bounds).  In Python:
&lt;pre class="brush: python"&gt;def msum2(a):
	bounds, s, t, j = (0,0), -float('infinity'), 0, 0
	
	for i in range(len(a)):
		t = t + a[i]
		if t &gt; s: bounds, s = (j, i+1), t
		if t &lt; 0: t, j = 0, i+1
	return (s, bounds)&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
In this problem the "globally optimal" solution corresponds to a subarray with a globally maximal sum, but at each each step we only make a decision relative to what we have already seen. That is, at each step we know the best solution &lt;em&gt;thus far&lt;/em&gt;, but might change our decision later based on our previous information and the current information.   This is the sense in the problem has optimal substructure.  Because we can make decisions locally we only need to traverse the list once, reducing the run-time of the solution to O(n) from O(n&lt;sup&gt;2&lt;/sup&gt;).  Again, a graph:&lt;/p&gt;

&lt;div class="math"&gt;
&lt;a rel="lightbox" href="/include/images/msum_performance.png"&gt;&lt;img src="http://assets.20bits.com/20070508/msum_performance_thumb.png" width="250" height="187" /&gt;&lt;/a&gt;
&lt;/div&gt;

&lt;a name="knapsack"&gt;&lt;/a&gt;
&lt;h4&gt;The Knapsack Problem&lt;/h4&gt;
&lt;p&gt;
Let's apply what we're learned so far to a slightly more interesting problem.  You are an art thief who has found a way to break into the impressionist wing at the Art Institute of Chicago.  Obviously you can't take everything.  In particular, you're constrained to take only what your knapsack can hold &amp;mdash; let's say it can only hold W pounds. You also know the market value for each painting.  Given that you can only carry W pounds what paintings should you steal in order to maximize your profit?
&lt;/p&gt;

&lt;p&gt;
First let's see how this problem exhibits both overlapping subproblems and optimal substructure.  Say there are n paintings with weights w&lt;sub&gt;1&lt;/sub&gt;, ..., w&lt;sub&gt;n&lt;/sub&gt; and market values v&lt;sub&gt;1&lt;/sub&gt;, ..., v&lt;sub&gt;n&lt;/sub&gt;.  Define A(i,j) as the maximum value that can be attained from considering only the first i items weighting at most j pounds as follows.
&lt;/p&gt;
&lt;p&gt;
Obviously A(0,j) = 0 and A(i,0) = 0 for any i &amp;le; n and j &amp;le; W.  If w&lt;sub&gt;i&lt;/sub&gt; &gt; j then A(i,j) = A(i-1, j) since we cannot include the i&lt;sup&gt;th&lt;/sup&gt; item.  If, however, w&lt;sub&gt;i&lt;/sub&gt; &amp;le; j then A(i,j) then we have a choice: include the i&lt;sup&gt;th&lt;/sup&gt; item or not.  If we do not include it then the value will be A(i-1, j).  If we do include it, however, the value will be v&lt;sub&gt;i&lt;/sub&gt; + A(i-1, j - w&lt;sub&gt;i&lt;/sub&gt;).  Which choice should we make?  Well, whichever is larger, i.e., the maximum of the two.
&lt;/p&gt;

&lt;p&gt;
Expressed formally we have the following recursive definition
&lt;img class="math" src="http://assets.20bits.com/20070508/knapsack_fct.png" alt="Knapsack function" width="491" height="71" /&gt;
&lt;/p&gt;

&lt;p&gt;
This problem exhibits both overlapping subproblems and optimal substructure and is therefore a good candidate for dynamic programming.  The subproblems overlap because at any stage (i,j) we might need to calculate A(k,l) for several k &lt; i and l &lt; j.   We have optimal substructure since at any point we only need information about the choices we have already made.
&lt;/p&gt;

&lt;p&gt;
The recursive solution is not hard to write: &lt;pre class="brush: python"&gt;def A(w, v, i,j):
    if i == 0 or j == 0: return 0
    if w[i-1] &gt; j:  return A(w, v, i-1, j)
    if w[i-1] &lt;= j: return max(A(w,v, i-1, j), v[i-1] + A(w,v, i-1, j - w[i-1]))&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
Remember we need to calculate A(n,W).  To do so we're going to need to create an n-by-W table whose entry at (i,j) contains the value of A(i,j).  The first time we calculate the value of A(i,j) we store it in the table at the appropriate location.  This technique is called &lt;a href="http://en.wikipedia.org/wiki/Memoization"&gt;memoization&lt;/a&gt; and is one way to exploit overlapping subproblems.  There's also a Ruby module called &lt;a href="http://raa.ruby-lang.org/project/memoize/"&gt;memoize&lt;/a&gt; which does it for Ruby.
&lt;/p&gt;

&lt;p&gt;
To exploit the optimal substructure we iterate over all i &lt;= n and j &lt;= W, at each step applying the recursion formula to generate the A(i,j) entry by using the memoized table rather than calling A() again. This gives an algorithm which takes O(nW) time using O(nW) space and our desired result is stored in the A(n,W) entry in the table.
&lt;/p&gt;

&lt;a name="everyday"&gt;&lt;/a&gt;
&lt;h4&gt;Everyday Dynamic Programming&lt;/h4&gt;

&lt;p&gt;
The above examples might make dynamic programming look like a technique which only applies to a narrow range of problems, but many algorithms from a wide range of fields use dynamic programming.  Here's a very partial list.
&lt;ol&gt;
&lt;li&gt;The &lt;a href="http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm"&gt;Needleman-Wunsch algorithm&lt;/a&gt;, used in bioinformatics.&lt;/li&gt;
&lt;li&gt;The &lt;a href="http://en.wikipedia.org/wiki/CYK_algorithm"&gt;CYK algorithm&lt;/a&gt; which is used the theory of formal languages and natural language processing.&lt;/li&gt;
&lt;li&gt;The &lt;a href="http://en.wikipedia.org/wiki/Viterbi_algorithm"&gt;Viterbi algorithm&lt;/a&gt; used in relation to &lt;a href="http://en.wikipedia.org/wiki/Hidden_Markov_model"&gt;hidden Markov models&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Finding the &lt;a href="http://en.wikipedia.org/wiki/Levenshtein_distance"&gt;string-edit distance&lt;/a&gt; between two strings, useful in writing spellcheckers.&lt;/li&gt;
&lt;li&gt;The &lt;a href="http://en.wikipedia.org/wiki/Duckworth-Lewis_method"&gt;D/L method&lt;/a&gt; used in the sport of &lt;a href="http://en.wikipedia.org/wiki/One-day_cricket"&gt;cricket&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;/p&gt;

&lt;p&gt;
That's all for today.  Cheers!
&lt;/p&gt;</description>
      <pubDate>Tue, 08 May 2007 09:23:48 +0000</pubDate>
      <link>http://20bits.com/article/introduction-to-dynamic-programming</link>
    </item>
    <item>
      <title>The Infection Puzzle</title>
      <description>&lt;p&gt;
I first heard this puzzle from the affable Hungarian mathematician and computer scientist &lt;a href="http://people.cs.uchicago.edu/~laci/"&gt;LÃ¡szlÃ³ Babai&lt;/a&gt; in his &lt;a href="http://www.cs.uchicago.edu/courses/description/CMSC/27400"&gt;Combinatorics and Probability&lt;/a&gt; class.  It gave headaches to a lot of people much smarter than I am, but there is what Babai would call an "Ah-haaaa!" solution.  Read on if you're brave enough.
&lt;/p&gt;
&lt;!--more--&gt;

&lt;!--[if IE]&gt;&lt;script type="text/javascript" src="/include/js/excanvas-compressed.js"&gt;&lt;/script&gt;&lt;![endif]--&gt;
&lt;script src="/include/js/infection.js" type="text/javascript"&gt;&lt;/script&gt;
&lt;h4&gt;The Rules&lt;/h4&gt;
&lt;p&gt;
The rules of the puzzle are simple.
&lt;ol&gt;
&lt;li&gt;You are given an n-by-n board, e.g., an 8x8 chessboard.&lt;/li&gt;
&lt;li&gt;Some initial number of the squares are "infected."&lt;/li&gt;
&lt;li&gt;If an uninfected square shares an edge with two or more infected squares then it too becomes infected.&lt;/li&gt;
&lt;li&gt;The infection spreads until there are no more squares which can be infected.&lt;/li&gt;
&lt;/ol&gt;
&lt;/p&gt;

&lt;p&gt;
We view the infection as spreading in discrete time steps, which makes this puzzle an example of a &lt;a href="http://mathworld.wolfram.com/CellularAutomaton.html"&gt;cellular automaton&lt;/a&gt;.  Here is an example of an initial infection which stops spreading well before it infects the whole board.

&lt;div class="math"&gt;
&lt;img src="http://assets.20bits.com/20070505/infection_0.png" height="81" width="81" /&gt;
&lt;img src="http://assets.20bits.com/20070505/infection_1.png" height="81" width="81" /&gt;
&lt;img src="http://assets.20bits.com/20070505/infection_2.png" height="81" width="81" /&gt;
&lt;img src="http://assets.20bits.com/20070505/infection_3.png" height="81" width="81" /&gt;
&lt;/div&gt;
&lt;/p&gt;

&lt;h4&gt;The Puzzle&lt;/h4&gt;
&lt;p&gt;
There are many initial configurations which will infect the whole board.  The most obvious and least interesting configuration is one with every square infected.  It's also not hard to see that we can infect the whole board with fewer than this many initially infected squares.  So, the question is, what is the smallest number of initially infected squares required to infect an n-by-n board?  Does it depend on n?  If so, how?
&lt;/p&gt;

&lt;p&gt;
To help you get a grips on the puzzle I've included a Javascript implementation below using the &lt;a href="http://en.wikipedia.org/wiki/Canvas_(HTML_element)"&gt;canvas tag&lt;/a&gt;.  This means that IE users are out of luck, but if you're on Windows you should be using &lt;a href="http://www.mozilla.com/en-US/firefox/"&gt;Firefox&lt;/a&gt; or &lt;a href="http://www.opera.com/"&gt;Opera&lt;/a&gt;, anyhow.  &lt;strong&gt;Edit:&lt;/strong&gt; I just learned about Google's &lt;a href="http://excanvas.sourceforge.net/"&gt;Explorer Canvas&lt;/a&gt;, so hopefully this now works in IE.  Since I don't have the means to test it on my laptop "hopefully" is the best I can offer.
&lt;/p&gt;

&lt;p&gt;
To infect or disinfect a square just click on it.  Once you're ready to test your initial configuration click "Run it!".
&lt;/p&gt;
&lt;p class="math"&gt;
&lt;canvas id="canvas" width="161" height="161"&gt;
&lt;/canvas&gt;
&lt;script type="text/javascript"&gt;
var canvas = $('canvas');
var game = new Infection((canvas.width - 1) / 20, canvas);
game.clear();
&lt;/script&gt;
&lt;br /&gt;
&lt;a href="#" onclick="game.start();return false;"&gt;&lt;strong&gt;Run it!&lt;/strong&gt;&lt;/a&gt; | &lt;a href="#" onclick="game.clear();return false"&gt;Clear&lt;/a&gt;
&lt;/p&gt;



&lt;p&gt;
You're free to use the code pursuant to the Creative Commons &lt;a href="http://creativecommons.org/licenses/by-sa/3.0/"&gt;Attribution-ShareAlike 3.0&lt;/a&gt; license.  The code requires a browser that supports the canvas tag, as mentioned above, and also the &lt;a href="http://www.prototypejs.org/"&gt;Prototype&lt;/a&gt; JavaScript framework.
&lt;div class="download"&gt;Download the &lt;a href="/include/js/infection.js"&gt;infection puzzle&lt;/a&gt; code.&lt;/div&gt;
&lt;/p&gt;

&lt;h4&gt;Update&lt;/h4&gt;
&lt;p&gt;
If you've read the comments when you know you can infect the whole n-by-n board by infecting one of the diagonals.  This leads people to believe that you cannot infect the whole board with fewer than n initially-infected squares.  The first "proof" of this rests in the claim that for the whole board to be infected every row and column must contain at least one infected square.  If we start with fewer than n squares infected than at least one row or column would be empty, so that the whole board could not be infected.
&lt;/p&gt;
&lt;p&gt;
This would be a fine proof were it the case that every row or column must have an initially infected square.  Here's a configuration which infects the whole board but which nevertheless has fewer than n (here n=8) infected rows and columns:
&lt;/p&gt;

&lt;img class="math" src="http://assets.20bits.com/20070505/infection_rowcol.png" width="161" height="161" /&gt;


&lt;p&gt;
So, we know we can infect the whole board by infected n squares initially (one of the diagonals).  But can we do it in fewer?  If not, why not?
&lt;/p&gt;</description>
      <pubDate>Sat, 05 May 2007 00:03:49 +0000</pubDate>
      <link>http://20bits.com/article/the-infection-puzzle</link>
    </item>
    <item>
      <title>Facebook job puzzles: Prime bits</title>
      <description>&lt;p&gt;
Welcome to the second installment of 20bits' &lt;a href="http://www.facebook.com/jobs_puzzles/"&gt;Facebook job puzzles&lt;/a&gt; solution manual.  This time I am going to tackle the &lt;a href="http://www.facebook.com/jobs_puzzles/?puzzle_id=5"&gt;Prime Bits&lt;/a&gt; puzzle.
&lt;/p&gt;

&lt;p&gt;
Like my &lt;a href="/article/facebook-job-puzzles-korn-shell"&gt;Korn Shell&lt;/a&gt; solution, this puzzle is mostly mathematical.  This time, however, we're going to be wading deep into &lt;a href="http://en.wikipedia.org/wiki/Combinatorics"&gt;combinatorics&lt;/a&gt; territory.  Combinatorics is the mathematics of counting.  If you have three pairs of socks, two pairs of pants, and four shirts, how many outfits can you wear?  If you have a collection of twenty playing cards how many two-card hands are there?  These are the sorts of questions combinatorics exists to answer, although the questions quickly become more complicated.
&lt;/p&gt;

&lt;p&gt;
Again, I'm aiming to make this understandable to an intelligent layperson interested in the puzzle.  If that's you then read on!
&lt;/p&gt;
&lt;!--more--&gt;

&lt;h4&gt;The Question&lt;/h4&gt;
&lt;p&gt;
This time around the question is much more straightforward.  Every (positive) integer can be represented using &lt;a href="http://en.wikipedia.org/wiki/Binary_numeral_system"&gt;binary&lt;/a&gt;.  Normally we write using decimal, which is based around powers of ten.  For example, 215 is really 2*10&lt;sup&gt;2&lt;/sup&gt; + 1*10&lt;sup&gt;1&lt;/sup&gt; + 5*10&lt;sup&gt;0&lt;/sup&gt;.  Binary involves using 2s rather than 10s as place values, so, e.g., 5 is written as 101 = 1*2&lt;sup&gt;2&lt;/sup&gt; + 0*2&lt;sup&gt;1&lt;/sup&gt; + 1*2&lt;sup&gt;0&lt;/sup&gt;.
&lt;/p&gt;

&lt;p&gt;
Every integer therefore can be respresented as a string of zeroes and ones.  Define &lt;tt&gt;P(x)&lt;/tt&gt; to be true if the number of ones in the binary representation of x is prime and false otherwise.  So, e.g., P(5) is true but P(4) is false.  Our job is to implement the function &lt;tt&gt;uint64_t prime_bits(uint64_t a, uint64_t b);&lt;/tt&gt; which returns the number of integers k, a &amp;le; k &amp;le; b such that P(k) is true.  &lt;tt&gt;uint64_t&lt;/tt&gt; is a way of designating 64-bit numbers in C/C++.
&lt;/p&gt;

&lt;h4&gt;The Obvious Solution&lt;/h4&gt;
&lt;p&gt;
Assuming we have implemented &lt;tt&gt;P(x)&lt;/tt&gt; the obvious solution in Python is this
&lt;pre class="brush: python"&gt;def prime_bits(a,b):
	return sum([P(k) for k in range(a,b+1)]);&lt;/pre&gt;
&lt;/p&gt;
&lt;p&gt;
That is, for each integer in the desired range we calculate P(k) and them sum all these values.  Since "true" corresponds to "1" and "false" to "0" we get the total number of true entries in our desired range.  A more explicit procedural implementation would be &lt;pre class="brush: python"&gt;def prime_bits(a,b):
	c = 0
	for k in range (a,b+1):
		if P(k):
			c = c + 1
	return c&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
Assuming P(n) = O(n) (this is called &lt;a href="http://en.wikipedia.org/wiki/Big_O_notation"&gt;Big-O notation&lt;/a&gt;) then we have prime_bits(a,b) = O(b&lt;sup&gt;2&lt;/sup&gt;).  Surely we can do better.  In fact, according to the constraints on the Facebook page we &lt;em&gt;have&lt;/em&gt; to do better to even be considered in the running.
&lt;/p&gt;

&lt;p&gt;
The above big-o analysis tells us that we must be careful of two things: one, our imlementation, whatever it is, cannot iterate over every integer between a and b; two, our prime-checking function has to be fast enough that it doesn't dwarf the running time of the rest of our algorithm.
&lt;/p&gt;

&lt;h4&gt;Bring a Little Binomial into Your Life&lt;/h4&gt;

&lt;p&gt;
Forget about the fact that we're dealing with numbers and think of our binary representation as a string.  That is, think of "110101" as nothing more than some sequence of zeroes and ones.  It could just as easily be "aababa" or "!!?!?!" or any other two characters we choose.
&lt;/p&gt;

&lt;p&gt;
From this perspective we can turn the question on its head.  Rather than going through each possible string one by one, let's count groups of strings en masse.  How many 6-character strings of zeroes and ones have 3 ones?
&lt;/p&gt;

&lt;p&gt;
The answer rests in that foundational function of combinatorics, the &lt;a href="http://mathworld.wolfram.com/BinomialCoefficient.html"&gt;binomial coefficient&lt;/a&gt;.  It is defined as follows:
&lt;img class="math" src="http://assets.20bits.com/20070427/binomial_def.png" alt="definition of binomial coefficient" /&gt;
pronounced "&lt;em&gt;n&lt;/em&gt; choose &lt;em&gt;k&lt;/em&gt;" and sometimes written n&lt;em&gt;C&lt;/em&gt;k, where
&lt;img class="math" src="http://assets.20bits.com/20070427/factorial_def.png" alt="definition of factorial" /&gt;

is the &lt;a href="http://mathworld.wolfram.com/Factorial.html"&gt;factorial&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
This might seem like gibberish so let's start with what it &lt;em&gt;means&lt;/em&gt;.  Let's say we're given a vat of balls labeled from 1-6.  The binomial coefficient tells us how many ways we can pick three balls from the vat if we don't care about the order in which they are picked.  We might choose {1,2,3} or {7,3,9}, for example, but {4,5,3} and {3,4,5} are considered to be the same drawing.  
&lt;/p&gt;

&lt;p&gt;
How might we go about counting this?  Well, let's start by caring about order because that's easier to count.  If we &lt;em&gt;do&lt;/em&gt; care about order then there are 6 ways to pick the first ball, 5 ways to pick the second ball, and 4 ways to pick the third and final ball.  This means there are 6*5*4 ways to pick the balls if we care about order, i.e., if we treat (4,5,3) and (3,4,5) as different drawings.
&lt;/p&gt;

&lt;p&gt;
Now how many ways are there to arrange each triplet?  Well, there are 3 ways to choose the first element, 2 ways to choose the second, and 1 way to choose the third and final element.  Explicitly, the six following triplets are different  orderings of the same drawing:
&lt;pre&gt;
(1,2,3), (2,3,1), (3,1,2), (1,3,2), (3,2,1), (2,1,3)
&lt;/pre&gt;
&lt;/p&gt;
&lt;p&gt;
But we can write 6*5*4 = 6!/3! and 3*2*1 = 3!, using the factorial notation.  This means that if we don't care about order, there are a total of 6!/(3!*3!) ways of drawing balls from the vat.  Or we could write (6!/(6-3)!3!) = 6C3, 6 choose 3.  So hopefully you can see that the above formula didn't totally come from nowhere.
&lt;/p&gt;

&lt;p&gt;
Let's go back to our binary brou-ha-ha.  Let's take a 6-bit binary string, 000000.  Now, how many such strings have three ones?  Well, 6C3.  If we want to know how many 6-bit binary strings have a prime number of ones we just need to find all the prime numbers less than 6 and use the binomial coefficient.  Since 2, 3, and 5 are the only primes less than 6 we know there are 
&lt;img class="math" src="http://assets.20bits.com/20070427/prime_choice_6.png" alt="6-bit binary string calculation" /&gt;
6-bit binary strings with either 2, 3 or 5 (i.e., a prime number of) ones.
&lt;/p&gt;

&lt;h4&gt;Paging Mr. Pascal&lt;/h4&gt;
&lt;p&gt;
Since our motivation for writing a new method is performance the above might very well be worthless if we can't make it perform.  Luckily the binomial coefficient has an excellent recursive definition related to &lt;a href="http://en.wikipedia.org/wiki/Pascal's_triangle"&gt;Pascal's triangle&lt;/a&gt;:
&lt;img class="math" src="http://assets.20bits.com/20070427/binomial_recursion.png" alt="binomial recursion" /&gt;
&lt;/p&gt;

&lt;p&gt;
By using &lt;a href="http://en.wikipedia.org/wiki/Dynamic_programming"&gt;dynamic programming&lt;/a&gt; we can generate the n&lt;sup&gt;th&lt;/sup&gt; row of Pascal's triangle in O(n&lt;sup&gt;2&lt;/sup&gt;) time.  But wait, weren't we looking for something that beat this?  (Remember our original implementation of prime_bits(a,b) took O(b&lt;sup&gt;2&lt;/sup&gt;) time.)  Here's where you need to be careful: we need to generate as many rows in Pascal's triangle as there are significant bits in our integer.  For a 64-bit integer this means we'll be generating at most 64 rows of Pascal's triangle.  In terms of our inputs a and b this is O((log b)&lt;sup&gt;2&lt;/sup&gt;) time.  Of course we're a far cry from a full implementation, but at least we know we can use binomial coefficients within our performance requirements.
&lt;/p&gt;

&lt;h4&gt;Back to Numbers&lt;/h4&gt;
&lt;p&gt;
Let's say we have a number which is a power of two, e.g., 2&lt;sup&gt;8&lt;/sup&gt;.  In binary this is 10000000.  All numbers less than it are of the form 0xxxxxxxx, where x is either 0 or 1.  We have eight places to fill and we're interested in those combinations which have a prime number of 1s.  2, 3, 5, and 7 are the only primes less than 8 so there are 
&lt;img class="math" src="http://assets.20bits.com/20070427/prime_choice_8.png" alt="8-bit integers with prime ones" /&gt;
positive integers less than 2&lt;sup&gt;8&lt;/sup&gt; which have a prime number of ones in their binary representation.
&lt;/p&gt;

&lt;p&gt;
So far so good, but the above method only works for numbers that are powers of two.  What happens if we want to count the number of desired positive integers less than 100101?  111101 has a prime number of ones but it is larger than 100101.  So how do we extend the above method to arbitrary integers?
&lt;/p&gt;

&lt;p&gt;
Consider the following three ranges of numbers:
&lt;pre&gt;
Numbers less than 100101 (37)
----
000000 to 011111
100000 to 100011
100100
100101
&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
First, does this range of numbers encompass every number less than 100101?  The first range is the set of all numbers less than 100000, the second the set of all numbers less than 100100, the third the set of all numbers less than 100101, and the fourth is the number itself.  So certainly every number in this set is less than 100101, but is that every such number?
&lt;/p&gt;

&lt;p&gt;
Yes, and we can see that by looking at place values.  If a number agrees with 100101 in the lest-most bit then it cannot have a 1 in the next two places without being &lt;em&gt;greater&lt;/em&gt; than the number in question.  Likewise, if it agrees with 100101 in the for the four left-most bits then it cannot have a 1 in the second position without being greater.
&lt;/p&gt;

&lt;p&gt;
From this we can determine combinatorially how many such numbers have a prime number of 1 bits.  For the range 000000 to 011111 there are
&lt;img class="math" src="http://assets.20bits.com/20070427/prime_range_1.png" alt="prime range 1" /&gt;
numbers which have the desired property since we have 5 bits to choose freely and 2,3, and 5 are the only primes less than or equal to 5.
&lt;/p&gt;

&lt;p&gt;
The next range, 100000 to 10011, is similar, except we have one bit fixed and two bits to choose freely.  Thus there are 
&lt;img class="math" src="http://assets.20bits.com/20070427/prime_range_2.png" alt="prime range 2" /&gt;
numbers which have the desired property.  This is because we there is already one bit set, so we count all combinations which, when added to 1, have a prime number of bits set to 1.
&lt;/p&gt;

&lt;h4&gt;The Algorithm and Some Caveats&lt;/h4&gt;
&lt;p&gt;
So, for a given number we now have a function pb(n) which counts the number of positive integers less than or equal to n which have a prime number of bits set to 1.  &lt;tt&gt;prime_bits(a,b)&lt;/tt&gt; then just becomes &lt;tt&gt;pb(b) - pb(a)&lt;/tt&gt;.
&lt;/p&gt;

&lt;p&gt;
To find the number of positive integers between 1 and n, inclusive, we therefore need three things:
&lt;ol&gt;
&lt;li&gt;Generate Pascal's triangle so we can easily extract binomial coefficients.  This takesO((log n)&lt;sup&gt;2&lt;/sup&gt;) time.&lt;/li&gt;
&lt;li&gt;Given a number &lt;em&gt;n&lt;/em&gt; find its most significant (or left-most) bit.  This takes O(log n) time.&lt;/li&gt;
&lt;li&gt;For each bit set to 1 count the number of combinations of bits to its right which yield a prime number of bits.  If we have a pre-generated list of primes this takes O((log n)&lt;sup&gt;2&lt;/sup&gt;), otherwise we can do it in O((log n)&lt;sup&gt;3&lt;/sup&gt;) time.&lt;/li&gt;
&lt;/ol&gt;
&lt;/p&gt;

&lt;p&gt;
Although the Facebook puzzle says it would be "uncouth" to not support integers over 64 bits, generating lists of primes is a well-understood problem.  Using the &lt;a href="http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes"&gt;Sieve of Eratosthenes&lt;/a&gt; one can generate every prime less than k in O(k) time and using the &lt;a href="http://en.wikipedia.org/wiki/Sieve_of_Atkin"&gt;Sieve of Atkin&lt;/a&gt; one can do it in O(k/log log k) time.  For now I'm just going to hard-code every prime less than 1024.  This means the above algorithm will work for up to 1024-bit integers.  If you care about running the above algorithm for abritrarily large numbers, well, you can implement one of the prime number sieves yourself.
&lt;/p&gt;

&lt;p&gt;
Here is my implementation in Python.
&lt;div class="notice warning"&gt;
This code has been redacted at the request of Facebook.  &lt;a href="mailto:twenty.bits@gmail.com?subject=Prime bits solution"&gt;Contact me&lt;/a&gt; if you want the code.
&lt;/div&gt;
&lt;/p&gt;

&lt;h4&gt;Improvements&lt;/h4&gt;
&lt;p&gt;
The code as-is runs quite fast.  Every input I've given it runs in well under a second, usually less than a tenth of a second, but there are still improvements.
&lt;ol&gt;
&lt;li&gt;Rather than calculate pb(b) - pb(a), integrate the two passes by determining the first bit-position in which a and b differ and running a variation of the above counting algorithm.&lt;/li&gt;
&lt;li&gt;Improve the significant bit calculation.  The current implementation is the most naÃ¯ve method.&lt;/li&gt;
&lt;li&gt;Generally tweak the loops to improve speed, since Python is notoriously slow in iterating over lists.  Or write it in a langage like C (or ASM, you hardcode d00d, you).&lt;/li&gt;
&lt;/p&gt;

&lt;p&gt;
Anyhow, none of the above strike me as particularly "interesting," so I'll leave them to more enterprising individuals.  Cheers and happy coding!
&lt;/p&gt;</description>
      <pubDate>Fri, 27 Apr 2007 00:39:05 +0000</pubDate>
      <link>http://20bits.com/article/facebook-job-puzzles-prime-bits</link>
    </item>
    <item>
      <title>Facebook job puzzles: Korn Shell</title>
      <description>&lt;p&gt;
Welcome to the first installment of 20bits' &lt;a href="http://www.facebook.com/jobs_puzzles/"&gt;Facebook job puzzles&lt;/a&gt; solutions manual.  My ultimate goal is to solve every interesting puzzle in the aforelinked list and make a public post with the solution code and an explanation.  Why?  Because I'm a mathematician by training and no good puzzle should go unsolved.
&lt;/p&gt;

&lt;p&gt;
I intended to start with the &lt;a href="http://www.facebook.com/jobs_puzzles/?puzzle_id=1"&gt;Evil Gambling Monster&lt;/a&gt; puzzle because I thought it would be the best learning experience given my lack of formal CS training.  I did learn some new algorithms in solving it (e.g., the &lt;a href="http://en.wikipedia.org/wiki/A*_search_algorithm"&gt;A&lt;sup&gt;*&lt;/sup&gt; search algorithm&lt;/a&gt;), but that's when the &lt;a href="http://www.facebook.com/jobs_puzzles/?puzzle_id=7"&gt;Korn Shell&lt;/a&gt; puzzle caught my eye.  My inner mathematician couldn't resist and I believe I have solved the puzzle mathematically.  The solution involves my favorite area of mathematics (&lt;a href="http://en.wikipedia.org/wiki/Group_theory"&gt;group theory&lt;/a&gt;), so I'm going to attempt to explain it in a way that is understandable to a layperson.
&lt;/p&gt;
&lt;!--more--&gt;
&lt;h4&gt;The Rules of the Game&lt;/h4&gt;
&lt;p&gt;
You are sitting at a computer terminal whose 26 alphabetic keys have been randomly rearranged (permuted) and you enter your name, say, "Mike Korn."  Since the keys are scrambled what appears on the screen is also scrambled.  The game consists of you typing the characters you see on the screen until your name appears.  The length of the game is the number of times you must type a the string on the screen before your name appears, at which point the game terminates.  
&lt;/p&gt;
&lt;p&gt;
A game therefore consists of two pieces of information, viz., an &lt;strong&gt;input string&lt;/strong&gt; and a &lt;strong&gt;permutation&lt;/strong&gt; of the 26 alphabetic keys.  Everything else is completely determined by these two pieces of information.  The puzzle is this: if I give you a name can you give me the length of the longest possible game which uses that name as its first input?
&lt;/p&gt;

&lt;h4&gt;Questions to Ask&lt;/h4&gt;
&lt;p&gt;
The first step in reasoning mathematically is to isolate your assumptions.  The biggest assumption above is that every possible game is guaranteed to terminate in a finite number of steps.  This isn't particularly obvious on the outset.  This leads us to the following questions.

&lt;ol&gt;
&lt;li&gt;Is every possible game guaranteed to terminate in a finite number of steps?&lt;/li&gt;
&lt;li&gt;How does the length of game vary with respect to the two inputs, i.e., the name and the permutation?&lt;/li&gt;
&lt;li&gt;If every game does terminate in a finite number of steps, what is the longest game?&lt;/li&gt;
&lt;/ol&gt;
&lt;/p&gt;
&lt;p&gt;
If we can answer these questions well enough then we are done.  I'm claiming that each of these questions has an explicit answer and that no algorithm is necessary to determine any of them.  So, let's get answering.
&lt;/p&gt;

&lt;h4&gt;Helpful Notation&lt;/h4&gt;
&lt;p&gt;
I will introduce more notation as it becomes necessary but I'd like to introduce a few preliminaries.  Let's say the A and B keys have been swapped on the keyboards and no other keys have been touched.  We can write this as A &amp;rarr; B &amp;rarr; A, since, if you type 'A' the screen prints 'B,' and then if you type 'B' the screen again prints 'A.' Similarly, if we took the letters Q,W,E and R and moved each to the right (on a QWERTY keyboard), with R going to Q's place, we could write Q &amp;rarr; W &amp;rarr; E &amp;rarr; R &amp;rarr; Q.
&lt;/p&gt;

&lt;p&gt;
Additionally, if I want to refer to a specific permutation I will use the symbol &amp;sigma; or &amp;tau; to represent it.  Those are the lower-case Greek letters sigma and tau, respectively.  This is only in the case where I am talking about a permutation in the abstract &amp;mdash; if I want to talk specifically about what the permutation has done to the keyboard keys then I will use the above notation to describe its action.
&lt;/p&gt;

&lt;h4&gt;The single-letter case&lt;/h4&gt;
&lt;p&gt;
Is every possible game guaranteed to terminate in a finite number of steps?  When in doubt, start simple.  What happens if we have a single letter as an input, say, 'A?'  We enter A and another letter, say, V, appears.  We then enter V and yet another letter appears.  Now, at any point, the next letter that appears &lt;i&gt;could&lt;/i&gt; be A, in which case we're done.  
&lt;/p&gt;
&lt;p&gt;
But can this go on forever?  Well, let's say we've seen every letter from A to Z &lt;i&gt;exactly once&lt;/i&gt; and the last letter on the screen was an X.  What happens when we hit the X key?  The original letter, A in this case, must appear on the screen.  If another letter appears, say B, then the key we pressed earlier which produced B must have been X.  But this means that we have seen X twice, an impossibility!
&lt;/p&gt;

&lt;p&gt;
In fact, we just solved the puzzle for the single-letter input case.  If we are given a single letter as an input the permutation (i.e., the rearrangement of the 26 alphabetic keys) we want is represented by A &amp;rarr; B &amp;rarr; C &amp;rarr; ... &amp;rarr; Y &amp;rarr; Z &amp;rarr; A.  For any single-letter input this produces a game of length 26.  As we showed above, no game with a single-letter input can last longer than 26 turns.
&lt;/p&gt;

&lt;h4&gt;On Cycles and Sufjan Stevens (or, the multi-character case)&lt;/h4&gt;
&lt;p&gt;
What happens when we are given a two-character input string?  When I explained this to my girlfriend I use a music analogy.  Let's say we have two musicians, one playing in 4/4 time and the other playing in 5/4 time (maybe he's Dave Brubeck or Sufjan Stevens).  If both musicians start playing on the same beat how many beats does it take for their measures to come back into alignment?  The answer is 20, the &lt;a href="http://en.wikipedia.org/wiki/Least_common_multiple"&gt;least common multiple&lt;/a&gt; of 4 and 5.
&lt;/p&gt;

&lt;p&gt;
Remember that if A and B are swapped then we write A &amp;rarr; B &amp;rarr; A.  This is called a "cycle" because the path a letter takes through it always eventually comes back to the beginning.  A more common notation is (A B) instead of A &amp;rarr; B &amp;rarr; A, with the understanding that the right side wraps around to the left.  Similarly we write (Q W E R) instead of Q &amp;rarr; W &amp;rarr; E &amp;rarr; R &amp;rarr; Q.
&lt;/p&gt;

&lt;p&gt;
Given an arbitrary permutation &amp;sigma; of the 26 alphabetic keys every letter belongs to exactly one cycle.  We can find all the other letters in this cycle by following the bouncing ball.  For example, if we want to find the cycle of C then we type in C.  'R' appears on screen, so we type in 'R.'  We then see 'H' followed by 'L' followed by 'V' followed by 'C' again.  The whole cycle is therefore (C R H L V).  If, in addition to having C, R, H, L, and V permuted in that fashion, the A and B keys have been swapped, we write (A B)(C R H L V).  By convention if we write down a permutation using this notation and leave off letters that means the corresponding keys haven't been moved.
&lt;/p&gt;

&lt;p&gt;
Let's take the permutation (A L)(E M R), i.e., the A and L keys have been swapped and the E key has been moved to M's position, M to R's, and R to E's.  What happens when we enter the name "Mel Farb?"  Remember that if a letter isn't present in any cycles it means it isn't affected by the permutation (mathematicians would say that the permutation &lt;i&gt;fixes&lt;/i&gt; that letter).
&lt;pre&gt;
Mel Farb
Rma Fleb
Erl Famb
Mea Flrb
Rml Faeb
Era Flmb
Mel Farb
&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
Going back to the musical analogy, imagine we have a whole orchestra.  Every letter is a musician.  Musicians playing the same part all belong to the same cycle.  The length of a cycle is the time signature in which each part is playing.  The length of the game is the number of beats it takes for the measures in the various parts (i.e., cycles) to realign, which happens to be the least common multiple of the lengths of the cycles.
&lt;/p&gt;

&lt;p&gt;
We have therefore answered our first question and much of the second.  Yes, every game is guaranteed to terminate in a finite number of steps.  If every part is being played by one musician, i.e., every cycle contains at least one letter in our input string, then the length of the game is determined wholly by the lengths of the cycles of the permutation.
&lt;/p&gt;

&lt;h4&gt;The Computation&lt;/h4&gt;
&lt;p&gt;
So every permutation can be represented uniquely&lt;span class="footnote"&gt;Up to the ordering of the cycles&lt;/span&gt; using the above cycle notation.  &lt;i&gt;This means that if we construct a cycle representation then we also get a permutation.&lt;/i&gt;  To construct really long games, then, we need to maximize the size of the least common multiple of the cycle lengths under the constraint that the sum of their lengths is no greater than 26.
&lt;/p&gt;

&lt;p&gt;
For now let's work in the case where our input string contains enough distinct letters, i.e., every cycle of our permutation contains at least one letter in the input string.  Let's look at some permutations.  Take (A B)(C D E)(F G H I J).  If the input string is something like "ACF" this permutation leads to a game of length 30, since the least common multiple of 2, 3, and 5 is 5*3*2 = 30.  But we could do better by adding another cycle of length 7, giving us a game length of 210.  Every number between 7 and 11 is a multiple of 2, 3, or 5, so no cycles of any of these lengths will increase the length of our game, but we can't add a cycle of length 11 because we only have 26 letters and 2+3+5+7+11 = 28.
&lt;/p&gt;

&lt;p&gt;
If every cycle in a permutation of the 26 letters contains at least one letter of the input string then we know that the length of the game is wholly determined by the cycle structure of the permutation.  The length is the least common multiple of the size of the cycles.  We therefore want to find the cycle structure which maximizes the length of the game.  We can do that as follows.
&lt;/p&gt;

&lt;p&gt;
Mathematically, writing a number n as sum of positive integers is called a &lt;a href="http://en.wikipedia.org/wiki/Partition_(number_theory)"&gt;partition of n&lt;/a&gt;.  So we're interested in partitions of 26.  We want to find the maximum least common multiple of the sizes of the parts of all possible partitions of 26.  Whew.  For example, we can write 26 = 25+1.  The least common multiple of 25 and 1 is 25.  We can also write 26 = 1+2+3+5+7+8.  The least common multiple of these parts is 210.  It turns out there are 2436 partitions of 26, which is more than we want to check by hand.
&lt;/p&gt;

&lt;p&gt;
I have written some Python code below which calculates this number and it happens to be 1260, corresponding to cycles of length 4, 5, 7, and 9.  This means that the longest possible game, period, is 1260.  No matter what the user inputs we will never be able to beat this number.  But can we always match it?
&lt;/p&gt;

&lt;h4&gt;From one permutation to many&lt;/h4&gt;

&lt;p&gt;
So, we know the "ideal" permutation has four cycles.  As I've said before, if the input string contains fewer than four distinct letters then one of the cycles will be empty and won't contribute to the length of the game.  This means that we have four cases for the number of distinct letters in the input string: one, two, three, and four or more.  Let's go case-by-case, focusing on the last case first.
&lt;/p&gt;

&lt;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;h5&gt;Four or more&lt;/h5&gt;
&lt;p&gt;
We know the ideal cycle structure in this situation (cycles of length 4,5,7, and 9), but the input string can still vary.  If the input string contains four or more distinct characters can we always find a permutation which makes the game last 1260 rounds?  Yes, because we are free to choose what letters go in what cycle.
&lt;/p&gt;
&lt;p&gt;
If our input string is "ABCDE" then we might put A in the first cycle, B in the second, C in the third, D in the fourth, and E in any of the cycles.  The remaining spots in the cycles can then be filled with whatever letters we want.  Because the length of the game is determined only by the cycle structure we are guaranteed to get a game that lasts 1260 rounds.
&lt;/p&gt;
&lt;li&gt;&lt;h5&gt;Three characters&lt;/h5&gt;
&lt;p&gt;
If our input string contains three characters we're back to square one.  We might think to try using the three longest cycles from our ideal 4-5-7-9 permutation above, but that doesn't quite work.  This permutation makes the game last 315 rounds, but what is to say that there isn't a permutation with exactly three cycles that lasts longer than 315 rounds?
&lt;/p&gt;
&lt;p&gt;
As it turns out there is a permutation which yields a game that lasts 630 rounds.  This permutation has cycles of length 7, 9, and 10.  Again, there is Python code below to calculate this number exactly.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;h5&gt;The last two cases&lt;/h5&gt;
&lt;p&gt;
This case is essentially the same as the three-character case, except that we wish to find a permutation with two cycles rather than three.  Using the code below we find a permutation which leads to a game of length 165 and has cycles of length 15 and 11.
&lt;/p&gt;
&lt;p&gt;
As for the one-letter case, well, we've already done it!  If you recall a single letter cannot go through a game longer than 26 rounds before appearing again.  What's more, the permutation which produces this game can be named explicitly: (A B C ... X Y Z).
&lt;/p&gt;
&lt;/li&gt;
&lt;/p&gt;

&lt;h4&gt;The Code&lt;/h4&gt;
&lt;p&gt;
As promised, here is the code which produces the above numbers.  &lt;tt&gt;landau&lt;/tt&gt; implements &lt;a href="http://mathworld.wolfram.com/LandausFunction.html"&gt;Landau's function&lt;/a&gt;, which produces the longest possible permutation on n letters &amp;mdash; that means landau(26) = 1260.  &lt;tt&gt;part2&lt;/tt&gt; handles the two-character case and &lt;tt&gt;part3&lt;/tt&gt; the three-character case.  With slight modifications you can make them print out the cycle structure rather than the length of the game.
&lt;div class="notice warning"&gt;
At the request of Facebook portions of the code have been redacted.
&lt;/div&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;pre class="brush: python"&gt;#!/usr/bin/python
import sys

def gcd(a,b):
	if b == 0: return a
	return gcd(b, a % b)

def lcm(a,b):
	return (a*b)/gcd(a,b)

def lcm2(a,b,c):
	return lcm(lcm(a,b),c)

def landau(n):
	ans = [1]*(n+1)
	
	for i in range(1,n+1):
		for k in range (1,i+1):
			test = lcm(k, ans[i-k])
			if (test &gt; ans[i]):
				ans[i] = test
			
	return ans[n]

print [landau(int(n)) for n in sys.argv[1:]]&lt;/pre&gt;
&lt;/p&gt;
&lt;p&gt;
Of course the Facebook puzzle (remember that?) stipulates that, given an input string on the command line, we are to output the longest possible time it could take to produce that input string again.  We've solved the problem mathematically which means the "solution" program is stupidly simple.  Here it is in Ruby, just because I can:
&lt;pre class="brush: ruby"&gt;#!/usr/bin/ruby
case ARGV.inject{|sum, i| sum + i}.downcase.split('').uniq.size
when 1
  puts 26
when 2
  puts 165
when 3
  puts 630
else
  puts 1260
end&lt;/pre&gt;
&lt;/p&gt;

&lt;h4&gt;The Math&lt;/h4&gt;
&lt;p&gt;
Some of you might be interested in the math, so here it is in a hyper-condensed form.  We start with what is called a &lt;a href="http://mathworld.wolfram.com/Group.html"&gt;group&lt;/a&gt;.  The set of all permutations on n letters forms a group called the &lt;a href="http://mathworld.wolfram.com/SymmetricGroup.html"&gt;symmetric group&lt;/a&gt;, denoted S&lt;sub&gt;n&lt;/sub&gt;.  We are interested in S&lt;sub&gt;26&lt;/sub&gt;, i.e., the set of all permutations of 26 letters.  For two elements &amp;sigma;, &amp;tau; in S&lt;sub&gt;n&lt;/sub&gt; we write &amp;sigma;&amp;tau; to mean their composition under the group operation.  We write &amp;sigma;&lt;sup&gt;2&lt;/sup&gt; to mean &amp;sigma;&amp;sigma;, i.e., the application of sigma twice.
&lt;/p&gt;

&lt;p&gt;
Permutations can be written using cycle notation.  Let's say &amp;sigma;(A) = B and &amp;sigma;(B) = A.  That is, &amp;sigma; swaps the letters A and B.  This corresponds to the cycle (A B).  Similarly, we might have &amp;sigma;(Q) = W, &amp;sigma;(W) = E, &amp;sigma;(E) = R, and &amp;sigma;(R) = Q, which corresponds to the cycle (Q W E R).  As in the explanation of the puzzle we can do the opposite: by writing down cycles we are actually choosing a permutation which contains those cycles.
&lt;/p&gt;

&lt;p&gt;
For every permutation &amp;sigma; in S&lt;sub&gt;n&lt;/sub&gt; there is a smallest positive integer n such that &amp;sigma;&lt;sup&gt;n&lt;/sup&gt; is the identity element, i.e., the permutation which fixes every letter.  This n is called the order of &amp;sigma;, denoted |&amp;sigma;|.  This is equivalent to the statement that every game as described initially will stop in a finite number of steps.  The order of &amp;sigma; is the least common multiple of the lengths of its cycles when written as a product of disjoint cycles.
&lt;/p&gt;

&lt;p&gt;
Since every element has a well-defined order the function g(n) = max{|&amp;sigma;| : &amp;sigma; in S&lt;sub&gt;n&lt;/sub&gt;} is also well-defined.  In fact, this function is called &lt;a href="http://mathworld.wolfram.com/LandausFunction.html"&gt;Landau's function&lt;/a&gt;.  I didn't learn of this function until I showed my solution to some math friends and they pointed me to the definition.  It turns out that g(26) = 1260.  You can see &lt;a href="http://www.research.att.com/~njas/sequences/A000793"&gt;the values&lt;/a&gt; of g(n) for 0 &amp;le; n &amp;le; 47 at the Online Encyclopedia of Integer Sequences (yes, such a thing exists).  Or you can use my Python program to calculate it &amp;mdash; it runs quite quickly!
&lt;/p&gt;


&lt;h4&gt;Das Ende&lt;/h4&gt;
&lt;p&gt;
The statement of the original puzzle is as follows:
&lt;blockquote&gt;Given a particular name (e.g. "Mike Korn"), what is the maximum number of times the user might have to try typing in his name (or whatever has appeared on the screen) until his real name appears, if the manner in which the keys have been mixed up is unknown? &lt;/blockquote&gt;
&lt;/p&gt;
&lt;p&gt;
From the above arguments we have learned that the answer depends &lt;em&gt;only&lt;/em&gt; on the number of distinct characters in the input string.  The Ruby code above outputs that number given the input.  Furthermore, it is not difficult to generate the permutation in addition by filling in the empty spots in the cycle structure of our ideal permutation in each case.  So, in that regard, we have actually solved a harder puzzle than the Facebook puzzle.
&lt;/p&gt;
&lt;p&gt;
I hope you all enjoyed reading this as much as I enjoyed writing it.  My daily life doesn't afford me many opportunities to return to my old math haunts.  Cheers, and happy coding!
&lt;/p&gt;

&lt;h4&gt;PS&lt;/h4&gt;
&lt;p&gt;
Here is some C code to print off the silly Facebook email address.
&lt;pre class="brush: c"&gt;#include &lt;stdio.h&gt;

int main(char **argv, int argc) {
    printf("%d\n", 0xFACEB00C&gt;&gt;2);
    return 0;
}&lt;/pre&gt;
&lt;/p&gt;</description>
      <pubDate>Thu, 19 Apr 2007 23:37:14 +0000</pubDate>
      <link>http://20bits.com/article/facebook-job-puzzles-korn-shell</link>
    </item>
    <item>
      <title>10 Tips for Optimizing MySQL Queries (That don't suck)</title>
      <description>&lt;p&gt;
Justin Silverton at &lt;a href="http://http://www.whenpenguinsattack.com/"&gt;Jaslabs&lt;/a&gt; has a supposed list of &lt;a href="http://www.whenpenguinsattack.com/2007/04/09/10-tips-for-optimizing-mysql-queries/"&gt;10 tips for optimizing MySQL queries&lt;/a&gt;.  I couldn't read this and let it stand because this list is really, really bad.  Some &lt;a href="http://immike.net/blog/2007/04/09/how-not-to-optimize-a-mysql-query/"&gt;guy named Mike&lt;/a&gt; noted this, too.  So in this entry I'll do two things: first, I'll explain why his list is bad; second, I'll present my own list which, hopefully, is much better.  Onward, intrepid readers!
&lt;/p&gt;
&lt;!--more--&gt;
&lt;h3&gt;Why That List Sucks&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;h4&gt;He's swinging for the top of the trees&lt;/h4&gt;
&lt;p&gt;
The rule in any situation where you want to opimize some code is that you first profile it and then find the bottlenecks.  Mr. Silverton, however, aims right for the tippy top of the trees.  I'd say 60% of database optimization is properly understanding SQL and the basics of databases.  You need to understand joins vs. subselects, column indices, how to normalize data, etc.  The next 35% is understanding the performance characteristics of your database of choice.  &lt;tt&gt;COUNT(*)&lt;/tt&gt; in MySQL, for example, can either be almost-free or painfully slow depending on which storage engine you're using.  Other things to consider: under what conditions does your database invalidate caches, when does it sort on disk rather than in memory, when does it need to create temporary tables, etc.  The final 5%, where few ever need venture, is where Mr. Silverton spends most of his time.  Never once in my life have I used &lt;tt&gt;SQL_SMALL_RESULT&lt;/tt&gt;.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h4&gt;Good problems, bad solutions&lt;/h4&gt;
&lt;p&gt;
There are cases when Mr. Silverton does note a good problem.  MySQL will indeed use a dynamic row format if it contains variable length fields like &lt;tt&gt;TEXT&lt;/tt&gt; or &lt;tt&gt;BLOB&lt;/tt&gt;, which, in this case, means sorting needs to be done on disk.  The solution is not to eschew these datatypes, but rather to split off such fields into an associated table.  The following schema represents this idea:
&lt;/p&gt;
&lt;pre class="brush: sql"&gt;CREATE TABLE posts (
	id int unsigned not null auto_increment,
	author_id int unsigned not null,
	created timestamp not null,
	PRIMARY KEY(id)
);

CREATE TABLE posts_data (
	post_id int unsigned not null.
	body text,
	PRIMARY KEY(post_id)
);&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h4&gt;That's just...yeah&lt;/h4&gt;
&lt;p&gt;
Some of his suggestions are just mind-boggling, e.g., "remove unnecessary paratheses."  It really doesn't matter whether you do &lt;tt&gt;SELECT * FROM posts WHERE (author_id = 5 AND published = 1)&lt;/tt&gt; or &lt;tt&gt;SELECT * FROM posts WHERE author_id = 5 AND published = 1&lt;/tt&gt;.  None.  Any decent DBMS is going to optimize these away.  This level of detail is akin to wondering when writing a C program whether the post-increment or pre-increment operator is faster.  Really, if that's where you're spending your energy, it's a surprise you've written any code at all
&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;My list&lt;/h3&gt;

&lt;p&gt;
Let's see if I fare any better.  I'm going to start from the most general.
&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;h4&gt;Benchmark, benchmark, benchmark!&lt;/h4&gt;
&lt;p&gt;
You're going to need numbers if you want to make a good decision.  What queries are the worst?  Where are the bottlenecks?  Under what circumstances am I generating bad queries?  Benchmarking is will let you simulate high-stress situations and, with the aid of profiling tools, expose the cracks in your database configuration.  Tools of the trade include &lt;a href="http://vegan.net/tony/supersmack/"&gt;supersmack&lt;/a&gt;, &lt;a href="http://httpd.apache.org/docs/2.2/programs/ab.html"&gt;ab&lt;/a&gt;, and &lt;a href="http://sysbench.sourceforge.net/"&gt;SysBench&lt;/a&gt;.  These tools either hit your database directly (e.g., supersmack) or simulate web traffic (e.g., ab).
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h4&gt;Profile, profile, profile!&lt;/h4&gt;
&lt;p&gt;
So, you're able to generate high-stress situations, but now you need to find the cracks.  This is what profiling is for.  Profiling enables you to find the bottlenecks in your configuration, whether they be in memory, CPU, network, disk I/O, or, what is more likely, some combination of all of them.
&lt;/p&gt;
&lt;p&gt;
The very first thing you should do is turn on the &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/slow-query-log.html"&gt;MySQL slow query log&lt;/a&gt; and install &lt;a href="http://mtop.sourceforge.net/"&gt;mtop&lt;/a&gt;.  This will give you access to information about the absolute worst offenders.  Have a ten-second query ruining your web application?  These guys will show you the query right off.
&lt;/p&gt;
&lt;p&gt;
After you've identified the slow queries you should learn about the MySQL internal tools, like &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/explain.html"&gt;&lt;tt&gt;EXPLAIN&lt;/tt&gt;&lt;/a&gt;, &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/show-status.html"&gt;&lt;tt&gt;SHOW STATUS&lt;/tt&gt;&lt;/a&gt;, and &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/show-processlist.html"&gt;&lt;tt&gt;SHOW PROCESSLIST&lt;/tt&gt;&lt;/a&gt;.  These will tell you what resources are being spent where, and what side effects your queries are having, e.g., whether your heinous triple-join subselect query is sorting in memory or on disk.  Of course, you should also be using your usual array of command-line profiling tools like top, procinfo, vmstat, etc. to get more general system performance information.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h4&gt;Tighten Up Your Schema&lt;/h4&gt;
&lt;p&gt;
Before you even start writing queries you have to design a schema.  Remember that the memory requirements for a table are going to be around &lt;tt&gt;#entries * size of a row&lt;/tt&gt;.  Unless you expect every person on the planet to register 2.8 trillion times on your website you do not in fact need to make your user_id column a &lt;tt&gt;BIGINT&lt;/tt&gt;.  Likewise, if a text field will always be a fixed length (e.g., a US zipcode, which always has a canonical representation of the form "XXXXX-XXXX") then a &lt;tt&gt;VARCHAR&lt;/tt&gt; declaration just adds a superfluous byte for every row.
&lt;/p&gt;
&lt;p&gt;
Some people poo-poo database normalization, saying it produces unecessarily complex schema.  However, proper normalization results in a minimization of redundant data.  Fundamentally that means a smaller overall footprint at the cost of performance &amp;mdash; the usual performance/memory tradeoff found everywhere in computer science.  The best approach, IMO, is to normalize first and denormalize where performance demands it.  Your schema will be more logical and you won't be optimizing prematurely.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h4&gt;Partition Your Tables&lt;/h4&gt;
&lt;p&gt;
Often you have a table in which only a few columns are accessed frequently.  On a blog, for example, one might display entry titles in many places (e.g., a list of recent posts) but only ever display teasers or the full post bodies once on a given page.  &lt;strike&gt;Horizontal&lt;/strike&gt; vertical partitioning helps:
&lt;/p&gt;
&lt;pre class="brush: sql"&gt;CREATE TABLE posts (
	id int unsigned not null auto_increment,
	author_id int unsigned not null,
	title varchar(128),
	created timestamp not null,
	PRIMARY KEY(id)
);

CREATE TABLE posts_data (
	post_id int unsigned not null,
	teaser text,
	body text,
	PRIMARY KEY(post_id)
);&lt;/pre&gt;
&lt;p&gt;
The above represents a situation where one is optimizing for reading.  Frequently accessed data is kept in one table while infrequently accessed data is kept in another.  Since the data is now partitioned the infrequently access data takes up less memory.  You can also optimize for writing: frequently &lt;em&gt;changed&lt;/em&gt; data can be kept in one table, while infrequently changed data can be kept in another.  This allows more efficient caching since MySQL no longer needs to expire the cache for data which probably hasn't changed.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h4&gt;Don't Overuse Artificial Primary Keys&lt;/h4&gt;
&lt;p&gt;
Artificial primary keys are nice because they can make the schema less volatile.  If we stored geography information in the US based on zip code, say, and the zip code system suddenly changed we'd be in a bit of trouble.  On the other hand, many times there are perfectly fine natural keys.  One example would be a join table for many-to-many relationships.  What &lt;strong&gt;not&lt;/strong&gt; to do:
&lt;/p&gt;
&lt;pre class="brush: sql"&gt;CREATE TABLE posts_tags (
	relation_id int unsigned not null auto_increment,
	post_id int unsigned not null,
	tag_id int unsigned not null,
	PRIMARY KEY(relation_id),
	UNIQUE INDEX(post_id, tag_id)
);&lt;/pre&gt;
&lt;p&gt;
Not only is the artificial key entirely redundant given the column constraints, but the number of post-tag relations are now limited by the system-size of an integer.  Instead one should do:
&lt;/p&gt;
&lt;pre class="brush: sql"&gt;CREATE TABLE posts_tags (
	post_id int unsigned not null,
	tag_id int unsigned not null,
	PRIMARY KEY(post_id, tag_id)
);&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h4&gt;Learn Your Indices&lt;/h4&gt;
&lt;p&gt;
Often your choice of indices will make or break your database.  For those who haven't progressed this far in their database studies, an index is a sort of hash.  If we issue the query &lt;tt&gt;SELECT * FROM users WHERE last_name = 'Goldstein'&lt;/tt&gt; and &lt;tt&gt;last_name&lt;/tt&gt; has no index then your DBMS must scan every row of the table and compare it to the string 'Goldstein.'  An index is usually a B-tree (though there are other options) which speeds up this comparison considerably.
&lt;/p&gt;
&lt;p&gt;
You should probably create indices for any field on which you are selecting, grouping, ordering, or joining.  Obviously each index requires space proportional to the number of rows in your table, so too many indices winds up taking more memory.  You also incur a performance hit on write operations, since every write now requires that the corresponding index be updated.  There is a balance point which you can uncover by profiling your code.  This varies from system to system and implementation to implementation.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h4&gt;SQL is Not C&lt;/h4&gt;
&lt;p&gt;
C is the canonical procedural programming language and the greatest pitfall for a programmer looking to show off his database-fu is that he fails to realize that SQL is not procedural (nor is it functional or object-oriented, for that matter).  Rather than thinking in terms of data and operations on data one must think of sets of data and relationships among those sets.  This usually crops up with the improper use of a subquery:
&lt;/p&gt;
&lt;pre class="brush: sql"&gt;SELECT a.id, 
	(SELECT MAX(created) 
	FROM posts 
	WHERE author_id = a.id) 
AS latest_post
FROM authors a&lt;/pre&gt;
&lt;p&gt;
Since this subquery is correlated, i.e., references a table in the outer query, one should convert the subquery to a join.
&lt;/p&gt;
&lt;pre class="brush: sql"&gt;SELECT a.id, MAX(p.created) AS latest_post
FROM authors a
INNER JOIN posts p
	ON (a.id = p.author_id)
GROUP BY a.id&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h4&gt;Understand your engines&lt;/h4&gt;
&lt;p&gt;
MySQL has two primary storange engines: MyISAM and InnoDB.  Each has its own performance characteristics and considerations.  In the broadest sense MyISAM is good for read-heavy data and InnoDB is good for write-heavy data, though there are cases where the opposite is true.  The biggest gotcha is how the two differ with respect to the &lt;tt&gt;COUNT&lt;/tt&gt; function.
&lt;/p&gt;
&lt;p&gt;
MyISAM keeps an internal cache of table meta-data like the number of rows.  This means that, generally, &lt;tt&gt;COUNT(*)&lt;/tt&gt; incurs no additional cost for a well-structured query.  InnoDB, however, has no such cache.  For a concrete example, let's say we're trying to paginate a query.  If you have a query &lt;tt&gt;SELECT * FROM users LIMIT 5,10&lt;/tt&gt;, let's say, running &lt;tt&gt;SELECT COUNT(*) FROM users LIMIT 5,10&lt;/tt&gt; is essentially free with MyISAM but takes the same amount of time as the first query with InnoDB.  MySQL has a &lt;tt&gt;SQL_CALC_FOUND_ROWS&lt;/tt&gt; option which tells InnoDB to calculate the number of rows as it runs the query, which can then be retreived by executing &lt;tt&gt;SELECT FOUND_ROWS()&lt;/tt&gt;.  This is very MySQL-specific, but can be necessary in certain situations, particularly if you use InnoDB for its other features (e.g., row-level locking, stored procedures, etc.).
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h4&gt;MySQL specific shortcuts&lt;/h4&gt;
&lt;p&gt;
MySQL provides many extentions to SQL which help performance in many common use scenarios.  Among these are &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/insert-select.html"&gt;&lt;tt&gt;INSERT ... SELECT&lt;/tt&gt;&lt;/a&gt;, &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html"&gt;&lt;tt&gt;INSERT ... ON DUPLICATE KEY UPDATE&lt;/tt&gt;&lt;/a&gt;, and &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/replace.html"&gt;&lt;tt&gt;REPLACE&lt;/tt&gt;&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
I rarely hesitate to use the above since they are so convenient and provide real performance benefits in many situations.  MySQL has other keywords which are more dangerous, however, and should be used sparingly.  These include &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/insert-delayed.html"&gt;&lt;tt&gt;INSERT DELAYED&lt;/tt&gt;&lt;/a&gt;, which tells MySQL that it is not important to insert the data immediately (say, e.g., in a logging situation).  The problem with this is that under high load situations the insert might be delayed indefinitely, causing the insert queue to baloon.  You can also give MySQL &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/index-hints.html"&gt;index hints&lt;/a&gt; about which indices to use.  MySQL gets it right most of the time and when it doesn't it is usually because of a bad scheme or poorly written query.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;h4&gt;And one for the road...&lt;/h4&gt;
Last, but not least, read Peter Zaitsev's &lt;a href="http://mysqlperformanceblog.com/"&gt;MySQL Performance Blog&lt;/a&gt; if you're into the nitty-gritty of MySQL performance.  He covers many of the finer aspects of database administration and performance.
&lt;/li&gt;
&lt;/ol&gt;</description>
      <pubDate>Tue, 10 Apr 2007 00:27:34 +0000</pubDate>
      <link>http://20bits.com/article/10-tips-for-optimizing-mysql-queries-that-dont-suck</link>
    </item>
    <item>
      <title>What is a word? An introduction to computational linguistics.</title>
      <description>&lt;p&gt;
What is a word?  This question is one of the most deceptively simple ones I know.  Everyone will say they know the answer, or at least say they know one when they see one, but even native speakers of a language can and do disagree.  The dictionary isn't much help since &lt;a href="http://dictionary.reference.com/browse/word"&gt;many&lt;/a&gt; &lt;a href="http://m-w.com/dictionary/word"&gt;dictionaries&lt;/a&gt; have multi-sentence, ad hoc definitions which basically boil down to "a word is a unit of language that means something, sort of."
&lt;/p&gt;

&lt;p&gt;
Let's jump ahead and assume we know what a word is, or that we can get native speakers to identify most words most of the time.  Furthermore, let's say that our goal is to get a computer to understand a given language.  Since humans learn languages initially by learning words and basic grammar it seems like a good choice to try and get computers to recognize words.  So, our goal: given a string of English letters insert spaces between the words.
&lt;/p&gt;

&lt;!--more--&gt;
&lt;p&gt;
&lt;h4&gt;What is a word?&lt;/h4&gt;
To show that the above exercise isn't totally contrived let's look at some of the subtleties in the idea of the word.  This is only for people interested in the "linguistics" part of "computaitonal linguistics," but if you want to read it then &lt;a href="#" onclick="Effect.SlideDown('whatsaword')"&gt;click here&lt;/a&gt;.
&lt;div id="whatsaword" style="display: none"&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;
&lt;h5&gt;Words are linguistic constructs, not orthographic ones.&lt;/h5&gt;
Language preceeds writing.  Children can speak and comprehend a language before they learn to read and write in that language.  That is to say nothing of people who are illiterate or languages which have no formalized writing system.  Furthermore, when learning a foreign language one typically learns words and basic grammar long before learning how to write, particularly if the writing system is dramatically different, e.g., an English-speaker learning Chinese or Arabic.  This is all to say that words are not just formal orthographic constructs like quotation marks or apostrophes.  Words appear to have some linguistic reality and are therefore worth studying from a language perspective.&lt;/p&gt;
&lt;p&gt;
None of this tells us what a word is, only that speakers of a language believe there is such a thing as a word.  English speakers can still say, however, that what we see on paper is more-or-less accurate: spaces represent breaks between words.  This dodges the question since the idea of a "word" exists cross-linguistically.  The definition of a word, therefore, should encompass all such contingencies.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;
&lt;h5&gt;Synthetic languages&lt;/h5&gt;
A few definitions, first.  A morpheme is the smallest unit of language with a meaning.  "Dogs" has two morphemes: "dog" and "s," with the former indicating a canine and the latter indicating multiplicity.  Similarly, "looked" as two morphemes, with the suffix "-ed" indicating the tense of the verb "look."  Synthetic languages have a high morpheme/word ratio.  English-speakers might be familar with the comically long German word.  In fact, German allows for essentially arbitrarily long words.  &lt;em&gt;DonaudampfschiffahrtsgesellschaftskapitÃ¤n&lt;/em&gt;, for example, means "Danube company steamship captain."  Is this one word or four?  And even in the English translation, is "steamship" one word, or two?
&lt;/p&gt;
&lt;p&gt;
Another extreme are synthetic languages like Hungarian which have many more grammatical affixes than English or German.  The ideas of "conditional," "future," etc. are all marked by single morphemes attached to a word.  In English "I would go" is three words, but in Hungarian it would be appear to be just one or perhaps two.
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;
&lt;h5&gt;Phonetics versus Orthography&lt;/h5&gt;
But most linguists wouldn't even say the above isn't that relevant.  At best it just provides us with more evidence that words are something real.  When we speak, however, there is no "space" between words in the same way that there are spaces between words in English.  If you've ever learned a foreign language you probably remember a point at which you realized you were hearing words rather than sounds.  Before that it sounded like one continuous stream of nonsense &amp;mdash; and you were right about the "one continuous stream" part, anyhow.  
&lt;/p&gt;
&lt;p&gt;
Once you have learned some of the language, however, the patterns become clear in your mind and words jump out at you.  That is, humans are able to take a single unbroken string of "characters" (i.e., sounds) and break them into words.  Imagine if instead of parsing a string of English characters we were parsing some phonetic representation of a sentence.  Suddenly our exercise no longer seems so uninteresting; indeed, we'd be doing something very much like what humans do when they parse a language they understand.
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;h4&gt;Assumptions&lt;/h4&gt;
Obviously we can't integrate all of the subtleties above as that would be tantamount to writing a computer program which actually processed text in the same way humans do.  Rather, we will work under the following assumptions: first, we already have a database (called the "lexicon") of words; second, this database is complete.  The first assumption isn't totally off-the-wall since it's a general working assumption among linguists that humans have just such a database.  The second, however, is much harder to swallow since the lexicon is typically understood to contain root morphemes plus general information about the morphology, phonology, phonotactics, etc. of the language.&lt;/p&gt;

&lt;p&gt;If I said "koop" were a verb, you'd know right away that "kooped," "koops," "kooper," etc. were all also valid words.  Likewise, even though "cromulent" is not actually an English word an English speaker knows that it could be (and that, furthermore, it would probably be an adjective), but that "plkdjfhg" could never be an English word.  Our database, however, is very dumb and very uncompressed: every permutation of every word should be present, otherwise that permutation won't be counted as a word.  We're only making this assumption to simplify the problem.  I may be a pretty good programmer, but I'm not good enough to write a computer program which automatically learns a language's syntax, morphology, and phonology.&lt;/p&gt;
&lt;/p&gt;

&lt;p&gt;Enough chit chat, let's get to the code.&lt;/p&gt;

&lt;p&gt;
&lt;h4&gt;The Algorithm&lt;/h4&gt;
The algorithm I'm going to use is a simple probabalistic &lt;a href="http://en.wikipedia.org/wiki/Dynamic_programming"&gt;dynamic programming&lt;/a&gt; algorithm.  Let's say we have a string like "therentisdue" and want to parse it as "the-rent-is-due."  Assuming our training data is representative of the language as a whole (a big assumption, for sure) then we know the probability of each word is #occurances of the word in the data over the total number of words in the data.  The idea is that the best parse of a string, given our training data, is the parse which has the highest probability of occuring.
&lt;/p&gt;

&lt;p&gt;
For the CS students out there this should scream "dynamic programming."  For everyone else, I'll explain.  The most obvious way to find the parse with the highest probability is to find every possible parse and then find that parse which has the highest probability.  Implementing the algorithm this way is intractable since there are 2&lt;sup&gt;n-1&lt;/sup&gt; parses (why?).  Instead we'll do the following.  The pseudo code:
&lt;pre class="brush: pseudocode"&gt;BestParse[0] := ""
FOR i in [1..length of StringToParse] DO
	FOR j in [0..i) DO
		parse := BestParse[j] + StringToParse[j,i]
		
		IF COST(parse) &lt; COST(BestParse[i]) THEN
			BestParse[i] = parse
		ENDIF
	ENDFOR
ENDFOR

DEFINE COST(parse)
	return -LOG2(PROBABILITY(parse))
END

DEFINE PROBABILITY(parse)
	return product of the frequencies of each word in parse
END&lt;/pre&gt;
&lt;/p&gt;
&lt;p&gt;
Let the input string be &lt;tt&gt;s&lt;/tt&gt;.  At each point &lt;tt&gt;i&lt;/tt&gt;, that is, for the initial &lt;tt&gt;i&lt;/tt&gt;-length substring of &lt;tt&gt;s&lt;/tt&gt;, determine what the best parse up to &lt;tt&gt;i&lt;/tt&gt; is.  Now, let's say we know what the best parse at &lt;tt&gt;i&lt;/tt&gt; is for some fixed &lt;tt&gt;i&lt;/tt&gt;.  To find the best parse at &lt;tt&gt;i+1&lt;/tt&gt; we try to insert a break after each initial &lt;tt&gt;j&lt;/tt&gt; substring, for &lt;tt&gt;j &amp;lt; i+1&lt;/tt&gt;.  Since we've been keeping track of the best parse (and cost) at each such &lt;tt&gt;j&lt;/tt&gt; the whole time, we just see which break insertion is the cheapest.
&lt;/p&gt;

&lt;p&gt;
Here is an illustration, again with "therentisdue."  Let's say we have "therenti" parsed so far.  This means we know the best parse for each initial substring of this string, e.g., "t", "th", "the", etc.  The best parse will probably be "the-rent-i" since each of these is a word and every other parse contains at least one non-word.  Now let's see how the algorithm determines the best parse of "therentis" from this.&lt;/p&gt;

&lt;p&gt;
After each character in the string we need to decide whether or not to insert a break.  Should we insert a space after the first character?  Well, yes, since the best parse of a single character is definitely that character.  So at the first step we get "t-|herentis."  If we're favoring single letters over non-words (it's our choice to make) then the best parse after the second character would be "t-h-|erentis."  After the third, however, the parse is "the-|rentis" since "the" is a word and therefore the best parse of the first three letters is "the" (we know this because, by assumption, we have already computed the best parse for "the").  Next we get "the-r-|entis," followed by "the-re-|ntis," and so on, until we get to "the-ren-|tis."  After this step we try "the-rent-|is."  This is a very good parse since we have three words.  Finally, we try "the-rent-i-|s," which has a lower probability than the previous parse because "s" is not a word.  Therefore "the-rent-is" is the parse which we save as the best parse of "therentis."
&lt;/p&gt;
&lt;p&gt;
I implemented this algorithm using C++, which you can download here.  By default it uses the KJV Bible as training data, which means what it considers words can be a little funny.  For example, "sin" is considered a very common word.
&lt;div class="download"&gt;
&lt;a href="http://assets.20bits.com/downloads/spaces-1.0.0.tar.bz2"&gt;spaces-1.0.0.tar.bz2&lt;/a&gt; (840K)
&lt;/div&gt;
&lt;/p&gt;</description>
      <pubDate>Wed, 04 Apr 2007 15:09:00 +0000</pubDate>
      <link>http://20bits.com/article/what-is-a-word-an-introduction-to-computaitonal-linguistics</link>
    </item>
  </channel>
</rss>
