<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>quuxlabs</title>
	<atom:link href="http://www.quuxlabs.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.quuxlabs.com</link>
	<description>transmuting ideas into value</description>
	<lastBuildDate>Thu, 19 May 2011 12:28:18 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Paper Token: Gutenberg&#8217;s version of One Time Passwords</title>
		<link>http://www.quuxlabs.com/blog/2010/09/paper-token-gutenbergs-version-of-one-time-passwords/</link>
		<comments>http://www.quuxlabs.com/blog/2010/09/paper-token-gutenbergs-version-of-one-time-passwords/#comments</comments>
		<pubDate>Wed, 29 Sep 2010 08:21:22 +0000</pubDate>
		<dc:creator>Alexandre Dulaunoy</dc:creator>
				<category><![CDATA[Security]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[infosec]]></category>
		<category><![CDATA[otp]]></category>
		<category><![CDATA[paper]]></category>
		<category><![CDATA[password]]></category>
		<category><![CDATA[rsa]]></category>
		<category><![CDATA[securid]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[simple]]></category>
		<category><![CDATA[token]]></category>

		<guid isPermaLink="false">http://www2.quuxlabs.com/?p=431</guid>
		<description><![CDATA[Nowadays, ensuring proper security for information technology is a mandatory requirement for governments, companies and even individuals. Even protected and hardened system infrastructures are vulnerable to hacking attacks and malware (see the latest press about the &#8220;stuxnet&#8221; worm infiltrating Iran&#8217;s &#8230; <a href="http://www.quuxlabs.com/blog/2010/09/paper-token-gutenbergs-version-of-one-time-passwords/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Nowadays, ensuring proper security for information technology is a mandatory requirement for governments, companies and even individuals. Even protected and hardened system infrastructures are vulnerable to hacking attacks and malware (see the latest press about the <a href="http://www.bbc.co.uk/news/technology-11388018">&#8220;stuxnet&#8221; worm infiltrating Iran&#8217;s nuclear power plants</a>). However, high security comes at a price: a lot of security technologies are not easily accessible to their users &#8212; especially with regard to the cost of their acquisition or maintenance. So-called <a href="http://en.wikipedia.org/wiki/One-time_password">One-Time Passwords (OTP)</a> are one such example.</p>
<p><span id="more-431"></span></p>

<h1>Introducing One-Time Passwords (OTP)</h1>
<p>What are one-time passwords? Wikipedia gives the <a href="http://en.wikipedia.org/wiki/One-time_password">following definition</a>:</p>
<blockquote><p>A one-time password (OTP) is a password that is valid for only one login session or transaction. OTPs avoid a number of shortcomings that are associated with traditional (static) <a href="http://en.wikipedia.org/wiki/Passwords">passwords</a>. The most important shortcoming that is addressed by OTPs is that, in contrast to static passwords, they are not vulnerable to <a href="http://en.wikipedia.org/wiki/Replay_attack">replay attacks</a>. This means that, if a potential intruder manages to record an OTP that was already used to log into a service or to conduct a transaction, he will not be able to abuse it since it will be no longer valid. On the downside, OTPs cannot be memorized by human beings. Therefore they require additional technology in order to work.</p></blockquote>
<p>Now let us look at why and when one would like to use OTP. For instance, imagine being at a place such as a cybercafé or a hotel where you cannot trust the local IT infrastructure. Unfortunately, you really need to access your email inbox through a non-encrypted communication channel (e.g. <a href="http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol">HTTP</a> instead of <a href="http://en.wikipedia.org/wiki/HTTP_Secure">HTTPS</a>, or <a href="http://en.wikipedia.org/wiki/IMAP">IMAP</a> instead of <a href="http://en.wikipedia.org/wiki/IMAPS">IMAPS</a>) but at the same time you don&#8217;t want to risk revealing your email password to the rest of the world. An OTP provides a solution to have a unique password that changes every time and at each authentication. Even if the password is captured by some bad guys as it travels across the networks, they will not be able to re-use it for their malicious purposes because the password is valid only one time or in a short period of time.</p>
<div id="attachment_806" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.quuxlabs.com/wp-content/uploads/2010/09/internet-cafe_flickr_user-feserc.jpg"><img class="size-medium wp-image-806" title="Browsing at an Internet Café" src="http://www.quuxlabs.com/wp-content/uploads/2010/09/internet-cafe_flickr_user-feserc-300x225.jpg" alt="Browsing at an Internet Café" width="300" height="225" /></a><p class="wp-caption-text">Browsing at an Internet Café: Not the most secure place for your confidential passwords. (Picture by Flickr user feserc)</p></div>
<h1>Commercial OTP solutions</h1>
<p>So while OTP are an interesting solution for such scenarios, they come literally at a price. A proprietary OTP solution can be quite expensive, especially if we consider that any required security devices for OTP such as <a href="http://en.wikipedia.org/wiki/Security_token">hardware security tokens</a> can cost more than 50,- EUR per person/year. Additionally, hardware tokens and their corresponding infrastructure is often proprietary and require a lot of effort to be setup and maintained.</p>
<div id="attachment_689" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.quuxlabs.com/wp-content/uploads/2010/09/MG_8477.jpg"><img class="size-medium wp-image-689" title="OTP key ring" src="http://www.quuxlabs.com/wp-content/uploads/2010/09/MG_8477-300x200.jpg" alt="Hardware OTP token (RSA and Cryptocard)" width="300" height="200" /></a><p class="wp-caption-text">Hardware OTP tokens (RSA SecureID and Cryptocard)</p></div>
<p>Can we improve the adoption of this security technology? How we could adapt OTP &#8212; the tools, the processes, etc. &#8212; to make this approach more accessible? And can we make it cheaper (e.g. using it for emergency access for a large number of users)?</p>
<p>As OTP rely on cryptographic functions to operate, we started our brainstorming by digging into one of the existing open standards describing OTP operation: <a href="http://www.ietf.org/rfc/rfc4226.txt">&#8220;HOTP: An HMAC-Based One-Time Password Algorithm&#8221; (RFC 4226)</a>. HOTP describes a counter-based OTP relying on the <a href="http://en.wikipedia.org/wiki/HMAC">HMAC-SHA-1 algorithm</a>. In other words, the corresponding token is relying on the evolution of a <a href="http://en.wikipedia.org/wiki/Counter">counter</a> for each authentication combined with the secret embedded in the <a href="http://en.wikipedia.org/wiki/Security_token">token</a> (and which is shared with the remote server responsible for verifying the authentication).</p>
<p>The cost (and complexity) of such a solution depends on two major components:</p>
<ul>
<li>The server side including the information about each token</li>
<li>The client side being the hardware/software token to compute the required OTP value for each authentication requested by the user</li>
</ul>
<p>Large OTP vendors rely on the current OTP scheme outlined above to make their business.  Even if the OTP algorithms themselves are standardized, the majority of vendors <a href="http://en.wikipedia.org/wiki/Security_token">lock the users into their own proprietary solution</a> so that users cannot choose to buy tokens from other suppliers.</p>
<h1>Paper Token: Gutenberg&#8217;s version of OTP</h1>
<p>Looking for an alternative to the commercial solutions described above, we asked ourselves: Is a simple, low-cost OTP solution possible? What if we could remove the dependency on an electronic device on the client side, i.e. the hardware token? And how could we come up with an affordable solution on the server side?</p>
<p>We remembered an <a href="http://www.schneier.com/blog/archives/2005/06/write_down_your.html">older blog post</a> by security expert <a href="http://en.wikipedia.org/wiki/Bruce_Schneier">Bruce Schneier</a> about password management for users:</p>
<blockquote><p>&#8220;Simply, people can no longer remember passwords good enough to reliably defend against dictionary attacks, and are much more secure if they choose a password too complicated to remember and then write it down. We&#8217;re all good at securing small pieces of paper. I recommend that people write their passwords down on a small piece of paper, and keep it with their other valuable small pieces of paper: in their wallet.&#8221; —Bruce Schneier 2005</p></blockquote>
<p>With this in mind, what if we came up with a security token based on a sheet of plain, old-school paper? As hopeless book aficionados, we thought that this would also be true to the spirit of <a href="http://en.wikipedia.org/wiki/Johannes_Gutenberg">Johannes Gutenberg</a>.</p>
<div id="attachment_812" class="wp-caption aligncenter" style="width: 243px"><a href="http://www.quuxlabs.com/wp-content/uploads/2010/09/Gutenberg.jpeg"><img class="size-medium wp-image-812" title="Johannes Gutenberg" src="http://www.quuxlabs.com/wp-content/uploads/2010/09/Gutenberg-233x300.jpg" alt="Johannes Gutenberg" width="233" height="300" /></a><p class="wp-caption-text">Johannes Gutenberg: Would he have been happy to see a paper token as a simple, low-cost OTP solution?</p></div>
<p>We decided to give it a go and have thus <a href="http://github.com/quuxlabs/paper-token">written a Perl script called &#8220;Paper Token&#8221;</a> which generates a printable page of a pre-calculated tokens for OTP.</p>
<div class="software-link">The Paper Token software is available at: <a href="http://github.com/quuxlabs/paper-token">http://github.com/quuxlabs/paper-token</a></div>
<p>You will have to set the initial counter (for a new token, this will usually start at the number zero) and the final counter. The Perl script will generate any possible values from the starting value until the end of the counter.   The unique key of the token (in this example, this is the test vector value &#8212; so do not use it in a production environment&#8230;) needs to be set to the value known by the remote server.</p>
<p>Here is a simple example of running the Paper Token script:</p>
<pre class="brush:perl">perl paper-token.pl  --output test.pdf --counter 0 --end 200 --secret 3132333435363738393031323334353637383930 --digits 6</pre>
<p>The script generates a PDF document that will be handed to the user in either digital form or directly as printed paper sheet (see the photo below for how the token sheet looks in practice). The user will then just need to begin using the paper token by reading his paper sheet from left to right and strike-through values whenever they have been used for an authentication, i.e. very similar to how <a href="http://en.wikipedia.org/wiki/Transaction_authentication_number">TAN</a> work in the context of online banking (beside that the TAN is used in conjunction with your password and where the OTP is replacing the password). The user will keep the paper token in his wallet just like any other valuable cards or papers.</p>
<div id="attachment_739" class="wp-caption alignnone" style="width: 610px"><a href="http://www.quuxlabs.com/wp-content/uploads/2010/09/IMG_6380.jpg"><img class="size-large wp-image-739" title="Paper token (an example)" src="http://www.quuxlabs.com/wp-content/uploads/2010/09/IMG_6380-600x400.jpg" alt="Paper token (an example)" width="600" height="400" /></a><p class="wp-caption-text">How a paper token looks like.</p></div>
<h1>Is there a future for a paper token?</h1>
<p>This experiment shows the feasibility of using paper tokens even without<br />
altering the security of the authentication process. Very often, we are giving up some<br />
security for the convenience of using a publicly accessible computer or a shared computer. Why are Web services not allowing for an alternative way of user authentication by including OTP? Maybe ease of use and user convenience are the answers to this question. Still, a simple OTP approach such as paper tokens might be feasible in practice without investing too much into existing infrastructures or without adding too much complexity to existing security processes. Hey, we&#8217;re brainstorming here after all!</p>
<p>Another potential use of such a technology is the use of OTP as a password/account recovery mechanism. A lot online services are relying on those silly &#8220;security questions&#8221; like <em>What is your mother&#8217;s maiden name?</em> or <em>What is your favorite color?,</em> and these are often a source of security problems as described in <a href="http://www.cl.cam.ac.uk/~jcb82/doc/fc2010_name_guessing.pdf">What’s in a Name? Evaluating Statistical Attacks on Personal Knowledge Questions</a>. (Just ask yourself: how many colors are there that users might have picked as their favorite? Not many? Exactly!) Maybe we could replace these easy to guess security questions through a pre-calculated OTP paper sheet?</p>
<h1>Summary</h1>
<p>Of course, we are not claiming to have found a miracle of a solution. Some questions and tasks are still unfinished (we haven&#8217;t talked much about the server side, for instance). But we want to initiate a discussion about simple, low-cost OTP solutions, and to not come to this discussion with empty hands but with some <a href="github.com/quuxlabs/paper-token">working piece of software</a> that could serve as a starting point for further activities. There are plenty of opportunities to improve the security while keeping the cost or the accessible at a reasonable level. It&#8217;s often just a matter of creativity&#8230;</p>
<h1>References</h1>
<p>From quuxlabs:</p>
<ul>
<li><a href="http://github.com/quuxlabs/paper-token">Paper Token</a> &#8211; our paper token generator described above for creating paper-based OTP tokens according to RFC 4226</li>
<li><a href="http://www.foo.be/cgi-bin/wiki.pl/SettingOOTP">The 5 minutes HOWTO install OpenOTP and test it with a token</a></li>
</ul>
<p>From other people:</p>
<ul>
<li><a href="http://tools.ietf.org/html/rfc4226">RFC4226: HOTP &#8211; An HMAC-Based One-Time Password Algorithm</a></li>
<li><a href="http://www.splintered.net/sw/otp/">OpenOTP</a> &#8211; an implementation of the HOTP protocol including server side (database token management)</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.quuxlabs.com/blog/2010/09/paper-token-gutenbergs-version-of-one-time-passwords/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Matrix Factorization: A Simple Tutorial and Implementation in Python</title>
		<link>http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/</link>
		<comments>http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/#comments</comments>
		<pubDate>Thu, 16 Sep 2010 21:13:58 +0000</pubDate>
		<dc:creator>Albert Au Yeung</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[matrix factorization]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[recommendation system]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.quuxlabs.com/?p=641</guid>
		<description><![CDATA[There is probably no need to say that there is too much information on the Web nowadays. Search engines help us a little bit. What is better is to have something interesting recommended to us automatically without asking. Indeed, from &#8230; <a href="http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>There is probably no need to say that there is too much information on the Web nowadays. Search engines help us a little bit. What is better is to have something interesting recommended to us automatically without asking. Indeed, from as simple as a list of the most popular bookmarks on <a href="http://delicious.com/">Delicious</a>, to some more personalized recommendations we received on <a href="http://www.amazon.com/">Amazon</a>, we are usually offered recommendations on the Web. </p>
<p>Recommendations can be generated by a wide range of algorithms. While user-based or item-based <a href="http://en.wikipedia.org/wiki/Collaborative_filtering">collaborative filtering</a> methods are simple and intuitive, matrix factorization techniques are usually more effective because they allow us to discover the latent features underlying the interactions between users and items. Of course, matrix factorization is simply a mathematical tool for playing around with matrices, and is therefore applicable in many scenarios where one would like to find out something hidden under the data. </p>
<p>In this tutorial, we will go through the basic ideas and the mathematics of matrix factorization, and then we will present a simple implementation in <a href="http://www.python.org/">Python</a>. We will proceed with the assumption that we are dealing with user ratings (e.g. an integer score from the range of 1 to 5) of items in a recommendation system. </p>
<p><span id="more-641"></span></p>

<h1>Basic Ideas</h1>
<p>Just as its name suggests, matrix factorization is to, obviously, factorize a matrix, i.e. to find out two (or more) matrices such that when you multiply them you will get back the original matrix. </p>
<p>As I have mentioned above, from an application point of view, matrix factorization can be used to discover latent features underlying the interactions between two different kinds of entities. (Of course, you can consider more than two kinds of entities and you will be dealing with tensor factorization, which would be more complicated.) And one obvious application is to predict ratings in collaborative filtering. </p>
<p>In a recommendation system such as <a href="http://www.netflix.com/">Netflix</a> or <a href="http://movielens.umn.edu/">MovieLens</a>, there is a group of users and a set of items (movies for the above two systems). Given that each users have rated some items in the system, we would like to predict how the users would rate the items that they have not yet rated, such that we can make recommendations to the users. In this case, all the information we have about the existing ratings can be represented in a matrix. Assume now we have 5 users and 10 items, and ratings are integers ranging from 1 to 5, the matrix may look something like this (a hyphen means that the user has not yet rated the movie):</p>
<table>
<tbody>
<tr>
<td></td>
<td><strong>D1</strong></td>
<td><strong>D2</strong></td>
<td><strong>D3</strong></td>
<td><strong>D4</strong></td>
</tr>
<tr>
<td><strong>U1</strong></td>
<td>5</td>
<td>3</td>
<td>-</td>
<td>1</td>
</tr>
<tr>
<td><strong>U2</strong></td>
<td>4</td>
<td>-</td>
<td>-</td>
<td>1</td>
</tr>
<tr>
<td><strong>U3</strong></td>
<td>1</td>
<td>1</td>
<td>-</td>
<td>5</td>
</tr>
<tr>
<td><strong>U4</strong></td>
<td>1</td>
<td>-</td>
<td>-</td>
<td>4</td>
</tr>
<tr>
<td><strong>U5</strong></td>
<td>-</td>
<td>1</td>
<td>5</td>
<td>4</td>
</tr>
</tbody>
</table>
<p>Hence, the task of predicting the missing ratings can be considered as filling in the blanks (the hyphens in the matrix) such that the values would be consistent with the existing ratings in the matrix. </p>
<p>The intuition behind using matrix factorization to solve this problem is that there should be some latent features that determine how a user rates an item. For example, two users would give high ratings to a certain movie if they both like the actors/actresses of the movie, or if the movie is an action movie, which is a genre preferred by both users. Hence, if we can discover these latent features, we should be able to predict a rating with respect to a certain user and a certain item, because the features associated with the user should match with the features associated with the item. </p>
<p>In trying to discover the different features, we also make the assumption that the number of features would be smaller than the number of users and the number of items. It should not be difficult to understand this assumption because clearly it would not be reasonable to assume that each user is associated with a unique feature (although this is not impossible). And anyway if this is the case there would be no point in making recommendations, because each of these users would not be interested in the items rated by other users. Similarly, the same argument applies to the items. </p>
<h1>The mathematics of matrix factorization</h1>
<p>Having discussed the intuition behind matrix factorization, we can now go on to work on the mathematics. Firstly, we have a set <img src='http://www.quuxlabs.com/wp-content/latex/4c6/4c614360da93c0a041b22e537de151eb-ffffff-000000-0.png' alt='U' title='U' class='latex' /> of users, and a set <img src='http://www.quuxlabs.com/wp-content/latex/f62/f623e75af30e62bbd73d6df5b50bb7b5-ffffff-000000-0.png' alt='D' title='D' class='latex' /> of items. Let <img src='http://www.quuxlabs.com/wp-content/latex/e1f/e1fd601dbae82a538d518550acb1af19-ffffff-000000-0.png' alt='\mathbf{R}' title='\mathbf{R}' class='latex' /> of size <img src='http://www.quuxlabs.com/wp-content/latex/c66/c664f3d3a2c99618e6cd3586e9bd3ce9-ffffff-000000-0.png' alt='|U| \times |D|' title='|U| \times |D|' class='latex' /> be the matrix that contains all the ratings that the users have assigned to the items. Also, we assume that we would like to discover $K$ latent features. Our task, then, is to find two matrics matrices <img src='http://www.quuxlabs.com/wp-content/latex/ccf/ccf6cb7a07e53d6a5c3e8449ae73d371-ffffff-000000-0.png' alt='\mathbf{P}' title='\mathbf{P}' class='latex' /> (a <img src='http://www.quuxlabs.com/wp-content/latex/b61/b6124fad67425b33363bd7ef9d4fe920-ffffff-000000-0.png' alt='|U| \times K' title='|U| \times K' class='latex' /> matrix) and <img src='http://www.quuxlabs.com/wp-content/latex/5e1/5e1ad0579fc06ddcbda6abaa092b7382-ffffff-000000-0.png' alt='\mathbf{Q}' title='\mathbf{Q}' class='latex' /> (a <img src='http://www.quuxlabs.com/wp-content/latex/ebb/ebb59df8dc321a9ddfc458db87dfb839-ffffff-000000-0.png' alt='|D| \times K' title='|D| \times K' class='latex' /> matrix) such that their product approximates <img src='http://www.quuxlabs.com/wp-content/latex/e1f/e1fd601dbae82a538d518550acb1af19-ffffff-000000-0.png' alt='\mathbf{R}' title='\mathbf{R}' class='latex' />:<br />
<center><br />
<img src='http://www.quuxlabs.com/wp-content/latex/1dc/1dcf596ef3cc5025b124812d0382659c-ffffff-000000-1.png' alt='\mathbf{R} \approx \mathbf{P} \times \mathbf{Q}^T = \hat{\mathbf{R}}' title='\mathbf{R} \approx \mathbf{P} \times \mathbf{Q}^T = \hat{\mathbf{R}}' class='latex' /><br />
</center><br/> </p>
<p>In this way, each row of <img src='http://www.quuxlabs.com/wp-content/latex/ccf/ccf6cb7a07e53d6a5c3e8449ae73d371-ffffff-000000-0.png' alt='\mathbf{P}' title='\mathbf{P}' class='latex' /> would represent the strength of the associations between a user and the features. Similarly, each row of <img src='http://www.quuxlabs.com/wp-content/latex/5e1/5e1ad0579fc06ddcbda6abaa092b7382-ffffff-000000-0.png' alt='\mathbf{Q}' title='\mathbf{Q}' class='latex' /> would represent the strength of the associations between an item and the features. To get the prediction of a rating of an item <img src='http://www.quuxlabs.com/wp-content/latex/fb1/fb1793a0a1f0a7f569eaaceb6bd6e7ff-ffffff-000000-0.png' alt='d_j' title='d_j' class='latex' /> by <img src='http://www.quuxlabs.com/wp-content/latex/eb0/eb00a04135562ae6f74786f084f54327-ffffff-000000-0.png' alt='u_i' title='u_i' class='latex' />, we can calculate the dot product of the two vectors corresponding to <img src='http://www.quuxlabs.com/wp-content/latex/eb0/eb00a04135562ae6f74786f084f54327-ffffff-000000-0.png' alt='u_i' title='u_i' class='latex' /> and <img src='http://www.quuxlabs.com/wp-content/latex/fb1/fb1793a0a1f0a7f569eaaceb6bd6e7ff-ffffff-000000-0.png' alt='d_j' title='d_j' class='latex' />:<br />
<center><br />
<img src='http://www.quuxlabs.com/wp-content/latex/864/864245053b03c6e746cbe83922746032-ffffff-000000-1.png' alt='\hat{r}_{ij} = p_i^T q_j = \sum_{k=1}^k{p_{ik}q_{kj}}' title='\hat{r}_{ij} = p_i^T q_j = \sum_{k=1}^k{p_{ik}q_{kj}}' class='latex' /><br />
</center><br/> </p>
<p>Now, we have to find a way to obtain <img src='http://www.quuxlabs.com/wp-content/latex/ccf/ccf6cb7a07e53d6a5c3e8449ae73d371-ffffff-000000-0.png' alt='\mathbf{P}' title='\mathbf{P}' class='latex' /> and <img src='http://www.quuxlabs.com/wp-content/latex/5e1/5e1ad0579fc06ddcbda6abaa092b7382-ffffff-000000-0.png' alt='\mathbf{Q}' title='\mathbf{Q}' class='latex' />. One way to approach this problem is the first intialize the two matrices with some values, calculate how `different&#8217; their product is to <img src='http://www.quuxlabs.com/wp-content/latex/1a0/1a0fa318ebc1b3fa6ffac01d86c286ce-ffffff-000000-0.png' alt='\mathbf{M}' title='\mathbf{M}' class='latex' />, and then try to minimize this difference iteratively. Such a method is called gradient descent, aiming at finding a local minimum of the difference. </p>
<p>The difference here, usually called the error between the estimated rating and the real rating, can be calculated by the following equation for each user-item pair: </p>
<p><center><br />
<img src='http://www.quuxlabs.com/wp-content/latex/09b/09b9fe29e79709943fbe8abc12d41025-ffffff-000000-1.png' alt='e_{ij}^2 = (r_{ij} - \hat{r}_{ij})^2 = (r_{ij} - \sum_{k=1}^K{p_{ik}q_{kj}})^2' title='e_{ij}^2 = (r_{ij} - \hat{r}_{ij})^2 = (r_{ij} - \sum_{k=1}^K{p_{ik}q_{kj}})^2' class='latex' /><br />
</center><br/></p>
<p>Here we consider the squared error because the estimated rating can be either higher or lower than the real rating. </p>
<p>To minimize the error, we have to know in which direction we have to modify the values of <img src='http://www.quuxlabs.com/wp-content/latex/dec/dec680a5862b2125d6c2757ace17beba-ffffff-000000-0.png' alt='p_{ik}' title='p_{ik}' class='latex' /> and <img src='http://www.quuxlabs.com/wp-content/latex/ed9/ed970bb4b6c725588b409188d7170f92-ffffff-000000-0.png' alt='q_{kj}' title='q_{kj}' class='latex' />. In other words, we need to know the gradient at the current values, and therefore we differentiate the above equation with respect to these two variables separately:<br />
<center><br />
<img src='http://www.quuxlabs.com/wp-content/latex/ef6/ef6ddecfa568a7882395b03d9a43b98e-ffffff-000000-1.png' alt='\frac{\partial}{\partial p_{ik}}e_{ij}^2 = -2(r_{ij} - \hat{r}_{ij})(q_{kj}) = -2 e_{ij} q_{kj}' title='\frac{\partial}{\partial p_{ik}}e_{ij}^2 = -2(r_{ij} - \hat{r}_{ij})(q_{kj}) = -2 e_{ij} q_{kj}' class='latex' /><br />
<img src='http://www.quuxlabs.com/wp-content/latex/dcf/dcf543615b34c3c92c52ac570697ddb1-ffffff-000000-1.png' alt='  \frac{\partial}{\partial q_{ik}}e_{ij}^2 = -2(r_{ij} - \hat{r}_{ij})(p_{ik}) = -2 e_{ij} p_{ik}' title='  \frac{\partial}{\partial q_{ik}}e_{ij}^2 = -2(r_{ij} - \hat{r}_{ij})(p_{ik}) = -2 e_{ij} p_{ik}' class='latex' /><br />
</center><br/> </p>
<p>Having obtained the gradient, we can now formulate the update rules for both <img src='http://www.quuxlabs.com/wp-content/latex/dec/dec680a5862b2125d6c2757ace17beba-ffffff-000000-0.png' alt='p_{ik}' title='p_{ik}' class='latex' /> and <img src='http://www.quuxlabs.com/wp-content/latex/ed9/ed970bb4b6c725588b409188d7170f92-ffffff-000000-0.png' alt='q_{kj}' title='q_{kj}' class='latex' />:<br />
<center><br />
<img src='http://www.quuxlabs.com/wp-content/latex/d8a/d8a4070c59529c9e13036a6bafaabdf8-ffffff-000000-1.png' alt='p&#039;_{ik} = p_{ik} + \alpha \frac{\partial}{\partial p_{ik}}e_{ij}^2 = p_{ik} + 2\alpha e_{ij} q_{kj} ' title='p&#039;_{ik} = p_{ik} + \alpha \frac{\partial}{\partial p_{ik}}e_{ij}^2 = p_{ik} + 2\alpha e_{ij} q_{kj} ' class='latex' /><br />
<img src='http://www.quuxlabs.com/wp-content/latex/c8c/c8c559c69dab539af106d84538f27a4f-ffffff-000000-1.png' alt='q&#039;_{kj} = q_{kj} + \alpha \frac{\partial}{\partial q_{kj}}e_{ij}^2 = q_{kj} + 2\alpha e_{ij} p_{ik} ' title='q&#039;_{kj} = q_{kj} + \alpha \frac{\partial}{\partial q_{kj}}e_{ij}^2 = q_{kj} + 2\alpha e_{ij} p_{ik} ' class='latex' /><br />
</center><br/></p>
<p>Here, <img src='http://www.quuxlabs.com/wp-content/latex/7b7/7b7f9dbfea05c83784f8b85149852f08-ffffff-000000-0.png' alt='\alpha' title='\alpha' class='latex' /> is a constant whose value determines the rate of approaching the minimum. Usually we will choose a small value for <img src='http://www.quuxlabs.com/wp-content/latex/7b7/7b7f9dbfea05c83784f8b85149852f08-ffffff-000000-0.png' alt='\alpha' title='\alpha' class='latex' />, say 0.0002. This is because if we make too large a step towards the minimum we may run into the risk of missing the minimum and end up oscillating around the minimum. </p>
<p>A question might have come to your mind by now: if we find two matrices <img src='http://www.quuxlabs.com/wp-content/latex/ccf/ccf6cb7a07e53d6a5c3e8449ae73d371-ffffff-000000-0.png' alt='\mathbf{P}' title='\mathbf{P}' class='latex' /> and <img src='http://www.quuxlabs.com/wp-content/latex/5e1/5e1ad0579fc06ddcbda6abaa092b7382-ffffff-000000-0.png' alt='\mathbf{Q}' title='\mathbf{Q}' class='latex' /> such that <img src='http://www.quuxlabs.com/wp-content/latex/4e3/4e37888e71add225aafff9e943e66b88-ffffff-000000-0.png' alt='\mathbf{P} \times \mathbf{Q}' title='\mathbf{P} \times \mathbf{Q}' class='latex' /> approximates <img src='http://www.quuxlabs.com/wp-content/latex/e1f/e1fd601dbae82a538d518550acb1af19-ffffff-000000-0.png' alt='\mathbf{R}' title='\mathbf{R}' class='latex' />, isn&#8217;t that our predictions of all the unseen ratings will all be zeros? In fact, we are not really trying to come up with <img src='http://www.quuxlabs.com/wp-content/latex/ccf/ccf6cb7a07e53d6a5c3e8449ae73d371-ffffff-000000-0.png' alt='\mathbf{P}' title='\mathbf{P}' class='latex' /> and <img src='http://www.quuxlabs.com/wp-content/latex/5e1/5e1ad0579fc06ddcbda6abaa092b7382-ffffff-000000-0.png' alt='\mathbf{Q}' title='\mathbf{Q}' class='latex' /> such that we can reproduce <img src='http://www.quuxlabs.com/wp-content/latex/e1f/e1fd601dbae82a538d518550acb1af19-ffffff-000000-0.png' alt='\mathbf{R}' title='\mathbf{R}' class='latex' /> exactly. Instead, we will only try to minimise the errors of the observed user-item pairs. In other words, if we let <img src='http://www.quuxlabs.com/wp-content/latex/b9e/b9ece18c950afbfa6b0fdbfa4ff731d3-ffffff-000000-0.png' alt='T' title='T' class='latex' /> be a set of tuples, each of which is in the form of <img src='http://www.quuxlabs.com/wp-content/latex/40f/40fa271aa4127b967a42623c68e9aeed-ffffff-000000-0.png' alt='(u_i, d_j, r_{ij})' title='(u_i, d_j, r_{ij})' class='latex' />, such that <img src='http://www.quuxlabs.com/wp-content/latex/b9e/b9ece18c950afbfa6b0fdbfa4ff731d3-ffffff-000000-0.png' alt='T' title='T' class='latex' /> contains all the observed user-item pairs together with the associated ratings, we are only trying to minimise every <img src='http://www.quuxlabs.com/wp-content/latex/6fd/6fd64a8eafc5224488e3523dd225bb7b-ffffff-000000-0.png' alt='e_{ij}' title='e_{ij}' class='latex' /> for <img src='http://www.quuxlabs.com/wp-content/latex/11e/11e227a7d100c0128a184cf35b74bfe1-ffffff-000000-0.png' alt='(u_i, d_j, r_{ij}) \in T' title='(u_i, d_j, r_{ij}) \in T' class='latex' />. (In other words, <img src='http://www.quuxlabs.com/wp-content/latex/b9e/b9ece18c950afbfa6b0fdbfa4ff731d3-ffffff-000000-0.png' alt='T' title='T' class='latex' /> is our set of training data.) As for the rest of the unknowns, we will be able to determine their values once the associations between the users, items and features have been learnt. </p>
<p>Using the above update rules, we can then iteratively perform the operation until the error converges to its minimum. We can check the overall error as calculated using the following equation and determine when we should stop the process.<br />
<center><br />
<img src='http://www.quuxlabs.com/wp-content/latex/4af/4afb50f23cb174de170269a10d246862-ffffff-000000-1.png' alt='E = \sum_{(u_i,d_j,r_{ij}) \in T}{e_{ij}} = \sum_{(u_i,d_j,r_{ij}) \in T}{(r_{ij} - \sum_{k=1}^K{p_{ik}q_{kj}})^2} ' title='E = \sum_{(u_i,d_j,r_{ij}) \in T}{e_{ij}} = \sum_{(u_i,d_j,r_{ij}) \in T}{(r_{ij} - \sum_{k=1}^K{p_{ik}q_{kj}})^2} ' class='latex' /><br />
</center><br/></p>
<h1>Regularization</h1>
<p>The above algorithm is a very basic algorithm for factorizing a matrix. There are a lot of methods to make things look more complicated. A common extension to this basic algorithm is to introduce regularization to avoid overfitting. This is done by adding a parameter <img src='http://www.quuxlabs.com/wp-content/latex/b06/b0603860fcffe94e5b8eec59ed813421-ffffff-000000-0.png' alt='\beta' title='\beta' class='latex' /> and modify the squared error as follows:<br />
<center><br />
<img src='http://www.quuxlabs.com/wp-content/latex/717/7175fabccdf1a9df3a07f261fc6d9364-ffffff-000000-1.png' alt='e_{ij}^2 = (r_{ij} - \sum_{k=1}^K{p_{ik}q_{kj}})^2 + \frac{\beta}{2} \sum_{k=1}^K{(||P||^2 + ||Q||^2)}' title='e_{ij}^2 = (r_{ij} - \sum_{k=1}^K{p_{ik}q_{kj}})^2 + \frac{\beta}{2} \sum_{k=1}^K{(||P||^2 + ||Q||^2)}' class='latex' /><br />
</center><br/></p>
<p>In other words, the new parameter <img src='http://www.quuxlabs.com/wp-content/latex/b06/b0603860fcffe94e5b8eec59ed813421-ffffff-000000-0.png' alt='\beta' title='\beta' class='latex' /> is used to control the magnitudes of the user-feature and item-feature vectors such that <img src='http://www.quuxlabs.com/wp-content/latex/44c/44c29edb103a2872f519ad0c9a0fdaaa-ffffff-000000-0.png' alt='P' title='P' class='latex' /> and <img src='http://www.quuxlabs.com/wp-content/latex/f09/f09564c9ca56850d4cd6b3319e541aee-ffffff-000000-0.png' alt='Q' title='Q' class='latex' /> would give a good approximation of <img src='http://www.quuxlabs.com/wp-content/latex/e1e/e1e1d3d40573127e9ee0480caf1283d6-ffffff-000000-0.png' alt='R' title='R' class='latex' /> without having to contain large numbers. In practice, <img src='http://www.quuxlabs.com/wp-content/latex/b06/b0603860fcffe94e5b8eec59ed813421-ffffff-000000-0.png' alt='\beta' title='\beta' class='latex' /> is set to some values in the range of 0.02. The new update rules for this squared error can be obtained by a procedure similar to the one described above. The new update rules are as follows.<br />
<center><br />
<img src='http://www.quuxlabs.com/wp-content/latex/708/7086061148d1ca3ff05059c7495b97ed-ffffff-000000-1.png' alt='p&#039;_{ik} = p_{ik} + \alpha \frac{\partial}{\partial p_{ik}}e_{ij}^2 = p_{ik} + \alpha(2 e_{ij} q_{kj} - \beta p_{ik} )' title='p&#039;_{ik} = p_{ik} + \alpha \frac{\partial}{\partial p_{ik}}e_{ij}^2 = p_{ik} + \alpha(2 e_{ij} q_{kj} - \beta p_{ik} )' class='latex' /><br />
<img src='http://www.quuxlabs.com/wp-content/latex/bbe/bbe62a8a7756fb7654415747ac2eea3d-ffffff-000000-1.png' alt='q&#039;_{kj} = q_{kj} + \alpha \frac{\partial}{\partial q_{kj}}e_{ij}^2 = q_{kj} + \alpha(2 e_{ij} p_{ik} - \beta q_{kj} )' title='q&#039;_{kj} = q_{kj} + \alpha \frac{\partial}{\partial q_{kj}}e_{ij}^2 = q_{kj} + \alpha(2 e_{ij} p_{ik} - \beta q_{kj} )' class='latex' /><br />
</center><br/></p>
<h1>Implementation in Python</h1>
<p>Once we have derived the update rules as described above, it actually becomes very straightforward to implement the algorithm. The following is a function that implements the algorithm in Python (note that this implementation requires the <a href="http://numpy.scipy.org/">numpy</a> module).</p>
<div class="software-link">Note: The complete Python code is available for download in section <a href="#source-code">Source Code</a> at the end of this post.</div>
<pre class="brush:python">
import numpy

def matrix_factorization(R, P, Q, K, steps=5000, alpha=0.0002, beta=0.02):
    Q = Q.T
    for step in xrange(steps):
        for i in xrange(len(R)):
            for j in xrange(len(R[i])):
                if R[i][j] > 0:
                    eij = R[i][j] - numpy.dot(P[i,:],Q[:,j])
                    for k in xrange(K):
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
        eR = numpy.dot(P,Q)
        e = 0
        for i in xrange(len(R)):
            for j in xrange(len(R[i])):
                if R[i][j] > 0:
                    e = e + pow(R[i][j] - numpy.dot(P[i,:],Q[:,j]), 2)
                    for k in xrange(K):
                        e = e + (beta/2) * (pow(P[i][k],2) + pow(Q[k][j],2))
        if e < 0.001:
            break
    return P, Q.T
</pre>
<p>We can try to apply it to our example mentioned above and see what we would get. Below is a code snippet in Python for running the example. </p>
<pre class="brush:python">
R = [
     [5,3,0,1],
     [4,0,0,1],
     [1,1,0,5],
     [1,0,0,4],
     [0,1,5,4],
    ]

R = numpy.array(R)

N = len(R)
M = len(R[0])
K = 2

P = numpy.random.rand(N,K)
Q = numpy.random.rand(M,K)

nP, nQ = matrix_factorization(R, P, Q, K)
nR = numpy.dot(nP, nQ.T)
</pre>
<p>And the matrix obtained from the above process would look something like this:</p>
<p><center></p>
<table width="250">
<tr>
<td></td>
<td><b>D1</b></td>
<td><b>D2</b></td>
<td><b>D3</b></td>
<td><b>D4</b></td>
</tr>
<tr>
<td><b>U1</b></td>
<td>  4.97</td>
<td> 2.98</td>
<td> 2.18</td>
<td> 0.98</td>
</tr>
<tr>
<td><b>U2</b></td>
<td> 3.97</td>
<td> 2.40</td>
<td> 1.97</td>
<td> 0.99</td>
</tr>
<tr>
<td><b>U3</b></td>
<td> 1.02</td>
<td> 0.93</td>
<td> 5.32</td>
<td> 4.93</td>
</tr>
<tr>
<td><b>U4</b></td>
<td> 1.00</td>
<td> 0.85</td>
<td> 4.59</td>
<td> 3.93</td>
</tr>
<tr>
<td><b>U5</b></td>
<td> 1.36</td>
<td> 1.07</td>
<td> 4.89</td>
<td> 4.12</td>
</tr>
</table>
<p></center> </p>
<p>We can see that for existing ratings we have the approximations very close to the true values, and we also get some 'predictions' of the unknown values. In this simple example, we can easily see that U1 and U2 have similar taste and they both rated D1 and D2 high, while the rest of the users preferred D3, D4 and D5. When the number of features (K in the Python code) is 2, the algorithm is able to associate the users and items to two different features, and the predictions also follow these associations. For example, we can see that the predicted rating of U4 on D3 is 4.59, because U4 and U5 both rated D4 high. </p>
<h1>Further Information</h1>
<p>We have discussed the intuitive meaning of the technique of matrix factorization and its use in collaborative filtering. In fact, there are many different extensions to the above technique. An important extension is the requirement that all the elements of the factor matrices (<img src='http://www.quuxlabs.com/wp-content/latex/ccf/ccf6cb7a07e53d6a5c3e8449ae73d371-ffffff-000000-0.png' alt='\mathbf{P}' title='\mathbf{P}' class='latex' /> and <img src='http://www.quuxlabs.com/wp-content/latex/5e1/5e1ad0579fc06ddcbda6abaa092b7382-ffffff-000000-0.png' alt='\mathbf{Q}' title='\mathbf{Q}' class='latex' /> in the above example) should be non-negative. In this case it is called non-negative matrix factorization (NMF). One advantage of NMF is that it results in intuitive meanings of the resultant matrices. Since no elements are negative, the process of multiplying the resultant matrices to get back the original matrix would not involve subtraction, and can be considered as a process of generating the original data by linear combinations of the latent features. </p>
<p><a name="source-code" /></p>
<h1>Source Code</h1>
<p>The full Python source code of this tutorial is available for download at:</p>
<ul>
<li><a href='http://www.quuxlabs.com/wp-content/uploads/2010/09/mf.py_.txt'>mf.py</a></li>
</ul>
<h1>References</h1>
<p>There have been quite a lot of references on matrix factorization. Below are some of the related papers.</p>
<ul>
<li>Gábor Takács et al (2008). <a href="http://portal.acm.org/citation.cfm?id=1454049">Matrix factorization and neighbor based algorithms for the Netflix prize problem</a>. In: Proceedings of the 2008 ACM Conference on Recommender Systems, Lausanne, Switzerland, October 23 - 25, 267-274.</li>
<li>Patrick Ott (2008). <a href="http://www.comp.leeds.ac.uk/ott/dl/mf_ott.pdf">Incremental Matrix Factorization for Collaborative Filtering</a>. Science, Technology and Design 01/2008, Anhalt University of Applied Sciences.</li>
<li>Daniel D. Lee and H. Sebastian Seung (2001). <a href="http://hebb.mit.edu/people/seung/papers/nmfconverge.pdf">Algorithms for Non-negative Matrix Factorization</a>. Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference. MIT Press. pp. 556–562.</li>
<li>Daniel D. Lee and H. Sebastian Seung (1999). <a href="http://www.nature.com/nature/journal/v401/n6755/abs/401788a0.html">Learning the parts of objects by non-negative matrix factorization</a>. Nature, Vol. 401, No. 6755. (21 October 1999), pp. 788-791.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Location and Friendship: Data Mining in Facebook</title>
		<link>http://www.quuxlabs.com/blog/2010/09/location-and-friendship-data-mining-in-facebook/</link>
		<comments>http://www.quuxlabs.com/blog/2010/09/location-and-friendship-data-mining-in-facebook/#comments</comments>
		<pubDate>Sun, 05 Sep 2010 00:41:03 +0000</pubDate>
		<dc:creator>Albert Au Yeung</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Social Networks]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[density]]></category>
		<category><![CDATA[distance]]></category>
		<category><![CDATA[facebook]]></category>
		<category><![CDATA[friendship]]></category>
		<category><![CDATA[geo]]></category>
		<category><![CDATA[geolocation]]></category>
		<category><![CDATA[location]]></category>
		<category><![CDATA[population]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[social network]]></category>
		<category><![CDATA[study]]></category>

		<guid isPermaLink="false">http://www.quuxlabs.com/?p=598</guid>
		<description><![CDATA[In the past, studying social issues such as the mobility of a group of people generally required a huge amount of effort. Questionnaires would have had to be prepared, distributed, and collected after they were filled in. It was and &#8230; <a href="http://www.quuxlabs.com/blog/2010/09/location-and-friendship-data-mining-in-facebook/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In the past, studying social issues such as the mobility of a group  of people generally required a huge amount of effort. Questionnaires  would have had to be prepared, distributed, and collected after they  were filled in. It was and still is a labor-intensive task when  face-to-face interviews are required to obtain various personal data.</p>
<p>Nowadays, we have more and more people connected to the Internet,  and many of these Internet users participate in various kinds of social  interactions on the Web. Most notably, users enjoy social networking  sites such as Facebook to establish their personal profile online and to  keep track of what their friends are doing.</p>
<p>As a result, huge volumes of data of social interactions can be  collected much more easily nowadays. Analyzing this data can give  insight into how different factors such as location, distance and  friendship influence the way people interact with each other.</p>
<p><span id="more-598"></span></p>

<p>The currently largest online social networking site is Facebook.  It claims to have a community of <a href="http://blog.facebook.com/blog.php?post=409753352130">more than 500 million registered users</a>, and it is <a href="http://www.alexa.com/siteinfo/facebook.com">ranked second</a> in terms of Internet traffic.  To serve this large number of users and to keep track of all the user  activities, Facebook is said to have been running <a href="http://www.datacenterknowledge.com/archives/2010/06/28/facebook-server-count-60000-or-more/">more than 60,000  servers</a>.</p>
<p>Facebook may just be a place where friends catch up with each  other. However, it has grown to become a world-wide network of frequent  and dynamic interactions. It is a place where social theories can be  tested due to the diversity of the backgrounds of the users. The social  data collected by Facebook is valuable both in generating more  personalized services and in understanding more about the users  themselves (see for example Facebook&#8217;s project on <a href="http://apps.facebook.com/usa_gnh/">gross national  happiness</a>.</p>
<p>Recently, Facebook published a paper in the <a href="http://www.www2010.org/">19th International World Wide Web Conference</a> with the title <a href="http://cameronmarlow.com/papers/find-me-if-you-can">Find Me If You Can: Improving Geographical Prediction  with Social and Spatial Proximity</a> (Backstrom et al., 2010). The paper  offers interesting insights into the relationship between geographical  distance and friendship of users. Let&#8217;s take a look at its content in  this blog post.</p>
<h1>Location, Population and Friendship</h1>
<p>As the title of the paper suggests, it is about predicting a user&#8217;s  location based on information about them available in Facebook. The  application of such a technique is quite broad. It can be used to detect  abnormal usage, to improve user experience, or to optimize the  presentation of advertisements. In addition, the study also gives  insight into the relationship between social ties and geographic  distance: Do users have more friends who are located close to them and  fewer friends who are living further away? Or is it true that the  Internet has removed the limitations of physical distance?</p>
<h2>Facebook Data</h2>
<p>The authors report that 6% of the 100 million Facebook users in the US have put  their home addresses in their profiles. Out of these addresses, 60% can  be converted into geo-locations (latitude and longitude). As a result,  the study makes use of a dataset that contains about 3.5 million user  profiles with precisely known home addresses.</p>
<p>Among these 3.5 million users, 2.9 million of them have at least  one friend with a valid address. On average, these users have about 10  friends with valid addresses. If we consider the social network of these  users, there are about 30.6 million relations between users with  precisely known locations (representing the social network as a graph  would result in a graph with 2.9 million vertices/nodes and 30.6 million  edges). It is a huge data set and, as the authors have noted, it  features much more precise location information compared to previous  studies which usually only deduce location information from IP  addresses.</p>
<h2>Population Density</h2>
<p>One interesting thing revealed in the paper is the distribution of  population across the country. The authors divide the population of the  US into three groups of roughly the same size according to the  population density of their locations, and then plot the average number  of people living x miles away against the distance from these locations.  The graph shows nicely the changes in population distribution as one  moves away from a certain location. For densely populated areas, for  instance, the number of people increases with the distance<em> </em> in the beginning as one considers a larger area within a large city,  but it decreases as soon as one moves to the suburban areas outside the  city. The lines of the three groups join each other at a point when the distance is far enough that the average population density of the country is reflected.</p>
<div id="attachment_630" class="wp-caption aligncenter" style="width: 510px"><a rel="attachment wp-att-630" href="http://www.quuxlabs.com/blog/2010/09/location-and-friendship-data-mining-in-facebook/fb-population-density/"><img class="size-full wp-image-630" title="Population Density" src="http://www.quuxlabs.com/wp-content/uploads/2010/09/Fb-population-density.png" alt="" width="500" height="357" /></a><p class="wp-caption-text">Relationship between population and distance for areas with different population densities. The three lines join each other at a point where the average population density is reflected (source: Backstrom, Sun, Marlow).</p></div>
<p>Of course, this result seems to be quite obvious and intuitive. But  still it is interesting that data from a social networking site can be  used to show such a characteristic of population density.</p>
<h2>Friendship and Distance</h2>
<p>The second aspect the paper investigates is the probability of  friendship as a function of geographical distance. Firstly, the distance  between each pair of friends is calculated. Secondly, the probability  of friendship at a certain distance is obtained by calculating the ratio  between the number of pairs of friends and the total number of pairs of  users at that distance. The authors observed that probability of  friendship is inversely proportional to distance at medium to long-range  distances. In other words, you are less likely to have friends which  live further away. However, at <em>shorter</em> distances friendship is  less sensitive to the effect of distance. Imagine an apartment building  with multiple families &#8212; just living there wouldn&#8217;t necessarily imply  being friends with everyone.</p>
<div id="attachment_631" class="wp-caption aligncenter" style="width: 510px"><a rel="attachment wp-att-631" href="http://www.quuxlabs.com/blog/2010/09/location-and-friendship-data-mining-in-facebook/fb-friendship-probability/"><img class="size-full wp-image-631" title="Probability distribution of friendship over distance" src="http://www.quuxlabs.com/wp-content/uploads/2010/09/Fb-friendship-probability.png" alt="" width="500" height="355" /></a><p class="wp-caption-text">Probability distribution of friendship over distance. It shows that probability of friendship is inversely proportional to distance in the medium and long range distances (source: Backstrom, Sun, Marlow).</p></div>
<p>What is more interesting is that when the authors break down the users  into different groups according to the population density of the areas  in which they live, they find out that the probability distribution of  friendship is different for different groups of users. For users living  in low density areas, they are more likely to have friends within a  short distance &#8212; seems like people in rural areas &#8216;stick together&#8217;. On  the other hand, users living in high density areas are slightly more  likely to have friends living far away. The authors suggest that this  may be because that &#8216;people living in metropolitan areas are more  cosmopolitan&#8217;.</p>
<div id="attachment_632" class="wp-caption aligncenter" style="width: 510px"><a rel="attachment wp-att-632" href="http://www.quuxlabs.com/blog/2010/09/location-and-friendship-data-mining-in-facebook/fb-friendship-probability-2/"><img class="size-full wp-image-632" title="Probability of friendship for different groups of users" src="http://www.quuxlabs.com/wp-content/uploads/2010/09/Fb-friendship-probability-2.png" alt="" width="500" height="355" /></a><p class="wp-caption-text">Probability of friendship for users in areas of different population densities. People living in high density areas are more likely to have long-distance friends (source: Backstrom, Sun, Marlow).</p></div>
<h2>Location Prediction</h2>
<p>The second part of the paper focuses on the problem of predicting the  location of a user. Here, the question is: can we predict the location  of a user if we know the locations of some or all of his friends? One  could simply assume that a user would most likely be located close to  his friends, and thus predict the user&#8217;s location using, for example,  the mean or median of his friends&#8217; locations. However, since  geographically there are places where no people live, such simple  approaches may produce strange &#8212; and funny &#8212; results when a user has  friends from different places in the US. Imagine having two friends, one  on Hawaii and one in Los Angeles &#8212; now calculate the mean of these two  locations and you might end up in the middle of the Pacific Ocean&#8230;</p>
<div id="attachment_635" class="wp-caption aligncenter" style="width: 510px"><a rel="attachment wp-att-635" href="http://www.quuxlabs.com/blog/2010/09/location-and-friendship-data-mining-in-facebook/hawaii-los-angeles/"><img class="size-full wp-image-635" title="Hawaii and Los Angeles" src="http://www.quuxlabs.com/wp-content/uploads/2010/09/Hawaii-los-angeles.png" alt="" width="500" height="304" /></a><p class="wp-caption-text">Sometimes &#39;keeping it simple, stupid&#39; will not work that well.</p></div>
<p>In fact, given the probability distribution mentioned above, one can  also consider a maximum likelihood estimator of the location of a user.  For example, we can calculate the probability that a user&#8217;s location is  at point <em>u</em> by multiplying the probability of friendship given the distance between location <em>u</em> and the location of the user&#8217;s friends. However, for a large social  network and a huge geographical area, such an approach would be very  computationally expensive.</p>
<p>The solution to this problem suggested by the author is to assume  that a user would always co-locate with one of his/her friends. In this  way the number of points that have to be evaluated becomes much smaller  (i.e. the search space becomes much smaller). The paper presents some  experiments and shows that this method of predicting the location of a  user &#8212; while still being only a rough estimation &#8212; is more accurate  than the current &#8216;standard&#8217; of using the IP addresses of the users for  deducing location information.</p>
<h1>Summary</h1>
<p>The paper of Backstrom, Sun and Marlow describes the analysis of Facebook data and the technique of location prediction in more detail. I think this paper is interesting because it shows how the large volume of user data collected from online social networking sites can be used to study different kinds of social issues. With this kind of research, we can expect computer science to play a more important role in the study of social issues and in the validation of different social theories.</p>
<p>The paper shows that people are still rather bounded by physical distance. It is still more likely that people have friends who are physically located close to themselves, even though the Internet has provided different channels for users to make friends with someone living far away.</p>
<p>In fact, some other studies have also arrived in similar conclusions. For example, <a href="http://sonic.northwestern.edu/">a group of researchers</a> from <a href="http://www.northwestern.edu/">Northwestern University</a> have studied user logs from a popular <a href="http://en.wikipedia.org/wiki/Massively_multiplayer_online_game">massively multi-player online game</a> (<a href="http://en.wikipedia.org/wiki/Massively_multiplayer_online_game">MMOG</a>), EverQuest II (Huang et al. 2009). The researchers found that although players can meet other players from all over the world in the game, and that homophily (similarity in certain characteristics) has certain influence on user behaviour, players are still more likely to interact with others who are geographically close to themselves, especially when the interaction involves partnership and collaboration in the game.</p>
<p><a href="http://en.wikipedia.org/wiki/Frances_Cairncross">Frances Cairncross</a> first wrote about the &#8216;<a href="http://www.amazon.com/Death-Distance-Communications-Revolution-Change/dp/0875848060">death of distance</a>&#8216; more than a decade ago. However, even now when people can connect to the Internet wherever they want, geographical proximity still seems to play an important role in determining with whom people interact with. Is this a barrier that even the Internet and the World Wide Web are not able to remove? Or is that there is still relatively little benefits for people to interact with someone far away? These would be interesting questions for both computer scientists and social scientists in the future.</p>
<h1>References</h1>
<ul>
<li><a href="http://www.facebook.com/data">Facebook Data Team</a></li>
<li><a href="http://cameronmarlow.com/papers/find-me-if-you-can">Find Me If You Can: Improving Geographical Prediction with Social and Spatial Proximity</a>, L. Backstrom, E. Sun and C. Marlow, WWW 2010.</li>
<li><a href="http://dmitriwilliams.com/proximity.pdf">Virtually There: The Role of Proximity and Homophily in Virtual World Networks</a>, Y. Huang, C. Shen, D. Williams, N. Contractor, SIN-09.</li>
<li><a href="http://www.amazon.com/Death-Distance-Communications-Revolution-Change/dp/0875848060">The Death of Distance: How the Communications Revolution Will Change Our Lives</a>, F. Cairncross, 1997</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.quuxlabs.com/blog/2010/09/location-and-friendship-data-mining-in-facebook/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Hadoop tutorials available</title>
		<link>http://www.quuxlabs.com/blog/2010/09/hadoop-tutorials-available/</link>
		<comments>http://www.quuxlabs.com/blog/2010/09/hadoop-tutorials-available/#comments</comments>
		<pubDate>Thu, 02 Sep 2010 08:23:35 +0000</pubDate>
		<dc:creator>Michael Noll</dc:creator>
				<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[cluster]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[parallel processing]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[tutorials]]></category>

		<guid isPermaLink="false">http://www.quuxlabs.com/?p=666</guid>
		<description><![CDATA[With the relaunch of our quuxlabs website, I have also migrated my Hadoop articles from my personal homepage to our tutorials section. The tutorials cover the following topics: Writing An Hadoop MapReduce Program In Python Running Hadoop On Ubuntu Linux &#8230; <a href="http://www.quuxlabs.com/blog/2010/09/hadoop-tutorials-available/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>With the relaunch of our quuxlabs website, I have also migrated my Hadoop articles from <a href="http://www.michael-noll.com/">my personal homepage</a> to our <a href="http://www.quuxlabs.com/tutorials/">tutorials</a> section. The tutorials cover the following topics:</p>
<ul>
<li><a href="http://www.quuxlabs.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/">Writing An Hadoop MapReduce Program In Python</a></li>
<li><a href="http://www.quuxlabs.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/">Running Hadoop On Ubuntu Linux (Single-Node Cluster)</a></li>
<li><a href="http://www.quuxlabs.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/">Running Hadoop On Ubuntu Linux (Multi-Node Cluster)</a></li>
</ul>
<p>Of course, we are always happy to receive your feedback. Several great additions and clarifications have come from our readers in the past. Enjoy!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.quuxlabs.com/blog/2010/09/hadoop-tutorials-available/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Reference implementation of SPEAR ranking algorithm released</title>
		<link>http://www.quuxlabs.com/blog/2010/07/reference-implementation-of-spear-ranking-algorithm-released/</link>
		<comments>http://www.quuxlabs.com/blog/2010/07/reference-implementation-of-spear-ranking-algorithm-released/#comments</comments>
		<pubDate>Sat, 10 Jul 2010 18:24:16 +0000</pubDate>
		<dc:creator>Michael Noll</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[expertise]]></category>
		<category><![CDATA[experts]]></category>
		<category><![CDATA[foss]]></category>
		<category><![CDATA[hits]]></category>
		<category><![CDATA[library]]></category>
		<category><![CDATA[license:gpl]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[ranking]]></category>
		<category><![CDATA[reference]]></category>
		<category><![CDATA[release]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[spam]]></category>
		<category><![CDATA[spear]]></category>

		<guid isPermaLink="false">http://www2.quuxlabs.com/?p=255</guid>
		<description><![CDATA[We have just released the &#8220;reference&#8221; implementation of our SPEAR ranking algorithm. The library is written in the Python programming language, and should be straight-forward to use. You can install the library via Python&#8217;s setuptools/easy_install or download it from GitHub. &#8230; <a href="http://www.quuxlabs.com/blog/2010/07/reference-implementation-of-spear-ranking-algorithm-released/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>We have just released the <a href="http://github.com/quuxlabs/Spear">&#8220;reference&#8221; implementation</a> of our <a href="http://www.spear-algorithm.org/">SPEAR ranking algorithm</a>. The <a href="http://github.com/quuxlabs/Spear">library</a> is written in the Python programming language, and should be straight-forward to use. You can install the library via Python&#8217;s setuptools/easy_install or download it from <a href="http://github.com/quuxlabs/Spear">GitHub</a>.</p>
<p>Here&#8217;s a quick example on how to use it:</p>
<pre class="brush:python">&gt;&gt;&gt; import spear
&gt;&gt;&gt; activities = [
... (datetime.datetime(2010,7,1,9,0,0), "alice", "http://quuxlabs.com/"),
... (datetime.datetime(2010,8,1,12,45,0), "bob", "http://quuxlabs.com/"),
... ]
&gt;&gt;&gt; spear_algorithm = spear.Spear(activities)
&gt;&gt;&gt; expertise_results, quality_results = spear_algorithm.run()</pre>
<p><span id="more-255"></span><br />
Get the top user and his expertise score:</p>
<pre class="brush:python">&gt;&gt;&gt; expertise_score, user = expertise_results[0]
&gt;&gt;&gt; print "%s =&gt; %.4f" % (user, expertise_score)
alice =&gt; 0.5858</pre>
<p>Get the top resource and its quality score:</p>
<pre class="brush:python">&gt;&gt;&gt; quality_score, resource = quality_results[0]
&gt;&gt;&gt; print "%s =&gt; %.4f" % (resource, quality_score)
http://quuxlabs.com/ =&gt; 1.0000</pre>
<p>You can also use the library to simulate the <a href="http://en.wikipedia.org/wiki/HITS_algorithm">HITS algorithm</a> of Jon Kleinberg. Simply supply a credit score function C(x) = 1 to the SPEAR algorithm (see the documentation of the Spear.run() method).</p>
<p>Feel free to play around with it and send us feedback.</p>
<p>PS: The SPEAR Python library requires <a href="http://www.scipy.org/">SciPy/NumPy</a>. If you don&#8217;t have these installed already, here are <a href="http://www.scipy.org/Installing_SciPy">some installation instructures</a> to get you started.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.quuxlabs.com/blog/2010/07/reference-implementation-of-spear-ranking-algorithm-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic page generated in 0.210 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2026-01-09 21:45:49 -->
<!-- Compression = gzip -->