What are one-time passwords? Wikipedia gives the following definition:
A one-time password (OTP) is a password that is valid for only one login session or transaction. OTPs avoid a number of shortcomings that are associated with traditional (static) passwords. The most important shortcoming that is addressed by OTPs is that, in contrast to static passwords, they are not vulnerable to replay attacks. This means that, if a potential intruder manages to record an OTP that was already used to log into a service or to conduct a transaction, he will not be able to abuse it since it will be no longer valid. On the downside, OTPs cannot be memorized by human beings. Therefore they require additional technology in order to work.
Now let us look at why and when one would like to use OTP. For instance, imagine being at a place such as a cybercafé or a hotel where you cannot trust the local IT infrastructure. Unfortunately, you really need to access your email inbox through a non-encrypted communication channel (e.g. HTTP instead of HTTPS, or IMAP instead of IMAPS) but at the same time you don’t want to risk revealing your email password to the rest of the world. An OTP provides a solution to have a unique password that changes every time and at each authentication. Even if the password is captured by some bad guys as it travels across the networks, they will not be able to re-use it for their malicious purposes because the password is valid only one time or in a short period of time.
So while OTP are an interesting solution for such scenarios, they come literally at a price. A proprietary OTP solution can be quite expensive, especially if we consider that any required security devices for OTP such as hardware security tokens can cost more than 50,- EUR per person/year. Additionally, hardware tokens and their corresponding infrastructure is often proprietary and require a lot of effort to be setup and maintained.
Can we improve the adoption of this security technology? How we could adapt OTP — the tools, the processes, etc. — to make this approach more accessible? And can we make it cheaper (e.g. using it for emergency access for a large number of users)?
As OTP rely on cryptographic functions to operate, we started our brainstorming by digging into one of the existing open standards describing OTP operation: “HOTP: An HMAC-Based One-Time Password Algorithm” (RFC 4226). HOTP describes a counter-based OTP relying on the HMAC-SHA-1 algorithm. In other words, the corresponding token is relying on the evolution of a counter for each authentication combined with the secret embedded in the token (and which is shared with the remote server responsible for verifying the authentication).
The cost (and complexity) of such a solution depends on two major components:
Large OTP vendors rely on the current OTP scheme outlined above to make their business. Even if the OTP algorithms themselves are standardized, the majority of vendors lock the users into their own proprietary solution so that users cannot choose to buy tokens from other suppliers.
Looking for an alternative to the commercial solutions described above, we asked ourselves: Is a simple, low-cost OTP solution possible? What if we could remove the dependency on an electronic device on the client side, i.e. the hardware token? And how could we come up with an affordable solution on the server side?
We remembered an older blog post by security expert Bruce Schneier about password management for users:
“Simply, people can no longer remember passwords good enough to reliably defend against dictionary attacks, and are much more secure if they choose a password too complicated to remember and then write it down. We’re all good at securing small pieces of paper. I recommend that people write their passwords down on a small piece of paper, and keep it with their other valuable small pieces of paper: in their wallet.” —Bruce Schneier 2005
With this in mind, what if we came up with a security token based on a sheet of plain, old-school paper? As hopeless book aficionados, we thought that this would also be true to the spirit of Johannes Gutenberg.
We decided to give it a go and have thus written a Perl script called “Paper Token” which generates a printable page of a pre-calculated tokens for OTP.
You will have to set the initial counter (for a new token, this will usually start at the number zero) and the final counter. The Perl script will generate any possible values from the starting value until the end of the counter. The unique key of the token (in this example, this is the test vector value — so do not use it in a production environment…) needs to be set to the value known by the remote server.
Here is a simple example of running the Paper Token script:
perl paper-token.pl --output test.pdf --counter 0 --end 200 --secret 3132333435363738393031323334353637383930 --digits 6
The script generates a PDF document that will be handed to the user in either digital form or directly as printed paper sheet (see the photo below for how the token sheet looks in practice). The user will then just need to begin using the paper token by reading his paper sheet from left to right and strike-through values whenever they have been used for an authentication, i.e. very similar to how TAN work in the context of online banking (beside that the TAN is used in conjunction with your password and where the OTP is replacing the password). The user will keep the paper token in his wallet just like any other valuable cards or papers.
This experiment shows the feasibility of using paper tokens even without
altering the security of the authentication process. Very often, we are giving up some
security for the convenience of using a publicly accessible computer or a shared computer. Why are Web services not allowing for an alternative way of user authentication by including OTP? Maybe ease of use and user convenience are the answers to this question. Still, a simple OTP approach such as paper tokens might be feasible in practice without investing too much into existing infrastructures or without adding too much complexity to existing security processes. Hey, we’re brainstorming here after all!
Another potential use of such a technology is the use of OTP as a password/account recovery mechanism. A lot online services are relying on those silly “security questions” like What is your mother’s maiden name? or What is your favorite color?, and these are often a source of security problems as described in What’s in a Name? Evaluating Statistical Attacks on Personal Knowledge Questions. (Just ask yourself: how many colors are there that users might have picked as their favorite? Not many? Exactly!) Maybe we could replace these easy to guess security questions through a pre-calculated OTP paper sheet?
Of course, we are not claiming to have found a miracle of a solution. Some questions and tasks are still unfinished (we haven’t talked much about the server side, for instance). But we want to initiate a discussion about simple, low-cost OTP solutions, and to not come to this discussion with empty hands but with some working piece of software that could serve as a starting point for further activities. There are plenty of opportunities to improve the security while keeping the cost or the accessible at a reasonable level. It’s often just a matter of creativity…
From quuxlabs:
From other people:
Recommendations can be generated by a wide range of algorithms. While user-based or item-based collaborative filtering methods are simple and intuitive, matrix factorization techniques are usually more effective because they allow us to discover the latent features underlying the interactions between users and items. Of course, matrix factorization is simply a mathematical tool for playing around with matrices, and is therefore applicable in many scenarios where one would like to find out something hidden under the data.
In this tutorial, we will go through the basic ideas and the mathematics of matrix factorization, and then we will present a simple implementation in Python. We will proceed with the assumption that we are dealing with user ratings (e.g. an integer score from the range of 1 to 5) of items in a recommendation system.
Just as its name suggests, matrix factorization is to, obviously, factorize a matrix, i.e. to find out two (or more) matrices such that when you multiply them you will get back the original matrix.
As I have mentioned above, from an application point of view, matrix factorization can be used to discover latent features underlying the interactions between two different kinds of entities. (Of course, you can consider more than two kinds of entities and you will be dealing with tensor factorization, which would be more complicated.) And one obvious application is to predict ratings in collaborative filtering.
In a recommendation system such as Netflix or MovieLens, there is a group of users and a set of items (movies for the above two systems). Given that each users have rated some items in the system, we would like to predict how the users would rate the items that they have not yet rated, such that we can make recommendations to the users. In this case, all the information we have about the existing ratings can be represented in a matrix. Assume now we have 5 users and 10 items, and ratings are integers ranging from 1 to 5, the matrix may look something like this (a hyphen means that the user has not yet rated the movie):
D1 | D2 | D3 | D4 | |
U1 | 5 | 3 | - | 1 |
U2 | 4 | - | - | 1 |
U3 | 1 | 1 | - | 5 |
U4 | 1 | - | - | 4 |
U5 | - | 1 | 5 | 4 |
Hence, the task of predicting the missing ratings can be considered as filling in the blanks (the hyphens in the matrix) such that the values would be consistent with the existing ratings in the matrix.
The intuition behind using matrix factorization to solve this problem is that there should be some latent features that determine how a user rates an item. For example, two users would give high ratings to a certain movie if they both like the actors/actresses of the movie, or if the movie is an action movie, which is a genre preferred by both users. Hence, if we can discover these latent features, we should be able to predict a rating with respect to a certain user and a certain item, because the features associated with the user should match with the features associated with the item.
In trying to discover the different features, we also make the assumption that the number of features would be smaller than the number of users and the number of items. It should not be difficult to understand this assumption because clearly it would not be reasonable to assume that each user is associated with a unique feature (although this is not impossible). And anyway if this is the case there would be no point in making recommendations, because each of these users would not be interested in the items rated by other users. Similarly, the same argument applies to the items.
Having discussed the intuition behind matrix factorization, we can now go on to work on the mathematics. Firstly, we have a set of users, and a set of items. Let of size be the matrix that contains all the ratings that the users have assigned to the items. Also, we assume that we would like to discover $K$ latent features. Our task, then, is to find two matrics matrices (a matrix) and (a matrix) such that their product approximates :
In this way, each row of would represent the strength of the associations between a user and the features. Similarly, each row of would represent the strength of the associations between an item and the features. To get the prediction of a rating of an item by , we can calculate the dot product of the two vectors corresponding to and :
Now, we have to find a way to obtain and . One way to approach this problem is the first intialize the two matrices with some values, calculate how `different’ their product is to , and then try to minimize this difference iteratively. Such a method is called gradient descent, aiming at finding a local minimum of the difference.
The difference here, usually called the error between the estimated rating and the real rating, can be calculated by the following equation for each user-item pair:
Here we consider the squared error because the estimated rating can be either higher or lower than the real rating.
To minimize the error, we have to know in which direction we have to modify the values of and . In other words, we need to know the gradient at the current values, and therefore we differentiate the above equation with respect to these two variables separately:
Having obtained the gradient, we can now formulate the update rules for both and :
Here, is a constant whose value determines the rate of approaching the minimum. Usually we will choose a small value for , say 0.0002. This is because if we make too large a step towards the minimum we may run into the risk of missing the minimum and end up oscillating around the minimum.
A question might have come to your mind by now: if we find two matrices and such that approximates , isn’t that our predictions of all the unseen ratings will all be zeros? In fact, we are not really trying to come up with and such that we can reproduce exactly. Instead, we will only try to minimise the errors of the observed user-item pairs. In other words, if we let be a set of tuples, each of which is in the form of , such that contains all the observed user-item pairs together with the associated ratings, we are only trying to minimise every for . (In other words, is our set of training data.) As for the rest of the unknowns, we will be able to determine their values once the associations between the users, items and features have been learnt.
Using the above update rules, we can then iteratively perform the operation until the error converges to its minimum. We can check the overall error as calculated using the following equation and determine when we should stop the process.
The above algorithm is a very basic algorithm for factorizing a matrix. There are a lot of methods to make things look more complicated. A common extension to this basic algorithm is to introduce regularization to avoid overfitting. This is done by adding a parameter and modify the squared error as follows:
In other words, the new parameter is used to control the magnitudes of the user-feature and item-feature vectors such that and would give a good approximation of without having to contain large numbers. In practice, is set to some values in the range of 0.02. The new update rules for this squared error can be obtained by a procedure similar to the one described above. The new update rules are as follows.
Once we have derived the update rules as described above, it actually becomes very straightforward to implement the algorithm. The following is a function that implements the algorithm in Python (note that this implementation requires the numpy module).
import numpy def matrix_factorization(R, P, Q, K, steps=5000, alpha=0.0002, beta=0.02): Q = Q.T for step in xrange(steps): for i in xrange(len(R)): for j in xrange(len(R[i])): if R[i][j] > 0: eij = R[i][j] - numpy.dot(P[i,:],Q[:,j]) for k in xrange(K): P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k]) Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j]) eR = numpy.dot(P,Q) e = 0 for i in xrange(len(R)): for j in xrange(len(R[i])): if R[i][j] > 0: e = e + pow(R[i][j] - numpy.dot(P[i,:],Q[:,j]), 2) for k in xrange(K): e = e + (beta/2) * (pow(P[i][k],2) + pow(Q[k][j],2)) if e < 0.001: break return P, Q.T
We can try to apply it to our example mentioned above and see what we would get. Below is a code snippet in Python for running the example.
R = [ [5,3,0,1], [4,0,0,1], [1,1,0,5], [1,0,0,4], [0,1,5,4], ] R = numpy.array(R) N = len(R) M = len(R[0]) K = 2 P = numpy.random.rand(N,K) Q = numpy.random.rand(M,K) nP, nQ = matrix_factorization(R, P, Q, K) nR = numpy.dot(nP, nQ.T)
And the matrix obtained from the above process would look something like this:
D1 | D2 | D3 | D4 | |
U1 | 4.97 | 2.98 | 2.18 | 0.98 |
U2 | 3.97 | 2.40 | 1.97 | 0.99 |
U3 | 1.02 | 0.93 | 5.32 | 4.93 |
U4 | 1.00 | 0.85 | 4.59 | 3.93 |
U5 | 1.36 | 1.07 | 4.89 | 4.12 |
We can see that for existing ratings we have the approximations very close to the true values, and we also get some 'predictions' of the unknown values. In this simple example, we can easily see that U1 and U2 have similar taste and they both rated D1 and D2 high, while the rest of the users preferred D3, D4 and D5. When the number of features (K in the Python code) is 2, the algorithm is able to associate the users and items to two different features, and the predictions also follow these associations. For example, we can see that the predicted rating of U4 on D3 is 4.59, because U4 and U5 both rated D4 high.
We have discussed the intuitive meaning of the technique of matrix factorization and its use in collaborative filtering. In fact, there are many different extensions to the above technique. An important extension is the requirement that all the elements of the factor matrices ( and in the above example) should be non-negative. In this case it is called non-negative matrix factorization (NMF). One advantage of NMF is that it results in intuitive meanings of the resultant matrices. Since no elements are negative, the process of multiplying the resultant matrices to get back the original matrix would not involve subtraction, and can be considered as a process of generating the original data by linear combinations of the latent features.
The full Python source code of this tutorial is available for download at:
There have been quite a lot of references on matrix factorization. Below are some of the related papers.
Nowadays, we have more and more people connected to the Internet, and many of these Internet users participate in various kinds of social interactions on the Web. Most notably, users enjoy social networking sites such as Facebook to establish their personal profile online and to keep track of what their friends are doing.
As a result, huge volumes of data of social interactions can be collected much more easily nowadays. Analyzing this data can give insight into how different factors such as location, distance and friendship influence the way people interact with each other.
The currently largest online social networking site is Facebook. It claims to have a community of more than 500 million registered users, and it is ranked second in terms of Internet traffic. To serve this large number of users and to keep track of all the user activities, Facebook is said to have been running more than 60,000 servers.
Facebook may just be a place where friends catch up with each other. However, it has grown to become a world-wide network of frequent and dynamic interactions. It is a place where social theories can be tested due to the diversity of the backgrounds of the users. The social data collected by Facebook is valuable both in generating more personalized services and in understanding more about the users themselves (see for example Facebook’s project on gross national happiness.
Recently, Facebook published a paper in the 19th International World Wide Web Conference with the title Find Me If You Can: Improving Geographical Prediction with Social and Spatial Proximity (Backstrom et al., 2010). The paper offers interesting insights into the relationship between geographical distance and friendship of users. Let’s take a look at its content in this blog post.
As the title of the paper suggests, it is about predicting a user’s location based on information about them available in Facebook. The application of such a technique is quite broad. It can be used to detect abnormal usage, to improve user experience, or to optimize the presentation of advertisements. In addition, the study also gives insight into the relationship between social ties and geographic distance: Do users have more friends who are located close to them and fewer friends who are living further away? Or is it true that the Internet has removed the limitations of physical distance?
The authors report that 6% of the 100 million Facebook users in the US have put their home addresses in their profiles. Out of these addresses, 60% can be converted into geo-locations (latitude and longitude). As a result, the study makes use of a dataset that contains about 3.5 million user profiles with precisely known home addresses.
Among these 3.5 million users, 2.9 million of them have at least one friend with a valid address. On average, these users have about 10 friends with valid addresses. If we consider the social network of these users, there are about 30.6 million relations between users with precisely known locations (representing the social network as a graph would result in a graph with 2.9 million vertices/nodes and 30.6 million edges). It is a huge data set and, as the authors have noted, it features much more precise location information compared to previous studies which usually only deduce location information from IP addresses.
One interesting thing revealed in the paper is the distribution of population across the country. The authors divide the population of the US into three groups of roughly the same size according to the population density of their locations, and then plot the average number of people living x miles away against the distance from these locations. The graph shows nicely the changes in population distribution as one moves away from a certain location. For densely populated areas, for instance, the number of people increases with the distance in the beginning as one considers a larger area within a large city, but it decreases as soon as one moves to the suburban areas outside the city. The lines of the three groups join each other at a point when the distance is far enough that the average population density of the country is reflected.
Of course, this result seems to be quite obvious and intuitive. But still it is interesting that data from a social networking site can be used to show such a characteristic of population density.
The second aspect the paper investigates is the probability of friendship as a function of geographical distance. Firstly, the distance between each pair of friends is calculated. Secondly, the probability of friendship at a certain distance is obtained by calculating the ratio between the number of pairs of friends and the total number of pairs of users at that distance. The authors observed that probability of friendship is inversely proportional to distance at medium to long-range distances. In other words, you are less likely to have friends which live further away. However, at shorter distances friendship is less sensitive to the effect of distance. Imagine an apartment building with multiple families — just living there wouldn’t necessarily imply being friends with everyone.
What is more interesting is that when the authors break down the users into different groups according to the population density of the areas in which they live, they find out that the probability distribution of friendship is different for different groups of users. For users living in low density areas, they are more likely to have friends within a short distance — seems like people in rural areas ‘stick together’. On the other hand, users living in high density areas are slightly more likely to have friends living far away. The authors suggest that this may be because that ‘people living in metropolitan areas are more cosmopolitan’.
The second part of the paper focuses on the problem of predicting the location of a user. Here, the question is: can we predict the location of a user if we know the locations of some or all of his friends? One could simply assume that a user would most likely be located close to his friends, and thus predict the user’s location using, for example, the mean or median of his friends’ locations. However, since geographically there are places where no people live, such simple approaches may produce strange — and funny — results when a user has friends from different places in the US. Imagine having two friends, one on Hawaii and one in Los Angeles — now calculate the mean of these two locations and you might end up in the middle of the Pacific Ocean…
In fact, given the probability distribution mentioned above, one can also consider a maximum likelihood estimator of the location of a user. For example, we can calculate the probability that a user’s location is at point u by multiplying the probability of friendship given the distance between location u and the location of the user’s friends. However, for a large social network and a huge geographical area, such an approach would be very computationally expensive.
The solution to this problem suggested by the author is to assume that a user would always co-locate with one of his/her friends. In this way the number of points that have to be evaluated becomes much smaller (i.e. the search space becomes much smaller). The paper presents some experiments and shows that this method of predicting the location of a user — while still being only a rough estimation — is more accurate than the current ‘standard’ of using the IP addresses of the users for deducing location information.
The paper of Backstrom, Sun and Marlow describes the analysis of Facebook data and the technique of location prediction in more detail. I think this paper is interesting because it shows how the large volume of user data collected from online social networking sites can be used to study different kinds of social issues. With this kind of research, we can expect computer science to play a more important role in the study of social issues and in the validation of different social theories.
The paper shows that people are still rather bounded by physical distance. It is still more likely that people have friends who are physically located close to themselves, even though the Internet has provided different channels for users to make friends with someone living far away.
In fact, some other studies have also arrived in similar conclusions. For example, a group of researchers from Northwestern University have studied user logs from a popular massively multi-player online game (MMOG), EverQuest II (Huang et al. 2009). The researchers found that although players can meet other players from all over the world in the game, and that homophily (similarity in certain characteristics) has certain influence on user behaviour, players are still more likely to interact with others who are geographically close to themselves, especially when the interaction involves partnership and collaboration in the game.
Frances Cairncross first wrote about the ‘death of distance‘ more than a decade ago. However, even now when people can connect to the Internet wherever they want, geographical proximity still seems to play an important role in determining with whom people interact with. Is this a barrier that even the Internet and the World Wide Web are not able to remove? Or is that there is still relatively little benefits for people to interact with someone far away? These would be interesting questions for both computer scientists and social scientists in the future.
Of course, we are always happy to receive your feedback. Several great additions and clarifications have come from our readers in the past. Enjoy!
]]>Here’s a quick example on how to use it:
>>> import spear >>> activities = [ ... (datetime.datetime(2010,7,1,9,0,0), "alice", "http://quuxlabs.com/"), ... (datetime.datetime(2010,8,1,12,45,0), "bob", "http://quuxlabs.com/"), ... ] >>> spear_algorithm = spear.Spear(activities) >>> expertise_results, quality_results = spear_algorithm.run()
Get the top user and his expertise score:
>>> expertise_score, user = expertise_results[0] >>> print "%s => %.4f" % (user, expertise_score) alice => 0.5858
Get the top resource and its quality score:
>>> quality_score, resource = quality_results[0] >>> print "%s => %.4f" % (resource, quality_score) http://quuxlabs.com/ => 1.0000
You can also use the library to simulate the HITS algorithm of Jon Kleinberg. Simply supply a credit score function C(x) = 1 to the SPEAR algorithm (see the documentation of the Spear.run() method).
Feel free to play around with it and send us feedback.
PS: The SPEAR Python library requires SciPy/NumPy. If you don’t have these installed already, here are some installation instructures to get you started.
]]>