Revolutions

Because it's Friday: What's a video store?

2013-10-25T17:33:28-07:00

Kids these days have it easy. In my day, you had to walk in the snow uphill both ways just to see a grainy VHS copy of E.T. the Extra Terrestrial, but now it's just a tap o' the iPad away. But seriously, this video is sweet and sentimental and brings back some good memories of a time before streaming:

Have a great weekend! We'll be back on Monday!

How to switch from spreadsheets to R for data analysis

2013-10-25T16:56:49-07:00

To take a spreadsheet beyond what it's designed for — data presentation, summarization and simple calculations — into the world of complex data analysis can be an alluring prospect. But it can also be dangerous: consider these examples of spreadsheet errors that led to monumental financial losses, mistaken government policies, and even the wrong drugs being given to cancer patients.

The answer is to move your analysis into a computing environment specifically designed for data analysis: R. Burns Statistics provides a step by step tutorial on transitioning from spreadsheets to R: if you care about the accuracy if your analysis, or even just being able to reproduce your results again in the future, you should check it out at the link below.

Burns Statistics: A first step towards R from spreadsheets

NESSIS 2013: The future of sports statistics is here!

2013-10-24T07:30:00-07:00

by Joseph Rickert

We have been following sports statistics regularly on the Revolutions Blog with quite a few sports related posts this year. In one post I did back in April about the Latham R package for baseball statistics I speculated on how baseball was poised to move from Moneyball style predictive analytics to real-time descriptive stats by showing strike zone heat maps overlaying TV images of batters swinging away. Sports statistics, however, is moving much more quickly than I imagined. Apparently, the NBA is blowing by this milestone and setting up to do real-time predictions.

Recently Mark Glickman, one of the organizers of NESSIS 2013 (The New England Symposium of Statistics in Sport) sent me links to the slides and videos of the presentations made at the conference. There are several excellent presentations here, but I was astounded by Dan Cerone's presentaton on "State of Transition: Estimating Real-Time Expected Possession Value in the NBA with a Spatiotemporal Transition Model and Player Tracking Data".

Dan, a Harvard graduate student, describes how he and his fellow researchers are using an optical tracking data a system developed by STATS, and scheduled to be installed in all 30 NBA areanas, to build predictive state transition models. The optical system tracks 2D locations of all 10 players on the court as well as the 3D position of the ball by taking 25 images per second. Using the 800 million data points generated from only 515 games the Harvard researchers are trying to answer questions like "How many points is a team expected to score given the spatial evolution of its possession up to time t?"

EPV = E[X|F(t) ] where X = number of points scored on this possession (unknown). and F(t) = space-time information of the possession up to time t.

The following graph shows spatial effect surface plots for some San Antonio Spurs players. These surfaces are components of the predictive model.

Just how big this kind of modeling is expected to be can be inferred from the opening remarks made by Mike Zarren, Assistant GM of the Boston Celtics, at the beginning of Dan's Presentation. Speaking about plans for the continued availability of the data, Mr. Zarren says "I've talked with people on both sides, at the league and also at Stats, and both are still interested in researchers getting some access to this data, but exactly what the model looks like is still up to debate". My guess is that there will be some serious money riding on this data and the predictive models based on it.

All of the NESSIS presentations exhibit a fairly high level of statistical play. In addition to Dans presentation, there are four more basketball related studies, one each on the Boston Marathon, soccer and tennis, one on Football about using Random Forest models to estimate win probabilities on each play during a game, and three presentations on baseball, including an R based analysis of "streakiness" by Jim Albert, long time R contributor and editor of the Journal of Quantitative Analysis in Sports. At the beginning of his talk Jim recounts how early in his career he was surprised to find an analysis of baseball data in a paper by Brad Efron and Carl Morris on Stein's Paradox in Statistics. At that time, Jim remarks, "you don't write about sports to get tenure... maybe times have changed": maybe they have.

Customize your R session with .Rprofile

2013-10-23T18:55:46-07:00

The .Rprofile file is a great way to customize your R session every time you start it up. You can use it to change R's defaults, define handy command-line functions, automatically load your favourite packages — anything you like! The Getting Genetics Blog has a nice example .Rprofile file to give you some inspiration on what to do. One popular setting is options(stringsAsFactors=FALSE), which prevents R from converting character data into factor objects when you import data frames.

One word of warning: if you often share R scripts with others, don't get too reliant on your .RProfile file. Your script may be assuming default settings that your colleagues may not share. Be sure to check your script still runs correctly when you start R with R --no-init-file before you share it. Check help(Startup) in R for details.

Getting Genetics Done: Customize your .Rprofile and Keep Your Workspace Clean

An introduction to Econometrics, using R

2013-10-22T16:57:20-07:00

If your econometrics is a bit rusty and you're also looking to learn the R language, you can kill two birds with one stone with Introductory Econometrics using Quandl and R. The first three parts of this seven-part tutorial introduces the basics of regression analysis, while the remaining sections provide R code you can try yourself to reproduce econometric analyses using data provided by the Quandl package. Get started at the link below.

Quandl: Welcome to Introductory Econometrics using Quandl and R

Video: Time-to-event models

2013-10-21T15:09:38-07:00

If you're trying to predict when an event will occur (for example, a consumer buying a product) or trying to infer why events occur (what were the factors that led to a component failing?), time-to-event models are a useful framework. These models are closely related to survival analysis in life sciences, except that the outcome of interest isn't "time to death" but time to some other event (e.g. in marketing, "time to purchase"). Also in today's applications the data sizes are much larger (often Hadoop scale) as all kinds of demographic, operational and sensor data are brought to bear to imrove the predictions.

In a webinar earlier this month, DataSong's John Wallace and Tess Nesbitt gave an overview of time-to-event models, with examples from marketing attribution and retail, and describing their on-demand implementation of these models using Revolution R Enterprise and Hadoop. You can watch the recorded webinar below:

You can also download the slides from the webinar at the link below.

Revolution Analytics Webinars: Using Time to Event Models for Prediction and Inference, presented by Revolution Analytics and DataSong

Because it's Friday: An illusion within an illusion

2013-10-18T13:00:00-07:00

As longtime readers of this blog will know, I love optical illusions, and the checkerboard shadow illusion is one of my all-time favourites. If your not familiar with it, here's a rendering of the illusion done in Maya I found online; note that squares C3 (a white square) and B5 (a grey square) look as different as you'd expect in the top frame, but when you add a shadow-casting cylinder to the scene the two squares are almost exactly the same shade of grey onscreen, despite what your eyes are telling you.

While the illusion is real (and demonstrates effictively that you can't always believe what your eyes tell you about colour), there's something fishy going on in this real-world recreation of the scene:

The whole point of the illusion is that the middle tile is actually white, and appears white to our brains, but is dark grey on-screen. Yet the woman in the video drags the white tile to a dark tile, where it should definitely appear as a different color in the better-lit area. I guess they're maintaining the RGB color of the dragged tile in CGI, but I don't think this really helps to explain the illusion.

Anyway, that's all for this week (but you can check out older Friday posts here). We're back on Monday.

Forbes on putting R-based analytics in the hands of business analysts

2013-10-18T11:45:49-07:00

Forbes has published an article today on the integration between Alteryx and Revolution R Enterprise, which gives business analysts the ability to drag and drop to connect data sources to R-based models, such as this one for Market Basket analysis:

(Experienced R users can always drill down to see the code behind the analysis, as you can see in this video.) Not only will this be a great way for non-programmers to access the wealth of capabilities in R (even with very large data sets, thanks to Revolution R Enterprise), it's also a great way for R programmers to make their custom data visualizations and models available to business analysts, by adding new icons to the Alteryx palette, or by publishing new workflows to the Alteryx Gallery.

Bob Muenchen (author of R for SAS and SPSS Users) also comments on the promise of a drag-and-drop UI for R in his blog post, What R Has Been Missing.

The ACM 2013 Mining Big Data Camp and "Un-Conference"

2013-10-17T08:29:40-07:00

by Joseph Rickert

The 2013 Mining Big Data Camp was held last Saturday at Ebay’s Town Hall Conference Center in San Jose. The San Francisco chapter of the ACM has been sponsoring this data mining themed, “un-conference” event since 2009. Attendance, this year was lighter than I remembered in the past, however, the event continues to be a viable way to find out what’s hot in the Silicon Valley Data Mining scene. The buzz this year was about Deep Learning, Data Science and R.

I stumbled into the hall just in time to watch the un-conference take shape. Greg Makowski and his team of ACM volunteers do a superb job of managing chaos. An un-conference self-organizes: people propose sessions, a show of hands decides which will fly, people volunteer or are gently prodded into leading the sessions, and a quick count decides which sessions get the larger rooms. I found myself leading two sessions: “An Overview and Introduction to R”, and a discussion on “How to become a Data Scientist”. The attitude of the participants in the R session was strictly business: “How is R organized?”, “What is the best way learn R?’, “Show me some code.” The pragmatism and enthusiasm reflected exactly what the polls indicate: R skills have become essential to Data Mining and Data Science.

In addition to the “Data Scientist” session in which I participated there was another parallel session led by eBay hiring managers on getting hired as a Data Scientist. I think the tremendous interest in this topic at the un-conference and elsewhere reflects how much momentum has been built up towards establishing “Data Scientist” as a distinct job position, and also indicates how useful the title has become as a label for a fairly extensive set of interdisciplinary skills. My take is that a Data Scientist needs to be proficient in four areas:

Statistical Inference: an understanding of sampling and experimental design at minimum
Sufficient programming skills to acquire and manipulate large data sets and implement machine learning algorithms
IT skills: some knowledge of Linux and big data architectures, how to connect to databases, clusters, clouds and hadoop
Business Skills: How to take an insufficiently articulated business problem and shape it into a series of relevant technical questions.

These are not all that different from Drew Conway’s original Venn diagram, but they include the ability to ask the right questions that Hilary Mason always so eloquently emphasizes.

While R and Data Science are in the realm of the here and now, the buzz around Deep Learning is that it might be the next really big thing. “Deep Learning” refers to using multi-layer neural nets, including Restricted Boltzmann Machines, to solve difficult tasks in machine vision, audio processing and difficult Natural Language Processing. Apparently, the basic ideas have been around for quite some time but recent advances in training these multilayer networks have made them practical for certain classes of problems. Python seems to be the language of choice for working in this area: for example NuPIC (the Numenta Platform for Intelligent Computing, which recently became an open-source project) is a mix of Python and C++ . The two very knowledgable Ebay engineers who lead the un-conference session worked through and example based on code that I think relied on the Pylearn2 library.

For me, the ACM un-conference brought some clarity to the complementary roles R and Python play in Data Mining, and provided concrete examples that illustrate why KDnuggets advises would-be Data Scientists to learn both languages (and SQL).

If you are interested in learning more about Deep Learning and its role in reviving the dreams for Artificial Intelligence have a look at the two Google Tech Talks by Geoff Hinton and Andrew Ng.

Fantasy Football Modeling with R

2013-10-16T16:06:44-07:00

Boris Chen, a data scientist for the New York Times, has been running since August a weekly blog with statistical analysis of NFL players, as fodder for Fantasy Football players around the country. Here's how he describes what he does:

My model pulls aggregated expert rankings from fantasypros, and I pass that data into a machine learning clustering algorithm called a gaussian mixture model to find tiers of players each week. Then I plot them in two dimensional space and the result is charts that let you easily decide your line up each week.

He performs the analysis in the R language. He provides more detail about the model itself in a recent feature in the New York Times (and can I say how gratifying it is to see the words "Gaussian mixture model" in a mainstream newspaper article — and in the Sports section, no less!). The article, as is his regular blog posts, is illustrated with charts created using R's ggplot2 package such as this one:

Yet another application of R to add to the list!

New York Times: Turning Advanced Statistics Into Fantasy Football Analysis (via reader JM)