Missing data can be a not so trivial problem when analysing a dataset and accounting for it is usually not so straightforward either.
If the amount of missing data is very small relatively to the size of the dataset, then leaving out the few samples with missing features may be the best strategy in order not to bias the analysis, however leaving out available datapoints deprives the data of some amount of information and depending on the situation you face, you may want to look for other fixes before wiping out potentially useful datapoints from your dataset.
While some quick fixes such as meansubstitution may be fine in some cases, such simple approaches usually introduce bias into the data, for instance, applying mean substitution leaves the mean unchanged (which is desirable) but decreases variance, which may be undesirable.
The mice package in R, helps you imputing missing values with plausible data values. These plausible values are drawn from a distribution specifically designed for each missing datapoint.
In this post we are going to impute missing values using a the airquality
dataset (available in R).
For the purpose of the article I am going to remove some datapoints from the dataset.
data < airquality data[4:10,3] < rep(NA,7) data[1:5,4] < NA
As far as categorical variables are concerned, replacing categorical variables is usually not advisable. Some common practice include replacing missing categorical variables with the mode of the observed ones, however, it is questionable whether it is a good choice. Even though in this case no datapoints are missing from the categorical variables, we remove them from our dataset (we can add them back later if needed) and take a look at the data using summary()
.
data < data[c(5,6)] summary(data) Ozone Solar.R Wind Temp Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :57.00 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:73.00 Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Mean : 42.13 Mean :185.9 Mean : 9.806 Mean :78.28 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 NA's :37 NA's :7 NA's :7 NA's :5
Apparently Ozone is the variable with the most missing datapoints. Below we are going to dig deeper into the missing data patterns.
There are two types of missing data:
Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large datasets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. We therefore check for features (columns) and samples (rows) where more than 5% of the data is missing using a simple function
pMiss < function(x){sum(is.na(x))/length(x)*100} apply(data,2,pMiss) apply(data,1,pMiss) Ozone Solar.R Wind Temp 24.183007 4.575163 4.575163 3.267974 [1] 25 25 25 50 100 50 25 25 25 50 25 0 0 0 0 0 0 0 0 0 0 [22] 0 0 0 25 25 50 0 0 0 0 25 25 25 25 25 25 0 25 0 0 25 [43] 25 0 25 25 0 0 0 0 0 25 25 25 25 25 25 25 25 25 25 0 0 [64] 0 25 0 0 0 0 0 0 25 0 0 25 0 0 0 0 0 0 0 25 25 [85] 0 0 0 0 0 0 0 0 0 0 0 25 25 25 0 0 0 25 25 0 0 [106] 0 25 0 0 0 0 0 0 0 25 0 0 0 25 0 0 0 0 0 0 0 [127] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [148] 0 0 25 0 0 0
We see that Ozone is missing almost 25% of the datapoints, therefore we might consider either dropping it from the analysis or gather more measurements. The other variables are below the 5% threshold so we can keep them. As far as the samples are concerned, missing just one feature leads to a 25% missing data per sample. Samples that are missing 2 or more features (>50%), should be dropped if possible.
The mice
package provides a nice function md.pattern()
to get a better understanding of the pattern of missing data
library(mice) md.pattern(data) Temp Solar.R Wind Ozone 104 1 1 1 1 0 34 1 1 1 0 1 4 1 0 1 1 1 3 1 1 0 1 1 3 0 1 1 1 1 1 1 0 1 0 2 1 1 1 0 0 2 1 1 0 0 1 2 1 0 1 0 1 2 1 0 0 0 0 4 5 7 7 37 56
The output tells us that 104 samples are complete, 34 samples miss only the Ozone measurement, 4 samples miss only the Solar.R value and so on.
A perhaps more helpful visual representation can be obtained using the VIM
package as follows
library(VIM) aggr_plot < aggr(data, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
The plot helps us understanding that almost 70% of the samples are not missing any information, 22% are missing the Ozone value, and the remaining ones show other missing patterns. Through this approach the situation looks a bit clearer in my opinion.
Another (hopefully) helpful visual approach is a special box plot
marginplot(data)
Obviously here we are constrained at plotting 2 variables at a time only, but nevertheless we can gather some interesting insights.
The red box plot on the left shows the distribution of Solar.R with Ozone missing while the blue box plot shows the distribution of the remaining datapoints. Likewhise for the Ozone box plots at the bottom of the graph.
If our assumption of MCAR data is correct, then we expect the red and blue box plots to be very similar.
The mice()
function takes care of the imputing process
tempData < mice(data,m=5,maxit=50,meth='pmm',seed=500) summary(tempData) Multiply imputed data set Call: mice(data = data, m = 5, method = "pmm", maxit = 50, seed = 500) Number of multiple imputations: 5 Missing cells per column: Ozone Solar.R Wind Temp 37 7 7 5 Imputation methods: Ozone Solar.R Wind Temp "pmm" "pmm" "pmm" "pmm" VisitSequence: Ozone Solar.R Wind Temp 1 2 3 4 PredictorMatrix: Ozone Solar.R Wind Temp Ozone 0 1 1 1 Solar.R 1 0 1 1 Wind 1 1 0 1 Temp 1 1 1 0 Random generator seed value: 500
A couple of notes on the parameters:
m=5
refers to the number of imputed datasets. Five is the default value.
meth='pmm'
refers to the imputation method. In this case we are using predictive mean matching as imputation method. Other imputation methods can be used, type methods(mice)
for a list of the available imputation methods.
If you would like to check the imputed data, for instance for the variable Ozone, you need to enter the following line of code
tempData$imp$Ozone 1 2 3 4 5 5 13 20 28 12 9 10 7 16 28 14 20 25 8 14 14 1 8 26 9 19 32 8 37 ...
The output shows the imputed data for each observation (first column left) within each imputed dataset (first row at the top).
If you need to check the imputation method used for each variable, mice
makes it very easy to do
tempData$meth Ozone Solar.R Wind Temp "pmm" "pmm" "pmm" "pmm"
Now we can get back the completed dataset using the complete()
function. It is almost plain English:
completedData < complete(tempData,1)
The missing values have been replaced with the imputed values in the first of the five datasets. If you wish to use another one, just change the second parameter in the complete()
function.
Let’s compare the distributions of original and imputed data using a some useful plots.
First of all we can use a scatterplot and plot Ozone against all the other variables
xyplot(tempData,Ozone ~ Wind+Temp+Solar.R,pch=18,cex=1)
Here it is
What we would like to see is that the shape of the magenta points (imputed) matches the shape of the blue ones (observed). The matching shape tells us that the imputed values are indeed “plausible values”.
Another helpful plot is the density plot:
densityplot(tempData)
The density of the imputed data for each imputed dataset is showed in magenta while the density of the observed data is showed in blue. Again, under our previous assumptions we expect the distributions to be similar.
Another useful visual take on the distributions can be obtained using the stripplot()
function that shows the distributions of the variables as individual points
stripplot(tempData, pch = 20, cex = 1.2)
Suppose that the next step in our analysis is to fit a linear model to the data. You may ask what imputed dataset to choose. The mice
package makes it again very easy to fit a a model to each of the imputed dataset and then pool the results together
modelFit1 < with(tempData,lm(Temp~ Ozone+Solar.R+Wind)) summary(pool(modelFit1)) est se t df Pr(>t) (Intercept) 72.812078768 2.95380500 24.650266 84.18464 0.000000e+00 Ozone 0.163094287 0.02607674 6.254397 57.78569 5.236295e08 Solar.R 0.009679676 0.00789576 1.225933 37.48960 2.278691e01 Wind 0.352582008 0.21639828 1.629320 92.89136 1.066321e01 lo 95 hi 95 nmis fmi lambda (Intercept) 66.938301817 78.68585572 NA 0.1477818 0.1277731 Ozone 0.110891894 0.21529668 37 0.2155848 0.1888975 Solar.R 0.006311604 0.02567095 7 0.3004189 0.2640672 Wind 0.782312735 0.07714872 0 0.1300747 0.1115442
The variable modelFit1
containts the results of the fitting performed over the imputed datasets, while the pool()
function pools them all together. Apparently, only the Ozone variable is statistically significant.
Note that there are other columns aside from those typical of the lm()
model: fmi contains the fraction of missing information while lambda is the proportion of total variance that is attributable to the missing data. For more information I suggest to check out the paper cited at the bottom of the page.
Remember that we initialized the mice
function with a specific seed, therefore the results are somewhat dependent on our initial choice. To reduce this effect, we can impute a higher number of dataset, by changing the default m=5
parameter in the mice()
function as follows
tempData2 < mice(data,m=50,seed=245435) modelFit2 < with(tempData2,lm(Temp~ Ozone+Solar.R+Wind)) summary(pool(modelFit2)) est se t df Pr(>t) (Intercept) 73.156084276 2.803010282 26.099114 129.3154 0.000000e+00 Ozone 0.166242781 0.024926976 6.669192 118.4408 8.645631e10 Solar.R 0.009046835 0.007374103 1.226839 114.5471 2.223989e01 Wind 0.382700790 0.202976584 1.885443 136.6735 6.149264e02 lo 95 hi 95 nmis fmi lambda (Intercept) 67.610387851 78.70178070 NA 0.11141367 0.0977762 Ozone 0.116882484 0.21560308 37 0.16290744 0.1488906 Solar.R 0.005560458 0.02365413 7 0.18096774 0.1667911 Wind 0.784081566 0.01867999 0 0.07425875 0.0608104
After having taken into account the random seed initialization, we obtain (in this case) more or less the same results as before with only Ozone showing statistical significance.
A gist with the full code for this post can be found here.
Note: I learnt this technique in a paper entitled mice: Multivariate Imputation by Chained Equations in R by Stef van Buuren. It is a great paper and I highly recommend to read it if you are interested in multiple imputation!
Thank you for reading this post, leave a comment below if you have any question.
Two weeks ago I used STAN to create predictions after just throwing in all independent variables. This week I aim to refine the STAN model. For this it is convenient to use the loo package (Efficient LeaveOneOut CrossValidation and WAIC for Bayesian Models). See also the paper by Aki Vehtari, Andrew Gelman and Jonah Gabry.
Since the package does the heavy lifting, it only remains to wrap it a function so I can quickly compare some models. A potential next step is to automate some more, but I did not do that pending current results.
Same as last time.
To keep the model similar as last time, I need to get a full design matrix for each independent variable in the model. So I made a mechanism which takes a model formulation and creates both the design matrix and a bunch of indices to keep track which column corresponds to which part of the model. To be specific, terms contains 1 to nterm if the corresponding column in xmat is from variable 1 (intercept) to the last variable. This sits in the function des.matrix.
The generated quantities block is purely for the LOO statistic.
It is preferred to compile the model only once, hence fit1 is calculated beforehand. Having done that preparation, MySmodel is a function which does model fitting, LOO statistic and output it all in one step. In this function I can just drop in the formula and get something usable as output, so I can easily examine a bunch of models. It seemed to me that forward selection was a suitable way to examine the model space. I know it is not ideal, but at this point I mainly want to know if this actually will function.
To my surprise, Title was the parameter which gave the best predictions. I had expected sex to play that role.
Survived ~ Title 445.4972 16.46314
Computed from 4000 by 891 loglikelihood matrix
Estimate SE
elpd_loo 445.5 16.5
p_loo 4.1 0.2
looic 891.0 32.9
All Pareto k estimates OK (k < 0.5)
The next variable was passenger class
Survived ~ Title + Pclass 395.5926 17.42705

The RuleFit algorithm from Friedman and Propescu is an interesting regression and classification approach that uses decision rules in a linear model.
RuleFit is not a completely new idea, but it combines a bunch of algorithms in a clever way. RuleFit consists of two components: The first component produces “rules” and the second component fits a linear model with these rules as input (hence the name “RuleFit”). The cool thing about the algorithm is that the produced model is highly interpretable, because the decision rules have an easy understandable format, but you still have a flexible enough approach to capture complex interactions and get a good fit.
The rules that the algorithm generates have a simple form:
if x2 < 3 and x5 < 7 then 1 else 0
The rules are generated from the covariates matrix X. You can also see the rules simply as new features based on your original features.
The RuleFit paper uses the Boston housing data as example: The goal is to predict the median house value in the Boston neighborhood. One of the rules that is generated by RuleFit is:
if (number of rooms > 6.64) and (concentration of nitric oxide < 0.67) then 1 else 0
The interesting part is how those rules are generated: They are derived from Decision Trees, by basically ripping them apart. Every path in a tree can be turned into a decision rule. You simply chain the binary decisions that lead to a certain node et voilà, you have a rule. It is desirable to generate a lot of diverse and meaningful rules. Gradient boosting is used to fit an ensemble of decision trees (by regressing/classifying y with your original features X). Each resulting trees is turned into multiple rules.

Another way to see this step is a black box, that generates a new set of features X’ out of your original features X. Those features are binary and can represent quite complex interactions of your original X. The rules are chosen to maximise the prediction/classification task at hand.
You will get A LOT of rules from the first step (and that is what you want). Since the first step is only a feature transformation function on your original data set you are still not done with fitting a model and also you want to reduce the number of rules. Lasso or L1 regularised regression is good in this scenario. Next to the rules also all numeric variables from your original data set will be used in the Lasso linear model. Every rule and numeric variable gets a coefficient (beta). And thanks to the regularisation, a lot of those betas will be estimated to zero. The numeric variables are added because trees suck at representing simple linear relationships between y and x. The outcome is a linear model that has linear effects for all of the numeric variables and also linear terms for the rules.
The paper not only introduces RuleFit and evaluates it, but it also comes with a bunch of useful tools, (comparable to Random Forest): Measurement tools for variable importance, degree of relevance of original input variables and interaction effects between variables.
There are currently two implementations out there one from the authors in R and one in the TMVA toolkit.
tl;dr The RuleFit regression and classification algorithm works by boosting an ensemble of decision trees, creating decision rules from those trees and throwing the rules (+ the numeric covariables) at L1 regularized regression (Lasso). Try it out.
Within the New Zealand Ministry of Business, Innovation and Employment, the Sector Trends team has recently secured resourcing for additional analysts on a range of statistical programmes. That’s the team that I usually manage, although for the next few months I’m doing a stint on a similar team, different topics. The formal details and position descriptions are on the MBIE website of course; this blog post should be considered only as informal additional material.
Here’s a little more about the positions there wasn’t space to put in the formal documentation:
MBIE spends some millions of dollars each year on producing, publishing and analysing data on the tourism sector in New Zealand. Tourism is New Zealand’s second biggest export industry, and the Minister for Tourism happens to also be the Prime Minister. The Principal Analyst Tourism Data will be responsible for the overall programme, including the worldleading Tourism Data Improvement Programme. They’ll lead on engagement with the Minister and with the sector, oversight overall prioritisation, planning, and quality control.
Principal Analysts don’t have staff or financial delegations  those are with the Manager  but take leadership on all aspects of the content in their area. They need the people and communication skills to engage with Ministers and the industry, and the technical skills to deal with our dataintensive projects and support the Analysts and Senior Analysts producing new data, publishing statistics, and undertaking analysis.
MBIE administers many hundreds of millions of dollars of science investments every year and is the lead agency advising the Government on science and innovation policy. A significant stepping up of resourcing is underway to improve the evidence base in the science and innovation field. Working with the policy teams, we’re identifying a range of areas where improvements are needed, from reusing our own investments data through to better understanding how science translates into innovation at the business level. This programme is at its early stages and needs a experienced leader with vision and mana to oversight it and make it happen.
This role will work across the range of areas the team provides analysis on, which as well as tourism and science and innovation includes regions, cities, sectors, and strategic microeconomic thinking. This is a more handson role than the Principals and needs someone who is either an expert with R and data management and visualisation or has demonstrated capacity to quickly become one. Senior Analysts lead complex projects in sensitive areas, and coach and mentor Analysts.
The Analyst role could be an entry level position or might be for a more experienced person looking for a new challenge. This is the classic “data ninja” role  you need to be an R expert or quickly become one, have great ideas about data analysis and visualisation, and be keen to get your hands dirty with the data that tells us how New Zealand works.
All the basics about MBIE are on the website. You can find out a bit more about the team from this presentation I recently gave for the New Zealand Analytics forum. We work closely with policy teams and industry on difficult change programmes. We led the world in using electronic card transactions to estimate regional tourism spend. We put a major Tier 1 official statistic expenditure survey online and improved understanding of tourist spend and New Zealand household consumption and savings. We’re now moving more into exciting interactive data visualisations. We use R, LaTeX, Git, Shiny, JavaScript, and SQLServer.
Checkout examples of our work:
If you want to do stuff like that with data, and also take it the next step to work with policy teams to form advice for Ministers on what to do about what we find, this could be the team for you!
It’s an exciting team, and now large enough to generate a huge buzz as we solve new problems every day. We are obsessively into continuous improvement and constantly training eachother in the innovative and new. Many of the things we do day to day we hadn’t heard of six months ago, and quite a few didn’t exist two years ago. It’s not the place to be if you think you’ve learnt everything you need to know already! We have a 20 minute seminar each week, and a one hour lab session too, constantly building our statistical and econometric skills. This is a team for someone with a bit of buzz and pizzaz as well as an obsession with exploring and analysing cool econopmic data to answer difficult problems.
Apply via the MBIE careers website by 14 October 2015.
Note that these four positions are part of a wider recruitment of eight analysts in total for the Evidence, Monitoring and Governance branch. The other four positions are in the Research and Evaluation teams in the branch and they do cool stuff too. So if the above isn’t quite right for you, check out the role descriptions for those other opportunities too, at the same webpage.
These positions are in Wellington, and you need to be able to work in New Zealand.
Within the New Zealand Ministry of Business, Innovation and Employment, the Sector Trends team has recently secured resourcing for additional analysts on a range of statistical programmes. That’s the team that I usually manage, although for the next few months I’m doing a stint on a similar team, different topics. The formal details and position descriptions are on the MBIE website of course; this blog post should be considered only as informal additional material.
Here’s a little more about the positions there wasn’t space to put in the formal documentation:
MBIE spends some millions of dollars each year on producing, publishing and analysing data on the tourism sector in New Zealand. Tourism is New Zealand’s second biggest export industry, and the Minister for Tourism happens to also be the Prime Minister. The Principal Analyst Tourism Data will be responsible for the overall programme, including the worldleading Tourism Data Improvement Programme. They’ll lead on engagement with the Minister and with the sector, oversight overall prioritisation, planning, and quality control.
Principal Analysts don’t have staff or financial delegations – those are with the Manager – but take leadership on all aspects of the content in their area. They need the people and communication skills to engage with Ministers and the industry, and the technical skills to deal with our dataintensive projects and support the Analysts and Senior Analysts producing new data, publishing statistics, and undertaking analysis.
MBIE administers many hundreds of millions of dollars of science investments every year and is the lead agency advising the Government on science and innovation policy. A significant stepping up of resourcing is underway to improve the evidence base in the science and innovation field. Working with the policy teams, we’re identifying a range of areas where improvements are needed, from reusing our own investments data through to better understanding how science translates into innovation at the business level. This programme is at its early stages and needs a experienced leader with vision and mana to oversight it and make it happen.
This role will work across the range of areas the team provides analysis on, which as well as tourism and science and innovation includes regions, cities, sectors, and strategic microeconomic thinking. This is a more handson role than the Principals and needs someone who is either an expert with R and data management and visualisation or has demonstrated capacity to quickly become one. Senior Analysts lead complex projects in sensitive areas, and coach and mentor Analysts.
The Analyst role could be an entry level position or might be for a more experienced person looking for a new challenge. This is the classic “data ninja” role – you need to be an R expert or quickly become one, have great ideas about data analysis and visualisation, and be keen to get your hands dirty with the data that tells us how New Zealand works.
All the basics about MBIE are on the website. You can find out a bit more about the team from this presentation I recently gave for the New Zealand Analytics forum. We work closely with policy teams and industry on difficult change programmes. We led the world in using electronic card transactions to estimate regional tourism spend. We put a major Tier 1 official statistic expenditure survey online and improved understanding of tourist spend and New Zealand household consumption and savings. We’re now moving more into exciting interactive data visualisations. We use R, LaTeX, Git, Shiny, JavaScript, and SQLServer.
Checkout examples of our work:
If you want to do stuff like that with data, and also take it the next step to work with policy teams to form advice for Ministers on what to do about what we find, this could be the team for you!
It’s an exciting team, and now large enough to generate a huge buzz as we solve new problems every day. We are obsessively into continuous improvement and constantly training eachother in the innovative and new. Many of the things we do day to day we hadn’t heard of six months ago, and quite a few didn’t exist two years ago. It’s not the place to be if you think you’ve learnt everything you need to know already! We have a 20 minute seminar each week, and a one hour lab session too, constantly building our statistical and econometric skills. This is a team for someone with a bit of buzz and pizzaz as well as an obsession with exploring and analysing cool econopmic data to answer difficult problems.
Apply via the MBIE careers website by 14 October 2015.
Note that these four positions are part of a wider recruitment of eight analysts in total for the Evidence, Monitoring and Governance branch. The other four positions are in the Research and Evaluation teams in the branch and they do cool stuff too. So if the above isn’t quite right for you, check out the role descriptions for those other opportunities too, at the same webpage.
These positions are in Wellington, and you need to be able to work in New Zealand.
During one of our Department’s weekly biostatistics “clinics”, a visitor was interested in creating confidence bands for a Gaussian density estimate (or a Gaussian mixture density estimate). The mean, variance, and two “nuisance” parameters, were simultaneously estimated using leastsquares. Thus, the approximate sampling variancecovariance matrix (4×4) was readily available. The two nuisance parameters do not directly affect the Gaussian density, but the client was concerned that their correlation with the mean and variance estimates would affect the variance of the density estimate. Of course, this might be the case in general, and a nonparametric bootstrap method might be used to account for this. Nevertheless, I proposed using the delta method, in which the variability of the nuisance parameter estimates do not affect that of the density estimate; a consequence of the normality assumption. This can be verified by fiddling with the parameters below.
The code below implements a Waldtype pointwise 95% confidence band for a test case; I made up the values of the estimated parameters and their approximate variancecovariance matrix (note that the mean and variance estimators are statistically independent). After fiddling with this a bit, it’s clear that this delta method approach can perform poorly when the sampling variance is large (e.g., the lower bound of the density estimate can be negative).
## bell curve function
bell < function(dist, mu=0, sig=1, p1=0, p2=0)
exp((distmu)^2/sig/2)/sqrt(2*pi)/sig
## plot bell curve at default parameters
curve(bell(x), from=5, to=5, ylim=c(0,0.6),
ylab="density", xlab="distance")
## compute gradient of bell_curve on a grid of distances
dgrid < seq(5, 5, 1/50)
bderv < numericDeriv(
expr=quote(bell(dgrid, mu, sig, p1, p2)),
theta=c("mu","sig","p1","p2"),
rho=list2env(list(dgrid=dgrid,mu=0,sig=1,p1=0,p2=0)))
bgrad < attr(bderv, 'gradient')
## variancecovariance matrix of mu, sig, p1, and p2
pvcov < matrix(c(1.0,0.0,0.1,0.0,
0.0,1.0,0.0,0.1,
0.1,0.0,0.2,0.1,
0.0,0.1,0.1,0.2)/100, 4,4)
## approxiamte variancecovariance of bell curve
## induced by variability in parameters
bvcov < bgrad %*% pvcov %*% t(bgrad)
## add pointwise 95% Wald confidence bands
polygon(x=c(dgrid, rev(dgrid)),
y=c(bderv + qnorm(0.975)*sqrt(diag(bvcov)),
bderv  qnorm(0.975)*sqrt(diag(bvcov))),
col="lightgray", border=NA)
lines(dgrid, bderv, lwd=2)
abline(h=0, lty=3)
Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning.
In this article we’ll work through the definition of differential privacy and demonstrate how Dwork et.al.’s recent results can be used to improve the model fitting process.
We can define epsilondifferential privacy through the following game:
 log( Prob[A(S) in Q] / Prob[A(S’) in Q] )  ≤ epsilon
for all of the adversary’s choices of S, S’ and Q. The probabilities are defined over coin flips in the implementation of A(), not over the data or the adversary’s choices.
The adversary’s goal is to use A() to tell between S and S’, representing a failure of privacy. The learner wants to extract useful statistics from S and S’ without violating privacy. Identifying a unique row (the one which changed markings) violates privacy. If the adversary can tell which set (S or S’) the learner is working on by the value of A(), then privacy is violated.
Notice S and S’ are “data sets” in the machine learning sense (collections of rows carrying information). Q is a set in the mathematic sense: a subset of the possible values that A() can return. To simplify the demonstration, we will take Q to be an interval.
In the following examples, A() returns the (approximate) expected value of a set s. The adversary has chosen two sets S and S’ of size n = 100:
The set Q will be an interval [T, 1], where the adversary picks the threshold T.
The adversary’s goal is to pick T such that when he sees that A(s) ≥ T, he knows that A() has just evaluated S’. The learner has two (competing) goals:
Here, A(s) simply returns the expected value mean(s)
, so it always returns A(S) = 0 when evaluating S, and A(S’) = 0.01 when evaluating S’. This is clearly not differentially private for any value of epsilon. If the adversary picks T = 1/200 = 0.005, then he can reliably identify when A() has evaluated the set S’, every time they play a round of the game.
In the figures, set1 in green represents the distribution of values returned by calls to A(S), and set2 in orange returns the distribution of values returned by calls to A(S’). I’ve plotted set2 upside down for clarity.
One way for the learner to obscure which set she is evaluating is to add a little random noise to the estimate, to “blur” it. Following Dwork et.al.’s methods, we’ll add noise from a Laplace distribution, which is symmetric and falls off slower than gaussian noise would. Here we show adding just a little bit of noise (of “width” sigma = 1/3n):
The shaded green region represents the chance that A(S) will return a value greater than the adversary’s threshold T — in other words, the probability that the adversary will mistake S for S’ in a round of the game. We’ve now made that probability nonzero, but it’s still much more likely that if A(s) > T, then s = S’. We need more noise.
In particular, we need the noise to be bigger than the gap A(S’)A(S) = 1/n, or 0.01. Let’s pick sigma = 3/n = 0.03:
Now we see that the shaded green region has almost the same area as the shaded orange region — you can think of epsilon as expressing the difference between the shaded green region and the shaded orange region. In fact, the absolute value of the logratio of the two areas is epsilon. In other words, observing that A(s) > T is no longer strong evidence that s = S’, and we have achieved differential privacy, to the degree epsilon.
Of course, A(s) is also no longer a precise estimate of mean(s)
. We’ll return to that point in a bit.
We can simulate the game I described above in R. I’ve put the code on github, here.
The code sweeps through a range of values of epsilon, and for each value selects a suitable noise width and runs 1000 rounds of the differential privacy game, returning a value for A(S) and A(S’). Each round we record whether A(S) and A(S’) are greater than T = 1/200.
Differential privacy aims to make the answers to “snooping queries” too vague to distinguish closely related sets (in this case, it makes the probability that A(S) ≥ T about the same as the probability that A(S’) ≥ T). But for machine learning, we are also interested in the output of A(). We can use the simulation above to estimate what happens to A(S) and A(S’) for different values of epsilon. Here we plot the (normalized) gap between the expected values of A(S) and A(S’) as a function of epsilon, from the simulation:
As epsilon gets smaller (implying stricter privacy), the relative gap between the expected values of A(S) and A(S’) gets smaller. The discrepancies in the plot are probably due to poor choices of the noise parameter (I picked them heuristically), but the trend is clear. This makes sense, since privacy implies that the A(S) should look a lot like A(S’).
However, as epsilon gets stricter, the estimates of mean(S) = 0
and mean(S') = 0.01
— which is what A() is supposed to estimate — also become poorer.
This is the tradeoff: how can we preserve privacy and still perform useful analysis?
The recent papers by Dwork, et.al. apply differential privacy to the problem of adaptive data analysis, or reusable test sets. In standard machine learning practice, we use two data sets to model: a training set to fit the model, and a test set to evaluate the model. Ideally we only use the test set once; but in practice we go back to it again and again, as we try to improve our model. We already know that performance estimates of a model over its training set are upwardly biased (the model looks more accurate on training than it really is, because it “knows” the training set); if we go back to the test set too many times, then performance estimates of the model on the test set are also upwardly biased, because we also “know” the test set. This problem is exacerbated when we also use the test set to tune the model: for example to pick the variables we use, or tune modeling parameters. In other words, even if we observe the best practice of using a training set to fit models, a calibration set to tune models, and a holdout set to evaluate the model, we can also contaminate the calibration set if we look at it too many times.
Dwork, et.al., describe how (and how many times) we can go back to a test set without biasing its evaluations of model performance. To do this, we make sure that the modeling procedure only interacts with the test set in a differentially private manner. Because it has limited access to the test set, the modeling procedure can’t learn the test set, so evaluations of a model over the test set still accurately reflect a model’s performance on unseen data.
Describing the proofs and techniques in these papers is outside the scope of this article, but we can demonstrate the results in a simple example, similar to the example application shown in Dwork, et.al.’s Science paper.
For our example, suppose we have 2000 rows of data with a binary outcome y (50% prevalence of the positive class), and 110 possible input variables to choose from. In our tests we used synthetic data with ten variables (x1...x10
) that carry signal and one hundred noise variables (n1...n100
).
We decide to use forwardstepwise logistic regression to choose a suitable model for predicting y. We split the sample data into 1000 rows for the training set, and 1000 rows for the test set. Given that we have selected k1 variables and have fit a model, M(k1), that uses those variables, we pick the kth variable by fitting a model Mnew on the training set using the (k1) previously selected variables plus one new variable from the remaining candidates. We then evaluate the accuracy of M(k1) and Mnew on the test set, and pick as the kth variable the one for which Mnew showed the most improvement. In other words, we fit candidate models on the training set, and evaluate them for improvement on the test set. This is a fairly common way of tuning models. Standard implementations of stepwise regression often use training set AIC as the criterion for picking the next model; we use accuracy improvement to keep the code straightforward, and closer to the assumptions in the paper.
Normally, the stepwise procedure would terminate when the model stops improving, but for this example we let it run to pick fifty variables. At each step we recorded the accuracy of the selected model on the training set (green circles), the test set (orange triangles) and on a fresh holdout set of 10,000 rows (purple squares). We picked this additional large holdout set (called “fresh”) to get a good estimate of the true accuracy of the model. Here’s the performance of the naive procedure:
The test set performance estimates were are more upwardly biased than the training set estimates! This is not too surprising, since we are using the test set to select variables. Despite the optimistic curve of the test scores, we can see from the true outofsample performance (freshScore
) that the model performance is not much better than random; in fact, the algorithm only picked one of the ten signal carrying variables (the first one selected). The test set (and to a lesser extent, the training set) believe that performance continues to improve, when in fact performance degrades as more noise variables swamp out the single signal variable.
We want to use the test set to pick variables, while also using it to accurately estimate outofsample error. In their recent Science article, Dwork, et.al. propose an algorithm called Thresholdout to let us safely reuse the test set. We implemented our diffPrivMethod
based on Thresholdout. diffPrivMethod
evaluates proposed models on both the training and test sets. If the model’s accuracies on the test and the training sets are within a certain tolerance, then diffPrivMethod
returns the training set accuracy with a bit of Laplacian noise as the model score. If the two accuracies are far apart, then diffPrivMethod
adds a bit of Laplacian noise to the test set accuracy, and returns that as the model score. In R, the code would look something like this:
# Return an estimate of model performance in a # differentially private way # v is the new candidate variable # varSet is the set of already selected variables diffPrivMethod = function(v) { # Original model was fit on # training data using varSet; # Fit a new model using varSet+v # using the training data # and evaluate the difference # between the new models on the test set testScore < modelScoreImprovement(varSet,v, trainData, testData) # Do the same thing, but evaluate # the improvement on the training set trainScore < modelScoreImprovement(varSet,v, trainData, trainData) if(abs(testScoretrainScore)<=eps/2) { # if the scores are close, # return trainScore with a little noise sc < trainScore + rlaplace(1,eps/2) } else { # otherwise, return testScore # with a little more noise sc < testScore + rlaplace(1,eps) } # make sure the score is in range [0,1] pmax(0,pmin(1,sc)) }
We picked the tolerance and the widths of the Laplace noise somewhat arbitrarily.
The beauty of this scheme is that we never look directly at the test set. If the test and training set performances are similar, we see the (noisy) training performance; when we do look at the test set, we only see a noisy view of it, so it leaks information more slowly.
The original Thresholdout algorithm does not add noise to the training set performance, probably under the reasoning that we know it already. We decided to add noise anyway, on the principle that even knowing when you are using the test set performance leaks a little information, and we wanted to reduce the leakage a little more.
Here are the results of the stepwise regression using diffPrivMethod
(we set the tolerance/noise parameter eps=0.04 for this run):
Now the test set estimates the true outofsample accuracies quite well. The chosen models achieve a peak accuracy of about 61%, at around 36 variables. This method did a better job of picking variables — it found all ten signal variables — though it did start picking noise variables fairly early on.
However, there is still money left on the table. What if we used a really large test set? Here, we show the results of running the naive stepwise regression (no differential privacy), with the same training set of 1000 rows and a test set of 10,000 rows.
Now the learning algorithm picks nine of the signal variables immediately, and achieves an accuracy of 62%. We can see that the model’s performance on the test set is still upwardly biased, but much less so than with the naive method. Because the test set is larger, we don’t contaminate it as much, even when looking at it directly.
This suggests that we can think of differential privacy as letting us simulate having a larger test set, though obviously we still need our actual test set to be a reasonable size, especially since stepwise regression requires a lot of queries to the test set. Our results also show that while these new differentialprivacy based schemes let us do a good job of estimating outofsample performance, that alone is not enough to insure best possible model performance.
All code, data, and examples for this experiment can be found here.
Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth. “Preserving Statistical Validity in Adaptive Data Analysis”, arXiv:1411.2664, April 2015.
Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth. “The reusable holdout: Preserving validity in adaptive data analysis”, Science, vol 349, no 6248 pp 636638, August 2015. [link to abstract]
Blum, Avrim and Moritz Hardt. “The Ladder: A Reliable Leaderboard for Machine Learning Competitions”, arXiv:1502.04585, February 2015.
Hadley Wickham, RStudio's Chief Scientist and prolific author of R books and packages, conducted an AMA (Ask Me Anything) session on Reddit this past Monday. The session was tremendously popular, generating more than 500 questions/comments and promoting the AMA to the front page of Reddit.
If you're not familiar with Hadley's work (which would be a surprise if you're an R user), his own introduction in the Reddit AMA post will fill you in:
Broadly, I'm interested in the process of data analysis/science and how to make it easier, faster, and more fun. That's what has led to the development of my most popular packages like ggplot2, dplyr, tidyr, stringr. This year, I've been particularly interested in making it as easy as possible to get data into R. That's lead to my work on the DBI, haven, readr, readxl, and httr packages. Please feel free to ask me anything about the craft of data science.
I'm also broadly interested in the craft of programming, and the design of programming languages. I'm interested in helping people see the beauty at the heart of R and learn to master it as easily as possible. As well as a number of packages like devtools, testthat, and roxygen2, I've written two books along those lines: Advanced R, which teaches R as a programming language, mostly divorced from its usual application as a data analysis tool; and R packages, which teaches software development best practices for R: documentation, unit testing, etc.
Check out the comments at the link below, where you'll find insights from Hadley on the best way to teach R, Big Data in R, the elegance (or otherwise) of the R language, being productive, the best BBQ, and much more.
There’s a plethora of generic content regarding GSoC preparation already available on the internet.* This blog post will concentrate on specific tips which should help anyone looking to get involved with The RProject, or any Rcentric OpenSource project, begin their journey.
*The post assumes basic programming knowledge (loops, functions, etc.) and familiarity with git/github.*
The basics
To begin with, familiarity with the basic R commands/constructs (function, apply family, etc) would be required. Previous experience of writing smalltomid sized programs and a good IDE can really come in handy. RStudio is highly recommended.
The following resources should help in beginning your R journey.
> Try R
> Learn R with R tutorials and coding challenges  DataCamp
> swirl: Learn R, in R. (Give this a go even if already proficient in R; great resource!)
Carrying the journey forward
If you are looking for more indepth R knowledge, try some of the numerous free courses on Coursera, edX, Udacity, etc.
(I had started with this one: https://www.coursera.org/course/rprog)
Once you find yourself moderately experienced in R programming (and if books are your thing), go through this ebook by Hadley Wickham: Advanced R.
He happens to be one of the most famous (and awesome) R programmers in the world (and was one of my GSoC mentors too). This book would help you become a much better R programmer, and in my opinion, a better programmer too.
Getting the feel of things
Once all the R basics have been dealt with, try going through this year’s GSoC project list:
rstatsgsoc: tableofproposedcodingprojects
Go through the proposal column of the list to get an idea about the projects that were done this year. Here’s the page for mine: rstatsgsoc/projectpage.
As you could see, there were two things a student had to do:
1.) Complete some tests.
2.) Contact the mentors.
It was the template for, more or less, each project. Each individual project, however, varied differently. You will have to choose one (or more) to apply to as per your interests/strengths/experience. In addition to R and basic version control (git/svn), each project will require some additional skills. Some would require proficiency in C/C++, others Javascript while some other will require you to work with the Shiny framework for R.
R packages
Majority of the projects would require you to write an Rpackage. The package system is the backbone of the R ecosystem, and one of the major reasons behind its popularity.
Quoting QuickR,
Packages are collections of R functions, data, and compiled code in a welldefined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.
This blog post should help you quickstart: Writing an R package from scratch.
The ebook R packages is one of the best and most uptodate resource about the same.
Miscellaneous
Join the R mailing lists (R: Mailing Lists) and follow some blogs. Rbloggers is a great one!
Conclusion
In Google Summer of Code (GSoC), there’s no fixed formula for getting accepted. The same applies to R Project.
R, and statistical computing, are fascinating subjects and you should explore them as freely as possible. At the time of writing this post (October 2, 2015), there’s still a few months to go before GSoC really heats up.
*Some generic GSoC content:
> Student Guide
> There are as many as 137 organizations selected in GSoC this year. For a beginner, it’s really difficult to select an organization. I’m preparing for GSoC 2016. So, how can I choose an appropriate organization?
There are many more resources available on the internet. Look them up.
Have fun while at it all. That’s real important.
Open Source is fun!