by Antony Unwin
University of Augsburg, Germany
David Moore's definition of data: numbers that have been given a context.
Here is some context for the finch dataset:
Fig 1: Illustrations of the beaks of four of Darwin's finches from "The Voyage of the Beagle". Note that only one of these (fortis) is included in the dataset.
R's package system is one of it great strengths, offering powerful additional capabilities of all kinds and including many interesting real datasets. Of course not all packages are as good as they might be, and as Bill Venables memorably put it on R-help in 2007: "Most packages are very good, but I regret to say some are pretty inefficient and others downright dangerous." So you have to treat any packages with care ("caveat downloader", as the R-omans might have said) and any datasets supplied must be handled carefully too.
You might think that supplying a dataset in an R package would be a simple matter: You include the file, you write a short general description mentioning the background and giving the source, you define the variables. Perhaps you provide some sample analyses and discuss the results briefly. Kevin Wright's agridat package is exemplary in these respects.
As it happens, there are a couple of other issues that turn out to be important. Is the dataset or a version of it already in R and is the name you want to use for the dataset already taken? At this point the experienced R user will correctly guess that some datasets have the same name but are quite different (e.g., movies, melanoma) and that some datasets appear in many different versions under many different names. The best example I know is the Titanic dataset, which is availble in the datasets package. You will also find titanic (COUNT, prLogistic, msme), titanic.dat (exactLoglinTest), titan.Dat (elrm), titgrp (COUNT), etitanic (earth), ptitanic (rpart.plot), Lifeboats (vcd), TitanicMat (RelativeRisk), Titanicp (vcdExtra), TitanicSurvival (effects), Whitestar (alr4), and one package, plotrix, includes a manually entered version of the dataset in one of its help examples. The datasets differ on whether the crew is included or not, on the number of cases, on information provided, on formatting, and on discussion, if any, of analyses. Versions with the same names in different packages are not identical. There may be others I have missed.
The issue came up because I was looking for a dataset of the month for the website of my book "Graphical Data Analysis with R". The plan is to choose a dataset from one of the recently released or revised R packages and publish a brief graphical analysis to illustrate and reinforce the ideas presented in the book while showing some interesting information about the data. The dataset finch in dynRB looked rather nice: five species of finch with nine continuous variables and just under 150 cases. It looked promising and what’s more it is related to Darwin’s work and there was what looked like an original reference from 1904.
Figure 2 shows the distribution of species:
Fig 2: The numbers of birds of the five different species in the dataset. The distribution is a little unbalanced with over half the birds being from one species, but that is real data for you.
Some of the variable names are clear enough (TailL must be the length of the tail), but what on earth could N.UBkL be? The help for the dataset only says it is a numeric vector. As a first resort I tried to find the 1904 reference on the web and it was surprisingly easy. The complete book is available and searchable from Cornell University Library. N.UBkL must be 'Maxilla from Nostril', i.e. the distance from nose to upper beak — obvious in retrospect really.
Naturally, once you have an original source in front of you, you explore a bit more. It turns out that the dataset only includes the birds found on one island, although the species may be found on more than one. That is OK (although the package authors could have told us). All cases with any missing values have been dropped (9 out of 155). You can understand why that might have been done (methods cannot handle missing values?), but mentioning it would have been nice. Information is available on sex for each bird in the original, but is not included in the dataset. Perhaps sex is not so relevant for their studies, but surely potentially very interesting to others. It is possible that the dataset was actually passed on to the authors by someone else and the authors themselves never looked at the original source. This would be by no means unusual in academic circles (sadly).
There is an extensive literature on Darwin's finches (which incidentally are not finches at all) and a key feature differentiating the species is the beak, as you can see in Fig 1. We can explore differences between species beaks more quantitatively by displaying the data in a suitable way:
Fig 3: A parallel coordinate plot of the nine measurements made on each bird with the five species distinguished by colour. The first two beak variables (BeakW and BeakH) separate the two bigger species from the other three. The following three variables (LBeakL, UBeakL, and N.UBkL) separate the smaller species from one another.
Could the two bigger species be separated from one another using some discrimination analysis or some machine learning technique? Possibly, I have not tried, but it is worth noting that these two species are considered to be two subspecies of the same species, so demonstrable differences are not so likely.
If you have got this far, you will realise that I am grateful to the package authors for providing this dataset in R and I appreciate their efforts. I just wish they had made a little more effort. When you think of how much care and effort went into collecting the real datasets we use (how long would it take you to collect so many birds, classify them and measure them?), we should take more trouble in looking after datasets and offer the original collectors of the data more respect and gratitude.
This is all true of so many datasets in R and you begin to wonder if there should not be a society for the protection of datasets (SPODS?). That might prevent them being so abused and maltreated. Far worse has been done to datasets in R than anything I have detailed here, but this is a family blog and details of graver abuses might upset sensitive readers.
To end on an optimistic note, some further googling led to the discovery of the complete data from the 1904 reference for all the species (there are not just 5 taxa, but 32) for all the Galapogos islands, with the sex variable, and with the cases with missing values. The source was the Dryad Digital Repository, a site I confess that was unknown to me. "The Dryad Digital Repository is a curated resource that makes the data underlying scientific publications discoverable, freely reusable, and citable." Sounds good, we should encourage more sites like that, and we should encourage providers of datasets in R to look after any data in their care better.
And returning to Moore's definition of data, wouldn't it be a help to distinguish proper datasets from mere sets of numbers in R?
brms
(GitHub, CRAN) package by Paul-Christian Bürkner to derive the 95% prediction credible interval for the four models I introduced in my first post about generalised linear models. Additionally, I am interested to predict how much ice cream I should hold in stock for a hot day at 35ºC, such that I only run out of ice cream with a probability of 2.5%.brms
this will take less than a minute of coding, because brm
allows me to specify my models in the usual formula syntax and I can leave it to the package functions to create and execute the Stan files.glm
. Only the binomial model requires a slightly different syntax. Here I use the default priors and link functions:brm
. The estimated parameters are quite similar, apart from (sigma):log.lin.mod$model
and note that the prior of (sigma) is modelled via a Cauchy distribution, unlike the inverse Gamma I used last week. I believe that explains my small difference in (sigma).predict
function gives me access to the posterior predictive statistics, including the 95% prediction credible interval. brms
from GitHub.R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] lattice_0.20-33 brms_0.4.1.9000 ggplot2_1.0.1
[4] rstan_2.7.0-1 inline_0.3.14 Rcpp_0.12.0
loaded via a namespace (and not attached):
[1] codetools_0.2-14 digest_0.6.8 MASS_7.3-43
[4] grid_3.2.2 plyr_1.8.3 gtable_0.1.2
[7] stats4_3.2.2 magrittr_1.5 scales_0.2.5
[10] stringi_0.5-5 reshape2_1.4.1 proto_0.3-10
[13] tools_3.2.2 stringr_1.0.0 munsell_0.4.2
[16] parallel_3.2.2 colorspace_1.2-6
This post was originally published on mages' blog.]]>It seems the summer is coming to end in London, so I shall take a final look at my ice cream data that I have been playing around with to predict sales statistics based on temperature for the last couple of weeks [1], [2], [3].
Here I will use the new brms
(GitHub, CRAN) package by Paul-Christian Bürkner to derive the 95% prediction credible interval for the four models I introduced in my first post about generalised linear models. Additionally, I am interested to predict how much ice cream I should hold in stock for a hot day at 35ºC, such that I only run out of ice cream with a probability of 2.5%.
Like in my previous post about the log-transformed linear model with Stan, I will use Bayesian regression models to estimate the 95% prediction credible interval from the posterior predictive distribution.
Thanks to brms
this will take less than a minute of coding, because brm
allows me to specify my models in the usual formula syntax and I can leave it to the package functions to create and execute the Stan files.
Let’s start. Here is the data again:
My models are written down in very much the same way as with glm
. Only the binomial model requires a slightly different syntax. Here I use the default priors and link functions:
Last week I wrote the Stan model for the log-transformed linear model myself. Here is the output of brm
. The estimated parameters are quite similar, apart from (sigma):
I access the underlying Stan model via log.lin.mod$model
and note that the prior of (sigma) is modelled via a Cauchy distribution, unlike the inverse Gamma I used last week. I believe that explains my small difference in (sigma).
To review the model I start by plotting the trace and density plots for the MCMC samples.
The predict
function gives me access to the posterior predictive statistics, including the 95% prediction credible interval.
Combining the outputs of all four models into one data frame gives me then the opportunity to compare the prediction credible intervals of the four models in one chart.
There are no big news in respect of the four models, but for the fact that here I can look at the posterior prediction credible intervals, rather then the theoretical distributions two weeks ago. The over-prediction of the log-transformed linear model is apparent again.
Running out of stock on a hot summer’s day would be unfortunate, because those are days when sells will be highest. But how much stock should I hold?
Well, if I set the probability of selling out at 2.5%, then I will have enough ice cream to sell with 97.5% certainty. To estimate those statistics I have to calculate the 97.5% percentile of the posterior predictive samples.
Ok, I have four models and four answers ranging from 761 to 2494. The highest number is more than 3 times the lowest number!
I had set the market size at 800 in my binomial model, so I am not surprised by its answer of 761. Also, I noted earlier that the log-normal distribution is skewed to the right, so that explains the high prediction of 2494. The Poisson model, like the log-transform linear model, has the implicit exponential growth assumption. Its mean forecast is well over 1000 as I pointed out earlier and hence the 97.5% prediction of 1510 is to be expected.
How much ice cream should I hold in stock? Well, if I believe strongly in my assumption of a market size of 800, then I should stick with the output of the binomial model, or perhaps hold 800 just in case.
Another aspect to consider would be the cost of holding unsold stock. Ice cream can usually be stored for some time, but the situation would be quite different for freshly baked bread or bunches of flowers that have to be sold quickly.
Note, I used the developer version of brms
from GitHub.
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] lattice_0.20-33 brms_0.4.1.9000 ggplot2_1.0.1
[4] rstan_2.7.0-1 inline_0.3.14 Rcpp_0.12.0
loaded via a namespace (and not attached):
[1] codetools_0.2-14 digest_0.6.8 MASS_7.3-43
[4] grid_3.2.2 plyr_1.8.3 gtable_0.1.2
[7] stats4_3.2.2 magrittr_1.5 scales_0.2.5
[10] stringi_0.5-5 reshape2_1.4.1 proto_0.3-10
[13] tools_3.2.2 stringr_1.0.0 munsell_0.4.2
[16] parallel_3.2.2 colorspace_1.2-6
This post provides an overview of performing diagnostic and performance evaluation on logistic regression models in R. After training a statistical model, it’s important to understand how well that model did in regards to it’s accuracy and predictive power. The following content will provide the background and theory to ensure that the right technique are being utilized for evaluating logistic regression models in R.
We will use the GermanCredit dataset in the caret package for this example. It contains 62 characteristics and 1000 observations, with a target variable (Class) that is allready defined. The response variable is coded 0 for bad consumer and 1 for good. It’s always recommended that one looks at the coding of the response variable to ensure that it’s a factor variable that’s coded accurately with a 0/1 scheme or two factor levels in the right order. The first step is to partition the data into training and testing sets.
Using the training dataset, which contains 600 observations, we will use logistic regression
to model Class as a function of five predictors.
Bear in mind that the estimates from logistic regression characterize the relationship between the predictor and response variable on a log-odds scale. For example, this model suggests that for every one unit increase in Age, the log-odds of the consumer having good credit increases by 0.018. Because this isn’t of much practical value, we’ll ussually want to use the exponential function to calculate the odds ratios for each preditor.
This informs us that for every one unit increase in Age, the odds of having good credit increases by a factor of 1.01. In many cases, we often want to use the model parameters to predict the value of the target variable in a completely new set of observations. That can be done with the predict function. Keep in mind that if the model was created using the glm function, you’ll need to add type="response"
to the predict command.
A logistic regression model has been built and the coefficients have been examined. However, some critical questions remain. Is the model any good? How well does the model fit the data? Which predictors are most important? Are the predictions accurate? The rest of this document will cover techniques for answering these questions and provide R code to conduct that analysis.
For the following sections, we will primarily work with the logistic regression that I created with the glm()
function. While I prefer utilizing the Caret
package, many functions in R
will work better with a glm
object.
A logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This is performed using the likelihood ratio test, which compares the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors. Removing predictor variables from a model will almost always make the model fit less well (i.e. a model will have a lower log likelihood), but it is necessary to test whether the observed difference in model fit is statistically significant. Given that (H_0) holds that the reduced model is true, a p-value for the overall model fit statistic that is less than (0.05) would compel us to reject the null hypothesis. It would provide evidence against the reduced model in favor of the current model. The likelihood ratio test can be performed in R using the lrtest()
function from the lmtest
package or using the anova()
function in base.
Higher values of the LR test statistic lead to small p-values and provide evidence against the reduced model. However, when the test statistic is lower and the alpha level is greater than 0.05, we can not reject H_0 and it provides evidence in support of the reduced model. In such cases, we would want to consider a variety of separate models with different predictors in order to determine the best model for the given dataset. If the value of alpha is more than 0.05 for the reduced model, we should consider testing different models with other sets of predictors.
Unlike linear regression with ordinary least squares estimation, there is no R^2 statistic which explains the proportion of variance in the dependent variable that is explained by the predictors. However, there are a number of pseudo R^2 metrics that could be of value. Most notable is McFadden’s R^2, which is defined as 1 – [ ln(L_M) / ln(L_0) ] where ln(L_M) is the log likelihood value for the fitted model and ln(L_0) is the log likelihood for the null model with only an intercept as a predictor. The measure ranges from 0 to just under 1, with values closer to zero indicating that the model has no predictive power.
Values for McFadden’s R^2 range with 0 to just under 1 with values between above 0.2 generally being considered as satisfactory. If McFadden’s R^2 is less then 0.2, then the model does a poor job in explaining the target variable. It’s reccomended that you should try to find additional predictive variables that will improve the overall fit of the model.
pR2(mod_fit_one) # look for 'McFadden'
Another approch to determining the goodness of fit is through the Homer-Lemeshow statistics, which is computed on data after the observations have been segmented into groups based on having similar predicted probabilities. It examines whether the observed proportions of events are similar to the predicted probabilities of occurence in subgroups of the data set using a pearson chi square test. Small values with large p-values indicate a good fit to the data while large values with p-values below (0.05) indicate a poor fit. The null hypothesis holds that the model fits the data and in the below example we would reject (H_0).
If the alpha for the HL test statistic is greater than 0.05, we fail to reject the null hypothesis that there is no difference between the observed and model-predicted values. This would suggest that the model provides an adaquete fit to the data. However, when alpha is less than 0.05, the fit is poor and the influence of individual observations on the model fit should be examined.
A wald test is used to evaluate the statistical significance of each coefficient in the model and is calculated by taking the ratio of the square of the regression coefficient to the square of the standard error of the coefficient. The idea is to test the hypothesis that the coefficient of an independent variable in the model is not significantly different from zero. If the test fails to reject the null hypothesis, this suggests that removing the variable from the model will not substantially harm the fit of that model. One important point to note is that the Wald test is usually printed out in most model outputs as the z-score.
If the alpha for the Wald statistics is below 0.05, it should compel us to reject the null hypothesis and accept that the variable should be included in the model. However, an alpha that is greater than 0.05 suggests that those explanatory variables can be omitted from the model.
regTermTest(mod_fit_one, "ForeignWorker")
regTermTest(mod_fit_one, "CreditHistory.Critical")
To assess the relative importance of individual predictors in the model, we can also look at the absolute value of the t-statistic for each model parameter. This technique is utilized by the varImp function in the caret package for general and generalized linear models.
These variable importance tables inform us as to the strength of the effect associated with each explanatory variable in the model. One should carefully examine these table and use them to determine if there are predictors that could be thrown out of the regression model.
varImp(mod_fit)
When developing models for prediction, the most critical metric regards how well the model does in predicting the target variable on out of sample observations. The process involves using the model estimates to predict values on the training set. Afterwards, we will compared the predicted target variable versus the observed values for each observation. In the example below, you’ll notice that our model accurately predicted 67% of the observations in the testing set.
Whether a classification rate is good depends on a number of factors, including the quality of the data, business domain, modeling technique, and so forth. Therefore, one should consider all of these factors when looking at the classification rate and determining whether it’s ‘good enough.’
pred = predict(mod_fit, newdata=testing)
accuracy <- table(pred, testing[,"Class"])
sum(diag(accuracy))/sum(accuracy)
The receiving operating characteristic is a measure of classifier performance. Using the proportion of positive data points that are correctly considered as positive and the proportion of negative data points that are mistakenly considered as positive, we generate a graphic that shows the trade off between the rate at which you can correctly predict something with the rate of incorrectly predicting something. Ultimately, we’re concerned about the area under the ROC curve, or AUROC. That metric ranges from 0.50 to 1.00, and values above 0.80 indicate that the model does a great job in discriminating between the two categories which comprise our target variable. Bear in mind that ROC curves can examine both target-x-predictor pairings and target-x-model performance. An example of both are presented below.
The area under the ROC curve ranges from 0.50 to 1.00, and values above 0.80 indicate that the model does a great job in discriminating between the two categories which comprise our target variable. If the value of the ROC curve is between 0.50 to 0.70, one should consider revisiting the individual predictors that are in the model and consider if any other explanatory variables should be included.
Once the logistic regression model has been developed and the odds-ratios calculated, it’s advised that we look at the confidence intervals for these estimates. Given that these parameters were estimated from a sample in order to explain phenomena at a larger population, it’s reasonable to expect that the real value of the statistic within the population won’t exactly match our model estimate and will actually fall within some range of values. Therefore, when evaluating statistical models, it’s important to consider the confidence intervals for the predictor estimates and their odds ratios. With logistic regression, we also need to be cognizant of the underlying distributional assumptions and the scale of the interval.
When assessing confidence intervals for the estimate is whether the values are ‘significantly differently different from 0.’ This can help inform us as to whether an explanatory variable should be included in the model. There is no hard rule however, and any decisions to modify the predictor list should also account for other diagnostic results, domain expertise, and so forth. The important thing to remember is that confidence intervals should be examined for both the individual attribute parameters and odds-ratios. Furthermore, is is suggested that one should compare both the Wald and profile likelihood CIs when evaluating the model.
confint(mod_fit_one) # selects the appropriate profile method
confint.default(mod_fit_one) # ci's for the estimates using wald
require(MASS)
exp(confint(mod_fit_one)) # ci's for the odds ratios
Before building a model and evaluating it’s results, we need to examine the correlations between the variables in the data set. The goal is to identify those variables which have a strong linear relationship and is done by developing a correlation matrix which takes each continuous variable and finds the correlation coefficient for every pairing in the data set. The correlation coefficient is calculated using Pearson or Spearman measurements, with the values ranging from -1 (negative correlation) to 1 (positive correlation). When working with a large number of variable, the correlation matrix can be visualized using a correlation plot to make the evaluation easier. For non-continuous variables that are categorical or ordinal, techniques such as analysis of variance or chi-square tests can be used to assess the relationship between those variables. In such cases, we want to know if there is a statistically significant difference between the two variables that we’re testing.
Correlation coefficients range from -1 (negative correlation) to 1 (positive correlation). For each two-by two correlation or entry in the correlation matrix, we can know whether the covariance between the two variables is strong positive, weak positive, and so forth. As a general rules, values greater than 0.5 or less than -0.05 indicate that the correlation is strong. Values between +/- 0.30 and +/- 0.50 indicate a moderate relationship. When these thresholds aren’t met, the user should consider whether those results to determine if there’s any covariance that could be contributing to multicolinearity in the model. Likewise, if we fair to reject the null hypothesis on the chi-square of independence of have point-beral estimates below +/- 0.30, the user should consider if these metrics could indicate that multicolinearity is present within the model.
Multicolinearity refers to a situation in which an explanatory variable within a regression model is correlated with other predictors. The same tests we use to detect multicolinearity in multivariable linear regression models can also be used to detect whether multicolinearity exists within logistic regression, although slight modifications will have to be made in the interpretation of the statistic. The primary test for detecting multicolinearity in a set of explanatory variables is the variance inflation factor test. It determines how much of the inflation in the standard error could be caused by collinearity.
Variance inflation factors are calculated for each predictor in the regression model. While there is disagreement on the appropriate cutoff for identifying whether the model suffers from multicolinearity, the following rules provide a rough guideline.
When variables show high correlation between one another, the user should consider ways to mitigate that. One solution could be to throw out an explanatory variable. The user could also find ways to aggregate variables that are correlated into a single index or new predictor. Lastly, the user could always look for a proxy variable that is similar to the variables that suffer from multicolinearity, but can still help with predicting the target variable.
After developing the model, the user will also want to examine whether certain observations in the data set had a significant impact on the estimates. Observations or data points which are significantly different from others may impact the parameter estimates and could indicate that there may be data entry errors. Furthermore, such influential observations may be worth examining further. To examine whether an observation has a stronger impact on the model estimates than other observations, we must first look at look at the residuals to identify observations that are not explained well by the model. Furthermore, we want to look at leverage scores, which measure the potential impact of an individual case on the results. Points with high leverage have a potentially influential impact on the model. However, bear in mind that an observation can have high leverage but little influence.
Because a high leverage observation is not necessarily influential, the user must consider how much of an impact that the high leverage observation has on the estimates of the model. High leverage does not automatically suggest that it’s influential and low leverage doesn’t necessarily mean that it’s not influential. When you believe that a data point has both leverage and contributes negatively to the accuracy of the model, the user must consider what action to take. It’s possible that there was a data entry error or some other reason which could compel us to throw out the observation. Another solution could be to imput that data point with value which isn’t as extreme, such as the average for that variable.
There you have it, a quick overview of how to evaluate logistic regression models. We outlined a variety of techniques that are used to assess goodness of variable, variable importance, and predictive accuracy. Furthermore, we covered other essential stages of the model building process such as outlier detection and testing for multicolinearity. Clearly,it’s not enough just to build a model and look at it’s R^2 value. Practioners must examine a variety of things to ensure that they are building quality models that can generalize to a larger population.
I’m pretty excited for tomorrow: I’ll begin teaching the Fall 2015 offering of 36-721, Statistical Graphics and Visualization. This is a half-semester course designed primarily for students in our MSP program (Masters in Statistical Practice).
A large part of the focus will be on useful principles and frameworks: human visual perception, the Grammar of Graphics, graphic design and interaction design, and more current dataviz research. As for tools, besides base R and ggplot2, I’ll introduce a bit of Tableau, D3.js, and Inkscape/Illustrator. For assessments, I’m trying a variant of “specs grading”, with a heavy use of rubrics, hoping to make my expectations clear and my TA’s grading easier.
My initial course materials are up on my department webpage.
Here are the
(I’ll probably just use Blackboard during the semester, but I may post the final materials here again.)
It’s been a pleasant challenge to plan a course that can satisfy statisticians (slice and dice data quickly to support detailed analyses! examine residuals and other model diagnostics! work with data formats from rectangular CSVs through shapefiles to social networks!) … while also passing on lessons from the data journalism and design communities (take design and the user experience seriously! use layout, typography, and interaction sensibly!). I’m also trying to put into practice all the advice from teaching seminars I’ve taken at CMU’s Eberly Center.
Also, in preparation, this summer I finally enjoyed reading more of the classic visualization books on my list.
I hope to post longer notes on each book sometime later.
“…for a general linear model (GLM), a single linear function is a sufficient statistic for each associated parameter…”
The recently arXived paper “Likelihood-free inference in high-dimensional models“, by Kousathanas et al. (July 2015), proposes an ABC resolution of the dimensionality curse [when the dimension of the parameter and of the corresponding summary statistics] by turning Gibbs-like and by using a component-by-component ABC-MCMC update that allows for low dimensional statistics. In the (rare) event there exists a conditional sufficient statistic for each component of the parameter vector, the approach is just as justified as when using a generic ABC-Gibbs method based on the whole data. Otherwise, that is, when using a non-sufficient estimator of the corresponding component (as, e.g., in a generalised [not general!] linear model), the approach is less coherent as there is no joint target associated with the Gibbs moves. One may therefore wonder at the convergence properties of the resulting algorithm. The only safe case [in dimension 2] is when one of the restricted conditionals does not depend on the other parameter. Note also that each Gibbs step a priori requires the simulation of a new pseudo-dataset, which may be a major imposition on computing time. And that setting the tolerance for each parameter is a delicate calibration issue because in principle the tolerance should depend on the other component values. I ran a comparative experiment on a toy normal target, using either empirical mean and variance (blue), or empirical median and mad (brown), with little difference in the (above) output. Especially when considering that I set the tolerance somewhat arbitrarily. This could be due to the fact that the pairs are quite similar in terms of their estimation properties. However, I then realised that the empirical variance is not sufficient for the variance conditional on the mean parameter. I looked at the limiting case (with zero tolerance), which amounts to simulating σ first and then μ given σ, and ran a (Gibbs and iid) simulation. The difference, as displayed below (red standing for the exact ABC case), is not enormous, even though it produces a fatter tail in μ. Note the interesting feature that I cannot produce the posterior distribution of the parameters given the median and mad statistics. Which is a great introduction to ABC!
N=10 data=rnorm(N,mean=3,sd=.5) #ABC with insufficient statistics medata=median(data) madata=mad(data) varamad=rep(0,100) for (i in 1:100) varamad[i]=mad(sample(data,N,rep=TRUE)) tol=c(.01*mad(data),.05*mad(varamad)) T=1e6 mu=rep(median(data),T) sig=rep(mad(data),T) for (t in 2:T){ mu[t]=rnorm(1) psudo=rnorm(N,mean=mu[t],sd=sig[t-1]) if (abs(medata-median(psudo))>tol[1]) mu[t]=mu[t-1] sig[t]=1/rexp(1) psudo=rnorm(N,mean=mu[t],sd=sig[t]) if (abs(madata-mad(psudo))>tol[2]) sig[t]=sig[t-1] } #ABC with more sufficient statistics meaata=mean(data) sddata=sd(data) varamad=rep(0,100) for (i in 1:100) varamad[i]=sd(sample(data,N,rep=TRUE)) tol=c(.1*sd(data),.1*sd(varamad)) for (t in 2:T){ mu[t]=rnorm(1) psudo=rnorm(N,mean=mu[t],sd=sig[t-1]) if (abs(meaata-mean(psudo))>tol[1]) mu[t]=mu[t-1] sig[t]=1/rexp(1) psudo=rnorm(N,mean=mu[t],sd=sig[t]) if (abs(sddata-sd(psudo))>tol[2]) sig[t]=sig[t-1] } #MCMC with false sufficient sig=1/sqrt(rgamma(T,shape=.5*N,rate=1+.5*var(data))) mu=rnorm(T,mean(data)/(1+sig^2/N),sd=1/sqrt(1+N/sig^2)))
Filed under: Books, R, Statistics, University life Tagged: ABC, ABC-Gibbs, compatible conditional distributions, convergence of Gibbs samplers, curse of dimensionality, exact ABC, Gibbs sampling, median, median absolute deviation, R
by Richard Kittler, Revolution R Enterprise PM, Microsoft Advanced Analytics
Revolution is excited to announce the availability of its latest release of Revolution R Enterprise 7.4.1 (RRE) as a technical preview on Microsoft Azure via Windows- and Linux-based virtual machines in the Azure Marketplace. Through Azure’s world-wide cloud infrastructure customers now have on-demand access to high-performance predictive analytics to accelerate growth, optimize operations, and expedite data insight and discovery from any place and at any time. Availability in Azure Marketplace is the first step in Microsoft’s plan to integrate Revolution’s products with the Azure and, in the bigger picture, Cortana Analytics.
Through the Azure Marketplace, users can run computations on data sets up to 1 terabyte on cloud-based Windows and Linux multi-CPU instances from 4 to 32 vCPUs (virtual CPUs), accessing data copied from an Azure data store, including blob storage or SQL Server, or accessed directly through an ODBC connection. Utility pricing for both versions starts at $1.50 per 4-cores per hour with no long-term commitments. During this preview period RRE users will be supported through a monitored Azure forum.
The Windows version of RRE is accessed with Windows Remote Desktop and includes Revolution’s Visual Studio-based IDE for R developers (Revolution R Enterprise DevelopR). Access to the Linux version of RRE is via SSH with R access through the RGui. Alternatively, customers have the option of bringing their own license for a favorite IDE such as RStudio, RStudio Server, or StatET. Both the Windows and Linux offerings include Revolution’s DeployR web services module.
RRE features a unique “write once deploy anywhere” functionality that enables data analysts and IT teams to write code once and deploy it anywhere in a variety of data management platforms, enterprise data warehouses, grids, clusters, servers and workstations without re-engineering costs. It is the only Big Data Big Analytics platform to include a library of Big Data-ready algorithms that run inside the Cloudera, Hortonworks, and MapR Hadoop platforms, Teradata databases, and soon SQL Server, with the highest possible performance.
A 30-day free trial is available with no software charge, though standard Azure cloud usage charges apply. To learn more about RRE on Azure Marketplace, please visit the Marketplace posting here.
A very brief post at the end of the field season on two little “details” that are annoying me in paper/analysis that I see being done (sometimes) around me.
The first one concern mixed effect models where the models built in the contain a grouping factor (say month or season) that is fitted as both a fixed effect term and as a random effect term (on the right side of the | in lme4 model synthax). I don’t really understand why anyone would want to do this and instead of spending time writing equations let’s just make a simple simulation example and see what are the consequences of doing this:
library(lme4) set.seed(20150830) #an example of a situation measuring plant biomass on four different month along a gradient of temperature data<-data.frame(temp=runif(100,-2,2),month=gl(n=4,k=25)) modmat<-model.matrix(~temp+month,data) #the coefficient eff<-c(1,2,0.5,1.2,-0.9) data$biom<-rnorm(100,modmat%*%eff,1) #the simulated coefficient for Months are 0.5, 1.2 -0.9 #a simple lm m_fixed<-lm(biom~temp+month,data) coef(m_fixed) #not too bad ## (Intercept) temp month2 month3 month4 ## 0.9567796 2.0654349 0.4307483 1.2649599 -0.8925088 #a lmm with month ONLY as random term m_rand<-lmer(biom~temp+(1|month),data) fixef(m_rand) ## (Intercept) temp ## 1.157095 2.063714 ranef(m_rand) ## $month ## (Intercept) ## 1 -0.1916665 ## 2 0.2197100 ## 3 1.0131908 ## 4 -1.0412343 VarCorr(m_rand) #the estimated sd for the month coeff ## Groups Name Std.Dev. ## month (Intercept) 0.87720 ## Residual 0.98016 sd(c(0,0.5,1.2,-0.9)) #the simulated one, not too bad! ## [1] 0.8831761 #now a lmm with month as both fixed and random term m_fixedrand<-lmer(biom~temp+month+(1|month),data) fixef(m_fixedrand) ## (Intercept) temp month2 month3 month4 ## 0.9567796 2.0654349 0.4307483 1.2649599 -0.8925088 ranef(m_fixedrand) #very, VERY small ## $month ## (Intercept) ## 1 0.000000e+00 ## 2 1.118685e-15 ## 3 -9.588729e-16 ## 4 5.193895e-16 VarCorr(m_fixedrand) ## Groups Name Std.Dev. ## month (Intercept) 0.40397 ## Residual 0.98018 #how does it affect the estimation of the fixed effect coefficient? summary(m_fixed)$coefficients ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.9567796 0.2039313 4.691676 9.080522e-06 ## temp 2.0654349 0.1048368 19.701440 2.549792e-35 ## month2 0.4307483 0.2862849 1.504614 1.357408e-01 ## month3 1.2649599 0.2772677 4.562233 1.511379e-05 ## month4 -0.8925088 0.2789932 -3.199035 1.874375e-03 summary(m_fixedrand)$coefficients ## Estimate Std. Error t value ## (Intercept) 0.9567796 0.4525224 2.1143256 ## temp 2.0654349 0.1048368 19.7014396 ## month2 0.4307483 0.6390118 0.6740851 ## month3 1.2649599 0.6350232 1.9919901 ## month4 -0.8925088 0.6357784 -1.4038048 #the numeric response is not affected but the standard error around the intercept and the month coefficient is doubled, this makes it less likely that a significant p-value will be given for these effect ie higher chance to infer that there is no month effect when there is some #and what if we simulate data as is supposed by the model, ie a fixed effect of month and on top of it we add a random component rnd.eff<-rnorm(4,0,1.2) mus<-modmat%*%eff+rnd.eff[data$month] data$biom2<-rnorm(100,mus,1) #an lmm model m_fixedrand2<-lmer(biom2~temp+month+(1|month),data) fixef(m_fixedrand2) #weird coeff values for the fixed effect for month ## (Intercept) temp month2 month3 month4 ## -2.064083 2.141428 1.644968 4.590429 3.064715 c(0,eff[3:5])+rnd.eff #if we look at the intervals between the intercept and the different levels we can realize that the fixed effect part of the model sucked in the added random part ## [1] -2.66714133 -1.26677658 1.47977624 0.02506236 VarCorr(m_fixedrand2) ## Groups Name Std.Dev. ## month (Intercept) 0.74327 ## Residual 0.93435 ranef(m_fixedrand2) #again very VERY small ## $month ## (Intercept) ## 1 1.378195e-15 ## 2 7.386264e-15 ## 3 -2.118975e-14 ## 4 -7.752347e-15 #so this is basically not working it does not make sense to have a grouping factor as both a fixed effect terms and a random term (ie on the right-hand side of the |)
Take-home message don’t put a grouping factor as both a fixed and random term effect in your mixed effect model. lmer is not able to separate between the fixed and random part of the effect (and I don’t know how it could be done) and basically gives everything to the fixed effect leaving very small random effects. The issue is abit pernicious because if you would only look at the standard deviation of the random term from the merMod summary output you could not have guessed that something is wrong. You need to actually look at the random effects to realize that they are incredibely small. So beware when building complex models with many fixed and random terms to always check the estimated random effects.
The second issue is maybe a bit older but I saw it appear in a recent paper (which is a cool one excpet for this stats detail). After fitting a model with several predictors one wants to plot their effects on the response, some people use partial residuals plot to do this (wiki). The issue with these plots is that when two variables have a high covariance the partial residual plot will tend to be over-optimistic concerning the effect of variable X on Y (ie the plot will look much nice than it should be). Again let’s do a little simulation on this:
library(MASS) set.seed(20150830) #say we measure plant biomass in relation with measured temperature and number of sunny hours say per week #the variance-covariance matrix between temperature and sunny hours sig<-matrix(c(2,0.7,0.7,10),ncol=2,byrow=TRUE) #simulate some data xs<-mvrnorm(100,c(5,50),sig) data<-data.frame(temp=xs[,1],sun=xs[,2]) modmat<-model.matrix(~temp+sun,data) eff<-c(1,2,0.2) data$biom<-rnorm(100,modmat%*%eff,0.7) m<-lm(biom~temp+sun,data) sun_new<-data.frame(sun=seq(40,65,length=20),temp=mean(data$temp)) #partial residual plot of sun sun_res<-resid(m)+coef(m)[3]*data$sun plot(data$sun,sun_res,xlab="Number of sunny hours",ylab="Partial residuals of Sun") lines(sun_new$sun,coef(m)[3]*sun_new$sun,lwd=3,col="red")
#plot of sun effect while controlling for temp pred_sun<-predict(m,newdata=sun_new) plot(biom~sun,data,xlab="Number of sunny hours",ylab="Plant biomass") lines(sun_new$sun,pred_sun,lwd=3,col="red")
#same stuff for temp temp_new<-data.frame(temp=seq(1,9,length=20),sun=mean(data$sun)) pred_temp<-predict(m,newdata=temp_new) plot(biom~temp,data,xlab="Temperature",ylab="Plant biomass") lines(temp_new$temp,pred_temp,lwd=3,col="red")
The first graph is a partial residual plot, from this graph alone we would be tempted to say that the number of hour with sun has a large influence on the biomass. This conclusion is biased by the fact that the number of sunny hours covary with temperature and temperature has a large influence on plant biomass. So who is more important temperature or sun? The way to resolve this is to plot the actual observation and to add a fitted regression line from a new dataset (sun_new in the example) where one variable is allowed to vary while all others are fixed to their means. This way we see how an increase in the number of sunny hour at an average temperature affect the biomass (the second figure). The final graph is then showing the effect of temperature while controlling for the effect of the number of sunny hours.
Happy modelling!
Filed under: R and Stat Tagged: lme4, R, residuals
This is the bimonthly post (for 2015-08-31) for new R Jobs from R-users.com.
Employers: you may visit this link to post a new R job to the R community (it’s free and quick).
Job seekers: please follow the links below to learn more and apply for your job of interest (or go to R-users.com to see all the R jobs that are currently available)
(you may also look at previous R jobs posts).
]]>