In this post we are talking about one of the most unintuitive results of statistics: the so called false positive paradox which is an example of the so called base rate fallacy. It describes a situation where a positive test result of a very sensitive medical test shows that you have the respective disease… yet you are most probably healthy!
The reason for this is that the disease itself is so rare that even with a very sensitive test the result is most probably false positive: it shows that you have the disease yet this result is false, you are healthy.
The key to understanding this result is to understand the difference between two conditional probabilities: the probability that you have a positive test result when you are sick and the probability that you are sick in case you got a positive test result – you are interested in the last (am I really sick?) but you only know the first.
Now for some notation (the vertical dash means “under the condition that”, P stands for probability):
To calculate one conditional probability from the other we use the famous Bayes’ theorem:
In the following example we assume a disease with an infection rate of 1 in 1000 and a test to detect this disease with a sensitivity of 99%. Have a look at the following code which illustrates the situation with Euler diagrams, first the big picture, then a zoomed-in version:
library(eulerr) A <- 0.001 # prevalence of disease BlA <- 0.99 # sensitivity of test B <- A * BlA + (1 - A) * (1 - BlA) # positive test (specificity same as sensitivity) AnB <- BlA * A AlB <- BlA * A / B # Bayes's theorem #AnB / B # Bayes's theorem in different form C <- 1 # the whole population main <- paste0("P(B|A) = ", round(BlA, 2), ", but P(A|B) = ", round(AlB, 2)) set.seed(123) fit1 <- euler(c("A" = A, "B" = B, "C" = C, "A&B" = AnB, "A&C" = A, "B&C" = B, "A&B&C" = AnB), input = "union") plot(fit1, main = main, fill = c("red", "green", "gray90"))
fit2 <- euler(c("A" = A, "B" = B, "A&B" = AnB), input = "union") plot(fit2, main = main, fill = c("red", "green"))
As you can see although this test is very sensitive when you get a positive test result the probability of you being infected is only 9%!
In the diagrams C is the whole population and A are the infected individuals. B shows the people with a positive test result and you can see in the second diagram that almost all of the infected A are also part of B (the brown area = true positive), but still most ob B are outside of A (the green area), so although they are not infected they have a positive test result! They are false positive.
The red area shows the people that are infected (A) but get a negative test result, stating that they are healthy. This is called false negative. The grey area shows the people who are healthy and get a negative test result, they are true negative.
Due to the occasion we are now coming to an even more extreme example: did Jesus rise from the dead? It is inspired by the very good essay “A drop in the sea”: Don’t believe in miracles until you’ve done the math.
Let us assume that we had very, very reliable witnesses (as a side note what is strange though is that the gospels cannot even agree on how many men or angels appeared at the tomb: it is one angel in Matthew, a young man in Mark, two men in Luke and two angels in John… but anyway), yet the big problem is that not many people so far have been able to defy death. I have only heard of two cases: supposedly the King of Kings (Jesus) but also of course the King himself (Elvis!), whereby sightings of Elvis after his death are much more numerous than of Jesus (just saying… )
Have a look at the following code (source for the number of people who have ever lived: WolframAlpha)
A <- 2/108500000000 # probability of coming back from the dead (The King = Elvis and the King of Kings = Jesus) BlA <- 0.9999999 # sensitivity of test -> very, very reliable witnesses (many more in case of Elvis B <- A * BlA + (1 - A) * (1 - BlA) # positive test = witnesses say He rose AnB <- BlA * A AlB <- BlA * A / B # Bayes's theorem C <- 1 # all people main <- paste0("P(B|A) = ", round(BlA, 2), ", but P(A|B) = ", round(AlB, 2)) fit1 <- euler(c("A" = A, "B" = B, "C" = C, "A&B" = AnB, "A&C" = A, "B&C" = B, "A&B&C" = AnB), input = "union") plot(fit1, main = main, fill = c("red", "green", "gray90"))
fit2 <- euler(c("A" = A, "B" = B, "A&B" = AnB), input = "union") plot(fit2, main = main, fill = c("red", "green"))
So, in this case C is the unfortunate group of people who have to go for good… it is us. As you can see although the witnesses are super reliable when they claim that somebody rose it is almost certain that they are wrong:
Or in the words of the above mentioned essay:
No one is justified in believing in Jesus’s resurrection. The numbers simply don’t justify the conclusion.
But this chimes well with a famous Christian saying “I believe because it is absurd” (or in Latin “Credo quia absurdum”) – you can find out more about that in another highly interesting essay: ‘I believe because it is absurd’: Christianity’s first meme
Unfortunately this devastating conclusion is also true in the case of Elvis…
Every so often a problem arises where it’s appropriate to use gradient descent, and it’s fun (and / or easier) to apply it manually. Recently I’ve applied it optimising a basic recommender system to ‘unsuppressing’ suppressed tabular data. I thought I’d do a series of posts about how I’ve used gradient descent, but figured it was worth while starting with the basics as a primer / refresher.
To understand how this works gradient descent is applied we’ll use the classic example, linear regression.
A simple linear regression model is of the form
where
The objective is to find the parameters such that they minimise the mean squared error.
This is a good problem since we know the analytical solution and can check our results.
In practice you would never use gradient descent to solve a regression problem, but it is useful for learning the concepts.
Set up
library(ggplot2) set.seed(241) nobs <- 250 b0 <- 4 b1 <- 2 # simulate data x <- rnorm(nobs) y <- b0 + b1*x + rnorm(nobs, 0, 0.5) df <- data.frame(x, y) # plot data g1 <- ggplot(df, aes(x = x, y = y)) + geom_point(size = 2) + theme_minimal()
The analytical solution is given by
# set model matrix X <- model.matrix(y ~ x, data = df) beta <- solve(t(X) %*% X) %*% t(X) %*% y beta
## [,1] ## (Intercept) 4.009283 ## x 2.016444
And just to convince ourselves this is correct
# linear model formulation lm1 <- lm(y ~ x, data = df) coef(lm1)
## (Intercept) x ## 4.009283 2.016444
g1 + geom_abline(slope = coef(lm1)[2], intercept = coef(lm1)[1], col = "darkmagenta", size = 1)
The objective is to achieve the same result using gradient descent. It works by updating the parameters with each iteration in the direction of negative gradient to minimise the mean squared error i.e.
where is the learning rate. Here is the MSE with respect to the regression parameters. Firstly, we find the partial derivatives of .
The learning rate is to ensure we don’t jump too far with each iteration and rather some proportion of the gradient, otherwise we could end up overshooting the minimum and taking much longer to converge or not find the optimal solution at all.
Applying this to the problem above, we’ll initialise our values for to something sensible e.g. . I’ll choose a learning rate of . This is a slow burn, a learning rate of 0.1-0.2 is more appropriate for this problem but we’ll get to see the movement of the gradient better. It’s worth trying different values of to see how it changes convergence. The algorithm is setup as
# gradient descent function gradientDescent <- function(formula, data, par.init, loss.fun, lr, iters){ formula <- as.formula(formula) X <- model.matrix(formula, data = data) y <- data[,all.vars(formula)[1]] par <- loss <- matrix(NA, nrow = iters+1, ncol = 2) par[1,] <- par.init for(k in 1:iters){ loss[k,] <- loss.fun(X=X, y=y, par=par[k,]) par[k+1,] <- par[k,] - lr*loss[k,] } return(list(par = par)) } # loss function loss.fun <- function(X, y, par) return(-2/nrow(X)*(t(X) %*% (y - X %*% par))) # gradient descent. not much to it really beta <- gradientDescent(y ~ x, data = df, par.init = c(1, 1), loss.fun = loss.fun, lr = 0.01, iters = 1000)$par # plotting results z <- seq(1, 1001, 10) g1 + geom_abline(slope = beta[z,2], intercept = beta[z,1], col = "darkmagenta", alpha = 0.2, size = 1)
tail(beta, 1)
## [,1] [,2] ## [1001,] 4.009283 2.016444
As expected we obtain the same result. The lines show the gradient and how the parameters converge to the optimal values. A less reasonable set of starting values still converges quickly to the optimal solution showing how well graident descent works on linear regression.
beta <- gradientDescent(y ~ x, data = df, par.init = c(6, -1), loss.fun = loss.fun, lr = 0.01, iters = 1000)$par # plotting results z <- seq(1, 1001, 10) beta.df <- data.frame(b0 = beta[z,1], b1 = beta[z,2]) g1 + geom_abline(data = beta.df, mapping = aes(slope = b1, intercept = b0), col = "darkmagenta", alpha = 0.2, size = 1)
tail(beta, 1)
## [,1] [,2] ## [1001,] 4.009283 2.016444
ggif_minimal <- df %>% ggplot(aes(x = x, y = y)) + geom_point(size = 2) + theme_minimal() + geom_abline(data = beta.df, mapping = aes(slope = b1, intercept = b0), col = "darkmagenta", size = 1) + geom_text( data = data.frame(z, b0 = beta[z,1], b1 = beta[z,2]), mapping = aes( x = -2.8, y = 9, label = paste("b0 = ", round(b0, 2), "\nb1 = ", round(b1, 2))), hjust = 0, size = 6 ) + transition_reveal(z) + ease_aes("linear") + enter_appear() + exit_fade() animate(ggif_minimal, width = 1920, height = 1080, fps = 80)
They are the basics of applying gradient descent. In practice there is no need to use gradient descent to solve a regression problem, but once you know how to apply it you’ll find real-world applications elsewhere that are more complicated (and interesting). If you can define the objective function and it is differentiable, you can apply gradient descent. In later posts i’ll demonstrate how I’ve applied it to real world problems. Stay tuned!
The post Applying gradient descent – primer / refresher appeared first on Daniel Oehm | Gradient Descending.
Here are a few of the more commonly used notations found in R code and documentation that confuse coders of any skill level who are new to R.
Be aware that any variable name that begins with a . is usually hidden from view, so won’t be seen in the Environment pane in RStudio or listed when you invoke
ls()
unless you specify
ls(all.names = TRUE)
Notation | Meaning |
---|---|
. |
This is a variable name. It’s often found in association with the magrittr pipe operator %>% and is the name used to denote the otherwise anonymous object being passed in from the left-hand side (LHS). You typically see it used when you need to pass the piped object in to base-R or other non-tidyverse functions. Normally, tidyverse functions are written to accept a data object as their first parameter and the pipe operator passes the result of the LHS implicitly as the first argument, but base-R functions might expect the data object in any parameter position. |
... |
Known as “dot-dot-dot”, “ellipses”, or “three dots”—difficult to search for, unless you know those names for it. This is another variable name. It can only be used as a formal parameter to a function and is often simply passed along to other functions called from within that function. It’s a way of coding the intention that “at coding time, we don’t know what additional arguments or how many may be passed in, so we’re just going to accept any number of additional arguments.” You can choose to parse them yourself and make use of them or simply pass them on to other functions you invoke within your function. |
.x |
Often used as a generic variable name for any type of object: see purrr::map. |
.f |
Often used as a generic variable name for a function object: see purrr::map. |
.data |
Often used as a generic variable name for a data.frame object, especially when being used as the first parameter to a function written to play nicely in the tidyverse: see dplyr::select. |
df |
This could mean “degrees of freedom”, but it’s also often used as a mnemonic generic variable name for a data.frame object, when you don’t yet know a better, more descriptive name to use. Consider it the “this could be any data.frame” name for a data.frame. It’s often found in early iterations of exploratory code and in example code. You should use more intention revealing variable names, when you can. |
.Last.value |
The is the variable name R automatically associates with the most recently evaluated object. Usually hidden, RStudio has an option to make it visible in your Environment pane: Tools > Global options… > General > Advanced > Show .Last.value in environment listing |
JASP is a free and open source statistics package that targets beginners looking to point-and-click their way through analyses. This article is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) for R, which best meets their needs. Most of these reviews also include cursory descriptions of the programming support that each GUI offers.
JASP stands for Jeffreys’ Amazing Statistics Program, a nod to the Bayesian statistician, Sir Harold Jeffreys. It is available for Windows, Mac, Linux, and there is even a cloud version. One of JASP’s key features is its emphasis on Bayesian analysis. Most statistics software emphasizes a more traditional frequentist approach; JASP offers both. However, while JASP uses R to do some of its calculations, it does not currently show you the R code it uses, nor does it allow you to execute your own. The developers hope to add that to a future version. Some of JASP’s calculations are done in C++, so getting that converted to R will be a necessary first step on that path.
There are various definitions of user interface types, so here’s how I’ll be using these terms:
GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.
IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.
The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as BlueSky Statistics, jamovi, and RKWard, install in a single step. Others install in multiple steps, such as R Commander (two steps), and Deducer (up to seven steps). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!
JASP’s single-step installation is extremely easy and includes its own copy of R. So if you already have a copy of R installed, you’ll have two after installing JASP. That’s a good idea though, as it guarantees compatibility with the version of R that it uses, plus a standard R installation by itself is harder than JASP’s.
When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) to very high (R Commander).
For JASP, plug-ins are called “modules” and they are found by clicking the “+” sign at the top of its main screen. That causes a new menu item to appear. However, unlike most other software, the menu additions are not saved when you exit JASP; you must add them every time you wish to use them.
JASP’s modules are currently included with the software’s main download. However, future versions will store them in their own repository rather than on the Comprehensive R Archive Network (CRAN) where R and most of its packages are found. This makes locating and installing JASP modules especially easy.
Currently there are only four add-on modules for JASP:
Three modules are currently in development: Machine Learning, Circular
analyses, and Auditing.
Some user interfaces for R, such as BlueSky, jamovi, and Rkward, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and Deducer, have you start R, then load a package from your library, and then call a function to finally activate the GUI. That’s more appropriate for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.
You start JASP directly by double-clicking its icon from your desktop, or choosing it from your Start Menu (i.e. not from within R itself). It interacts with R in the background; you never need to be aware that R is running.
A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including BlueSky and jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a dataset from within it.
JASP is the only program in this set of reviews that lacks a data editor. It has only a data viewer (Figure 2, left). If you point to a cell, a message pops up to say, “double-click to edit data” and doing so will transfer the data to another program where you can edit it. You can choose which program will be used to edit your data in the “Preferences>Data Editing” tab, located under the “hamburger” menu in the upper-right corner. The default is Excel.
When JASP opens a data file, it automatically assigns metadata to the variables. As you can see in Figure 2, it has decided my variable “pretest” was a factor and provided a bar chart showing the counts of every value. For the extremely similar “posttest” variable it decided it was numeric, so it binned the values and provided a more appropriate histogram.
While JASP lacks the ability to edit data directly, it does allow you to edit some of the metadata, such as variable scale and variable (factor levels). I fixed the problem described above by clicking on the icon to the left of each variable name, and changing it from a Venn diagram representing “nominal”, to a ruler for “scale”. Note the use of terminology here, which is statistical rather than based on R’s use of “factor” and “numeric” abxyxas respectively. Teaching R is not part of JASP’s mission.
JASP cannot handle date/time variables other than to read them as character and convert them to factor. Once JASP decides a character or date/time variable is a factor, it cannot be changed.
Clicking on the name of a factor will open a small window on the top of the data viewer where you can over-write the existing labels. Variable names however, cannot be changed without going back to Excel, or whatever editor you used to enter the data.
The ability to import data from a wide variety of formats is extremely important; you can’t analyze what you can’t access. Most of the GUIs evaluated in this series can open a wide range of file types and even pull data from relational databases. JASP can’t read data from databases, but it can import the following file formats:
The ability to read SAS and Stata files is planned for a future release. Though based on R, JASP cannot read R data files!
The ability to export data to a wide range of file types helps when you need multiple tools to complete a task. Research is commonly a team effort, and in my experience, it’s rare to have all team members prefer to use the same tools. For these reasons, GUIs such as BlueSky, Deducer, and jamovi offer many export formats. Others, such as R Commander and RKward can create only delimited text files.
A fairly unique feature of JASP is that it doesn’t save just a dataset, but instead it saves the combination of a dataset plus its associated analyses. To save just the dataset, you go to the “File” tab and choose “Export data.” The only export format is comma separated value file (.csv).
It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be computed, transformed, scaled, recoded, or binned; strings and dates need to be manipulated; missing values need to be handled; datasets need to be sorted, stacked, merged, aggregated, transposed, or reshaped (e.g. from “wide” format to “long” and back).
A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time is tedious.
Some GUIs, such as BlueSky and R Commander can handle nearly all of these tasks. Others, such as jamovi and RKWard handle only a few of these functions.
JASP’s data management capabilities are minimal. It has a simple calculator that works by dragging and dropping variable names and math or statistical operators. Alternatively, you can type formulas using R code. Using this approach, you can only modify one variable at time, making day-to-day analysis quite tedious. It’s also unable to apply functions across rows (jamovi handles this via a set of row-specific functions). Using the calculator, I could never figure out how to later edit the formula or even delete a variable if I made an error. I tried to recreate one, but it told me the name was already in use.
You can filter cases to work on a subset of your data. However, JASP can’t sort, stack, merge, aggregate, transpose, or reshape datasets. The lack of combining datasets may be a result of the fact that JASP can only have one dataset open in a given session.
The goal of pointing and clicking your way through an analysis is to save time by recognizing menu settings rather than performing the more difficult task of recalling programming commands. Some GUIs, such as BlueSky and jamovi, make this easy by sticking to menu standards and using simpler dialog boxes; others, such as RKWard, use non-standard menus that are unique to it and hence require more learning.
JASP’s interface uses tabbed windows and toolbars in a way that’s similar to Microsoft Office. As you can see in Figure 3, the “File” tab contains what is essentially a menu, but it’s already in the dropped-down position so there’s no need to click on it. Depending on your selections there, a side menu may pop out, and it stays out without holding the mouse button down.
The built-in set of analytic methods are contained under the “Common” tab. Choosing that yields a shift from menus to toolbar icons shown in Figure 4.
Clicking on any icon on the toolbar causes a standard dialog box to pop out the right side of the data viewer (Figure 2, center). You select variables to place into their various roles. This is accomplished by either dragging the variable names or by selecting them and clicking an arrow located next to the particular role box. As soon as you fill in enough options to perform an analysis, its output appears instantly in the output window to the right. Thereafter, every option chosen adds to the output immediately; every option turned off removes output. The dialog box does have an “OK” button, but rather than cause the analysis to run, it merely hides the dialog box, making room for more space for the data viewer and output. Clicking on the output itself causes the associated dialog to reappear, allowing you to make changes.
While nearly all GUIs keep your dialog box settings during your session, JASP keeps those settings in its main file. This allows you to return to a given analysis at a future date and try some model variations. You only need to click on the output of any analysis to have the dialog box appear to the right of it, complete with all settings intact.
Output is saved by using the standard “File> Save” selection.
The JASP Materials web page provides links to a helpful array of information to get you started. The How to Use JASP web page offers a cornucopia of training materials, including blogs, GIFs, and videos. The free book, Statistical Analysis in JASP: A Guide for Students, covers the basics of using the software and includes a basic introduction to statistical analysis.
R GUIs provide simple task-by-task dialog boxes which generate much more complex code. So for a particular task, you might want to get help on 1) the dialog box’s settings, 2) the custom functions it uses (if any), and 3) the R functions that the custom functions use. Nearly all R GUIs provide all three levels of help when needed. The notable exception that is the R Commander, which lacks help on the dialog boxes themselves.
JASP’s help files are activated by choosing “Help” from the hamburger menu in the upper right corner of the screen (Figure 5). When checked, a window opens on the right of the output window, and its contents change as you scroll through the output. Given that everything appears in a single window, having a large screen is best.
The help files are very well done, explaining what each choice means, its assumptions, and even journal citations. While there is no reference to the R functions used, nor any link to their help files, the overall set of R packages JASP uses is listed here.
The various GUIs available for R handle graphics in several ways. Some, such as RKWard, focus on R’s built-in graphics. Others, such as BlueSky, focus on R’s popular ggplot graphics. GUIs also differ quite a lot in how they control the style of the graphs they generate. Ideally, you could set the style once, and then all graphs would follow it.
There is no “Graphics” menu in JASP; all the plots are created from within the data analysis dialogs. For example, boxplots are found in “Common> Descriptives> Plots.” To get a scatterplot I tried “Common> Regression> Plots” but only residual plots are found there. Next I tried “Common> Descriptives> Plots> Correlation plots” and was able to create the image shown in Figure 6. Apparently, there is no way to get just a single scatterplot.
The plots JASP creates are well done, with a white background and axes that don’t touch at the corners. It’s not clear which R functions are used to create them as their style is not the default from the R’s default graphics package, ggplot2, or lattice.
The most important graphical ability that JASP lacks is the ability to do “small multiples” or “facets”. Faceted plots allow you to compare groups by showing a set of the same type of plot repeated by levels of a categorical variable.
Setting the dots-per-inch is the only graphics adjustment JASP offers. It doesn’t support styles or templates. However, plot editing is planned for a future release.
Here is the selection of plots JASP can create.
The way statistical models (which R stores in “model objects”) are created and used, is an area on which R GUIs differ the most. The simplest and least flexible approach is taken by RKWard. It tries to do everything you might need in a single dialog box. To an R programmer, that sounds extreme, since R does a lot with model objects. However, neither SAS nor SPSS were able to save models for their first 35 years of existence, so each approach has its merits.
Other GUIs, such as BlueSky and R Commander save R model objects, allowing you to use them for scoring tasks, testing the difference between two models, etc. JASP saves a complete set of analyses, including the steps used to create models. It offers a “Sync Data” option on its File menu that allows you to re-use the entire analysis on a new dataset. However, it does not let you save R model objects.
All of the R GUIs offer a decent set of statistical analysis methods. Some also offer machine learning methods. As you can see from the table below, JASP offers the basics of statistical analysis. Included in many of these are Bayesian measures, such as credible intervals. See Plug-in Modules section above for more analysis types.
Analysis | Frequentist | Bayesian |
1. ANOVA | ✓ | ✓ |
2. ANCOVA | ✓ | ✓ |
3. Binomial Test | ✓ | ✓ |
4. Contingency Tables (incl. Chi-Squared Test) | ✓ | ✓ |
5. Correlation: Pearson, Spearman, Kendall | ✓ | ✓ |
6. Exploratory Factor Analysis (EFA) | ✓ | – |
7. Linear Regression | ✓ | ✓ |
8. Logistic Regression | ✓ | – |
9. Log-Linear Regression | ✓ | ✓ |
10. Multinomial | ✓ | – |
11. Principal Component Analysis (PCA) | ✓ | – |
12. Repeated Measures ANOVA | ✓ | ✓ |
13. Reliability Analyses: α, λ6, and ω | ✓ | – |
14. Structural Equation Modeling (SEM) | ✓ | – |
15. Summary Stats | – | ✓ |
16. T-Tests: Independent, Paired, One-Sample | ✓ | ✓ |
One of the aspects that most differentiates the various GUIs for R is the code they generate. If you decide you want to save code, what type of code is best for you? The base R code as provided by the R Commander which can teach you “classic” R? The tidyverse code generated by BlueSky Statistics? The completely transparent (and complex) traditional code provided by RKWard, which might be the best for budding R power users?
JASP uses R code behind the scenes, but currently, it does not show it to you. There is no way to extract that code to run in R by itself. The JASP developers have that on their to-do list.
Some of the GUIs reviewed in this series of articles include extensive support for programmers. For example, RKWard offers much of the power of Integrated Development Environments (IDEs) such as RStudio or Eclipse StatET. Others, such as jamovi or the R Commander, offer just a text editor with some syntax checking and code completion suggestions.
JASP’s mission is to make statistical analysis easy through the use of menus and dialog boxes. It installs R and uses it internally, but it doesn’t allow you to access that copy (other than in its data calculator.) If you wish to code in R, you need to install a second copy.
One of the biggest challenges that GUI users face is being able to reproduce their work. Reproducibility is useful for re-running everything on the same dataset if you find a data entry error. It’s also useful for applying your work to new datasets so long as they use the same variable names (or the software can handle name changes). Some scientific journals ask researchers to submit their files (usually code and data) along with their written report so that others can check their work.
As important a topic as it is, reproducibility is a problem for GUI users, a problem that has only recently been solved by some software developers. Most GUIs (e.g. the R Commander, Rattle) save only code, but since GUI users don’t write the code, they also can’t read it or change it! Others such as jamovi, RKWard, and the newest version of SPSS, save the dialog box entries and allow GUI users to have reproducibility in the form they prefer.
JASP records the steps of all analyses, providing exact reproducibility. In addition, if you update a data value, all the analyses that used that variable are recalculated instantly. That’s a very useful feature since people coming from Excel expect this to happen. You can also use “File> Sync Data” to open a new data file and rerun all analyses on that new dataset. However, the dataset must have exactly the same variable names in the same order for this to work. Still, it’s a very feature that GUI users will find very useful. If you wish to share your work with a colleague so they too can execute it, they must be JASP users. There is no way to export an R program file for them to use. You need to send them only your JASP file; It contains both the data and the steps you used to analyze it.
A topic related to reproducibility is package management. One of the major advantages to the R language is that it’s very easy to extend its capabilities through add-on packages. However, updates in these packages may break a previously functioning analysis. Years from now you may need to run a variation of an analysis, which would require you to find the version of R you used, plus the packages you used at the time. As a GUI user, you’d also need to find the version of the GUI that was compatible with that version of R.
Some GUIs, such as the R Commander and Deducer, depend on you to find and install R. For them, the problem is left for you to solve. Others, such as BlueSky, distribute their own version of R, all R packages, and all of its add-on modules. This requires a bigger installation file, but it makes dealing with long-term stability as simple as finding the version you used when you last performed a particular analysis. Of course, this depends on all major versions being around for long-term, but for open-source software, there are usually multiple archives available to store software even if the original project is defunct.
JASP if firmly in the latter camp. It provides nearly everything you need in a single download. This includes the JASP interface, R itself, and all R packages that it uses. So for the base package, you’re all set.
Ideally, output should be clearly labeled, well organized, and of publication quality. It might also delve into the realm of word processing through R Markdown, knitr or Sweave documents. At the moment, none of the GUIs covered in this series of reviews meets all of these requirements. See the separate reviews to see how each of the other packages is doing on this topic.
The labels for each of JASP’s analyses are provided by a single main title which is editable, and subtitles, which are not. Pointing at a title will cause a black triangle to appear, and clicking that will drop a menu down to edit the title (the single main one only) or to add a comment below (possible with all titles).
The organization of the output is in time-order only. You can remove an analysis, but you cannot move it into an order that may make more sense after you see it.
While tables of contents are commonly used in GUIs to let you jump directly to a section, or to re-order, rename, or delete bits of output, that feature is not available in JASP.
Those limitations aside, JASP’s output quality is very high, with nice fonts and true rich text tables (Figure 7). Tabular output is displayed in the popular style of the American Psychological Association. That means you can right-click on any table and choose “Copy” and the formatting is retained. That really helps speed your work as R output defaults to mono-spaced fonts that require additional steps to get into publication form (e.g. using functions from packages such as xtable or texreg). You can also export an entire set of analyses to HTML, then open the nicely-formatted tables in Word.
LaTeX users can right-click on any output table and choose “Copy special> LaTeX code” to to recreate the table in that text formatting language.
Repeating an analysis on different groups of observations is a core task in data science. Software needs to provide an ability to select a subset one group to analyze, then another subset to compare it to. All the R GUIs reviewed in this series can do this task. JASP allows you to select the observation to analyze in two ways. First, clicking the funnel icon located at the upper left corner of the data viewer opens a window that allows you to enter your selection logic, such as “gender = Female”. From an R code perspective, it does not use R’s “==” symbol for logical equivalence, nor does it allow you to put value labels in quotes. It generates a subset that you can analyze in the same way as the entire dataset. Second, you can click on the name of a factor, then check or un-check the values you wish to keep. Either way, the data viewer grays out the excluded data lines to give you a visual cue.
Software also needs the ability to automate such selections so that you might generate dozens of analyses, one group at a time. While this has been available in commercial GUIs for decades (e.g. SPSS “split-file”, SAS “by” statement), BlueSky is the only R GUI reviewed here that includes this feature. The closest JASP gets on this topic is to offer a “split” variable selection box in its Descriptives procedure.
Early in the development of statistical software, developers tried to guess what output would be important to save to a new dataset (e.g. predicted values, factor scores), and the ability to save such output was built into the analysis procedures themselves. However, researchers were far more creative than the developers anticipated. To better meet their needs, output management systems were created and tacked on to existing tools (e.g. SAS’ Output Delivery System, SPSS’ Output Management System). One of R’s greatest strengths is that every bit of output can be readily used as input. However, for the simplification that GUIs provide, that’s a challenge.
Output data can be observation-level, such as predicted values for each observation or case. When group-by analyses are run, the output data can also be observation-level, but now the (e.g.) predicted values would be created by individual models for each group, rather than one model based on the entire original data set (perhaps with group included as a set of indicator variables).
You can also use group-by analyses to create model-level data sets, such as one R-squared value for each group’s model. You can also create parameter-level data sets, such as the p-value for each regression parameter for each group’s model. (Saving and using single models is covered under “Modeling” above.)
For example, in our organization, we have 250 departments and want to see if any of them have a gender bias on salary. We write all 250 regression models to a dataset, and then search to find those whose gender parameter is significant (hoping to find none, of course!)
BlueSky is the only R GUI reviewed here that does all three levels of output management. JASP not only lacks these three levels of output management, it even lacks the fundamental observation-level saving that SAS and SPSS offered in their first versions back in the early 1970s. This entails saving predicted values or residuals from regression, or scores from principal components analysis or factor analysis. The developers plan to add that capability to a future release.
While most of the R GUI projects encourage module development by volunteers, the JASP project hasn’t done so. However, this is planned for a future release.
JASP is easy to learn and use. The tables and graphs it produces follow the guidelines of the Americal Psychological Association, making them acceptable by many scientific journals without any additional formatting. Its developers have chosen their options carefully so that each analysis includes what a researcher would want to see. Its coverage of Bayesian methods is the most extensive I’ve seen in this series of software reviews.
As nice as JASP is, it lacks important features, including: a data editor, an R code editor, the ability to see the R code it writes, the ability to handle date/time variables, the ability to read/write R, SAS, and Stata data files, the ability to perform many more fundamental data management tasks, the ability to save new variables such as predicted values or factor scores, the ability to save models so they can be tested on hold-out samples or new data sets, and the ability to reuse an analysis on new data sets using the GUI. While those are quite a few features to add, JASP is funded by several large grants from the Dutch Science Foundation and the ERC, allowing them to guarantee continuous and ongoing development.
Thanks to Eric-Jan Wagenmakers and Bruno Boutin for their help in understanding JASP’s finer points. Thanks also to Rachel Ladd, Ruben Ortiz, Christina Peterson, and Josh Price for their editorial suggestions. Edit
Installing R packages on Linux systems has always been a risky affair. In RStudio
Package Manager 1.0.8, we’re giving administrators and R users the information
they need to make installing packages easier. We’ve also made it
easier to use Package Manager offline and improved search performance.
Download the 45-day evaluation
today to see how RStudio Package Manager can help you, your team, and your
entire organization access and organize R packages. Learn more with our online
demo server or latest webinar.
R packages can depend on one another, but they can also depend on software
external to the R ecosystem. On Ubuntu 18.04, for example, in order to install the curl
R package, you must have previously run apt-get install libcurl
. R
packages often note these dependencies inside their DESCRIPTION files, but this
information is free-form text that varies by package. In the past, system
administrators would need to manually parse these files. In order to install
ggplot2
, you’d need to look at the system requirements for ggplot2
and all
its dependencies. This labor-intensive process rarely goes smoothly. Frequently,
system dependencies are not be uncovered until a package failed to install,
often with a cryptic error message that can leave R users and administrators frantically
searching StackOverflow.
To address this problem, we’ve begun cataloging and testing
system
prerequisites.
The result is a list of install commands available for administrators and R
users. We’ve tested this list by installing all 14,024 CRAN packages across six
Linux distributions.
For any package, Package Manager shows you if there are system pre-requisites
and the commands you can use to install them. Today this support is limited to
Linux, but we plan to support Windows and Mac requirements in the future.
Package Manager automatically rolls up prerequisites for dependent R packages.
As an example, the httr
R package depends on the curl
package which depends
on libcurl
. Package Manager will show the libcurl
prerequisite for the
httr
package–and for all of httr
’s reverse dependencies!
In most cases, RStudio Package Manager provides the checks and governance
controls needed by IT to bridge the gap between offline production systems and
RStudio’s public CRAN service. However, in certain cases it may be necessary to
run RStudio Package Manager offline. Version 1.0.8 introduces a new
tool to help offline
customers. A new utility has been created to make cloning packages into an
air-gapped environment safe and fast.
In addition to these major changes, the new release includes the following updates:
Please review the full release notes.
Upgrade Planning
Upgrading to 1.0.8 from 1.0.6 will take less than five minutes. If you are
upgrading from an earlier version, be sure to consult the release notes for the
intermediate releases, as well.
Don’t see that perfect feature? Wondering why you should be worried about
package management? Want to talk about other package-management strategies?
Email us, our product team is happy to help!
Some things are just irresistible to a community manager – PhD student Hugo Gruson’s recent tweets definitely fall into that category.
I was surprised and intrigued to see an example of our software peer review guidelines being used in a manuscript review, independent of our formal collaboration with the journal Methods in Ecology and Evolution (MEE). This is exactly the kind of thing rOpenSci is working to enable by developing a good set of practices that broadly apply to research software.
But who was this reviewer and what was their motivation? What role did the editors handling the manuscript play? I contacted the authors and then the journal and, in less than a week we had everyone on board to talk about their perspectives on the process.
To me, MEE’s role is to help increase the quality of the methods used in ecology and evolution, and this includes research software. It would be great to reach a point where all the research software used in ecology is at the same high standard as the packages that have been through rOpenSci software peer review.
Not all R packages that we receive at MEE fit in with the rOpenSci package scope, but I’d love to see them go through a similar process. This is where the rOpenSci review checklist comes in. In my view, it’s the gold standard for reviewing R packages and I was thrilled to see that Hao (manuscript reviewer) had used it with this paper.
The idea of doing code review as part of reviewing a manuscript is new to a lot of people. Often, invited reviewers decline because they don’t think they have the right experience. If you have experience with creating packages though, reviewing code isn’t something to be worried about. rOpenSci’s guidelines are a great way for people new to reviewing code to become comfortable with the process.
When I was asked to review the code for the pavo 2.0 manuscript^{1}, I had an initial moment of panic – I had no experience doing formal code review. Luckily, I knew that rOpenSci had a set of reviewing guidelines, and that a few MEE Applications papers had used them. The same guidelines are also used by the Journal of Open Source Software (JOSS). Although this submission wasn’t flagged for rOpenSci review, I didn’t see a conflict with using their guidelines for my task.
The checklist helped me to organise my review. I started with the basic package review template, and then focused on a detailed look at the primary vignette (which is where I expect most users start). The rOpenSci guidelines encourage the use of some automated tools, like goodpractice
to facilitate reviewing. The hardest part was providing suggestions to address what the goodpractice::gp()
function flagged as complex or redundant code. The remainder of the review went pretty smoothly. I’m a fan of task checklists, so I’m glad that the authors found my comments useful. Hopefully the changes will help with the future maintenance of the package.
We were immediately struck by the rigor and thoughtfulness of the reviews and pleasantly surprised to see reference to rOpenSci in Hao’s [anonymous] review. It was clear that Hao and two other reviewers had invested significant time in examining not only the manuscript and documentation, but the codebase itself. An uncommon, but welcome experience.
Our package was singularly improved as a result, both for end-users and ourselves. Many of the suggestions that we implemented – such as comprehensive test coverage, explicit styling, greater code safety, executable examples, and contributor guidelines – will persist and guide the development of this (and related) packages into the future.
We know that software is challenging to review since the overlap of field-specific expertise between developers and biologists is relatively limited. This is where the value of rOpenSci’s work in developing tractable standards for reviewers and developers really comes into focus, as well as the willingness of journals such as MEE to encourage their use. We’re just grateful for the experience and would be thrilled to see the practice expand in scope and reach where possible.
Since the early days of the journal, code and software papers (or Applications articles as we call them) have been really important to MEE. In our Policy on Publishing Code we highlight our commitment to ensuring the quality of code through the peer review process.
We’ve got a team of dedicated Applications Editors who handle code manuscripts and they do a great job of balancing their comments on the manuscript and the code that goes along with it. Resources like the rOpenSci package review guidelines can really help to take the pressure off these Editors, and they give reviewers confidence to comment on the code. It’s great to have the chance to promote them here and we hope that this post will encourage more people to check them out.
We also partner directly with rOpenSci for software peer review. If you have an R package that meets the aims and scope of both MEE and rOpenSci, you can opt for a joint review in which the R package is reviewed by rOpenSci, followed by fast-tracked review of the manuscript by MEE. Manuscripts published through this process are recognized via a mark on both HTML and PDF versions of their paper. We’ve had two articles published to date as a result of this partnership^{2} ^{3}.
Having a manuscript reviewed can often feel like a quite mysterious process. Your work disappears into a black box and comes out with a load of anonymous suggestions for how to improve it. At rOpenSci and Methods in Ecology and Evolution, we want to help open up that black box. Thanks to Hugo’s tweet of gratitude, and the goodwill of the editors, reviewers and authors of the pavo 2.0 paper, this post provides a glimpse of what is possible. Will you give it a try next time?
Do you spend a lot of time on data exploration? If yes, then you will like today’s post about AutoEDA written by Mateusz Staniak.
If you ever dreamt of automating the first, laborious part of data analysis when you get to know the variables, print descriptive statistics, draw a lot of histograms and scatter plots – you weren’t the only one. Turns out that a lot of R developers and users thought of the same thing. There are over a dozen R packages for automated Exploratory Data Analysis and the interest in them is growing quickly. Let’s just look at this plot of number of downloads from the official CRAN repository.
Replicate this plot with
stats <- archivist::aread("mstaniak/autoEDA-resources/autoEDA-paper/52ec") stat <- stats %>% filter(date > "2014-01-01" ) %>% arrange(date) %>% group_by(package) %>% mutate(cums = cumsum(count), packages = paste0(package, " (",max(cums),")")) stat$packages <- reorder(stat$packages, stat$cums, function(x)-max(x)) ggplot(stat, aes(date, cums, color = packages)) + geom_step() + scale_x_date(name = "", breaks = as.Date(c("2014-01-01", "2015-01-01", "2016-01-01", "2017-01-01", "2018-01-01", "2019-01-01")), labels = c(2014:2019)) + scale_y_continuous(name = "", labels = comma) + DALEX::theme_drwhy() + theme(legend.position = "right", legend.direction = "vertical") + scale_color_discrete(name="") + ggtitle("Total number of downloads", "Based on CRAN statistics")
New tools arrive each year with a variety of functionalities: creating summary tables, initial visualization of a dataset, finding invalid values, univariate exploration (descriptive and visual) and searching for bivariate relationships.
We compiled a list of R packages dedicated to automated EDA, where we describe twelve packages: their capabilities, their strong aspects and possible extensions. You can read our review paper on arxiv: https://arxiv.org/abs/1904.02101.
Spoiler alert: currently, automated means simply fast. The packages that we describe can perform typical data analysis tasks, like drawing bar plot for each categorical feature, creating a table of summary statistics, plotting correlations, with a single command. While this speeds up the work significantly, it can be problematic for high-dimensional data and it does not take the advantage of AI tools for actual automatization. There is a lot of potential for intelligent data exploration (or model exploration) tools.
More extensive list of software (including Python libraries and web applications) and papers is available on Mateusz’s GitHub. Researches can follow our autoEDA project on ResearchGate.
In the previous post of this series unveiling the relationship between UFO sightings and population, we crossed the threshold of normality underpinning linear models to construct a generalised linear model based on the more theoretically satisfying Poisson distribution.
On inspection, however, this model revealed itself to be less well suited to the data than we had, in our tragic ignorance, hoped. While it appeared, on visual inspection, to capture some features of the data, the predictive posterior density plot demonstrated that it still fell short of addressing the subtleties of the original.
In this post, we will seek to overcome this sad lack in two ways: firstly, we will subject our models to pitiless mathematical scrutiny to assess their ability to describe the data. With our eyes irrevocably opened to these techniques, we will construct an ever more complex armillary with which to approach the unknowable truth.
Our previous post showed the different fit of the Poisson model to the data from the simple Gaussian linear model. When presented with a grim array of potential models, however, it is crucial to have reliable and quantitative mechanisms to select amongst them.
The eldritch procedure most suited to this purpose, model selection, in our framework, draws on information criteria that express the relative effectiveness of models at creating sad mockeries of the original data. The original and most well-known such criterion is the Akaike Information Criterion, which has, in turn, spawned a multitude of successors applicable in different situations and with different properties. Here, we will make use of Leave-One-Out Cross Validation (LOO-CV)^{1} as the most applicable to the style of model and set of techniques applied here.
It is important to reiterate that these approaches do not speak to an absolute underlying truth; information criteria allow us to choose between models, assessing which has most closely assimilated the madness and chaos of the data. For LOO-CV, this results in an expected log predictive density (elpd
) for each model. The model with the lowest elpd
is the least-warped mirror of reality amongst those we subject to scrutiny.
There are many fragile subtleties to model selection, of which we will mention only two here. Firstly, in general, the greater the number of predictors or variables incorporated into a model, the more closely it will be able to mimic the original data. This is problematic, in that a model can become overfit to the original data and thus be unable to represent previously unseen data accurately — it learns to mimic the form of the observed data at the expense of uncovering its underlying reality. The LOO-CV technique avoids this trap by, in effect, withholding data from the model to assess its ability to make accurate inferences on previously unseen data.
The second consideration in model selection is that the information criteria scores of models, such as (elpd
) in LOO-CV, are subject to standard error in their assessment; the score itself is not a perfect metric of model performance, but a cunning approximation. As such we will only consider one model to have outperformed its competitors if the difference in their relative elpd
is several times greater than this standard error.
With this understanding in hand, we can now ruthlessly quantify the effectiveness of the Gaussian linear model against the Poisson generalised linear model.
The original model presented before our subsequent descent into horror was a simple linear Gaussian, produced through use of ggplot2
‘s geom_smooth
function. To compare this meaningfully against the Poisson model of the previous post, we must now recreate this model using the, now hideously familar, tools of Bayesian modelling with Stan.
With both models straining in their different directions towards the light, we apply LOO-CV cross validation to assess their effectiveness at predicting the data.
> compare( loo_normal, loo_poisson ) elpd_diff se -8576.1 712.5
The information criterion shows that the complexity of the Poisson model does not, in fact, produce a more effective model than the false serenity of the Gaussian^{2}. The negative elpd_diff
of the compare
function supports the first of the two models, and the magnitude being over twelve times greater than the standard error leaves little doubt that the difference is significant. We must, it seems, look further.
With these techniques for selecting between models in hand, then, we can move on to constructing ever more complex attempts to dispel the darkness.
The Poisson distribution, whilst appropriate for many forms of count data, suffers from fundamental limits to its understanding. The single parameter of the Poisson, \(\lambda\), enforces that the mean and variance of the data are equal. When such comforting falsehoods wither in the pale light of reality, we must move beyond the gentle chains in which the Poisson binds us.
The next horrific evolution, then, is the negative binomial distribution, which similarly speaks to count data, but presents a dispersion parameter (\(\phi\)) that allows the variance to exceed the mean^{3}.
With our arcane theoretical library suitably expanded, we can now transplant the still-beating Poisson heart of our earlier generalised linear model with the more complex machinery of the negative binomial:
$$\begin{eqnarray}
y &\sim& \mathbf{NegBinomial}(\mu, \phi)\\
\log(\mu) &\sim& \alpha + \beta x\\
\alpha &\sim& \mathcal{N}(0, 1)\\
\beta &\sim& \mathcal{N}(0, 1)\\
\phi &\sim& \mathbf{HalfCauchy}(2)
\end{eqnarray}$$
As with the Poisson, our negative binomial generalised linear model employs a log link function to transform the linear predictor. The Stan code for this model is given below.
With this model fit, we can compare its whispered falsehoods against both the original linear Gaussian model and the Poisson GLM:
> compare( loo_poisson, loo_negbinom ) elpd_diff se 8880.8 721.9
With the first comparison, it is clear that the sinuous flexibility offered by the dispersion parameter, \(\phi\), of the negative binomial allows that model to mould itself much more effectively to the data than the Poisson. The elpd_diff
score is positive, indicating that the second of the two compared models is favoured; the difference is over twelve times the standard error, giving us confidence that the negative binomial model is meaningfully more effective than the Poisson.
Whilst superior to the Poisson, does this adaptive capacity allow the negative binomial model to render the naïve Gaussian linear model obsolete?
> compare( loo_normal, loo_negbinom ) elpd_diff se 304.7 30.9
The negative binomial model subsumes the Gaussian with little effort. The elpd_diff
is almost ten times the standard error in favour of the negative binomial GLM, giving us confidence in choosing it. From here on, we will rely on the negative binomial as the core of our schemes.
The improvements we have seen with the negative binomial model allow us to discard the Gaussian and Poisson models with confidence. It is not, however, sufficient to fill the gaping void induced by our belief that the sightings of abnormal aerial phenomena in differing US states vary differently with their human population.
To address this question we must ascertain whether allowing our models to unpick the individual influence of states will improve their predictive ability. This, in turn, will lead us into the gnostic insanity of hierarchical models, in which we group predictors in our models to account for their shadowy underlying structures.
The first step on this path is to allow part of the linear function underpinning our model, specifically the intercept value, \(\alpha\), to vary between different US states. In a simple linear model, this causes the line of best fit for each state to meet the y-axis at a different point, whilst maintaining a constant slope for all states. In such a model, the result is a set of parallel lines of fit, rather than a single global truth.
This varying intercept can describe a range of possible phenomena for which the rate of change remains constant, but the baseline value varies. In such hierarchical models we employ a concept known as partial pooling to extract as much forbidden knowledge from the reluctant data as possible.
A set of entirely separate models, such as the per-state set of linear regressions presented in the first post of this series, employs a no pooling approach: the data of each state is treated separately, with an entirely different model fit to each. This certainly considers each the uniqueness of each state, but cannot benefit from insights drawn from the broader range of data we have available, which we may reasonably assume to have some relevance.
By contrast, the global Gaussian, Poisson, and negative binomial models presented so far represent complete pooling, in which the entire set of data is considered a formless, protean amalgam without meaningful structure. This mindless, groping approach causes the unique features of each state to be lost amongst the anarchy and chaos.
A partial pooling approach instead builds a global mean intercept value across the dataset, but allows the intercept value for each individual state to deviate according to a governing probability distribution. This both accounts for the individuality of each group of observations, in our case the state, but also draws on the accumulated wisdom of the whole.
We now construct a partially-pooled varying intercept model, in which the parameters and observations for each US state in our dataset is individually indexed:
$$\begin{eqnarray}
y &\sim& \mathbf{NegBinomial}(\mu, \phi)\\
\log(\mu) &\sim& \alpha_i + \beta x\\
\alpha_i &\sim& \mathcal{N}(\mu_\alpha, \sigma_\alpha)\\
\beta &\sim& \mathcal{N}(0, 1)\\
\phi &\sim& \mathbf{HalfCauchy}(2)
\end{eqnarray}$$
Note that the intercept parameter, \(\alpha\), in the second line is now indexed by the state, represented here by the subscript \(i\). The slope parameter, \(\beta\), remains constant across all states.
This model can be rendered in Stan code as follows:
Once the model has twisted itself into the most appropriate form for our data, we can now compare it against our previous completely-pooled model:
> compare( loo_negbinom, loo_negbinom_var_intercept ) elpd_diff se 363.2 28.8
Our transcendent journey from the statistical primordial ooze continues: the varying intercept model is favoured over the completely-pooled model by a significant margin.
Now that our minds have apprehended a startling glimpse of the implications of the varying intercept model, it is natural to consider taking a further terrible step and allowing both the slope and the intercept to vary^{4}.
With both the intercept and slope of the underlying linear predictor varying, an additional complexity raises its head: can we safely assume that these parameters, the intercept and slope, vary independently of each other, or may there be arcane correlations between them? Do states with a higher intercept also experience a higher slope in general, or is the opposite the case? Without prior knowledge to the contrary, we must allow our model to determine these possible correlations, or we are needlessly throwing away potential information in our model.
For a varying slope and intercept model, therefore, we must now include a correlation matrix, \(\Omega\), between the parameters of the linear predictor for each state in our model. This correlation matrix, as with all parameters in a Bayesian framework, must be expressed with a prior distribution from which the model can begin its evaluation of the data.
With deference to the authoritative quaint and curious volume of forgotten lore we will use an LKJ prior for the correlation matrix without further discussion of the reasoning behind it.
$$\begin{eqnarray}
y &\sim& \mathbf{NegBinomial}(\mu, \phi)\\
\log(\mu) &\sim& \alpha_i + \beta x_i\\
\begin{bmatrix}
\alpha_i\\
\beta_i
\end{bmatrix} &\sim& \mathcal{N}(
\begin{bmatrix}
\mu_\alpha\\
\mu_\beta
\end{bmatrix}, \Omega )\\
\Omega &\sim& \mathbf{LKJCorr}(2)\\
\phi &\sim& \mathbf{HalfCauchy}(2)
\end{eqnarray}$$
This model has grown and gained a somewhat twisted complexity compared with the serene austerity of our earliest linear model. Despite this, each further step in the descent has followed its own perverse logic, and the progression should clear. The corresponding Stan code follows:
The ultimate test of our faith, then, is whether the added complexity of the partially-pooled varying slope, varying intercept model is justified. Once again, we turn to the ruthless judgement of the LOO-CV:
> compare( loo_negbinom_var_intercept, loo_negbinom_var_intercept_slope ) elpd_diff se 13.3 2.4
In this final step we can see that our labours in the arcane have been rewarded. The final model is once again a significant improvement over its simpler relatives. Whilst the potential for deeper and more perfect models never ends, we will settle for now on this.
With our final model built, we can now begin to examine its mortifying implications. We will leave the majority of the subjective analysis for the next, and final, post in this series. For now, however, we can reinforce our quantitative analysis with visual assessment of the posterior predictive distribution output of our final model.
In comparison with earlier attempts, the varying intercept and slope model visibly captures the overall shape of the distribution with terrifying ease. As our wary confidence mounts in the mindless automaton we have fashioned, we can now examine its predictive ability on our original data.
The purpose of our endeavours is to show whether or not the frequency of extraterrestrial visitations is merely a sad reflection of the number of unsuspecting humans living in each state. After seemingly endless cryptic calculations, our statistical machinery implies that there are deeper mysteries here: allowing the relationship between sightings and the underlying linear predictors to vary by state more perfectly predicts the data. There are clearly other, hidden, factors in play.
More than that, however, our final model allows us to quantify these differences. We can now retrieve from the very bowels of our inferential process the per-state distribution of paremeters for both the slope and intercept of the linear predictor.
It is important to note that, while we are still referring to the \(\alpha\) and \(\beta\) parameters as the slope and intercept, their interpretation is more complex in a generalised linear model with a \(\log\) link function than in the simple linear model. For now, however, this diagram is sufficient to show that the horror visited on innocent lives by our interstellar visitors is not purely arbitrary, but depends at least in part on geographical location.
With this malign inferential process finally complete we will turn, in the next post, to a trembling interpretation of the model and its dark implications for our collective future.
When it comes to data visualization, it can be fun to think of all the flashy and exciting ways to display a dataset. But if you’re trying to convey information, flashy isn’t always the way to go.
In fact, one of the most powerful ways to communicate the relationship between two variables is the simple line graph.
A line graph is a type of graph that displays information as a series of data points connected by straight line segments.
There are many different ways to use R to plot line graphs, but the one I prefer is the ggplot geom_line
function.
Before we dig into creating line graphs with the ggplot geom_line
function, I want to briefly touch on ggplot
and why I think it’s the best choice for plotting graphs in R.
ggplot
is a package for creating graphs in R, but it’s also a method of thinking about and decomposing complex graphs into logical subunits.
ggplot
takes each component of a graph–axes, scales, colors, objects, etc–and allows you to build graphs up sequentially one component at a time. You can then modify each of those components in a way that’s both flexible and user-friendly. When components are unspecified, ggplot
uses sensible defaults. This makes ggplot
a powerful and flexible tool for creating all kinds of graphs in R. It’s the tool I use to create nearly every graph I make these days, and I think you should use it too!
Throughout this post, we’ll be using the Orange dataset that’s built into R. This dataset contains information on the age and circumference of 5 different orange trees, letting us see how these trees grow over time. Let’s take a look at this dataset to see what it looks like:
The dataset contains 3 columns: Tree, age, and cimcumference. There are 7 observations for each Tree, and there are 5 Trees, for a total of 35 observations in all.
library(tidyverse)
# Filter the data we need
tree_1 <- filter(Orange, Tree == 1)
# Graph the data
ggplot(tree_1) +
geom_line(aes(x = age, y = circumference))
Here we are starting with the simplest possible line graph using geom_line. For this simple graph, I chose to only graph the size of the first tree. I used dplyr
to filter the dataset to only that first tree.
If you’re not familiar with dplyr
‘s filter
function, it’s my preferred way of subsetting a dataset in R, and I recently wrote an in-depth guide to dplyr filter if you’d like to learn more!
Once I had filtered out the dataset I was interested in, I then used ggplot + geom_line()
to create the graph. Let’s review this in more detail:
First, I call ggplot
, which creates a new ggplot
graph. It’s essentially a blank canvas on which we’ll add our data and graphics. In this case, I passed tree_1 to ggplot
, indicating that we’ll be using the tree_1 data for this particular ggplot
graph.
Next, I added my geom_line
call to the base ggplot
graph in order to create this line. In ggplot
, you use the +
symbol to add new layers to an existing graph. In this second layer, I told ggplot
to use age as the x-axis variable and circumference as the y-axis variable.
And that’s it, we have our line graph!
ggplot + geom_line
Expanding on this example, let’s now experiment a bit with colors.
# Filter the data we need
tree_1 <- filter(Orange, Tree == 1)
# Graph the data
ggplot(tree_1) +
geom_line(aes(x = age, y = circumference), color = 'red')
You’ll note that this geom_line call is identical to the one before, except that we’ve added the modifier color = 'red'
to to end of the line. Experiment a bit with different colors to see how this works on your machine. You can use most color names you can think of, or you can use specific hex colors codes to get more granular.
Now, let’s try something a little different. Compare the ggplot
code below to the code we just executed above. There are 3 differences. See if you can find them and guess what will happen, then scroll down to take a look at the result.
# Graph different data
ggplot(Orange) +
geom_line(aes(x = age, y = circumference, color = Tree))
This line graph is quite different from the one we produced above, but we only made a few minor modifications to the code! Did you catch the 3 changes? They were:
color = 'red'
, we specified color = Tree
aes()
parenthesesLet’s review each of these changes:
This change is relatively straightforward. Instead of only graphing the data for a single tree, we wanted to graph the data for all 5 trees. We accomplish this by changing our input dataset in the ggplot()
call.
color = Tree
and moving it within the aes()
parenthesesI’m combining these because these two changes work together.
Before, we told ggplot
to change the color of the line to red by adding color = 'red'
to our geom_line()
call.
What we’re doing here is a bit more complex. Instead of specifying a single color for our line, we’re telling ggplot
to map the data in the Tree
column to the color
aesthetic.
Effectively, we’re telling ggplot
to use a different color for each tree in our data! This mapping also lets ggplot
know that it also needs to create a legend to identify the trees, and it places it there automatically!
ggplot + geom_line
Let’s look at a related example. This time, instead of changing the color of the line graph, we will change the linetype:
ggplot(Orange) +
geom_line(aes(x = age, y = circumference, linetype = Tree))
This ggplot + geom_line()
call is identical to the one we just reviewed, except we’ve substituted linetype
for color
. The graph produced is quite similar, but it uses different linetypes instead of different colors in the graph. You might consider using something like this when printing in black and white, for example.
aes()
(aesthetic) mappings in ggplotWe just saw how we can create graphs in ggplot
that map the Tree variable to color or linetype in a line graph. ggplot
refers to these mappings as aesthetic mappings, and they encompass everything you see within the aes()
in ggplot.
Aesthetic mappings are a way of mapping variables in your data to particular visual properties (aesthetics) of a graph.
This might all sound a bit theoretical, so let’s review the specific aesthetic mappings you’ve already seen as well as the other mappings available within geom_line.
The main aesthetic mappings for ggplot + geom_line()
include:
x
: Map a variable to a position on the x-axisy
: Map a variable to a position on the y-axiscolor
: Map a variable to a line colorlinetype
: Map a variable to a linetypegroup
: Map a variable to a group (each variable on a separate line)size
: Map a variable to a line sizealpha
: Map a variable to a line transparencyFrom the list above, we’ve already seen the x
, y
, color
, and linetype
aesthetic mappings.
x
and y
are what we used in our first ggplot + geom_line()
function call to map the variables age and circumference to x-axis and y-axis values. Then, we experimented with using color
and linetype
to map the Tree variable to different colored lines or linetypes.
In addition to those, there are 3 other main aesthetic mappings often used with geom_line
.
The group
mapping allows us to map a variable to different groups. Within geom_line
, that means mapping a variable to different lines. Think of it as a pared down version of the color
and linetype
aesthetic mappings you already saw. While the color
aesthetic mapped each Tree to a different line with a different color, the group
aesthetic maps each Tree to a different line, but does not differentiate the lines by color or anything else. Let’s take a look:
group
aesthetic mapping in ggplot + geom_line
ggplot(Orange) +
geom_line(aes(x = age, y = circumference, group = Tree))
You’ll note that the 5 lines are separated as before, but the lines are all black and there is no legend differentiating them. Depending on the data you’re working with, this may or may not be appropriate. It’s up to you as the person familiar with the data to determine how best to represent it in graph form!
In our Orange tree dataset, if you’re interested in investigating how specific orange trees grew over time, you’d want to use the color
or linetype
aesthetics to make sure you can track the progress for specific trees. If, instead, you’re interested in only how orange trees in general grow, then using the group
aesthetic is appropriate, simplifying your graph and discarding unnecessary detail.
ggplot
is both flexible and powerful, but it’s up to you to design a graph that communicates what you want to show. Just because you can do something doesn’t mean you should. You should always think about what message you’re trying to convey with a graph, then design from those principles.
Keep this in mind as we review the next two aesthetics. While these aesthetics absolutely have a place in data visualization, in the case of the particular dataset we’re working with, they don’t make very much sense. But this is a guide to using geom_line
in ggplot
, not graphing the growth of Orange trees, so I’m still going to cover them for the sake of completeness!
ggplot + geom_line
with the alpha
aestheticggplot(Orange) +
geom_line(aes(x = age, y = circumference, alpha = Tree))
Here we map the Tree
variable to the alpha
aesthetic, which controls the transparency of the line. As you can see, certain lines are more transparent than others. In this case, transparency does not add to our understanding of the graph, so I would not use this to illustrate this dataset.
size
aesthetic mapping in ggplot + geom_line
ggplot(Orange) +
geom_line(aes(x = age, y = circumference, size = Tree))
Finally, we turn to the size aesthetic, which controls the size of lines. Again, I would say this is not does not add to our understanding of our data in this context. That said, it does slightly resemble Charles Joseph Minard’s famous graph of the death tolls of Napoleon’s disastrous 1812 Russia Campaign, so that’s kind of cool:
Before, we saw that we are able to use color
in two different ways with geom_line. First, we were able to set the color of a line to red by specifying color = 'red'
outside of our aes()
mappings. Then, we were able to map the variable Tree
to color by specifying color = Tree
inside of our aes()
mappings. How does this work with all of the other aesthetics you just learned about?
Essentially, they all work the same as color! That’s the beautiful thing about graphing in ggplot
–once you understand the syntax, it’s very easy to expand your capabilities.
Each of the aesthetic mappings you’ve seen can also be used as a parameter, that is, a fixed value defined outside of the aes()
aesthetic mappings. You saw how to do this with color when we set the line to red with color = 'red'
before. Now let’s look at an example of how to do this with linetype in the same manner:
ggplot(Orange) +
geom_line(aes(x = age, y = circumference, group = Tree), linetype = 'dotted')
To review what values linetype
, size
, and alpha
accept, just run ?linetype
, ?size
, or ?alpha
from your console window!
When I was getting started with R and ggplot, the distinction between aesthetic mappings (the values included inside your aes()
), and parameters (the ones outside your aes()
was the concept that tripped me up the most. You’ll learn how to deal with these issues over time, but I can help you speed along this process with a few common errors that you can keep an eye out for.
aes()
callIf you’re trying to map the Tree variable to linetype, you should include linetype == tree
within the aes()
of your geom_line
call. What happens if you accidentally include it outside, and instead run ggplot(Orange) + geom_line(aes(x = age, y = circumference), linetype = Tree)
? You’ll get an error message that looks like this:
Whenever you see this error about object not found, make sure you check and make sure you’re including your aesthetic mappings inside the aes()
call!
aes()
callAlternatively, if we try to specify a specific parameter value (for example, color = 'red'
) inside of the aes()
mapping, we get a less intutive issue:
ggplot(Orange) +
geom_line(aes(x = age, y = circumference, color = 'red'))
In this case, ggplot
actually does produce a line graph (success!), but it doesn’t have the result we intended. The graph it produces looks odd, because it is putting the values for all 5 trees on a single line, rather than on 5 separate lines like we had before. It did change the color to red, but it also included a legend that simply says ‘red’. When you run into issues like this, double check to make sure you’re including the parameters of your graph outside your aes()
call!
You should now have a solid understanding of how to use R to plot line graphs using ggplot
and geom_line
! Experiment with the things you’ve learned to solidify your understanding. As an exercise, try producing a line graph of your own using a different dataset and at least one of the aesthetic mappings you learned about. Leave your graph in the comments or email it to me at mt.toth@gmail.com — I’d love to take a look at what you produce!
Did you find this post useful? I frequently write tutorials like this one to help you learn new skills and improve your data science. If you want to be notified of new tutorials, sign up here!
Roland Stevenson is a data scientist and consultant who may be reached on Linkedin.
When setting up R and RStudio Server on a cloud Linux instance, some thought should be given to implementing a workflow that facilitates collaboration and ensures R project reproducibility. There are many possible workflows to accomplish this. In this post, we offer an “opinionated” solution based on what we have found to work in a production environment. We assume all development takes place on an RStudio Server cloud Linux instance, ensuring that only one operating system needs to be supported. We will keep the motivation for good versioning and reproducibility short: R projects evolve over time, as do the packages that they rely on. R projects that do not control package versions will eventually break and/or not be shareable or reproducible^{1}.
Since R is a slowly evolving language, it might be reasonable to require that a particular Linux instance have only one version of R installed. However, requiring all R users to use the same versions of all packages to facilitate collaboration is clearly out of the question. The solution is to control package versions at the project level.
We use packrat
to control package versions. Already integrated with RStudio Server, packrat
ensures that all installed packages are stored with the project^{2}, and that these packages are available when a project is opened. With packrat
, we know that project A will always be able to use ggplot2 2.5.0 and project B will always be able to use ggplot2 3.1.0. This is important if we want to be able to reproduce results in the future.
On Linux, packrat
stores compiled packages in packrat/lib/
, an R-version-specific path, relative to the project’s base directory. An issue arises if we are using R version 3.5.0 one week and then upgrade to R 3.5.1 the next week: a packrat
project will not find the 3.5.0 libraries anymore, and we will need to rebuild all the packages to install them in the 3.5.1 path. packrat
will automatically build all packages from source (sources are stored in packrat/src
) if it notices they are missing. However, this process can take tens of minutes, depending on the number of packages being built. Since this can be cumbersome when collaborating, we also opt to include the packrat/lib
path in version control, thereby committing the compiled libraries as well.
Our solution is to bind one fixed R version to an instance^{3} and release fixed-R instance images periodically. We prefer limited, consistent R-versions over continually upgrading to the most recent version of R. This approach helps to ensure reproducibility and make collaboration easier, avoids having to use docker containers^{4}. While binding a fixed version of R to an instance may seem restrictive, we have found that it is in fact quite liberating. Since we only update the existing R version infrequently (think once a year), the barrier of agreeing on an R-version is removed and with it any need to agree on package versions at the user level. Instead, packages are distributed with the project via git. The benefits of fixing the R version for a particular instance are:
packrat
projects and reproducing results are both made easier, since pre-compiled libraries are included with the projects.packrat
will automatically build and install libraries if an upgraded version is detected. In this way, a project can be opened on an instance with an upgraded R version and have its libraries compiled. Our limited instance image release schedule means the overhead to handle this only occurs at a maximum of once each year.What we lose by not being on the bleeding edge of (thankfully relatively non-critical) bug fixes we gain in ease of collaboration. Here’s what we’ve done to accomplish this:
git clone
the repo and git checkout
the branch suitable for the Linux flavor, R-version, and RStudio version we want. The scripts also ensure R is not auto-updated in the future.packrat
-managed, fixed-versions of many popular data-science packages^{5}.git clone
rstudio-project to a new project directory locally and remove the existing .git
directory so that it can be turned it into a new git repo with git init
.packrat
library of the “Packages” tab, and then run packrat::snapshot()
to save any libraries and ugrades into the project’s packrat/
directory. We can then git add packrat
to add any packrat
updates to the project’s git repo.Here is a quick example script showing the workflow:
git clone git@github.com:ras44/rstudio-instance.git
cd rstudio-instance
git checkout centos7_R3.5.0_RSS1.1.453
./install.sh
sudo passwd # set user password for RStudio Server login
cd
git clone git@github.com:ras44/rstudio-project.git new-project
cd new-project
git checkout dev-linux-centos7-R3.5.0
rm -rf .git
git init
Finally, here are some issues with packrat
that we have run into along with our solutions. Note that RStudio support has been very helpful in addressing issues while monitoring and providing solutions via their github issue tracker.
If R crashes and the packrat
libraries are not accessible after the RStudio restarts the session, the project might need to be re-opened. Run .libPaths()
to ensure the project library paths are correct. Verify libraries are accessible by looking at the “packages” tab in RStudio Server and ensuring a “Project Library” header exists with all packages(see above image). Follow issue discussion.
An issue can arise when some packages are updated but others aren’t. This can be challenging to troubleshoot and raises the question of what to do when package versions become incompatible with each other. This is not packrat
, but version compatibility.
Installing packages directly from a private/internal github is evolving. An easy solution exists: simply clone the package to a local directory such as ~/local_repos/
. Then use install_local()
to install from the local_repos
directory. See issue for details.
packrat
can occasionally have very slow snapshots, particularly with projects that contains many R-Markdown files and packages. This is likely due to packrat
dependency searches. As discussed in the issue, we resolve it by ignoring all of our source directories with packrat::set_opts(ignored.directories=c("all","my","R","src","directories")
and then running packrat::snapshot(ignore.stale=TRUE, infer.dependencies=FALSE)
.
packrat/
directory
Imagine you have a function that only takes one argument, but you would really like to work on a vector of values. A short example on how function Vectorize()
can accomplish this. Let’s say we have a data.frame
xy <- data.frame(sample = c("C_pre_sample1", "C_post_sample1", "T_pre_sample2",
"T_post_sample2", "NA_pre_sample1"),
value = runif(5))
# sample value
# 1 C_pre_sample1 0.3048032
# 2 C_post_sample1 0.3487163
# 3 T_pre_sample2 0.3359707
# 4 T_post_sample2 0.6698358
# 5 NA_pre_sample1 0.9490707
and you want to subset only samples that start with C_pre
or T_pre
. Of course you can construct a nice regular expression, implement an anonymouse function using lapply
/sapply
or use one of those fancy tidyverse functions.
A long winded way would be to find matches using regular expression for each level, combine them and subset. This is for pedagogical reasons, so please bare with me.
i.ind <- do.call(cbind, list(
grepl(pattern = "^C_pre", x = xy$sample),
grepl(pattern = "^T_pre", x = xy$sample)
))
i.ind
# [,1] [,2]
# [1,] TRUE FALSE
# [2,] FALSE FALSE
# [3,] FALSE TRUE
# [4,] FALSE FALSE
# [5,] FALSE FALSE
# Find those rows in `xy` that have at least one TRUE and use that to subset the
# data.frame.
xy[rowSums(i.ind) > 0, ]
# sample value
# 1 C_pre_sample1 0.3048032
# 3 T_pre_sample2 0.3359707
The same can be achieved using a vectorized version of the grepl
function. We designate which argument exactly is being vectorized, in our case pattern
because that’s the argument that is varying.
vgrepl <- Vectorize(grepl, vectorize.args = "pattern")
Here we use function Vectorize
and we tell it to vectorize argument pattern
. What this will do is run the grepl
function for any element of the vector we pass in, just like we did in the i.ind
objects a few lines above.
This would be an equivalent of doing it using an anonymouse function
tmp <- sapply(c("^C_pre", "^T_pre"), FUN = function(pt, input) {
grepl(pt, x = input)
}, input = xy$sample)
tmp
# ^C_pre ^T_pre
# [1,] TRUE FALSE
# [2,] FALSE FALSE
# [3,] FALSE TRUE
# [4,] FALSE FALSE
# [5,] FALSE FALSE
While this can be somewhat verbose, you can use vgrepl
as you would use grepl
, with the minor detail that you pass a whole vector to pattern
instead of a single regular expression.
i.vec <- vgrepl(pattern = c("^C_pre", "^T_pre"), x = xy$sample)
# ^C_pre ^T_pre
# [1,] TRUE FALSE
# [2,] FALSE FALSE
# [3,] FALSE TRUE
# [4,] FALSE FALSE
# [5,] FALSE FALSE
xy[rowSums(i.vec) > 0, ]
# sample value
# 1 C_pre_sample1 0.3048032
# 3 T_pre_sample2 0.3359707
John Cook recently wrote an interesting blog post on random vectors and random projections. In the post, he states two surprising facts of high-dimensional geometry and gives some intuition for the second fact. In this post, I will provide R code to demonstrate both of them.
Fact 1: Two randomly chosen vectors in a high-dimensional space are very likely to be nearly orthogonal.
Cook does not discuss this fact as it is “well known”. Let me demonstrate it empirically. Below, the first function generates a -dimensional unit vector uniformly at random. The second function takes in two -dimensional vectors, x1
and x2
, and computes the angle between them. (For details, see Cook’s blog post.)
genRandomVec <- function(p) { x <- rnorm(p) x / sqrt(sum(x^2)) } findAngle <- function(x1, x2) { dot_prod <- sum(x1 * x2) / (sqrt(sum(x1^2) * sum(x2^2))) acos(dot_prod) }
Next, we use the replicate
function to generate 100,000 pairs of 10,000-dimensional vectors and plot a histogram of the angles they make:
simN <- 100000 # no. of simulations p <- 10000 set.seed(100) angles <- replicate(simN, findAngle(genRandomVec(p), genRandomVec(p))) angles <- angles / pi * 180 # convert from radians to degrees hist(angles)
Note the scale of the x-axis: the angles are very closely bunched up around 90 degrees, as claimed.
This phenomenon only happens for “high” dimensions. If we change the value of p
above to 2, we obtain a very different histogram:
How “high” does the dimension have to be before we see this phenomenon kick in? Well, it depends on how tightly bunched up we want the angles to be around 90 degrees. The histogram below is the same simulation but for p = 20
(notice the wider x-axis scale):
It seems like the bell-shaped curve already starts to appear with p = 3
!
Fact 2: Generate 10,000 random vectors in 20,000 dimensional space. Now, generate another random vector in that space. Then the angle between this vector and its projection on the span of the first 10,000 vectors is very likely to be very near 45 degrees.
Cook presents a very cool intuitive explanation of this fact which I highly recommend. Here, I present simulation evidence of the fact.
The difficulty in this simulation is computing the projection of a vector onto the span of many vectors. It can be shown that the projection of a vector onto the column span of a (full-rank) matrix is given by (see this post and this post for details). For our fact, is a matrix, so computing is going to take prohibitively long.
I don’t know another way to compute the projection of a vector onto the span of other vectors. (Does anyone know of better ways?) Fortunately, based on my simulations in Fact 1, this phenomenon will probably kick in for much smaller dimensions too!
First, let’s write up two functions: one that takes a vector v
and matrix A
and returns the projection of v
onto the column span of A
:
projectVec <- function(v, A) { A %*% solve(t(A) %*% A) %*% t(A) %*% v }
and a function that does one run of the simulation. Here, p
is the dimensionality of each of the vectors, and I assume that we are looking at the span of p/2
vectors:
simulationRun <- function(p) { A <- replicate(p/2, genRandomVec(p)) v <- genRandomVec(p) proj_v <- projectVec(v, A) findAngle(proj_v, v) }
The code below runs 10,000 simulations for p = 20, taking about 2 seconds on my laptop:
simN <- 10000 # no. of simulations p <- 20 # dimension of the vectors set.seed(54) angles <- replicate(simN, simulationRun(p)) angles <- angles / pi * 180 # convert from radians to degrees hist(angles, breaks = seq(0, 90, 5)) abline(v = 45, col = "red", lwd = 3)
We can already see the bunching around 45 degrees:
The simulation for p = 200
takes just under 2 minutes on my laptop, and we see tighter bunching around 45 degrees (note the x-axis scale).
Here is an example how easy it is to use cdata
to re-layout your data.
Tim Morris recently tweeted the following problem (corrected).
Please will you take pity on me #rstats folks? I only want to reshape two variables x & y from wide to long! Starting with: d xa xb ya yb 1 1 3 6 8 2 2 4 7 9 How can I get to: id t x y 1 a 1 6 1 b 3 8 2 a 2 7 2 b 4 9 In Stata it's: . reshape long x y, i(id) j(t) string In R, it's: . an hour of cursing followed by a desperate tweet 👆 Thanks for any help! PS – I can make reshape() or gather() work when I have just x or just y.
This is not to make fun of Tim Morris: the above should be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.
First: (and this is the important part) define our problem using an example. Tim Morris did this really well, but let’s repeat it here. We want to realize the following data layout transform.
Second: identify the record ID and record structure in both the before and after examples.
Third: attach the cdata
package, and use build_frame()
to type in the example "before" data.
library("cdata")
before <- build_frame(
"id" , "xa", "xb", "ya", "yb" |
1 , 1 , 3 , 6 , 8 |
2 , 2 , 4 , 7 , 9 )
knitr::kable(before)
id | xa | xb | ya | yb |
---|---|---|---|---|
1 | 1 | 3 | 6 | 8 |
2 | 2 | 4 | 7 | 9 |
Fourth: (this is the "hard" part) copy the column marked names from the before into the matching record positions in the after example.
Fifth: copy the annotated "after" record in as your layout transform control table.
ct <- qchar_frame(
"t" , "x" , "y" |
"a", xa , ya |
"b", xb , yb )
knitr::kable(ct)
t | x | y |
---|---|---|
a | xa | ya |
b | xb | yb |
In the above we are using a convention that concrete values are written in quotes, and symbols to be taken from the "before" data frame are written without quotes.
Now specify the many-record transform.
layout_spec <- rowrecs_to_blocks_spec(
ct,
recordKeys = "id")
The layout_spec
completely encodes our intent. So we can look at it to double check what transform we have specified.
print(layout_spec)
## {
## row_record <- wrapr::qchar_frame(
## "id" , "xa", "xb", "ya", "yb" |
## . , xa , xb , ya , yb )
## row_keys <- c('id')
##
## # becomes
##
## block_record <- wrapr::qchar_frame(
## "id" , "t", "x", "y" |
## . , "a", xa , ya |
## . , "b", xb , yb )
## block_keys <- c('id', 't')
##
## # args: c(checkNames = TRUE, checkKeys = TRUE, strict = FALSE)
## }
And we can now apply the layout transform to data.
after <- before %.>% layout_spec
# cdata 1.0.9 adds the non-piped function notation:
# layout_by(layout_spec, before)
knitr::kable(after)
id | t | x | y |
---|---|---|---|
1 | a | 1 | 6 |
1 | b | 3 | 8 |
2 | a | 2 | 7 |
2 | b | 4 | 9 |
A really fun extra: we can build an inverse layout specification to reverse the transform.
reverse_layout <- t(layout_spec) # invert the spec using t()
print(reverse_layout)
## {
## block_record <- wrapr::qchar_frame(
## "id" , "t", "x", "y" |
## . , "a", xa , ya |
## . , "b", xb , yb )
## block_keys <- c('id', 't')
##
## # becomes
##
## row_record <- wrapr::qchar_frame(
## "id" , "xa", "xb", "ya", "yb" |
## . , xa , xb , ya , yb )
## row_keys <- c('id')
##
## # args: c(checkNames = TRUE, checkKeys = TRUE, strict = FALSE)
## }
after %.>%
reverse_layout %.>%
knitr::kable(.)
id | xa | xb | ya | yb |
---|---|---|---|---|
1 | 1 | 3 | 6 | 8 |
2 | 2 | 4 | 7 | 9 |
And that is it, we have a re-usable layout_spec
that can transform future data. We have many tutorials on the method here, and the source code for this note can be found here.
Since 2017 I have been an instructor for DataCamp, the VC-backed online data science education platform. What this means is that I am not an employee, but I have developed content for the company as a contractor. I have two courses there, one on text mining and one on practical supervised machine learning.
About two weeks ago, DataCamp published a blog post outlining an incident of sexual misconduct at the company. The post was published one day after a group of over 100 instructors sent a letter to DataCamp saying that the way the company had handled sexual misconduct was not acceptable and the company needed to do better. I was one of the signers of the letter, and in fact was one of the writers of the letter as well as one of the folks who helped organize this group action. This letter came after many months of instructors like me attempting to engage with DataCamp in productive discussion. It did not threaten to go public or call for the executive’s firing, but it did bring up how the growing rumors and uncertainty around misconduct at DataCamp have been a problem for the personal reputations of instructors. For example, I have had people I don’t know come up to me at conferences and say things like, “I heard something bad happened at DataCamp. What’s up?”
People in the broader data science community often associate instructors with DataCamp, but it turns out we have very little control or influence with the company (outside of what effectively turned into group organizing). As contractors, we have less influence than regular employees. With the way most of our contracts are written, we don’t even have the ability to end our affiliations with the company. I personally regret signing my contract and will bring some hard-learned lessons about such contracts to any future collaborations. I allowed relationships I have with people who work for DataCamp to influence my perception of the contract I was signing.
I think it’s likely that DataCamp management thought or hoped that their post was enough to placate instructors, and that they essentially did what we asked in the letter. I am deeply disappointed that this was their response. The main problem is how the leadership of DataCamp has chosen to deal with and disclose an incident like this. Although the post does clearly say that what happened was inappropriate and that the dynamic between an executive and an employee makes that particularly egregious, detail is used in harmful and victim-blaming ways. Every detail that might possibly put DataCamp in a better light is included, and details that provide a counternarrative are excluded. This is particularly frustrating to me because I have given feedback to multiple individuals at DataCamp that this kind of language is unproductive and unhelpful for rebuilding trust with instructors and the broader community, as well as largely unpersuasive to most readers. We have all read enough of these committee-written, half-hearted “apologies” by now that such strategies are obvious.
Perhaps you have noticed that searching for information about sexual misconduct at DataCamp does not surface their own post. This is because the company added a noindex
flag to this post (and only this post, unlike their other blog posts) so that it would not be indexed by search engines like Google.
If you think @DataCamp deserves credit for a public post that they did not discipline an executive for sexual misconduct, consider that they put this code in the HTML, so no search engines would index it. I don't see it anywhere else on their blog. #rstats #python #datasci #metoo pic.twitter.com/xbQoSaQl5J
— Noam Ross (@noamross) April 12, 2019
This particular choice on the part of DataCamp is true to the character of the rest of my interactions with the company over the past year or so. I have hesitated to go into a lot of detail publicly about what’s happened with me because others have experienced much worse, but I will share a few things for context. Employees refused to respond in email/writing about concerns I raised, and instead always deferred to scheduling (time-consuming and yet unproductive) one-on-one calls. There was one group meeting for instructors who had raised concerns, but it was organized as a webinar where instructors could not speak, could not see who else was in the meeting, and could not see questions typed by other participants.
"Nothing has ever, ever been solved by a webinar." -great wisdom from @alexhanna
— Julia Silge (@juliasilge) February 27, 2019
This has been painful in many ways. For starters, it is painful to learn how harm has come to a respected member of my community through a company that I have directly helped make money, and how little has been done for so long done to make this right. It is painful to realize I was not so savvy in signing my contract. It is painful to navigate the souring of a working relationship with a company where people who I know and care about are employees. I don’t really know how to handle this difficult situation other than to affirm the value of these individuals while telling the truth about what has happened.
I don’t expect the companies or people I work with to be perfect, and in fact, I myself work for an imperfect company. What I do need to maintain a continuing relationship with a company and/or people is trustworthiness and accountability. I am an optimist and still hold out hope for the folks at DataCamp to demonstrate that I (and the broader community) should trust them. For now, though, much like Noam Ross writes in his post, I urge everyone to avoid using the materials I have developed for DataCamp, if at all possible for you.
The way to make men face consequences for their decisions is to force repercussions via companies and institutions. So I hope you don’t take my DataCamp course. I hope you will stop using DataCamp and let them know this is why. I hope if your company uses DataCamp you convince them to stop buying licenses. I hope if you invest in or advertise with or accept sponsorships from DataCamp you stop.
Like many other instructors, I am looking into options for making these or similar materials available openly. In the meantime, my book on text mining (written with my valued coauthor David Robinson) is available entirely online, and I have blog posts and tutorials freely available on many aspects of text mining.
Interactive EDA is nice but customized interactive EDA is even nicer. To celebrate the new CRAN version of my ‘ExPanDaR’ package I prepare a customized variant of ‘ExPanD’ to explore the U.S. EPA data on fuel economy. Our objective is to develop an interactive display that guides the reader on how to explore the fuel economy data in an intuitive way.
First, let’s load the packages and the data from EPA’s web page. In addition, I prepared a small data set containing the countries of domicile for the car producers with more than 100 different cars listed in the EPA data (covering the vast majority of the EPA sample).
library(tidyverse)
library(ExPanDaR)
# The following two chuncks borrow
# from the raw data code of the
# fueleconomy package by Hadley Wickham,
# See: https://github.com/hadley/fueleconomy
if(!file.exists("vehicles.csv")) {
tmp <- tempfile(fileext = ".zip")
download.file("http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip",
tmp, quiet = TRUE)
unzip(tmp, exdir = ".")
}
raw <- read.csv("vehicles.csv", stringsAsFactors = FALSE)
countries <- read.csv("https://joachim-gassen.github.io/data/countries.csv",
stringsAsFactors = FALSE)
Next, I merge the data and re-code some factors to be more stable across time as the EDA has made several coding changes over time.
vehicles <- raw %>%
mutate(car = paste(make, model, trany),
mpg_hwy = ifelse(highway08U > 0, highway08U, highway08),
mpg_city = ifelse(city08U > 0, city08U, city08)) %>%
left_join(countries) %>%
select(car, make, country, trans = trany,
year,
class = VClass, drive = drive, fuel = fuelType,
cyl = cylinders, displ = displ,
mpg_hwy, mpg_city) %>%
filter(drive != "",
year > 1985,
year < 2020) %>%
mutate(fuel = case_when(
fuel == "CNG" ~ "gas",
fuel == "Gasoline or natural gas" ~ "hybrid_gas",
fuel == "Gasoline or propane" ~ "hybrid_gas",
fuel == "Premium and Electricity" ~ "hybrid_electro",
fuel == "Premium Gas or Electricity" ~ "hybrid_electro",
fuel == "Premium Gas and Electricity" ~ "hybrid_electro",
fuel == "Regular Gas or Electricity" ~ "hybrid_electro",
fuel == "Electricity" ~ "electro",
fuel == "Diesel" ~ "diesel",
TRUE ~ "gasoline"
),
class = case_when(
grepl("Midsize", class) ~ "Normal, mid-size",
grepl("Compact", class) ~ "Normal, compact",
grepl("Small Station Wagons", class) ~ "Normal, compact",
grepl("Large Cars", class) ~ "Normal, large",
grepl("Minicompact", class) ~ "Normal, sub-compact",
grepl("Subcompact", class) ~ "Normal, sub-compact",
grepl("Two Seaters", class) ~ "Two Seaters",
grepl("Pickup Trucks", class) ~ "Pickups",
grepl("Sport Utility Vehicle", class) ~ "SUVs",
grepl("Special Purpose Vehicle", class) ~ "SUVs",
grepl("Minivan", class) ~ "(Mini)vans",
grepl("Vans", class) ~ "(Mini)vans"
),
drive = case_when(
grepl("4-Wheel", drive) ~ "4-Wheel Drive",
grepl("4-Wheel", drive) ~ "4-Wheel Drive",
grepl("All-Wheel", drive) ~ "4-Wheel Drive",
grepl("Front-Wheel", drive) ~ "Front-Wheel Drive",
grepl("Rear-Wheel", drive) ~ "Rear-Wheel Drive"
),
trans = case_when(
grepl("Automatic", trans) ~ "Automatic",
grepl("Manual", trans) ~ "Manual"
)) %>%
na.omit()
Now, the EPA fuel economy data is ready for exploration. Nothing stops you from firing up ‘ExPanD’ right away.
ExPanD(vehicles, cs_id = "cars", ts_id = "year")
This works but the display is not optimal for what we are after. It includes components that are not needed (e.g., the missing values display as the data do not contain any missing observations) and lacks documentation. We can do better by using some of the new customization features of the ‘ExPanD’ app. See below for a ‘customized’ version that guides the reader a little bit better through the analysis by omitting unneeded displays, adding variable definitions, setting informative defaults and adding explanatory texts between displays.
df_def <- data.frame(
var_name = names(vehicles),
var_def = c("Make, model and transition type indentifying a unique car in the data",
"Make of car",
"Country where car producing firm is loacted",
"Transition type (automatic or manual)",
"Year of data",
"Classification type of car (simplified from orginal data)",
"Drive type of car (Front Wheel, Rear Wheel or 4 Wheel)",
"Fuel type (simplified from orginal data)",
"Number of engine cylinders",
"Engine displacement in liters",
"Highway miles per gallon (MPG). For electric and CNG vehicles this number is MPGe (gasoline equivalent miles per gallon).",
"City miles per gallon (MPG). For electric and CNG vehicles this number is MPGe (gasoline equivalent miles per gallon)."),
type = c("cs_id", rep("factor", 3), "ts_id", rep("factor", 3), rep("numeric", 4))
)
html_blocks <- c(
paste("",
"By default, this display uses all data from car makes with more",
"than 100 cars in the 'fueleconomy.gov' database.",
"Above, you can limit the analysis to cars from a certain make,",
"class, country, fuel type or other factor present in the data.",
""),
paste("",
"In the display above, remove the check mark to see the absolute",
"number of cars included in the data each year.",
"Also, change the additional factor to see how the distribution",
"of cars across countries, transition types, etc. changes over time",
""),
paste("",
"In the two tables above, you can assess the distributions of the",
"four numerical variables of the data set. Which car has the",
"largest engine of all times?",
""),
paste("",
"Explore the numerical variables across factors. You will see,",
"not surprisingly, that fuel economy varies by car class.",
"Does it also vary by drive type?",
""),
paste("",
"The above two panels contain good news. Fuel economy has",
"increased over the last ten years. See for yourself:",
"Has the size of engines changed as well?",
""),
paste("",
"The scatter plot documents a clear link between engine size",
"and fuel economy in term of miles per gallon.",
"Below, you can start testing for associations.",
""),
paste("",
"Probably, you will want to test for some associations that",
"require you to construct new variables. No problem. Just enter the",
"variable definitions above. Some ideas on what to do:",
"- Define country dummies (e.g., country == 'US') to see",
"whether cars from certain countries are less fuel efficient than others.
",
"- Define a dummy for 4-Wheel drive cars to assess the penalty",
"of 4-Wheel drives on fuel economy.
",
"- If you are from a metric country, maybe your are mildly annoyed",
"by the uncommon way to assess fuel economy via miles per gallon.",
"Fix this by defining a liter by 100 km measure",
"(hint: 'l100km_hwy := 235.215/mpg_hwy').
",
""),
paste("",
"Above, you can play around with certain regression parameters.",
"See how robust coefficients are across car classes by estimating",
"the models by car class ('subset' option).",
"Try a by year regression to assess the development of fuel economy",
"over time.
",
"If you like your analysis, you can download the configuration",
"and reload it at a later stage using the buttons below.",
"")
)
cl <- list(
ext_obs_period_by = "2019",
bgbg_var = "mpg_hwy",
bgvg_var = "mpg_hwy",
scatter_loess = FALSE,
delvars = NULL,
scatter_size = "cyl",
bar_chart_relative = TRUE,
reg_x = c("cyl", "displ", "trans"),
scatter_x = "displ",
reg_y = "mpg_hwy",
scatter_y = "mpg_hwy",
bgvg_byvar = "class",
quantile_trend_graph_var = "mpg_hwy",
bgbg_byvar = "country",
scatter_color = "country", bar_chart_var2 = "class",
ext_obs_var = "mpg_hwy",
trend_graph_var1 = "mpg_hwy",
trend_graph_var2 = "mpg_city",
sample = "vehicles"
)
ExPanD(vehicles, df_def = df_def, config_list = cl,
title = "Explore the Fuel Economy of Cars in the U.S. Market",
abstract = paste("This interactive display features the ",
"fuel economy data provided by the U.S. Environmental Protection Agency",
"and allows you to explore the fuel economy of cars in the U.S. market",
"across time and other dimensions. Scroll down and enjoy!"),
components = c(subset_factor = TRUE,
html_block = TRUE,
bar_chart = TRUE,
html_block = TRUE,
descriptive_table = TRUE,
ext_obs = TRUE,
html_block = TRUE,
by_group_bar_graph = TRUE,
by_group_violin_graph = TRUE,
html_block = TRUE,
trend_graph = TRUE,
quantile_trend_graph = TRUE,
html_block = TRUE,
scatter_plot = TRUE,
html_block = TRUE,
udvars = TRUE,
html_block = TRUE,
regression = TRUE,
html_block = TRUE),
html_blocks = html_blocks
)
You can assess the customized EDA online here. While customizing ‘ExPanD’ requires extra work, it helps to make the EDA flow better. Fur further information on how to build customized EDAs using ‘ExPanD’ step-by-step, please refer to the new vignette of the package.
Feel free to comment below. Alternatively, you can reach me via email or twitter.
Enjoy!